How can i find outliers
Outliers are important to analyze because they usually hold important information about the data being studied, and may distort the findings of your data analysis. You might remove the outlier from the data set if it was an error, but analyzing it first can show you its meaning or help you predict future outliers.
They can provide insight into the data gathering, recording and analyzing process, and may be the key to discovering system inconsistencies. Even when outliers are errors, they can help you better understand your data, which is why it's important to identify and evaluate any outliers. Here are five ways to find outliers in your data set:. An easy way to identify outliers is to sort your data, which allows you to see any unusual data points within your information.
Try sorting your data by ascending or descending order, then you can examine the data to find outliers. An unusually high or low piece of data could be an outlier.
If you have a small set of data, you can do this by hand. If you have a large set of data, consider sorting it with a database program. For example, if you have these numbers in ascending order: 3, 6, 7, 10 and 54, you can see that 54 is a lot larger than the rest of the data points. Statisticians would consider 54 an outlier. Another example could be: 2, 38, 43, 49 and You can see that 2 is much smaller than the other data points, so we can say that 2 is the outlier.
Once you've identified your outliers, you can begin to research why they appeared in your data. You can also use graphs, such as scatter plots or histograms, to find outliers. Graphs present your data visually, making it easy to see when a piece of data differs from the rest of the data set.
A scatter plot displays your points of data as dots on a graph based on two variables, plotted on your x-axis and y-axis. Scatter plots are useful for visualizing outliers because you can see when one dot is far away from the other dots, which are usually clustered together.
Therefore, the data point that is far away from the group is the outlier. A histogram displays data in groups called "bins. Calculate the lower quartile. This point, to which we will assign the variable Q1, is the data point below which 25 percent or one quarter of the observations set.
In other words, this is the halfway point of the points in your data set below the median. If there are an even number of values below the median, you once again must average the two middle values to find Q1, much like you may have had to do to find the median itself. In our example, 6 points lie above the median and 6 points lie below it.
This means that, to find the lower quartile, we will need to average the two middle points of the bottom six points. Points 3 and 4 of the bottom 6 are both equal to Calculate the upper quartile. This point, which is assigned the variable Q3, is the data point above which 25 percent of the data sits. Finding Q3 is almost identical to finding Q1, except that, in this case, the points above the median, rather than below it, are taken into account.
Continuing with the example above, the two middle points of the 6 points above the median are 71 and Find the interquartile range. Now that we've defined Q1 and Q3, we need to calculate the distance between these two variables. The distance from Q1 to Q3 is found by subtracting Q1 from Q3. The value you obtain for the interquartile range is vital for determining the boundaries for non-outlier points in your data set.
In our example, our values for Q1 and Q3 are 70 and To find the interquartile range, we subtract Q3 - Q1: Note that this works even if Q1, Q3, or both are negative numbers. For example, if our Q1 value was , our interquartile range would be Find the "inner fences" for the data set. Outliers are identified by assessing whether or not they fall within a set of numerical boundaries called "inner fences" and "outer fences".
To find the inner fences for your data set, first, multiply the interquartile range by 1. Then, add the result to Q3 and subtract it from Q1. The two resulting values are the boundaries of your data set's inner fences. In our example, the interquartile range is Multiplying this by 1. We add this number to Q3 and subtract it from Q1 to find the boundaries of the inner fences as follows: In our data set, only the temperature of the oven - degrees - lies outside this range and thus may be a mild outlier.
However, we have yet to determine if this temperature is a major outlier, so let's not draw any conclusions until we do so. Find the "outer fences" for the data set. This is done in the same way as the inner fences, except that the interquartile range is multiplied by 3 instead of 1. The result is then added to Q3 and subtracted from Q1 to find the upper and lower boundaries of the outer fence. In our example, multiplying the interquartile range above by 3 yields 1. We find the boundaries of the outer fence in the same fashion as before: Any data points that lie outside the outer fences are considered major outliers.
In this example, the oven temperature, degrees, lies well outside the outer fences, so it's definitely a major outlier. Use a qualitative assessment to determine whether to "throw out" outliers. Using the methodology described above, it's possible to determine whether certain points are minor outliers, major outliers, or not outliers at all. However, make no mistake - identifying a point as an outlier only marks it as a candidate for omission from the data set, not as a point that must be omitted.
The reason that an outlier differs from the rest of the points in the data set is crucial in determining whether to omit the outlier or not. Generally, outliers that can be attributed to an error of some sort - an error in measurement, recording, or experimental design, for instance - are omitted.
Another criterion to consider is whether outliers significantly impact the mean average of a data set in a way that skews it or makes it appear misleading. This is especially important to consider if you intend to draw conclusions from the mean of your data set. Let's assess our example. In our example, since it's highly unlikely that the oven reached a temperature of degrees through some unforeseen natural force, we can conclude with near-certainty that the oven was accidentally left on, resulting in the anomalous high temperature reading.
Since the outlier can be attributed to human error and because it's inaccurate to say that this room's average temperature was almost 90 degrees, we should opt to omit our outlier. Understand the importance of sometimes retaining outliers. Scientific experiments are especially sensitive situations when dealing with outliers - omitting an outlier in error can mean omitting information that signifies some new trend or discovery.
For instance, let's say that we're designing a new drug to increase the size of fish in a fish farm. In other words, the first drug gave one fish a mass of 71 grams, the second drug gave a different fish a mass of 70 grams, and so on. In this situation, is still a big outlier, but we shouldn't omit it because, assuming it's not due to an error, it represents a significant success in our experiment. The drug that yielded a gram fish worked better than all the other drugs, so this point is actually the most important one in our data set, rather than the least.
Besides strong outliers, there is another category for outliers. If a data value is an outlier, but not a strong outlier, then we say that the value is a weak outlier. We will look at these concepts by exploring a few examples. The number 9 certainly looks like it could be an outlier. It is much greater than any other value from the rest of the set. To objectively determine if 9 is an outlier, we use the above methods. The first quartile is 2 and the third quartile is 5, which means that the interquartile range is 3.
We multiply the interquartile range by 1. The result, 9. Therefore there are no outliers. The first quartile, third quartile, and interquartile range are identical to example 1. When we add 1. Since 10 is greater than 9.
Is 10 a strong or weak outlier? When we add 9 to the third quartile, we end up with a sum of Since 10 is not greater than 14, it is not a strong outlier. Thus we conclude that 10 is a weak outlier. We always need to be on the lookout for outliers. Sometimes they are caused by an error. Other times outliers indicate the presence of a previously unknown phenomenon.
Another reason that we need to be diligent about checking for outliers is because of all the descriptive statistics that are sensitive to outliers.
The mean, standard deviation and correlation coefficient for paired data are just a few of these types of statistics. Actively scan device characteristics for identification.
0コメント