College of Micronesia-FSM: Dana Lee Ling's Introduction to Statistics Using OpenOffice.org, LibreOffice.org Calc, 4th edition: "Section 3.2: Differences in the Distribution of Data”
Read this section.
The range is the maximum data value minus the minimum data value.
The range is a useful basic statistic that provides information on the distance between the most extreme values in the data set.
The range does not show if the data if evenly spread out across the range or crowded together in just one part of the range. The way in which the data is either spread out or crowded together in a range is referred to as the distribution of the data. One of the ways to understand the distribution of the data is to calculate the position of the quartiles and making a chart based on the results.
Percentiles, Quartiles, Box and Whisker charts
The median is the value that is the middle value in a sorted list of values. At the median 50% of the data values are below and 50% are above. This is also called the 50th percentile for being 50% of the way "through" the data.
If one starts at the minimim, 25% of the way "through" the data, the point at which 25% of the values are smaller, is the 25th percentile. The value that is 25% of the way "through" the data is also called the first quartile.
Moving on "through" the data to the median, the median is also called the second quartile.
Moving past the median, 75% of the way "through" the data is the 75th percentile also known as the third quartile.
Note that the 0th percentile is the minimum and the 100th percentile is the maximum.
Spreadsheets can calculate the first, second, and third quartile for data using a function, the quartile function.
Data is a range with data. Type represents the type of quartile. (0 = minimum,
1 = 25% or first quartile, 2 = 50% (median), 3 = 75% or third quartile and 4 =
maximum. Thus if data is in the cells A1:A20, the first quartile could be
The InterQuartile Range (IQR) is the range between the first and third quartile:
There are some subtleties to calculating the IQR for sets with even versus odd sample sizes, but this text leaves those details to the spreadsheet software functions.
Quartiles, Box and Whisker plots
The above is very abstract and hard to visualize. A box and whisker plot takes the above quartile information and plots a chart based on the quartiles.
A box and whisker plot is built around a box that runs from the value at the 25th percentile (first quartile) to the value at the 75th percentile (third quartile). The length of the box spans the distance from the value at the first quartile to the third quartile, this is called the Inter-Quartile Range (IQR). A line is drawn inside the box at the location of the 50th percentile. The 50th percentile is also known as the second quartile and is the median for the data. Half the scores are above the median, half are below the median. Note that the 50th percentile is the median, not the mean.
The basic box plot described above has lines that extend from the first quartile down to the minimum value and from the third quartile to the maximum value. These lines are called "whiskers" and end with a cross-line called a "fence". If, however, the minimum is more than 1.5 × IQR below the first quartile, then the lower fence is put at 1.5 × IQR below the first quartile and the values below the fence are marked with a round circle. These values are referred to as potential outliers - the data is unusually far from the median in relation to the other data in the set.
Likewise, if the maximum is more than 1.5 × IQR beyond the third quartile, then the upper fence is located at 1.5 × IQR above the 3rd quartile. The maximum is then plotted as a potential outlier along with any other data values beyond 1.5 × IQR above the 3rd quartile.
There are actually two types of outliers. Potential outliers between 1.5 × IQR and 3.0 × IQR beyond the fence . Extreme outliers are beyond 3.0 × IQR. In the program Gnome Gnumeric potential outliers are marked with a circle colored in with the color of the box. Extreme outiers are marked with an open circle - a circle with no color inside.
An example with hypothetical data sets is given to illustrate box plots. The data consists of two samples. Sample one (s1) is a uniform distribution and sample two (s2) is a highly skewed distribution.
Box and whisker plots can be generated by the Gnome Gnumeric program or by using on line box plot generators.
The box and whisker plot is a useful tool for exploring data and determining whether the data is symmetrically distributed, skewed, and whether the data has potential outliers - values far from the rest of the data as measured by the InterQuartile Range. The distribution of the data often impacts what types of analysis can be done on the data.
The distribution is also important to determining whether a measurement that was done is performing as intended. For example, in education a "good" test is usually one that generates a symmetric distibution of scores with few outliers. A highly skewed distribution of scores would suggest that the test was either too easy or too difficult. Outliers would suggest unusual performances on the test.
Two data sets, one uniform, the other with one potential outlier and one extreme outlier.
Consider the following data:
|Data set 1||5, 5, 5, 5||5||5||5||5||5||0||0|
|Data set 2||2, 4, 6, 8||none||5||5||2||8||6||5|
|Data set 3||2, 2, 8, 8||none||5||5||2||8||6||5|
Neither the mode, median, nor the mean reveal clearly the differences in the distribution of the data above. The mean and the median are the same for each data set. The mode is the same as the mean and the median for the first data set and is unavailable for the last data set (spreadsheets will report a mode of 2 for the last data set). A single number that would characterize how much the data is spread out would be useful.
As noted earlier, the range is one way to capture the spread of the data. The range is calculated by subtracting the smallest value from the largest value. In a spreadsheet:
The range still does not characterize the difference between set 2 and 3: the last set has more data further away from the center of the data distribution. The range misses this difference.
To capture the spread of the data we use a measure related to the average distance of the data from the mean. We call this the standard deviation. If we have a population, we report this average distance as the population standard deviation. If we have a sample, then our average distance value may underestimate the actual population standard deviation. As a result the formula for sample standard deviation adjusts the result mathematically to be slightly larger. For our purposes these numbers are calculated using spreadsheet functions.
One way to distinguish the difference in the distribution of the numbers in data set 2 and data set 3 above is to use the standard deviation.
|Data set 1||5, 5, 5, 5||5||0.00|
|Data set 2||2, 4, 6, 8||5||2.58|
|Data set 3||2, 2, 8, 8||5||3.46|
The function that calculates the sample standard deviation is:
In this text the symbol for the sample standard deviation is usually sx.
In this text the symbol for the population standard deviation is usually σ.
The symbol sx usually refers the standard deviation of single variable x data. If there is y data, the standard deviation of the y data is sy. Other symbols that are used for standard deviation include s and σx. Some calculators use the unusual and confusing notations σxn−1 and σxn for sample and population standard deviations.
In this class we always use the sample standard deviation in our calculations. The sample standard deviation is calculated in a way such that the sample standard deviation is slightly larger than the result of the formula for the population standard deviation. This adjustment is needed because a population tends to have a slightly larger spread than a sample. There is a greater probability of outliers in the population data.
Coefficient of variation CV
The Coefficient of Variation is calculated by dividing the standard deviation (usually the sample standard deviation) by the mean.
Note that the CV can be expressed as a percentage: Group 2 has a CV of 52% while group 3 has a CV of 69%. A deviation of 3.46 is large for a mean of 5 (3.46/5 = 69%) but would be small if the mean were 50 (3.46/50 = 7%). So the CV can tell us how important the standard deviation is relative to the mean.
Rules of thumb regarding spread
As an approximation, the standard deviation for data that has a symmetrical, heap-like distribution is roughly one-quarter of the range. If given only minimum and maximum values for data, this rule of thumb can be used to estimate the standard deviation.
At least 75% of the data will be within two standard deviations of the mean,
regardless of the shape of the distribution of the data.
At least 89% of the data will be within three standard deviations of the mean, regardless of the shape of the distribution of the data.
If the shape of the distribution of the data is a symmetrical heap, then as much as 95% of the data will be within two standard deviations of the mean.
Data beyond two standard deviations away from the mean is considered "unusual" data.
Basic statistics and their interaction with the levels of measurement
Levels of measurement and appropriate measures
|Level of measurement||Appropriate measure of middle||Appropriate measure of spread|
|nominal||mode||none or number of categories|
|interval||median or mean||range or standard deviation|
At the interval level of measurement either the median or mean may be more appropriate depending on the specific system being studied. If the median is more appropriate, then the range should be quoted as a measure of the spread of the data. If the mean is more appropriate, then the standard deviation should be used as a measure of the spread of the data.
Another way to understand the levels at which a particular type of measurement can be made is shown in the following table.
Levels at which a particular statistic or parameter has meaning
|Level of measurement|
|coefficient of variation|
For example, a mode, median, and mean can be calculated for ratio level measures. Of those, the mean is usually considered the best measure of the middle for a random sample of ratio level data.