Variance
The variance of a dataset is the mean-squared deviation between the data and the empirical mean. For a dataset of points with empirical mean , the variance is given by:
Variance is not a good measure of spread
The variance is the squared deviation, and as such is of the wrong scale and has different units to the data. As a result, we introduce a new measure of spread:
Standard Deviation
The Standard Deviation is the root-mean-squared deviation from the mean, given by:
This is a much more useful measure of spread than variance.
68-95-99.7 Rule
For data following a normal distribution, it can be useful to know that, roughly:
- 68% of the data falls within 1 standard deviation of the mean;
- 95% of the data falls within 2 standard deviations of the mean;
- 99.7% of the data falls within 3 standard deviations of the mean. As such, this is known as the 68-95-99.7 Rule.
Mean Absolute Difference
The Mean Absolute Difference (MAD) is given by:
and is typically more robust than the standard deviation. This means that it is less sensitive to outliers.
Inter-quartile Range
The inter-quartile range (IQR), sometimes inter-quartile distance (IQD) is the range of the central half of the dataset. That is,
Detecting Outliers
A common method for detecting outliers uses the IQR. Any datum that is
- Larger than ; or
- Smaller than . Is considered to be an outlier.