In the Analyze phase of Six Sigma, we take the data that we collected in the Measure phase and try to make some sense of it. We use statistical thinking and analysis to prove or disprove our hunches about what is going on in a process, and to identify the true root cause of a problem. Dr. Juran referred to the Analyze phase as “The diagnostic journey from symptom to cause”.

No one knows how to make a product or deliver a service the same every time. There is always some degree of variation in the output of any process. This variation can be measured and described in two ways. We can describe the central tendency of the data by calculating the mean, median and/or mode, and we can describe the amount of dispersion around the center of the data by calculating the range, standard deviation and/or variance.

In addition to simply describing the data, we can also use these statistics to compare a process before and after improvement. Did we shift the location of the center of the process closer to the target value required by the customer? Did we reduce the amount of variation or dispersion in the process?

First, consider central tendency. The mean is the mathematical average of the data values. The median is the middle value when the data are arranged by size. The mode is the most frequently occurring value in the data. In a perfectly normal distribution, all three will be the same.

The range is the difference between the largest value and the smallest value. Standard deviation is an index that describes the amount of variation in the data, and the only definition for standard deviation is a formula. The variance is the square of standard deviation.

We may calculate these descriptive statistics for either the entire population of data, or from a sample drawn from a larger population. Sample statistics are simply estimates of the true population statistics. Population statistics are preferred, but it may not be practical or cost effective to collect data on an entire population. In those cases, we rely on a sample chosen from the population in a random and representative fashion. Descriptive statistics are easily calculated using Excel or statistical software such as Minitab.

There are three other descriptive statistics that are used less frequently. The standard error of the mean is a statistic that describes the standard deviation of the sample mean estimate of the population mean. If we take a number of samples and calculate their means, there will some variation in the results from one sample to the next. The standard error of the mean is equal to the sample standard deviation divided by the sample size. We want the standard error of the mean to be as small as possible. The bigger the sample, the smaller the standard error of the mean.

Kurtosis is a measure of the peakedness of a distribution of data. Higher kurtosis means more of the variance is the result of infrequent extreme deviations, as opposed to frequent modestly sized deviations.

Skewness is a measure of the asymmetry of the distribution of the data. Negative skew means the left tail of the distribution is longer; positive skew means the right tail is longer.

Your comments or questions about this article are welcome, as are suggestions for future articles. Feel free to contact me by email at roger@keyperformance.com.

About the author: Mr. Roger C. Ellis is an industrial engineer by training and profession. He is a Six Sigma Master Black Belt with over 45 years of business experience in a wide range of fields. Mr. Ellis develops and instructs Six Sigma professional certification courses for Key Performance LLC. For a more detailed biography, please refer to www.keyperformance.com.

On October 15th, 2015, **posted in:** Articles, Six Sigma by Roger EllisTags: descriptive statistics, Six Sigma