# Statistical Software – A Trap for the Unwary

Statistical software is widely used in practice to conduct various statistical tests and to generate various graphs and charts.  In our Six Sigma courses we use Minitab, which has been under continuous development since 1972 and is one of the most widely used statistics packages.  There are many other fine products on the market as well.  There are a number of issues that you should be aware of when using any statistical software.

First, you must understand how to choose the correct test or graph.  The choice of which statistical test or graph to use often depends on what type of data is being used, and how the data was collected.  A good example is statistical process control charts.  There are seven basic types of control charts, and several other specialized charts.  Care must be taken to choose the correct chart for the type of data that is being analyzed.  In our courses I have included a number of decision flowcharts to make this task more straightforward for students.

Next, you must understand how to correctly use the software to conduct the statistical test or generate the graph.  A common issue is how to enter the data into the software.  For example, when creating Xbar and R control charts in Minitab, the data is collected in subgroups.  There are two options for entering the data.  One is to enter all of the data is in one column.  In this case, we define for Minitab how many values are in each subgroup.  If the subgroup size is five, Minitab will use the first five values as subgroup one, the second five values as subgroup two, and so on.  The other option is to enter the data in columns, where each row represents one subgroup.    For a subgroup size of five, each row would have five columns of data.  A common mistake that students make is to enter the data in one format, and then choose the analysis method that expects the data to be in the other format.

It is important to understand what methods and formulas are being used by the software under any given set of circumstances.  One of the basic descriptive statistics that we calculate from data is the standard deviation, which describes the amount of variation present in a group of data.  If you have assumed in the past that there is one and only one way to calculate standard deviation, you may be surprised to learn that this is not the case.   The sum of the squares method is most commonly used when calculating standard deviation; however, alternative formulas include Welford’s method.

A different formula is used to calculate standard deviation for a population of data than is used to calculate standard deviation of a sample taken from a larger population.  In addition to calculating the sample standard deviation, we may also calculate a confidence interval around that sample standard deviation, as we are never 100% sure about our conclusion when that conclusion is based on a sample.

Sample standard deviation is considered to be a biased estimate.  For an unbiased estimate of standard deviation we need to apply a correction factor.  These correction factors and the formulas used to apply them vary based on the underlying distribution of the data.

Finally, we must understand how to correctly interpret the results.  Statistical software will develop a graph or present the results of a calculation for a statistical test, but it is up to the user to determine what the results mean.  We must always consider the statistical significance as well as the practical significance.  For example, we often will compare the output of a process to see if there is any statistical difference in the process performance before and after improvement.  It is possible for a change to be statistically significant, but small enough that it is of no practical value.  The opposite can also be true under certain circumstances.