### With Data, Anything Can Happen!

In article 73 in this series, I discussed the issue of whether to transform data.    Students of Six Sigma often think (incorrectly) that there is something inherently wrong with their data if it is not normally distributed.  We should always strive to understand why the data is distributed the way it is and recognize that not all processes produce outputs that are normally distributed.

In our Master Black Belt course, I included an assignment on data transformation with the intent of making sure that students understand how to perform data transformations, as well as understanding the issues that surround data transformation.

One of my Master Black Belt students recently submitted an assignment where she explained that she attempted to transform actual measurement data from one of her manufacturing projects.  She used a probability plot to test the data for normality.  The null hypothesis is no difference between the data and a normal distribution.  The alternate hypothesis is that there is a difference between the data and a normal distribution.  The data proved to be not normally distributed – the null hypothesis was rejected based on a p value that was smaller than .005.

The first line of defense when evaluating data is to make sure that there are no errors in collection of the data or how the data was transcribed.  There were no obvious errors in the data.

She then tried a Box Cox transformation on the data.  This transformation routine failed to transform the data into a normal distribution.   Finally, she tried the Johnson transformation, which also failed to transform the data into a normal distribution.

I replied as follows:  This is a very interesting data set.  I ran the Individual Distribution Identification routine under Stat – Quality Tools in Minitab.  The data as collected do not fit ANY of the available distributions.  In addition, as you determined, the Box Cox transformation and the Johnson transformation both failed to transform the data.

The Box-Cox transformation does not guarantee normality.  It does not check for normality – rather it attempts to produce the smallest standard deviation in the data.  Transformed data is likely to be distributed normally when the standard deviation is minimized, but there is no guarantee.  Likewise, there is no guarantee that the algorithms in the Johnson transformation will produce a normal distribution.

I then created a histogram of the untransformed data.  The data appear to be bimodal – i.e. they appear to be from two different distributions, such as two different machines or two different shifts.  This could very well explain why the data cannot be successfully transformed into a normal distribution.

In practice, I always encourage students to proceed with caution.  When instructing Black Belt students, I tell them that they should only transform data when there is a good reason to do so, such as a test that cannot otherwise be accomplished.  Students often transform data to perform a test that requires normal data, instead of using an available test for non-normal data.  Process capability analysis is a good example.