Transforming Non-normal Data: Proceed with Caution!

Share Button

Data that follow a distribution that is either not symmetric or that is symmetric but not bell shaped are said to be non-normal.   A common misconception among students is that all data should be normally distributed, and that there is something inherently wrong if their data is not normally distributed.  In fact, data is not always normally distributed and we should not expect to always see normal distributions.  Non-normality is a fact of life.  Nature does not produce perfectly normal distributions.

Some processes include a fundamentally non-normal mechanism. For example, data for the mean time to repair equipment is often distributed in an exponential fashion, with short repair times being most likely and longer repair times less likely.  Another example is warpage in floor tiles, where a small amount of warpage is most likely and more warpage is less likely.

Data may also be non-normal because of values close to zero or to some other natural limit that the data cannot fall beyond.  For example, cycle time to perform an activity has a natural limit of zero.

Here are some additional possible reasons that data may be distributed in a non-normal fashion:

  • The data may come from two or more different sources – two or more pieces of equipment, two or more locations, two or more different suppliers of material, and so on.  In that case, we might look into sub-dividing the data in order to get a more definitive picture of what is going on.
  • The data may include extreme values that are caused by data entry errors, missing data, or outliers. In this case, we should correct the errors or include the missing data if at all possible.
  • The process that generated the data may be out of control, i.e. not stable. In this case, we should work to bring the process under control and then collect new data.
  • The level of data discrimination may be insufficient. Rounding errors or measuring equipment with insufficient resolution may distort the data.  In this case we should investigate ways to measure and collect data that provide a higher level of resolution or discrimination.
  • Data may have been trimmed or modified in some manner after it was collected but before it was analyzed for normality. We must take great care when deciding whether or not to exclude certain data points.

In our courses we use Minitab statistical software.   Minitab can be used to evaluate whether data fit a normal distribution or some other type of distribution.    Non-normality of data is a problem if and only if we want to use a tool that requires normally distributed data and our data are not normally distributed.   Be aware that many tools do not require normally distributed data, and that there are often alternative tools available for those that do.  As an example, Minitab offers capability analysis tools for both normal and non-normal data.

Sometimes it is possible to apply a function that will make non-normal data appear approximately normal.  This is referred to as data transformation.  Minitab can be used to perform such data transformations.  The details of how data transformation is accomplished can be found under Methods and Formulas in Minitab.

While it is possible to perform such a transformation using Minitab or other software products, my students are coached to proceed with caution before doing so.  All transformations improve normality by altering the relative distance between data points.  This may be a problem if the original values need to be interpreted in a substantial manner.  For example, the distance between years of age cannot be altered without altering the meaning of the original data.

Some experts advise against data transformation under any circumstances.  They argue that it is important to understand what the underlying data is telling us about the process, and that tests that require normal (or near normal) data are not appropriate for non-normal data even if it has been transformed.

The best approach is to first understand why the data are distributed the way they are distributed.  If we find any problems in the way that the data was collected, those problems should be corrected and new data collected.  Finally, we should strive to analyze data as it was collected, without transformation, if at all possible.

Your comments or questions about this article are welcome, as are suggestions for future articles.  Feel free to contact me by email at roger@keyperformance.com.

About the author:  Mr. Roger C. Ellis is an industrial engineer by training and profession.  He is a Six Sigma Master Black Belt with over 48 years of business experience in a wide range of fields.  Mr. Ellis develops and instructs Six Sigma professional certification courses for Key Performance LLC.   For a more detailed biography, please refer to www.keyperformance.com.

 

 

Share Button

On August 2nd, 2017, posted in: Articles, Six Sigma by Tags: ,

Leave a Reply