# You Can’t Always Tell by Looking

Our Green Belt course includes an exercise where students create a histogram from a file of 100 data values.  Here is the histogram of the data, with a normal probability distribution curve superimposed:

Students are not asked to comment on the result, but most are unable to resist doing so.  The majority who do comment state that the histogram indicates that the data is normally distributed.  The histogram does give us a good indication of how the data is distributed, but a probability plot must be used to determine objectively if the data fit a specific distribution such as normal.

The null hypothesis for the probability plot is that there is no difference between the distribution of the data and the hypothesized distribution (in this case the normal probability distribution).  The alternate hypothesis is that there is a difference between the distribution of the data and the hypothesized distribution.  If we fail to reject the null hypothesis, the data are normally distributed.  If we reject the null hypothesis, the data are not normally distributed.

Here is the probability plot of the data that was used to create the histogram:

We decide whether to reject the null hypothesis by comparing the calculated p value (in this case less than .005) to the alpha decision level (in this case an alpha value of .05 corresponding to a 95% level of confidence in our decision).  If the p value is lower than alpha, we reject the null hypothesis.  In this case the p value is lower than alpha, and we reject the null hypothesis.  We conclude that the data are not normally distributed.  The distribution of the data is symmetrical around the mean, but it is more peaked in the middle than a normal distribution would be.

This example illustrates why I caution students to only make statements that can be supported by facts when working with data and statistics.  You can’t always tell just by looking!