Correlation and cause-and-effect

Correlation is the measure of the linear association between two independent variables.  For example, we might find that the square footage of a home and its sales price are highly correlated.  A larger home would have a higher sales price, and a smaller home would have a lower sales price.  This is an example of positive correlation.  An example of negative correlation would be a student who has many absences from class.  As the number of absences goes up, the grade point average of the student goes down.

The coefficient of correlation is a single number that ranges from -1 to +1, where negative one is perfect negative correlation and positive one is perfect positive correlation.  A coefficient of zero implies no correlation at all.

There are several cautions that must be observed when interpreting a coefficient of correlation.  First, a high level of correlation is not proof positive of cause and effect (although it is usually a darn good hint in practice).  However, there may be a third variable that affects the two correlated variables simultaneously.  Here is an example to illustrate why we need to proceed with caution.

Ice cream sales and the number of shark sightings at the beach are found to be highly positively correlated.  As ice cream sales go up, the number of recorded shark sightings goes up.  If we conclude that ice cream sales cause shark sightings, we could expect to eliminate shark sightings by stopping the sales of ice cream.  In reality, ice cream sales have nothing to do with causing shark sightings.  The more mundane explanation is that a third variable (warmer weather) causes more people to go the beach and buy ice cream and causes more sharks to be swimming in the warmer waters, resulting in more shark sightings.

A low coefficient of correlation does not mean that there is no relationship between two variables.  It means that there is no LINEAR relationship.  There may in fact be a higher order curvilinear relationship between the two variables, such as one that resembles a parabola when graphed.

The accuracy of the coefficient of correlation may also be impacted by errors in the measurement system that was used to collect the data behind the calculation.    Finally, the calculation of the coefficient of correlation is very sensitive to extreme data values.  A single value in the set of data can change the coefficient of correlation a great deal.

It is never appropriate to conclude that changes to one variable cause changes to another variable based only on the coefficient of correlation.  If we suspect that changes in a variable influence changes in another variable, we should conduct a properly controlled Designed Experiment in order to prove or disprove cause and effect.

Your comments or questions about this article are welcome, as are suggestions for future articles.  Feel free to contact me by email at

About the author:  Mr. Roger C. Ellis is an industrial engineer by training and profession.  He is a Six Sigma Master Black Belt with over 45 years of business experience in a wide range of fields.  Mr. Ellis develops and instructs Six Sigma professional certification courses for Key Performance LLC.   For a more detailed biography, please refer to