In our Yellow Belt course, students discuss a situation where improper selection of a sample from a larger population resulted in problems with the data that was collected, and how these problems might have been prevented.
Students often misunderstand the issues involved in randomly selecting a sample that is representative of a population. Here is a recent submission with student comments in italics, along with my responses:
A situation where improper selection of a sample from a larger population of items resulted in problems with the data is Presidential Election Polling. In the months before the 2012 presidential election…the Gallup poll predicted a narrow victory by Mitt Romney over President Barack Obama. Instead, the president easily won re-election with a 51.1 percent to 47.2 percent margin.
There were several Gallup polls leading up to the election. The results of the final poll were as follows:
“President Barack Obama and Republican challenger Mitt Romney are within one percentage point of each other in Gallup’s final pre-election survey of likely voters, with Romney holding 49% of the vote, and Obama 48%. After removing the 3% of undecided voters from the results and allocating their support proportionally to the two major candidates, Gallup’s final allocated estimate of the race is 50% for Romney and 49% for Obama.
Results are based on telephone interviews conducted Nov. 1-4, 2012, on the Gallup Daily Election Tracking Poll, with a random sample of 3,117 adults, aged 18 and older, living in all 50 U.S. states and the District of Columbia.
For results based on the total sample of national adults, one can say with 95% confidence that the maximum margin of error is ±2 percentage points.
For results based on the total sample of 2,854 registered voters, one can say with 95% confidence that the maximum margin of error is ±2 percentage points.
Results for likely voters are based on the subsample of 2,551 survey respondents deemed most likely to vote in the November 2012 general election, according to a series of questions measuring current voting intentions and past voting behavior. For results based on the total sample of likely voters, one can say with 95% confidence that the margin of sampling error is ±2 percentage points.”
Please note that the margin of error at 95% confidence was plus or minus 2 percent, which was much larger than the predicted gap between Romney and Obama.
The population is all voters in the United States (US citizens over the age of 18)
I would have defined the population as all registered voters in the United States.
The sample was selected randomly. Gallup uses telephone surveys in areas where telephone coverage represents at least 80% of the population. Where telephone penetration is less than 80%, Gallup uses face-to-face interviewing.
Polls usually come from a low number of participants surveyed in comparison to the number of voters that will vote on election day.
The number of people polled is a different issue than whether the sample was random and representative. The size of the sample that we need depends on two factors. (1) the level of confidence that we want to have in the results and (2) the amount of variation in the population. Beyond a certain point there is no benefit to increasing sample size. Consider that if there was no variation in the population, we would be 100% confident with a sample of one.
The random polling may be flawed- the pollsters may be collecting data from voters of a political party that were chosen at a higher percentage than the other major political party. For example, 7 out of 10 “random” voters polled could be registered in one party and this could skew the survey.
This would be true if the determination of how to make the calls was NOT random. It would NOT be true if the calls were selected in a random fashion, which is how they were done.
Gallup uses telephone surveys in areas where telephone coverage represents at least 80% of the population. Likely voters may not be available to take phone calls. Likely voters may not have their contact information available to Gallup. Likely voters outside the area covered by Gallup are not having their vote polled.
If calls were made at random until enough people had been contacted, this would not be an issue. Also, Gallup did cover all 50 states and the District of Columbia.
Where telephone penetration is less than 80%, Gallup uses face-to-face interviewing. Likely voters may not be available for face to face interviews. Likely voters may be difficult to identify.
Again, this would not be an issue if voters were chosen at random until enough had been contacted. Voters were identified by Gallup when contacted as to whether they are likely to vote.
Problems could have been prevented by ensuring that an equal number of voters from each political party are polled.
This is wrong thinking as it would bias the results. For example, South Dakota is overwhelmingly Republican. Contacting an equal number of people from both parties there would badly distort the results. The selection of a sample must be random, so it is representative of the overall population.
The key takeaway from this lesson is that samples must be chosen randomly and in a manner such that they are an unbiased representation of the larger population. Only then can we rely on the sample data.
Your comments or questions about this article are welcome, as are suggestions for future articles. Feel free to contact me by email at email@example.com.
About the author: Mr. Roger C. Ellis is an industrial engineer by training and profession. He is a Six Sigma Master Black Belt with over 50 years of business experience in a wide range of fields. Mr. Ellis develops and instructs Six Sigma professional certification courses for Key Performance LLC. For a more detailed biography, please refer to www.keyperformance.com.