# Issues in sampling – a real world example

In a previous article in this series I discussed issues to be aware of when selecting a sample from a larger population. You may find it useful to review that article before reading this one.

Samples must be randomly selected in order to be representative of the overall population. Otherwise, we risk introducing bias into the results.  One of the assignments in our Black Belt course asks students to identify issues that might impact their ability to collect random samples, and then explain how they would deal with these issues to assure that their samples are representative of the population.  Here in part is a recent submission for this assignment, along with my comments.

“The hospital wants me to draw a random sample from our master patient index for the purpose of a survey.  Email is the desired communication method because of cost. From my experience in information technology……the population containing actual email addresses may be skewed towards younger people……. This could bias my results and not provide a true representation of our patient population.”

The student has identified a key issue – not everyone has an email address. The student may very well be correct that a higher percentage of younger patients have email.  Let’s assume that this is true.  If we send a survey only to those who have an email, the survey will be biased. The question is what should we do about this issue?

One option would be to select a random sample from only those patients who have an email address. The results from this survey would carry a disclaimer that they are only representative of those patients who have an email address on file.

The student went on to say:

“To deal with these issues a large sample……increases the confidence level that the sample is random.”

The statement that “….a large sample….increases the confidence that the sample is random” is not correct.  Randomness of sample selection and sample size are two separate issues. The size of the sample has a bearing on the confidence level of the results but does not have anything to do with whether the sample was randomly selected.  We could take a very large sample that is biased because it consists only of women or only of people younger than 25 years of age.  The sample would be large but certainly not random.  On the other hand we could randomly select a sample that is too small to allow us to have much (if any) confidence in the results.

The student then stated: “I could also do a query of patients with email addresses including the age demographic and plot those results as to normality.  In this way I could test my hypothesis that patients with actual email addresses available are more heavily skewed to a younger population.  Understanding the population you are drawing the sample from and the use of appropriate tools would minimize the chance of a non-random sample.”

This is a confusing statement. How the population is distributed with respect to age (normally or in some other manner including skewed) is not a relevant issue. What is relevant is whether or not the sample data that we collect is representative of the population as a whole. What I believe the student was suggesting is to construct a population consisting of only those patients who have email, but yet still accurately represents the overall population by age group, and then randomly select a sample from this new population.  This could be done.  But – and it is a big but – there is no way of knowing if this new population (with email, and properly representing all age groups) would be representative of the overall population on other factors such as sex, race, geography, income level, with or without health insurance, marital status, health status (acutely ill/chronically ill/healthy) and so on.

Finally, the response rate for the survey is an issue that the student did not explore. Response rate for external surveys such as this one are typically in the 10 to 15% range. The results of the survey could be biased simply because of who chooses to respond or not respond.

Whenever we work with facts and data to make decisions, we must carefully evaluate how those facts and data were collected. We must be wary of factors that introduce bias and must do our best to minimize or eliminate bias.