When we are working with data, we usually work with samples since it is too difficult, time consuming and expensive to work with population data. In fact, population data is not even required, and we can draw good conclusions from sampled data. So, what is sampling and how do we determine how many samples we need?
For example, if you wanted to understand the opinion of all the employees in a company and a company has 100,000 employees, we can get a good idea of what the employees are thinking about by sampling a small subset of these employees and asking them to fill out a survey. A few other areas where we can use sampling are for customer satisfaction surveys to get an idea of what the customer feels about our product, market research to identify customers and target markets, education surveys to make changes to the school or institutional way of working, healthcare surveys to find out what issues and concerns patients have, surveys to identify what the voters think or even who is going to win the next elections. Clearly, whenever we are dealing with large number of data points for the population, sampling is something that adds value to our analysis.
How to calculate sample size?
We want to use a sample size large enough so that the conclusions we make from using the samples would be the “same” as what we would have concluded in case, we used the population data. If the sample size is too small, then our conclusions may be way off and if the sample size is too large, then we are wasting resources. The required sample size is selected such that it helps us minimize two types of errors - Type I (false positive or alpha risk) and Type II (false negative or beta risk). Since we are collecting limited sample data, there is always a chance of making errors, but we want to limit the chance of these errors as much as possible. By increasing the sample size, we can reduce both the Type I and Type II errors. The five variables you need to determine sample size are the type of data you are trying to estimate, the population size, the margin of error you are willing to tolerate, the confidence level and the power of the test. The confidence level value dictates the amount of Type I error and the power of the test dictates the amount of Type II error. We won’t get into the nitty gritty of things, but we need to use statistical tools to estimate the required sample size.
Let’s take an example of sample size calculation based on a popular online sample size calculator from Survey Monkey. In this calculator, there are three inputs, the population size, the confidence level required in your analysis, and the margin of error that you would like to make. Taking a very large population size, a default confidence level of 95% and a margin of error of 5%. The sample size calculator shows a required sample size of 385. This is the number of responses you would need on your survey. If the response rate is low, clearly, we need to survey a lot more people in order to achieve a sample size of 385.
Problem with existing analysis
There are a lot of online calculators out there as can be seen from some of the references listed below. The problem with these calculators is that they only ask you for confidence level and not the power of the test. If fact if you derive the formula that is being used in these calculators you will find that intrinsically, they are setting the power of the test at 50%. That means that there is a 50% chance of a Type II or beta error in these algorithms. These algorithms are so ubiquitous that they find mention is so many online calculators that you find available on the Internet.
In fact, for the problem above, if you set the confidence level at 95% (5% alpha risk) and power of the test at 90% (10% beta risk), you will find that the required sample size is actually 1047. This is significantly higher than the number 385 reported using online calculators. Hence, users of online calculators who blindly use these tools available on the Internet may be significantly under sampling (collecting less data that what is truly required for performing the analysis).
In order to demonstrate the right way to calculate sample size. The following analysis was performed with a power of 50% (rest of the parameters are same as the example above):
The following analysis was performed with a power of 90% (rest of the parameters are the same as the example above):
In summary, I would recommend that you understand the math behind these algorithms and only use calculators that ask for both confidence level and power using tools such as Minitab or Sigma Magic software which will give you the right analysis results.
The popular saying “Caveat Emptor” or Buyers Beware also applies to statistical tools.
Here are some references that only use the confidence level for analysis (only a sample of listings shown):