
The Central Limit Theorem
Purpose
This will demonstrate what the Central Limit Theorem is, why it is significant, and what conclusions you can draw from sampled data. This also demonstrates how to use the R programming language to sample from a data set and draw conclusions about the data using the sampled data. There is also a simulation at the bottom that shows how different distributions and sample sizes affect the distribution of sampled data.
This is intended for students with some familiarity with both statistics and the R programming language.
Table of Contents
- Introduction
- An Example Using Die Rolls
- An Example Using Temperature Data
- Central Limit Theorem and Z-Values
- Finite Population Correction Factor
- Simulating the Central Limit Theorem
- Additional Sources
Introduction
The Central Limit Theorem is one of the most important concepts in inferential statistics.
It states that if you have a population with a mean $\mu$ and a standard deviation $\sigma$ and you take an infinitely large number of sample means:
- The mean of the sample means will be equal to the population mean regardless of the sample size and regardless of the population distribution.
- The sample means will be normal for a normally distributed population distribution and sample means for other population distributions will approach a normal distribution as the sample size increases.
- The standard deviation of the sample means will be approximately equal to the standard deviation of the population divided by the square root of the sample size regardless of the sample size.
- As the sample size increases the standard deviation decreases.
Sample sizes of 30 or more ($n \geq 30$) are considered sufficiently large unless you know the population is normally distributed, in which case smaller sample sizes are acceptable.
Mean of Sample Means
As the the number of sample means approach infinity, the mean of the sampled means will have the same mean as the population regardless of the sample size and regardless of the population distribution.
$$\text{mean of sample means} = \text{population mean}$$
$$\mu_{\bar{X}} = \mu$$
Where:
- $\bar{X}$ is a sample mean
- $\mu_{\bar{X}}$ is the mean of the sample means
- $\mu$ is the population mean
Standard Error of the Mean
The standard deviation of the sample means is also called the standard error of the mean.
The standard deviation of the sample data will be equal to standard deviation of the population divided by the square root of the sample size, or:
$$\text{standard error of the mean} = \frac{\text{population standard deviation}}{\sqrt{\text{sample size}}}$$
$$\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$$
Where:
- $\sigma_{\bar{M}}$ is the standard error of the mean
- $\sigma$ is the standard deviation of the population
- n is the sample size
The standard error of the mean will get smaller as the sample size increases. A smaller standard error would indicate that the sample data is more representative of the total population. For example, if we had identical standard deviations for two different populations but different sample sizes, then the standard error of the mean would be smaller for the set with a larger sample size.
You can use the standard error of the mean to draw inferences about the population data from a sample. This is useful in the many many situations where it would be impossible to measure every individual from a population. For example, you can take the standard deviation of the means of a group of samples and multiply it by the square root of the sample size to find a population standard deviation.
Sampling Error
Sampling error is the difference between the actual population data and the same measure of the sampled data.
Since the sample does not include all members of the population, statistics on the sample, such as means and quantiles, generally differ from the characteristics of the entire population, which are known as parameters.
Number of Sample Means
- When the population data is normally distributed then the sample means will also be normally distributed.
- When the population data is not normally distributed then at least 30 sample means will have to be made for the sample means to be normally distributed. This is because of sampling error.
An Example With Dice
Consider an example with a die. A die has an equal probability of landing on any side so the probability distribution is uniform. But, taking the mean of multiple samples produces a normal distribution.
In this example below, a die is sampled thirty times and then the mean is taken from those 30 samples. That operation is repeated 10,000 times and the resulting set of sample means is plotted in a histogram.
die <- 1:6
die_rolls <- NULL
for (i in 1:10000){
die_rolls <- append(die_rolls, mean(sample(die, 30, replace = TRUE)))
}
hist(die_rolls)
In the figure above you can see the following:
- the sample means have a normal distribution
- the population mean is 3.5 and so is the mean of the sample means
- the standard deviation of the samples multiplied by the square root of the sample size is close to the square root of the population (see below)
sd(die_rolls)*sqrt(30)
[1] 1.720758
sd(die)
[1] 1.870829
An Example Using Temperature Data
This example will sample from population data to demonstrate that:
- a non-normal population will appear normal when sampled
- that the sample means will converge around the population mean
- that the standard deviation of the sample means is approximately equal to the population standard deviation divided by the square root of the sample size
This example uses historical data from the United States Historical Climatology Network. The data set contains daily high temperatures in Farmington, Maine from 1911 through 2010. Download the data here.
The variables in the data file are:
- station number (i.e., 172765: 2-digit state code, followed by 4-digit station code)
- day of year (1-365)
- month
- day of month
- year the daily record was set
- daily temperature high
Prepare the Data
First load the following packages:
library(data.table)
library(ggplot2)
library(easyGgplot2)
Then import the data:
``` Station172765_TMax<- data.table(read.table("/data/ME_2765tmax.txt", header = FALSE)) ```
The columns are labeled “V1” through “V6” so replace the column names with more descriptive names.
setnames(Station172765_TMax, c("V1", "V2", "V3", "V4", "V5", "V6"), c("StationID", "YearDay", "Year", "Month", "MonthDay", "MaxTemp"))
On days that no data was collected the temperature is recorded as -999, change that to NA.
Station172765_TMax[MaxTemp == -999, MaxTemp := NA]
Now view the dataset:
Station172765_TMax
## StationID YearDay Year Month MonthDay MaxTemp
## 1: 172765 1 1911 1 1 18
## 2: 172765 2 1911 1 2 36
## 3: 172765 3 1911 1 3 38
## 4: 172765 4 1911 1 4 35
## 5: 172765 5 1911 1 5 17
## ---
## 36496: 172765 361 2010 12 27 25
## 36497: 172765 362 2010 12 28 26
## 36498: 172765 363 2010 12 29 24
## 36499: 172765 364 2010 12 30 32
## 36500: 172765 365 2010 12 31 44
Plotting Temperature Data
This plot shows the temperature highs with the day of the year on the x-axis and the temperature on the y-axis:
p <- ggplot(Station172765_TMax, aes(YearDay, MaxTemp))
p + geom_point()
The histogram of the data shows that it’s bimodal:
Station172765_TMax[,hist(MaxTemp, xlab="Temp", main="Frequency Max Temp - Farmington, ME, 1911 - 2010")]
Sample Means
This will return a data table called sampleMeans
with four columns of 10000 sample means each; each sample mean will come from 2, 10, 30 or 50 samples from the temperature data set.
Station172765_TMax[,.(sampleSize2 = replicate(10,mean(sample(MaxTemp,2,replace = TRUE),na.rm = TRUE)))]
sampleMeans <- Station172765_TMax[,.(
sampleSize2 = replicate(10000,mean(sample(MaxTemp,2,replace = TRUE),na.rm = TRUE)),
sampleSize10 = replicate(10000,mean(sample(MaxTemp,10,replace = TRUE),na.rm = TRUE)),
sampleSize30 = replicate(10000,mean(sample(MaxTemp,30,replace = TRUE),na.rm = TRUE)),
sampleSize50 = replicate(10000,mean(sample(MaxTemp,50,replace = TRUE),na.rm = TRUE))
)]
Sampling Error of Sample Means
This dataset shows that there can be substantial variability in the sample means especially for smaller sample sizes, but an infinitely large number of sample means will have a mean equal to the population mean regardless of the sample size or distribution. In this case the sample means are clustering around ~54.5 for a population mean.
> sampleMeans[1:5]
## sampleSize2 sampleSize10 sampleSize30 sampleSize50
## 1: 63.5 54.50000 58.50000 51.73469
## 2: 47.5 37.22222 55.33333 54.98000
## 3: 63.0 63.90000 49.50000 55.04167
## 4: 53.5 50.30000 52.26667 50.26531
## 5: 52.5 64.40000 55.60000 56.04000
Mean of Sample Means
Taking the mean of all of the sample means shows that they are fairly close to the population mean.
> sapply(sampleMeans, mean)
## sampleSize2 sampleSize10 sampleSize30 sampleSize50
## 54.52900 54.46785 54.44227 54.46443
Station172765_TMax[, mean(MaxTemp, na.rm = TRUE)]
## [1] 54.45328
Standard Error of the Mean
The standard deviation of the sample means, also called the standard error of the mean, is much smaller with a larger sample size and much larger with a small sample size.
sapply(sampleMeans, sd)
## sampleSize2 sampleSize10 sampleSize30 sampleSize50
## 14.943480 6.766649 3.880390 3.023907
Also notice that the standard deviation of the sample means is approximately equal to the standard deviation of the population data divided by the square root of the sample size. Compare the standard deviations above to the population standard deviation divided by the square root of the sample size below.
Station172765_TMax[, sd(MaxTemp, na.rm = TRUE)]/sqrt(2)
## [1] 14.93637
Station172765_TMax[, sd(MaxTemp, na.rm = TRUE)]/sqrt(10)
## [1] 6.679747
Station172765_TMax[, sd(MaxTemp, na.rm = TRUE)]/sqrt(30)
## [1] 3.856554
Station172765_TMax[, sd(MaxTemp, na.rm = TRUE)]/sqrt(50)
## [1] 2.987274
Without a complete data set for this weather data, you could take the standard deviation of the sample means and multiply it by the square root of our sample size to get the standard deviation of the population.
> sampleMeans[,sd(sampleSize50)] * sqrt(50)
## [1] 21.38225
Station172765_TMax[, sd(MaxTemp, na.rm = TRUE)]
## [1] 21.12321
Plot Sample Means
Plotting the sampleMeans data as a density plot shows that this sampled data looks like a normal distribution even though the original data was bimodal. You can also see that the standard error is much larger for the sample means with the smaller sample size.
meltedSampleMeans <- sampleMeans[,.(
sampleSize = c(
rep("sampleSize2",10000),
rep("sampleSize10",10000),
rep("sampleSize30",10000),
rep("sampleSize50",10000)
),
sampleMeans = c(
sampleSize2,
sampleSize10,
sampleSize30,
sampleSize50
)
)]
p <- ggplot(meltedSampleMeans, aes(sampleMeans, fill = sampleSize, color = sampleSize))
p + geom_density(alpha = 0.2) + geom_vline(xintercept = 54.45328) + labs(x = "Sample Means", y = "Density")
Notice that the different sets of sample means are all clustered around the same population mean (the black vertical line), and that the standard deviation declines as the sample size increases.
Cental Limit Theorem Z-Values
If the number of samples is large enough, i.e., more than 30, then conclusions can be made about the sample means in the same way that conclusions can be made about normally distributed datasets.
Below is the formula for z-values in normally distributed data:
$$z=\frac{\text{value - mean}}{\text{standard deviation}}$$
We can use this very similar formula to find z-values for sample means of population data:
$$z = \frac{\bar{X}-\mu}{\sigma\div\sqrt{n}} = \frac{\bar{X}-\mu}{\sigma_{\bar{X}}}$$
$$z = \frac{\text{sample mean}-\text{population mean}}{\text{population standard deviation}\div\sqrt{\text{sample size}}} = \frac{\text{sample mean}-\text{population mean}}{\text{sample standard deviation}}$$
Z-Values With Sample Size Less Than 30
If the number of samples is less than 30, then use the normal formula for a z-score.
$$z = \frac{\bar{X}-\mu}{\sigma}$$
Example Weather Data
So if we took 36 days at random from the weather data, what is the probability that the mean of those 36 days would be greater than 60 degrees?
The standard deviation of the sample means would be: $$\sigma_{\bar{X}} =\frac{\sigma}{\sqrt{n}} = \frac{21.12321}{\sqrt{36}} = 3.520535$$
The z-score is:
$$z = \frac{\bar{X}-\mu}{\sigma\div\sqrt{n}} = \frac{60-54.45328}{21.12321\div\sqrt{36}} = 1.575533$$
(60-54.45328)/(21.12321/sqrt(36))
## [1] 1.575533
And based on Z-Score lookup table the probability that 36 days would have a mean greater than 60 is 1-.94295 or 0.05705.
Pnorm function
You can use the pnorm function to calculate probabilities based on the summary data of a normal distribution.
The arguments for pnorm look like this:
pnorm(quantile, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
What’s key in this operation is that the value for the standard deviation entered into pnorm is not 21.12321, but $21.12321 \div \sqrt{36}$.
pnorm(60, mean=54.45328, sd=(21.12321/sqrt(36)), lower.tail = FALSE)
## [1] 0.0575667
So that means there’s a 5.76% chance that an average of 36 samples from this data set will be greater than 60 degrees. A short test in R with the weather station data set shows that this estimation holds up.
This will take 1,000,000 sample means with a sample size of 36, then return a count of the number of samples greater than 60.
> sampleMeans60 <- Station172765_TMax[,replicate(1000000, mean(sample(MaxTemp,36, replace = TRUE),na.rm=TRUE))]
> table(sampleMeans60 > 60)
## FALSE TRUE
## 943019 56981
$$\frac{56981}{1000000} = .056981 \approx 5.70%$$
Smaller Sample Size
Where it gets interesting is if we have 5 days instead of 36. Five days falls below the level at which we have to divide sigma by the root of the sample size.
The z-score is:
$$z = \frac{\bar{X}-\mu}{\sigma} = \frac{60-54.45328}{21.12321} = 0.2625889$$
And based on Z-Score lookup table the probability that the mean of five values will be greater than 60 is 1-.60257 or 0.39743.
Using pnorm we get a similar answer:
pnorm(60, 54.45328, 21.12321, lower.tail = FALSE)
## [1] 0.3964337
Finite Population Correction Factor
The Central Limit Theorem and standard error of the mean assume that samples are drawn with replacement. However, almost all survey work are conducted on finite populations and samples are drawn without replacement. In these cases, and especially when the sample size *n* is large compared to the population size *N* (more than 5%), the finite population correction (FPC) factor is used to account for the added precision gained by sampling close to a larger percentage of the population. The effect of the FPC is that the error becomes zero when the sample size *n* is equal to the population size *N*.
Formula for the population correction factor:
$$\sqrt{\frac{N-n}{N-1}}$$
Where:
- N is the population size
- n is the sample size
So the standard error of the mean is multiplied by the population correction factor:
$$\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} \cdot \sqrt{\frac{N-n}{N-1}}$$
And the total formula for the z-value when using the population correction factor becomes:
$$z = \frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}\cdot\sqrt{\frac{N-n}{N-1}}}$$
Simulating the Central Limit Theorem
The simulation below allows you to select different population distributions, change parameters that change the shape of the population distribution, and then change the sample size. From there you can view different plots that allow you to compare the population data with sampled data, or view changes in the sampled data as the sample size increases or decreases.
The different tabs will show different plots:
- Population Distribution. This shows the population distribution.
- Distributions of Samples. This shows histograms of eight samples.
- Distribution of Sample Means. This shows the distribution of the sample means.
- Sample Means QQ Plot. A quantile quantile plot which compares the distribution of the sample means to a normal distribution.
This simulation is based on code from Quality and Innovation and ShinyEd. The code can be found on github and is hosted by shinyapps.io.
Additional Sources
- Sampling Distributions and Central Limit Theorem in R
- Simulation of the Sampling Distribution of the Mean Can Mislead
- The Use of R Language in the Teaching of Central Limit Theorem
- Simulation: Central Limit Theorem
- UCLA: The Central Limit Theorem
- Sampling Distributions and the Central Limit Theorem
- In Brief: Standard Deviation and Standard Error