The Central Limit Theorem

Purpose

This will demonstrate what the Central Limit Theorem is, why it is significant, and what conclusions you can draw from sampled data.

Table of Contents

  1. Introduction
    1. Mean of Sample Means
    2. Standard Error of the Mean
    3. Sampling Error
    4. Number of Sample Means
  2. An Example with Temperature Data
    1. Prepare Temperature Data
    2. Plot Temperature Data
    3. Sample Means
    4. Sampling Error
    5. Mean of Sample Means
  3. Central Limit Theorem and Z-Values
    1. Sample Size Less Than 30
    2. Example: Z-Score with Weather Data
      1. Using Pnorm Function
    3. Smaller Sample Size
  4. Finite Population Correction Factor
    1. Population Size and Population Correction Factor
      1. Small N
      2. Large N
      3. No Correction Factor
  5. Example
  6. Simulating the Central Limit Theorem
  7. Additional Sources

Introduction

The Central Limit Theorem is one of the most important concepts in inferential statistics.

It states that if you have a population with a mean $\mu$ and a standard deviation $\sigma$ and you take an infinitely large number of sample means:

  1. the mean of the sample means will be equal to the population mean regardless of the sample size and regardless of the population distribution.
  2. the sample means will be normal for a normally distributed population distribution and sample means for other population distributions will approach a normal distribution as the sample size increases.
  3. the standard deviation of the sample means will be equal to the standard deviation of the population divided by the square root of the sample size regardless of the sample size.
  4. as the sample size increases the standard deviation decreases.

Sample sizes of 30 or more ($n \geq 30$) are considered sufficiently large unless you know the population is normally distributed, in which case smaller sample sizes are acceptable. However the population distribution will affect the sample size necessary to produce a normal distribution


Mean of Sample Means

As the the number of sample means approach infinity the mean of the sampled means will have the same mean as the population regardless of the sample size and regardless of the population distribution.

$$\text{mean of sample means} = \text{population mean}$$

OR

$$\mu_{\bar{X}} = \mu$$

Where:

  • $\bar{X}$ is a sample mean
  • $\mu_{\bar{X}}$ is the mean of the sample means
  • $\mu$ is the population mean


Standard Error of the Mean

The standard deviation of the sample means is also called the standard error of the mean.

The standard deviation of the sample data will be equal to standard deviation of the population divided by the square root of the sample size, or:

$$\text{standard error of the mean} = \frac{\text{population standard deviation}}{\sqrt{\text{sample size}}}$$

or


$$\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}}$$

Where:

  • $\sigma_{\bar{M}}$ is the standard error of the mean
  • $\sigma$ is the standard deviation of the population
  • n is the sample size


The standard error of the mean will get smaller as the sample size increases. A smaller standard error would indicate that the sample data is more representative of the total population. For example if we had identical standard deviations for two different populations but different sample sizes, then the standard error of the mean would be smaller for the set with a larger sample size.


You can use the standard error of the mean draw inferences about the population data from a sample. This is useful in the many many situations where it would be impossible to measure every individual from a population. For example if we 30 sets of fifty samples from a population, we can take the standard deviation of the means of those samples and find a population standard deviation.


Sampling Error

Sampling error is the difference between the actual population data and the same measure of the sampled data. For example the mean of each sample will be different from the mean of the population data.

From wikipedia:

Since the sample does not include all members of the population, statistics on the sample, such as means and quantiles, generally differ from the characteristics of the entire population, which are known as parameters.


Number of Sample Means

  • When the population data is normally distributed then the sample means will also be normally distributed.
  • When the population data is not normally distributed then at least 30 sample means will have to be made for the sample means to be normally distributed. This is because of sampling error.



An Example With Dice

Consider an example with a die. A die has an equal probability of landing on any side so the probability distribution is uniform. But, taking the mean of multiple samples produces a normal distribution.

die <- 1:6
die_rolls <- NULL

for (i in 1:10000){
    die_rolls <- append(die_rolls, mean(sample(die, 30, replace = TRUE)))
}

hist(die_rolls)


An Example: Temperature Data


This example will sample from population data to demonstrate that a non-normal population will appear normal when sampled, that the sample means will converge around the population mean, and that the standard deviation of the sample means is equal to the population standard deviation divided by the square root of the sample size.

This example uses historical data from the United States Historical Climatology Network. The data set contains daily high temperatures in Farmington, Maine from 1911 through 2010. Download the data here.

The variables/columns in the data file are:

  • Station number (i.e., 172765: 2-digit state code, followed by 4-digit station code)
  • Day of year (1-365)
  • Month
  • Day of Month
  • Year the daily record was set
  • Daily Temperature High


Prepare the Data

First load the following packages:

library(data.table)
library(ggplot2)
library(easyGgplot2)


Then import the data.

Station172765_TMax<- data.table(read.table("/data/ME_2765tmax.txt", header = FALSE))

The columns are labeled “V1” through “V6” so replace the column names with more descriptive names.

setnames(Station172765_TMax, c("V1", "V2", "V3", "V4", "V5", "V6"), c("StationID", "YearDay", "Year", "Month", "MonthDay", "MaxTemp"))

On days that no data was collected the temperature is recorded as -999, change that to NA.

Station172765_TMax[MaxTemp == -999, MaxTemp := NA]

Now view the dataset:

Station172765_TMax

##        StationID YearDay Year Month MonthDay MaxTemp
##     1:    172765       1 1911     1        1      18
##     2:    172765       2 1911     1        2      36
##     3:    172765       3 1911     1        3      38
##     4:    172765       4 1911     1        4      35
##     5:    172765       5 1911     1        5      17
##    ---                                              
## 36496:    172765     361 2010    12       27      25
## 36497:    172765     362 2010    12       28      26
## 36498:    172765     363 2010    12       29      24
## 36499:    172765     364 2010    12       30      32
## 36500:    172765     365 2010    12       31      44


Plotting Temperature Data


This plot shows the temperature highs with the day of the year on the x-axis and the temperature on the y-axis:

p <- ggplot(Station172765_TMax, aes(YearDay, MaxTemp))
p + geom_point() 


The histogram of the data shows that it’s bimodal:

Station172765_TMax[,hist(MaxTemp, xlab="Temp", main="Frequency Max Temp - Farmington, ME, 1911 - 2010")]

Sample Means

This will return a data table called sampleMeans with four columns of 10000 sample means each; each sample mean will come from 2, 10, 30 or 50 samples from the temperature data set.


Station172765_TMax[,.(sampleSize2 = replicate(10,mean(sample(MaxTemp,2,replace = TRUE),na.rm = TRUE)))]

sampleMeans <- Station172765_TMax[,.(
    sampleSize2 = replicate(10000,mean(sample(MaxTemp,2,replace = TRUE),na.rm = TRUE)),
    sampleSize10 = replicate(10000,mean(sample(MaxTemp,10,replace = TRUE),na.rm = TRUE)),
    sampleSize30 = replicate(10000,mean(sample(MaxTemp,30,replace = TRUE),na.rm = TRUE)),
    sampleSize50 = replicate(10000,mean(sample(MaxTemp,50,replace = TRUE),na.rm = TRUE))
    )]

Sampling Error of Sample Means

This dataset shows that there can be substantial variability in the sample means especially for smaller sample sizes, but an infinitely large number of sample means will have a mean equal to the population mean regardless of the sample size or distribution. In this case the sample means are clustering around ~54.5 for a population mean.

> sampleMeans[1:5]

##    sampleSize2 sampleSize10 sampleSize30 sampleSize50
## 1:        63.5     54.50000     58.50000     51.73469
## 2:        47.5     37.22222     55.33333     54.98000
## 3:        63.0     63.90000     49.50000     55.04167
## 4:        53.5     50.30000     52.26667     50.26531
## 5:        52.5     64.40000     55.60000     56.04000


Mean of Sample Means

Taking the mean of all of the sample means shows that they are fairly close to the population mean.

> sapply(sampleMeans, mean)

##  sampleSize2 sampleSize10 sampleSize30 sampleSize50 
##     54.52900     54.46785     54.44227     54.46443 

Station172765_TMax[, mean(MaxTemp, na.rm = TRUE)]

## [1] 54.45328


Standard Error of the Mean


The standard deviation of the sample means, also called the standard error of the mean is much smaller with a larger sample size and much larger with a small sample size.

sapply(sampleMeans, sd) 

## sampleSize2 sampleSize10 sampleSize30 sampleSize50 
##   14.943480     6.766649     3.880390     3.023907 

Also notice that the standard deviation of the sample means is approximately equal to the standard deviation of the population data divided by the square root of the sample size. Compare the standard deviations above to the population standard deviation divided by the square root of the sample size below.

Station172765_TMax[, sd(MaxTemp, na.rm = TRUE)]/sqrt(2)
## [1] 14.93637

Station172765_TMax[, sd(MaxTemp, na.rm = TRUE)]/sqrt(10)
## [1] 6.679747

Station172765_TMax[, sd(MaxTemp, na.rm = TRUE)]/sqrt(30)
## [1] 3.856554

Station172765_TMax[, sd(MaxTemp, na.rm = TRUE)]/sqrt(50)
## [1] 2.987274


Without a complete data set for this weather data, you could take the standard deviation of the sample means and multiply it by the square root of our sample size to get the standard deviation of the population.

> sampleMeans[,sd(sampleSize50)] * sqrt(50)

## [1] 21.38225

Station172765_TMax[, sd(MaxTemp, na.rm = TRUE)] 
## [1] 21.12321


Plot Sample Means


Plotting the sampleMeans data as a density plot shows that this sampled data looks like a normal distribution even though the original data was bimodal. You can also see that the standard error is much larger for the sample means with the smaller sample size.

meltedSampleMeans <- sampleMeans[,.(
    sampleSize = c(
        rep("sampleSize2",10000),
        rep("sampleSize10",10000),
        rep("sampleSize30",10000),
        rep("sampleSize50",10000)
    ), 
    sampleMeans = c(
        sampleSize2,
        sampleSize10,
        sampleSize30,
        sampleSize50
    )
)]

p <- ggplot(meltedSampleMeans, aes(sampleMeans, fill = sampleSize, color = sampleSize))
p + geom_density(alpha = 0.2) + geom_vline(xintercept = 54.45328) + labs(x = "Sample Means", y = "Density") 

Notice that the different sets of sample means are all clustered around the same population mean (the black vertical line), and that the standard deviation declines as the sample size increases.


Cental Limit Theorem Z-Values

If the number of samples is large enough, ie more than 30, then conclusions can be made about the sample means in the same way that conclusions can be made about normally distributed datasets.

Below is the formula for z-values in normally distributed data:

$$z=\frac{\text{value - mean}}{\text{standard deviation}}$$


We can use this very similar formula to find z-values for sample means of population data:

$$z = \frac{\bar{X}-\mu}{\sigma\div\sqrt{n}} = \frac{\bar{X}-\mu}{\sigma_{\bar{X}}}$$

or

$$z = \frac{\text{sample mean}-\text{population mean}}{\text{population standard deviation}\div\sqrt{\text{sample size}}} = \frac{\text{sample mean}-\text{population mean}}{\text{sample standard deviation}}$$


Z-Values With Sample Size Less Than 30


If the number of sample is less than 30 then use the normal formula for a z-score.

$$z = \frac{\bar{X}-\mu}{\sigma}$$


Example Weather Data

So if we took 36 days at random from the weather data, what is the probability that the mean of those 36 days would be greater than 60 degrees?


The standard deviation of the sample means would be: $$\sigma_{\bar{X}} =\frac{\sigma}{\sqrt{n}} = \frac{21.12321}{\sqrt{36}} = 3.520535$$

OR

Station172765_TMax[, sd(MaxTemp, na.rm = TRUE)]/sqrt(36)
## [1] 3.520536


The z-score is:

$$z = \frac{\bar{X}-\mu}{\sigma\div\sqrt{n}} = \frac{60-54.45328}{21.12321\div\sqrt{36}} = 1.575533$$

(60-54.45328)/(21.12321/sqrt(36))

## [1] 1.575533


And based on Z-Score lookup table the probability that 36 days would have a mean greater than 60 is 1-.94295 or 0.05705.


Pnorm function

What’s key in this operation is that the value for the standard deviation entered into pnorm is not 21.12321, but $21.12321 \div \sqrt{36}$.

pnorm(60, mean=54.45328, sd=(21.12321/sqrt(36)), lower.tail = FALSE)
## [1] 0.0575667

So that means there’s a 5.76% chance that an average of 36 samples from this data set will be greater than 60 degrees. A short test in R with the weather station data set shows that this estimation holds up.

This will take 1,000,000 million sample means with a sample size of 36, then return of a count of the number of samples greater than 60.

> sampleMeans60 <- Station172765_TMax[,replicate(1000000, mean(sample(MaxTemp,36, replace = TRUE),na.rm=TRUE))] 

> table(sampleMeans60 > 60)

##  FALSE   TRUE 
## 943019  56981 

$$\frac{56981}{1000000} = .056981 \approx 5.70\%$$


Smaller Sample Size

Where it gets interesting is if we have 5 days instead of 36. Five days falls below the level at which we have to divide sigma by the root of the sample size.

The z-score is:

$$z = \frac{\bar{X}-\mu}{\sigma} = \frac{60-54.45328}{21.12321} = 0.2625889$$

And based on Z-Score lookup table the probability that the mean of five values will be greater than 60 is 1-.60257 or 0.39743.


Using pnorm we get a similar answer:

pnorm(60, 54.45328, 21.12321, lower.tail = FALSE)

## [1] 0.3964337


Finite Population Correction Factor


The Central Limit Theorem and standard error of the mean assume that samples are drawn with replacement. However almost all survey work are conducted on finite populations and samples are drawn without replacement. In these cases and especially when the sample size n is large compared to the population size N (more than 5%), the finite population correction (FPC) factor is used to account for the added precision gained by sampling close to a larger percentage of the population. The effect of the FPC is that the error becomes zero when the sample size n is equal to the population size N.

Formula for the population correction factor:

$$\sqrt{\frac{N-n}{N-1}}$$

Where:

  • N is the population size
  • n is the sample size


So the standard error of the mean is multiplied by the population correction factor:

$$\sigma_{\bar{X}} = \frac{\sigma}{\sqrt{n}} \cdot \sqrt{\frac{N-n}{N-1}}$$


And the total formula for the z-value when using the population correction factor becomes:

$$z = \frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}\cdot\sqrt{\frac{N-n}{N-1}}}$$


Population Size and Population Correction Factor

This shows the effect that this correction factor has based on the size of the population N and the sample size n.


Small N

If N is small and multiplied by a data set with a $\sigma$ of 1.

$$\sqrt{\frac{150-30}{150-1}} = 0.8974236$$

$$\sigma_{\bar{X}\text{Small}} = \frac{1}{\sqrt{30}} \cdot \sqrt{\frac{150-30}{150-1}} = 0.1638464$$

$$z = \frac{4-3.5}{\frac{1}{\sqrt{30}}\cdot\sqrt{\frac{150-30}{150-1}}} = 3.051639$$

So in this case 99.89% of the sample means fall below 4.


Large N

If N is large and multiplied by a data set with a $\sigma$ of 1:

$$\sqrt{\frac{100000-5}{100000-1}} = 0.999855$$


$$\sigma_{\bar{X}\text{Large}} = \frac{1}{\sqrt{30}} \cdot \sqrt{\frac{1000-5}{1000-1}} = 0.1825477$$

$$z = \frac{4-3.5}{\frac{1}{\sqrt{30}}\cdot\sqrt{\frac{100000-30}{100000-1}}} = 2.73901$$

In this case 99.69% of the sample means fall below 4.


No Correction Factor

This is fairly close to the standard error of the mean without a correction factor:

$$\sigma_{\bar{X}\text{No Correction Factor}} = \frac{1}{\sqrt{30}} = 0.1825742$$

$$z = \frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}}$$

$$z = \frac{4-3.5}{\frac{1}{\sqrt{30}}} \approx 2.738613$$

And like the previous case 99.69% of the sample means fall below 4.


Simulating the Central Limit Theorem

This simulation is based on code from Quality and Innovation and ShinyEd. The code can be found on github.

Additional Sources



Related