Central limit theorem

Link to Jupyter Notebook

What? Sampling points from a distribution and plotting the frequency of the sample mean approaches a normal distribution.

We start with some crazy distribution that has got nothing to do with a normal distribution. Sample points from that distribution with some arbitrary sample size, following which we plot the sample mean (or sample sum) on a frequency table – repeat this lot of times (tending to infinity) we end up getting a normal distribution of sample means!

For all the unknown actions belonging to some probability distribution (SAME distribution is important) and sample points from them and average their values we end up getting a normal distribution. This is true for any distribution.

Meanwhile, the Law of Large Numbers tells us that if we take a sample (n) observations of our random variable & avg the observation (mean)– it will approach the expected value E(x) of the random variable.

Nice Khan Academy video explaining this. Link

Small python experiment to show this in action:

Defining a discrete distribution

Let’s assume we have a dice which is unfair and does not ever land on 3 and 5. Lands more on 2 and 6. We use Numpy’s random.choice module for this Link

dice = np.arange(1,7)
probabilities = [0.2, 0.3, 0.0, 0.2, 0.0, 0.3]

# Draw sample size = n, take the mean and plot the frequencies 
def sample_draw_mean(trials=1000, sample_size=1):
    sample_mean = []
    for i in range(trials):
        sample = np.random.choice(dice, size=sample_size, p=probabilities, replace=True, )
    return sample_mean 

sns.distplot(sample_draw_mean(trials=1000, sample_size=1), bins=len(dice));


For sample size 1 it is seen that the frequency of rolling numbers of the die relate to the probability we have determined above. However we can start to define samples from that distribution wherein, instead of single number we draw ex. 4 numbers. We do this multiple times and plot the histogram of the mean.

Plotting sampling distribution of sample mean


As we keep plotting the frequency distribution for the sample mean it starts to approach the normal distribution! That’s the central limit theorem.

Also the mean of the distribution of the distribution is the population mean!

population_mean = np.mean(dice)
= 3.5 

Nifty tech tag lists fromĀ Wouter Beeftink