Back to the Basic Statistics — Population and Sample
A population is the entire group of individuals or items that we want to study. A sample is a smaller subset that we actually observe or measure.
In machine learning and data science, we often use samples to train models and make predictions because we can’t look at the entire universe of data. It’s important to understand the difference between the two and how they’re relevant to the work we do.
Every dataset you work in machine learning is a sample and not population.
Sample Mean
We take the sample from population. Average of the sample is the Sample Mean. As the sample increases, better is the estimate of the sample mean.
Proportion
The sample taken by the total number of population gives the proportion of the sample.
Sample Variance
We know that variance gives the measure of how much the data is spread. The Sample variance gives how the sample is spread across the population.
So the N is the total population and n is the sample. Y is the random variable where as yi is the value at that point.
When we do not know the population size and anything about population. we can use
This works well if the sample size increases.
Law of Large numbers — As the sample increases the value is closer to the population mean.
- The samples must be drawn randomly from the population.
- The sample size must be sufficiently large.
- The larger the sample size, the more accurate the sample mean is likely to be.
- The individual observations in the sample must be independent of each other. This is the law of large numbers.
Central Limit Theorem
Take a distribution, any distribution. It can be as skewed as you want. Now, take a few samples, always the same number, and look at the average, and do this many times, and plot all these averages. Guess what you get? Yes, you get the normal distribution, no matter what distribution you started with in the first place. This is a fascinating result and is one of the pinnacles of statistics. It is called the central limit period.
The above works for Discrete Random variables. What if we have Continuous Random Variables.
When you average a large enough number of variables, the distribution will approximately follow a normal distribution.
So the central theorem goes like this
Happy Learning!!
References:DeepLearning.ai