Advanced Probability Concepts for Machine Learning
Probability is one of the important yet most used concept in Machine Learning space. Calculating probability and Bayes theorem are some of the basic concepts that everyone should be aware of.
Now that we know how to calculate basic probability. We might have got some questions in mind like me. What if I do not have fixed value outputs and get something that takes multiple variables like temperatures. Here comes the concept of Random Variables.
Random Variables
Suppose we are tossing a coin X number of times, X is the random variable which can take any value like 5 or 10. Each time we flip 10 coins we can get heads anywhere between 0 to 10. Like in round 1, I can get 5 heads then X = 5 and I can get 2 heads then X = 2.
X can be discrete or continuous.
Discrete Random Variables — Rolling a dice, flipping the coin etc.,
Continuous Random Variables — Amount of rainfall, Height of Gymnastics, Time the bus arrives etc.,
If you observe, Discrete has the finite number of values but also could be infinite Can take only a countable number of values where as Continuous can be take in between between interval. Also Discrete is deterministic.
All discrete random variables can be modeled by their probability mass function, also abbreviated as PMF. If we consider an example of coin. If 2 coins are tossed the probability distribution would be X=0,1,2 where X represents number of Heads. If we plot it on histogram, I can see 1/4,2/4,1/4. Similarly, for 3 coins the X can span from 0 to 3. and the the plot would be
This is same across as X increases. What if we find the pattern that the outcomes occur for the values of X. This can be done with Binomial Distribution.
Binomial Distribution
The binomial distribution is an example of a discrete distributions. We define Binomial Coefficient with n and k
The PMF of this case is symmetric.
Consider the experiment of 5 coin tosses.
X follows a binomial distribution, 5 is the number of flips and p is probability of Hits which is P(H)
If the number of flips is n and generalize it we get the above expression
How do we calculate Binomial Coefficient?
Suppose we are tossing the dice 4 times and each time we are picking something that is not repeated then first time we have probability of 6 choices and for second turn we have 5 and third we have 4 and 4th turn we have 3 options to choose. So the probability would be 6*5*4*3 /4! which can be written as
Probability Distribution for continuous variables
The values are discrete till now but if they are continuous and cannot be plotted on the binomial distribution
Suppose we take the time which is continuous and wanted to plot it we can take the intervals like 0–1 min, 1–2 min…
Suppose we reduce the time and look at the distribution even more minutely and repeat infinitely many times we get
The area under the curve is equal to 1
In discrete, we can exactly say probability of event say n times the dice is rolled. But if you consider the continuous distribution, we cannot say exactly at 2 min what is the probability that call ends and we take the intervals between which we can estimate how many calls ends. These probabilities for continuous distributions can be encoded with Probability Density Function.
Probability Density function
It is denoted by fₓ(x). It tells you the rate you accumulate probability around each point. Only defined for continuous variables.
So the area can be calculated between 2 points a and b and the total area should be equal to 1 and fₓ(x) > 0
Cumulative Distribution Function
We know that the probability mass function for discrete distributions and the probability density function for continuous distributions.
The cumulative distribution function is one that shows the actual probability that the call is between zero and a certain number of minutes. So it’s just much more convenient to calculate. There’s also a cumulative distribution function for discrete distributions.
For the Continuous distribution function we discussed, the cumulative probability would look like this. It is the sum of previous intervals together.
Maximum value of Cumulative Distribution function could reach is 1
For cumulative distribution the plot looks like this. the CDF shows how much probability the variable has accumulated until a certain value.
To Summarize, PDF is always positive and has a total area of one underneath the curve and CDF has left endpoint of zero a right endpoint of one and is always positive and its always increasing. We can use any of them based on convenience.
Uniform Distribution
The uniform distribution is where we have the numbers as homogeneous in the interval which means there is no much variation and is uniform across the interval.
A continuous random variable can be modeled with a uniform distribution if all the possible values lie in an interval and have the same frequency of occurrence.
Let us consider this distribution where it remains same except between a and b and the CDF can be like this.
Normal / Gaussian Distribution
The continuous bell shaped curve which has the steep curve and is named after famous Carl Friedrich Gauss. This means that when n is very large, the binomial distribution can be approximated by a Gaussian distribution.
We know that the function that gives bell curve is e^(-x²/2) but the peak value of bell curve is around the 0. So to adjust to our distribution of data we can subtract the x by mean. Now to make this around the distribution i.e., we use standard Deviation which is the variance of the spread and divide the x with it but still the height looks different so to reduce it. also the area should be 1. To reduce it we divide by standard deviation multiplied by sqrt(2* Pi)
If the distribution is centered at 0 then it is called standard normal distribution where standard deviation is 1 and mean is 0.
Sampling from Distribution
Sampling is picking the points that have the probabilities given by the original distribution. It is always good to take samples from different areas of distribution. This cannot be viewed with normal distribution but if you consider the CDF and take the points at different heights from 0 to 1 we get the data points that can be considered as sampling. this is called Sampling Probability Distribution and is very important concept in Machine Learning.