Probability — How is it related to Machine learning?

Shilpa Thota
7 min readJust now

--

Before jumping into where do we use probability in Machine learning, Let us understand what is probability in a quick brief terms

The Probability is the measure of how likely the event occurs.

For Example, dice a coin, what is the probability that the heads would be the output. It is 1 of 2 outputs. so the probability 50%.

If we want to find the probability that a child picked at random plays soccer, there are total 10 kids and 7 kids does not play soccer and 3 plays soccer. So the output is only 30% likely that it occur which is the one who plays soccer.

Here probability of heads or child picked at random plays soccer is the event. All the members which is 2 in case of dice and 10 in case of soccer is called the sample space.

Complement of Probability of event occurring is 1 minus the probability of event not occurring.

Sum of Probability (Disjoint Events) The probability of occurring of event 1 or event 2 then we add the probabilities of both

For Example, In a class of 10, if we have 3 like maths and 4 likes science. Probability of kids that like either maths or science is 0.3 +0.4 which is 0.7

Sum of Probability ( Joint Events) In the above scenario, we don’t have anything in common, but what if they are common, which can be visualized as union. Let us consider the above example of students who like maths is 6 and students who like science is 5 and students who like both is 3. What is the probability that student likes either maths or science is 0.6 + 0.5 but here there are common which is 0.3 added twice so subtract it from the sum. So the probability is 0.6+ 0.5–0.3 = 0.8

This can be expressed in formula of P(A U B) = P(A) + P(B) — P(A ∩ B)

Independent Events — The events are said to be independent if the output of one event does not affect the other.

For example, what is the probability of landing on head 5 times. It would be calculated as 1 event has the probability of 1/2 and so 5 events would be 1/2 * 1/2 * 1/2 * 1/2 * 1/2 = 1/32

Conditional Probability — Conditional probability is all about calculating the probability of an event happening given that another event has already happened.

For Example, let’s say you are wondering about the probability that today is humid, that is some number. However, if you found out that yesterday was raining, the probability that today is humid changes, that is a conditional probability

Suppose we have 2 coins and you want the probability of 2 coins land on heads. We already know that 1 coin is head so the probability would be 1/2 as we only have 2 cases knowing that 1st one is head. This can be represented as P(HH | 1st is H) which is GIVEN that the first one is heads

We know that the Product Rule for Independent Events

This only happens if A and B are independent.

Suppose we want to find the P(A ∩ B) with conditional probability. Using the product rule above and considering P(B) given A with certain condition, we get

This is the case when A and B are dependent.

Bayes Theorem

Bayes theorem is one of the most important theorems in probability and is used all over the place, including in machine learning. It is used for spam recognition, speech detection, and many other things

Bayes theorem

Let us take an example, find if emails are spam. let’s say that we have a big data set of 100 emails, and 20 of them are actually spam. The easiest classifier that we can find is one that says that everything is spam with a 20% probability, because that's all we know so far. we’re going to see how many emails among the spam ones contain the word lottery and we notice that there’s 14 of them. 10 of the emails which are not spam has the word lottery. Let us not worry about the mails that do not have lottery in the mail.

Total there are total 24. out of which, P(spam | lottery) = 14 / 24 and P( not spam | lottery) = 10/24 .

In the above Bayes theorem formula, A is email with spam and B is email with word lottery. the formula turns out to be

  • P(spam) is 20 out of 100 which is 0.2
  • P(not spam is 80 out of 100 which is 0.8
  • P(lottery | spam) is 14/20 which is 0.7
  • P(lottery | not spam) is 10/80 which is 0.125

P(spam | lottery) would be 0.583

Prior and Posterior

Prior is that is the original probability that you can calculate, not knowing anything else. Then something happens, and that’s the event. The event gives you information about the probability. With that information, you can calculate something called the posterior. We had that the prior was P (A). An event can be called E, and the posterior is P(A/E).

Initially we calculated the P(spam) as spam over spam + not spam then the event occurred which is the email with lottery. P(spam | lottery) is spam with lottery over spam with lottery and not spam with lottery. The first calculation is prior and after the event it is posterior.

The Naive Bayes Model

From the above example of building a classifier of spam emails. We considered the word lottery. But in reality, there might be other words also like winning. So the P(spam | winning) = spam with winning / spam with winning + not spam with winning.

So now if we want our classifier to be able to find emails with both lottery and winning. P( spam | lottery & winning) = spam emails with both words winning and lottery / spam emails with winning and lottery + not spam emails with winning and lottery. But we might not have emails with both the words. Imagine the words are independent, then the P(lottery and winning | spam) would be P(lottery | spam) . P(winning | spam)

Substituting this is in Bayes theorem,

consider not spam as ham

This is called Naive Bayes model.

Probability in Machine Learning

Now that we have seem the Model and have basic understanding of what probability is. Let us now relate these concepts to the Machine learning.

Machine learning is a lot about probabilities. Many times we would be calculating the probability of something given some other factors.

In the above example, where we saw the email filter to find the spam or not and it depends on the words which is considered as feature. There can be multiple features like attachment or recipient. This is the conditional probability that probability of spam given some features.

Another example is sentiment analysis where you want to determine if a piece of text is happy or sad. In this case, you want to find the probability that the piece of text is happy given the words it contains. Let’s do another example, image recognition. In here you try to find out if an image has a particular thing or not. Let’s say that you want to recognize if there’s a cat in an image or not. So you calculate the probability that there’s a cat in the image based on the pixels in the image. So these are all conditional probabilities.

There can be pure probabilities in Machine learning. There’s another big area of machine learning called generative machine learning like face recognition and it’s a part of unsupervised machine learning where you want to maximize probabilities. So for example, image generation, if you’ve seen these great images of faces, for example, that are generated by computers, in here you want to maximize the probability that a bunch of pixels form a human face. Or in text generation, you want to maximize the probability that a bunch of words are sensical text and that it talks about a certain thing. So all of these are examples of machine learning that use a lot of probability.

Reference : Deeplearning.ai

--

--