Statistics for Machine Learning: Everything You Need to Know
Updated on Feb 24, 2025 | 9 min read | 6.1k views
Share:
For working professionals
For fresh graduates
More
Updated on Feb 24, 2025 | 9 min read | 6.1k views
Share:
Table of Contents
Statistics and Probability form the core of Machine Learning and Data Science. It is the statistical analysis coupled with computing power and optimization that Machine Learning is capable of achieving what it’s achieving today. From the basics of probability to descriptive and inferential statistics, these topics make the base of Machine Learning.
By the end of this tutorial, you will know the following:
Independent and Dependent events
Let’s consider 2 events, event A and event B. When the probability of occurrence of event A doesn’t depend on the occurrence of event B, then A and B are independent events. For eg., if you have 2 fair coins, then the probability of getting heads on both the coins will be 0.5 for both. Hence the events are independent.
Now consider a box containing 5 balls — 2 black and 3 red. The probability of drawing a black ball first will be 2/5. Now the probability of drawing a black ball again from the remaining 4 balls will be 1/4. In this case, the two events are dependent as the probability of drawing a black ball for the second time depends on what ball was drawn on the first go.
Marginal Probability
It’s the probability of an event irrespective of the outcomes of other random variables, e.g. P(A) or P(B).
Joint Probability
It’s the probability of two different events occurring at the same time, i.e., two (or more) simultaneous events, e.g. P(A and B) or P(A, B).
Conditional Probability
It’s the probability of one (or more) events, given the occurrence of another event or in other words, it is the probability of an event A occurring when a secondary event B is true. e.g. P(A given B) or P(A | B).
Join the ML Course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.
Probability Distributions depict the distribution of data points in a sample space. It helps us see the probability of sampling certain data points when sampled at random from the population. For example, if a population consists of marks of students of a school, then the probability distribution will have Marks on the X-axis and the number of students with those marks on the Y-axis. This is also called a Histogram. The histogram is a type of Discrete Probability Distribution. The main types of Discrete Distribution are Binomial Distribution, Poisson Distribution and Uniform Distribution.
On the other hand, a Continuous Probability Distribution is made for data that has continuous value. In other words, when it can have an infinite set of values like height, speed, temperature, etc. Continuous Probability Distributions have tremendous use in Data Science and statistical analysis for checking feature importance, data distributions, statistical tests, etc.
In addition to the previously stated discrete probability distributions (binomial, poisson, and uniform), a few more significant discrete probability distributions are often employed in statistics for machine learning.
The Bernoulli distribution is a finite probability distribution that indicates a binary outcome in which the random variable used has only two possible values, often labeled as 0 and 1. It is typically employed to define the possibility of success or failure in a single test.
The geometric distribution is applied to determine the number of trials necessary to get the initial favorable outcome in an arrangement of different Bernoulli trials with a uniform chance for accuracy across trials.
The negative binomial distribution simulates the total number of trials required to attain an appropriate number of successes in a sequence of autonomous Bernoulli trials. The geometric distribution is generalized through the provision for a variable number of successes.
The most well-known continuous distribution is Normal Distribution, which is also known as the Gaussian distribution or the “Bell Curve.”
Consider a normal distribution of heights of people. Most of the heights are clustered in the middle part which is taller and gradually reduces towards left and right extremes which denote a lower probability of getting that value randomly.
This curve is centred at its mean and can be tall and slim or it can be short and spread out. A slim one denotes that there is less number of distinct values that we can sample. And a more spread out curve shows that there is a larger range of values. This spread is defined by its Standard Deviation.
Greater the Standard Deviation, more spread will be your data. Standard Deviation is just a mathematical derivation of another property called the Variance, which defines how much the data ‘varies’. And variance is what data is all about, Variance is information. No Variance, no information. The Normal Distribution has a crucial role in stats – The Central Limit Theorem.
It is important to mention that normal distribution is essential to statistical learning in AI. Many methods for statistical learning in machine learning algorithms assume or attempt to approximate the normal distribution.
The 68-95-99.7 rule, commonly known as the empirical standard or the three-sigma rule, is an essential characteristic of the normal distribution. According to the report, around 68% of information lies within one standard deviation of the mean, 95% is between two standard deviations, and 99.7% is within three standard deviations. This rule is a valuable guideline regarding comprehending data distribution and spotting outliers.
Measures of Central Tendency are the ways by which we can summarize a dataset by taking a single value. There are 3 Measures of Tendency mainly:
1. Mean: The mean is just the arithmetic mean or the average of the values in the data/feature. Sum of all values divided by the number of values gives us the mean. Mean is usually the most common way to measure the centre of any data, but can be misleading in some cases. For example, when there are a lot of outliers, the mean will start to shift towards the outliers and be a bad measure of the centre of your data.
2. Median: Median is the data point that lies exactly in the centre when the data is sorted in increasing or decreasing order. When the number of data points is odd, then the median is easily picked as the centre most point. When the number of data points is even, then the median is calculated as the mean of the 2 centre most data points.
3. Mode: Mode is the data point that is most frequently present in a dataset. The mode remains most robust to outliers as it will still remain fixed at the most frequent point.
In addition to the mean, median, and mode, additional metrics of central tendency that might give insights into the data should be included.
4. Weighted Mean: When distinct data points have varied weights or relevance, the weighted mean is used. It is determined by multiplying every single value by its corresponding weight and then dividing the sum of these weighted numbers by the total weights.
5. Trimmed Mean: A trimmed mean is a mean variation that decreases the impact of outliers on estimation. Before computing the mean of the remaining numbers, a fixed percentage of the highest and lowest figures is removed. When the data contains severe outliers that greatly distort the mean, the trimmed mean is beneficial.
The central limit theorem in statistics states that, given a sufficiently large sample size, the sampling distribution will approximate a normal distribution regardless of that variable’s distribution. Let me bring the essence of the above statement in plain words.
The data might be of any distribution. It could be perfect or skewed normal, it could be exponential or (almost) any distribution you may think of. However, if you repeatedly take samples from the population and keep plotting the histogram of their means, you will eventually find that this new distribution of all the means resembles the Normal Distribution!
In essence, it doesn’t matter what distribution your data is in, the distribution of their means will always be normal.
But how many samples are needed to hold CLT true? The thumb rule says that it should be >30. So if you take 30 or more samples from any distribution, the means will be normally distributed no matter the underlying distribution type.
When it involves hypothesis testing and estimating parameter values, the Central Limit Theorem has major ramifications. Many statistical tests and estimation procedures are based on the presumption of a regularly distributed sample distribution, which is frequently obtained thanks to the Central Limit Theorem. Based on sample statistics, we can draw reasonable predictions about the parameters of the population.
Standard Deviation and Standard Error are often confused with one another. Standard Deviation, as you might know, describes or quantifies the variation in the data on both sides of the distribution – lower than mean and greater than mean. If your data points are spread across a large range of values, the standard deviation will be high.
Now, as we discussed above, by Central Limit Theorem, if we plot the means of all the samples from a population, the distribution of those means will again be a normal distribution. So it will have its own standard deviation, right?
The standard deviation of the means of all samples from a population is called Standard Error. The value of Standard Error will be usually less than the Standard Deviation as you are calculating the standard deviation of means, and the value of means would be less spread than individual data points due to aggregation.
You can even calculate the standard deviation of medians, mode or even standard deviation of standard deviations!
Statistical concepts form the real core of Data Science and ML. To be able to make valid deductions and understand the data at hand effectively, you need to have a solid understanding of the statistical and probability concepts discussed in this tutorial.
upGrad provides a Executive PG Programme in Machine Learning & AI and a Master of Science in Machine Learning & AI that may guide you toward building a career. These courses will explain the need for Machine Learning and further steps to gather knowledge in this domain covering varied concepts ranging from Gradient Descent to Machine Learning.
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Top Resources