Home
Blog
Artificial Intelligence
Statistics for Machine Learning: Everything You Need to Know

Statistics for Machine Learning: Everything You Need to Know

Q: 1. Is knowledge of statistics mandatory for doing well in machine learning?

Statistics is a very vast field. In machine learning, statistics basically help in understanding the data deeply. Some statistical concepts like probability, data interpretation, etc. are needed in several machine learning algorithms. However, you do not have to be an expert on all the topics of statistics to do well in machine learning. By knowing just the fundamental concepts, you will be able to perform efficiently.

Q: 2. Will knowing some coding beforehand be helpful in machine learning?

Coding is the heart of machine learning, and programmers who understand how to code well will have a deep understanding of how the algorithms function and, thus, will be able to monitor and optimize those algorithms more effectively. You do not need to be an expert in any programming language, although any prior knowledge will be beneficial. If you are a beginner, Python is a good choice since it is simple to learn and has a user-friendly syntax.

Q: 3. How do we use calculus in everyday life?

Weather forecasts are based on a number of variables, such as wind speed, moisture content, and temperature, which can only be calculated using calculus. The use of calculus may also be seen in aviation engineering in a variety of ways. Calculus is also used by vehicle industries to improve and ensure good safety of the vehicles. It is also used by credit card companies for payment purposes.

By Pavan Vadapalli

Updated on Feb 24, 2025 | 9 min read | 6.1k views

Table of Contents

Statistics and Probability form the core of Machine Learning and Data Science. It is the statistical analysis coupled with computing power and optimization that Machine Learning is capable of achieving what it’s achieving today. From the basics of probability to descriptive and inferential statistics, these topics make the base of Machine Learning.

Top Machine Learning and AI Courses Online

Master of Science in Machine Learning & AI from LJMU		Executive Post Graduate Programme in Machine Learning & AI from IIITB
Advanced Certificate Programme in Machine Learning & NLP from IIITB	Advanced Certificate Programme in Machine Learning & Deep Learning from IIITB	Executive Post Graduate Program in Data Science & Machine Learning from University of Maryland
Machine Learning Certification

By the end of this tutorial, you will know the following:

Probability Basics
Probability Distributions
Normal Distribution
Measures of Central Tendency
Central Limit Theorem
Standard Deviation & Standard Error
Skewness & Kurtosis

Probability Basics

Independent and Dependent events

Let’s consider 2 events, event A and event B. When the probability of occurrence of event A doesn’t depend on the occurrence of event B, then A and B are independent events. For eg., if you have 2 fair coins, then the probability of getting heads on both the coins will be 0.5 for both. Hence the events are independent.

IIIT Bangalore

Executive Diploma in Machine Learning and AI

Placement Assistance

Executive PG Program11 Months

Liverpool John Moores University

Master of Science in Machine Learning & AI

Dual Credentials

Master's Degree19 Months

Trending Machine Learning Skills

AI Courses	Tableau Certification
Natural Language Processing	Deep Learning AI

Now consider a box containing 5 balls — 2 black and 3 red. The probability of drawing a black ball first will be 2/5. Now the probability of drawing a black ball again from the remaining 4 balls will be 1/4. In this case, the two events are dependent as the probability of drawing a black ball for the second time depends on what ball was drawn on the first go.

Marginal Probability

It’s the probability of an event irrespective of the outcomes of other random variables, e.g. P(A) or P(B).

Joint Probability

It’s the probability of two different events occurring at the same time, i.e., two (or more) simultaneous events, e.g. P(A and B) or P(A, B).

Conditional Probability

It’s the probability of one (or more) events, given the occurrence of another event or in other words, it is the probability of an event A occurring when a secondary event B is true. e.g. P(A given B) or P(A | B).

Join the ML Course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.

Probability Distributions

Probability Distributions depict the distribution of data points in a sample space. It helps us see the probability of sampling certain data points when sampled at random from the population. For example, if a population consists of marks of students of a school, then the probability distribution will have Marks on the X-axis and the number of students with those marks on the Y-axis. This is also called a Histogram. The histogram is a type of Discrete Probability Distribution. The main types of Discrete Distribution are Binomial Distribution, Poisson Distribution and Uniform Distribution.

On the other hand, a Continuous Probability Distribution is made for data that has continuous value. In other words, when it can have an infinite set of values like height, speed, temperature, etc. Continuous Probability Distributions have tremendous use in Data Science and statistical analysis for checking feature importance, data distributions, statistical tests, etc.

In addition to the previously stated discrete probability distributions (binomial, poisson, and uniform), a few more significant discrete probability distributions are often employed in statistics for machine learning.

The Bernoulli distribution is a finite probability distribution that indicates a binary outcome in which the random variable used has only two possible values, often labeled as 0 and 1. It is typically employed to define the possibility of success or failure in a single test.

The geometric distribution is applied to determine the number of trials necessary to get the initial favorable outcome in an arrangement of different Bernoulli trials with a uniform chance for accuracy across trials.

The negative binomial distribution simulates the total number of trials required to attain an appropriate number of successes in a sequence of autonomous Bernoulli trials. The geometric distribution is generalized through the provision for a variable number of successes.

Normal Distribution

The most well-known continuous distribution is Normal Distribution, which is also known as the Gaussian distribution or the “Bell Curve.”

Consider a normal distribution of heights of people. Most of the heights are clustered in the middle part which is taller and gradually reduces towards left and right extremes which denote a lower probability of getting that value randomly.

This curve is centred at its mean and can be tall and slim or it can be short and spread out. A slim one denotes that there is less number of distinct values that we can sample. And a more spread out curve shows that there is a larger range of values. This spread is defined by its Standard Deviation.

Greater the Standard Deviation, more spread will be your data. Standard Deviation is just a mathematical derivation of another property called the Variance, which defines how much the data ‘varies’. And variance is what data is all about, Variance is information. No Variance, no information. The Normal Distribution has a crucial role in stats – The Central Limit Theorem.

It is important to mention that normal distribution is essential to statistical learning in AI. Many methods for statistical learning in machine learning algorithms assume or attempt to approximate the normal distribution.

The 68-95-99.7 rule, commonly known as the empirical standard or the three-sigma rule, is an essential characteristic of the normal distribution. According to the report, around 68% of information lies within one standard deviation of the mean, 95% is between two standard deviations, and 99.7% is within three standard deviations. This rule is a valuable guideline regarding comprehending data distribution and spotting outliers.

Measures of Central Tendency

Measures of Central Tendency are the ways by which we can summarize a dataset by taking a single value. There are 3 Measures of Tendency mainly:

1. Mean: The mean is just the arithmetic mean or the average of the values in the data/feature. Sum of all values divided by the number of values gives us the mean. Mean is usually the most common way to measure the centre of any data, but can be misleading in some cases. For example, when there are a lot of outliers, the mean will start to shift towards the outliers and be a bad measure of the centre of your data.

2. Median: Median is the data point that lies exactly in the centre when the data is sorted in increasing or decreasing order. When the number of data points is odd, then the median is easily picked as the centre most point. When the number of data points is even, then the median is calculated as the mean of the 2 centre most data points.

3. Mode: Mode is the data point that is most frequently present in a dataset. The mode remains most robust to outliers as it will still remain fixed at the most frequent point.

In addition to the mean, median, and mode, additional metrics of central tendency that might give insights into the data should be included.

4. Weighted Mean: When distinct data points have varied weights or relevance, the weighted mean is used. It is determined by multiplying every single value by its corresponding weight and then dividing the sum of these weighted numbers by the total weights.

5. Trimmed Mean: A trimmed mean is a mean variation that decreases the impact of outliers on estimation. Before computing the mean of the remaining numbers, a fixed percentage of the highest and lowest figures is removed. When the data contains severe outliers that greatly distort the mean, the trimmed mean is beneficial.

Central Limit Theorem

The central limit theorem in statistics states that, given a sufficiently large sample size, the sampling distribution will approximate a normal distribution regardless of that variable’s distribution. Let me bring the essence of the above statement in plain words.

The data might be of any distribution. It could be perfect or skewed normal, it could be exponential or (almost) any distribution you may think of. However, if you repeatedly take samples from the population and keep plotting the histogram of their means, you will eventually find that this new distribution of all the means resembles the Normal Distribution!

In essence, it doesn’t matter what distribution your data is in, the distribution of their means will always be normal.

But how many samples are needed to hold CLT true? The thumb rule says that it should be >30. So if you take 30 or more samples from any distribution, the means will be normally distributed no matter the underlying distribution type.

When it involves hypothesis testing and estimating parameter values, the Central Limit Theorem has major ramifications. Many statistical tests and estimation procedures are based on the presumption of a regularly distributed sample distribution, which is frequently obtained thanks to the Central Limit Theorem. Based on sample statistics, we can draw reasonable predictions about the parameters of the population.

Standard Deviation & Standard Error

Standard Deviation and Standard Error are often confused with one another. Standard Deviation, as you might know, describes or quantifies the variation in the data on both sides of the distribution – lower than mean and greater than mean. If your data points are spread across a large range of values, the standard deviation will be high.

Now, as we discussed above, by Central Limit Theorem, if we plot the means of all the samples from a population, the distribution of those means will again be a normal distribution. So it will have its own standard deviation, right?

The standard deviation of the means of all samples from a population is called Standard Error. The value of Standard Error will be usually less than the Standard Deviation as you are calculating the standard deviation of means, and the value of means would be less spread than individual data points due to aggregation.

You can even calculate the standard deviation of medians, mode or even standard deviation of standard deviations!

Popular AI and ML Blogs & Free Courses

IoT: History, Present & Future	Machine Learning Tutorial: Learn ML	What is Algorithm? Simple & Easy
Robotics Engineer Salary in India : All Roles	A Day in the Life of a Machine Learning Engineer: What do they do?	What is Information Technology?
Permutation vs Combination: Difference between Permutation and Combination	Learning Artificial Intelligence & Machine Learning - How to Start	Machine Learning with R: Everything You Need to Know
NLP Free Course	Fundamentals of Deep Learning of Neural Networks	Linear Regression: Step by Step Guide
Artificial Intelligence in the Real World	Introduction to Tableau	Case Study using Python, SQL and Tableau

Before You Go

Statistical concepts form the real core of Data Science and ML. To be able to make valid deductions and understand the data at hand effectively, you need to have a solid understanding of the statistical and probability concepts discussed in this tutorial.

upGrad provides a Executive PG Programme in Machine Learning & AI and a Master of Science in Machine Learning & AI that may guide you toward building a career. These courses will explain the need for Machine Learning and further steps to gather knowledge in this domain covering varied concepts ranging from Gradient Descent to Machine Learning.