What is the EM Algorithm in Machine Learning? [Explained with Examples]
Updated on Sep 23, 2022 | 7 min read | 10.5k views
Share:
For working professionals
For fresh graduates
More
Updated on Sep 23, 2022 | 7 min read | 10.5k views
Share:
Table of Contents
The EM algorithm or Expectation-Maximization algorithm is a latent variable model that was proposed by Arthur Dempster, Nan Laird, and Donald Rubin in 1977.
In the applications for machine learning, there could be few relevant variables part of the data sets that go unobserved during learning. Try to understand Expectation-Maximization or the EM algorithm to gauge the estimation of all latent variables using observed data. You might begin with understanding the main problems in this context of EM algorithm variables.
In the context of statistic modeling, the most common problem could be when you try estimating joint probability distribution for any data set in em algorithm in machine learning.
Probability Density related estimation is actually the construction of estimate-based as per the observed data. When you explain em algorithm in machine learning, it involves selecting probability distribution functions as well as the parameters of this function best explaining the joint probability of observed data.
Density estimation needs the selection of probability distribution-related functions and parameters of distribution that explain the joint probability-related distribution of a sample. The main problem with this density estimation could be:
A common technique that solves this issue is Maximum Likelihood Estimation, or something you call “maximum likelihood”.
A latent variable model comprises observable variables and unobservable variables. Observed variables are those that can be measured whereas unobserved (latent/hidden) variables are inferred from observed variables.
As explained by the trio, the EM algorithm can be used to determine the local maximum likelihood (MLE) parameters or maximum a posteriori (MAP) parameters for latent variables (unobservable variables that need to be inferred from observable variables) in a statistical model. It is used to predict these values or determine data that is missing or incomplete, provided that you know the general form of probability distribution associated with these latent variables.
To put it simply, the general principle behind the EM algorithm in machine learning involves using observable instances of latent variables to predict values in instances that are unobservable for learning. This is done until convergence of the values occurs.
The algorithm is a rather powerful tool in machine learning and is a combination of many unsupervised algorithms. This includes the k-means clustering algorithm, among other EM algorithm variants.
Join the Machine Learning Course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.
Let’s explore the mechanism of the Expectation-Maximization algorithm in Machine Learning:
Steps 2 and step 3 are repeated until convergence. Meaning if the values are not converging, we will repeat the E – step and M – step.
In terms of statistics, maximum likelihood estimation is a method that helps to estimate all parameters of the probability distribution. The same works by maximizing a likelihood function for making probable the observed data for any statistical model.
However, the Maximum Likelihood mode comes with a big limitation. This is its assumption that data is complete as well as fully observed. It never mandates that a model could actually access all data. It goes on to assume that all variables that are relevant to a model are already present. The reality is that some relevant variables could remain hidden, leading to inconsistencies. Such hidden variables causing inconsistencies are termed Latent Variables.
In the presence of a latent variable, the traditional maximum estimator won’t work as you anticipate. Find an appropriate model parameter in the presence of a latent variable by employing the Expectation-Maximization or EM algorithm for machine learning.
Disadvantages of EM Algorithm | |
1 | Every iteration in the EM algorithm results in a guaranteed increase in likelihood. |
2 | The Expectation step and Maximization step is rather easy and the solution for the latter mostly exists in closed form. |
Advantages of the EM Algorithm | |
1 | The expectation-Maximization algorithm takes both forward and backward probabilities into account. This is in contrast with numerical optimization which takes only the forward probabilities into account. |
2 | EM algorithm convergence is very slow and is only made to the local optima. |
The latent variable model has plenty of real-world applications in machine learning.
Let us understand the EM algorithm using a Gaussian Mixture Model.
To estimate the parameters of a Gaussian Mixture Model, we will need some observed variables generated by two separate processes whose probability distributions are known. However, the data points of the two processes are combined and we do not know which distribution they belong to.
We aim to estimate the parameters of these distributions using the Maximum Likelihood estimation of the EM algorithm as explained above.
Here is the code we will use:
# Given a function for which we have to compute density of
# Gaussian at point x_i given mu, sigma: G(x_i, mu, sigma); and
# another function to compute the log-likelihoods: L(x, mu, sigma, pi)
def estimate_gmm(x, K, tol=0.001, max_iter=100):
”’ Estimate GMM parameters.
:param x: list of observed real-valued variables
:param K: integer for number of Gaussian
:param tol: tolerated change for log-likelihood
:return: mu, sigma, pi parameters
”’
# 0. Initialize theta = (mu, sigma, pi)
N = len(x)
mu, sigma = [rand()] * K, [rand()] * K
pi = [rand()] * K
curr_L = np.inf
for j in range(max_iter):
prev_L = curr_L
# 1. E-step: responsibility = p(z_i = k | x_i, theta^(t-1))
r = {}
for i in range(N):
parts = [pi[k] * G(x_i, mu[k], sigma[k]) for i in range(K)]
total = sum(parts)
for i in k:
r[(i, k)] = parts[k] / total
# 2. M-step: Update mu, sigma, pi values
rk = [sum([r[(i, k)] for i in range(N)]) for k in range(K)]
for k in range(K):
pi[k] = rk[k] / N
mu[k] = sum(r[(i, k)] * x[i] for i in range(N)) / rk[k]
sigma[k] = sum(r[(i, k)] * (x[i] – mu[k]) ** 2) / rk[k]
# 3. Check exit condition
curr_L = L(x, mu, sigma, pi)
if abs(prev_L – curr_L) < tol:
break
return mu, sigma, pi
In the E-Step, we can use the Bayes theorem to determine the expected values of the given data points that are drawn from the past iterations of the algorithm. In the M-Step, we assume that the values of the latent variables are fixed to estimate the proxies in the unobserved instances using the Maximum Likelihood. Finally, we use the standard mean and standard deviation formulas to estimate the parameters of the gaussian mixture model.
This brings us to the end of the article. For more information on Machine Learning concepts, get in touch with the top faculty of IIIT Bangalore and Liverpool John Moores University through upGrad‘s Master of Science in Machine Learning & AI program.
It is an 18 months course that offers 450+ hours of learning content, 12+ industry projects, 10 Capstone project options, and 10+ coding assignments. You also enjoy personalised mentorship from industry experts, and career guidance counselling through live sessions. The next batch begins on Feb 28, 2021!
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Top Resources