Explore Courses
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Birla Institute of Management Technology Birla Institute of Management Technology Post Graduate Diploma in Management (BIMTECH)
  • 24 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Popular
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science & AI (Executive)
  • 12 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
University of MarylandIIIT BangalorePost Graduate Certificate in Data Science & AI (Executive)
  • 8-8.5 Months
upGradupGradData Science Bootcamp with AI
  • 6 months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
OP Jindal Global UniversityOP Jindal Global UniversityMaster of Design in User Experience Design
  • 12 Months
Popular
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Rushford, GenevaRushford Business SchoolDBA Doctorate in Technology (Computer Science)
  • 36 Months
IIIT BangaloreIIIT BangaloreCloud Computing and DevOps Program (Executive)
  • 8 Months
New
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Popular
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
Golden Gate University Golden Gate University Doctor of Business Administration in Digital Leadership
  • 36 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
Popular
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
Bestseller
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
IIIT BangaloreIIIT BangalorePost Graduate Certificate in Machine Learning & Deep Learning (Executive)
  • 8 Months
Bestseller
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in AI and Emerging Technologies (Blended Learning Program)
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
ESGCI, ParisESGCI, ParisDoctorate of Business Administration (DBA) from ESGCI, Paris
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration From Golden Gate University, San Francisco
  • 36 Months
Rushford Business SchoolRushford Business SchoolDoctor of Business Administration from Rushford Business School, Switzerland)
  • 36 Months
Edgewood CollegeEdgewood CollegeDoctorate of Business Administration from Edgewood College
  • 24 Months
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with Concentration in Generative AI
  • 36 Months
Golden Gate University Golden Gate University DBA in Digital Leadership from Golden Gate University, San Francisco
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Deakin Business School and Institute of Management Technology, GhaziabadDeakin Business School and IMT, GhaziabadMBA (Master of Business Administration)
  • 12 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science (Executive)
  • 12 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityO.P.Jindal Global University
  • 12 Months
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (AI/ML)
  • 36 Months
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDBA Specialisation in AI & ML
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
New
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGrad KnowledgeHutupGrad KnowledgeHutAzure Administrator Certification (AZ-104)
  • 24 Hours
KnowledgeHut upGradKnowledgeHut upGradAWS Cloud Practioner Essentials Certification
  • 1 Week
KnowledgeHut upGradKnowledgeHut upGradAzure Data Engineering Training (DP-203)
  • 1 Week
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
Loyola Institute of Business Administration (LIBA)Loyola Institute of Business Administration (LIBA)Executive PG Programme in Human Resource Management
  • 11 Months
Popular
Goa Institute of ManagementGoa Institute of ManagementExecutive PG Program in Healthcare Management
  • 11 Months
IMT GhaziabadIMT GhaziabadAdvanced General Management Program
  • 11 Months
Golden Gate UniversityGolden Gate UniversityProfessional Certificate in Global Business Management
  • 6-8 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
IU, GermanyIU, GermanyMaster of Business Administration (90 ECTS)
  • 18 Months
Bestseller
IU, GermanyIU, GermanyMaster in International Management (120 ECTS)
  • 24 Months
Popular
IU, GermanyIU, GermanyB.Sc. Computer Science (180 ECTS)
  • 36 Months
Clark UniversityClark UniversityMaster of Business Administration
  • 23 Months
New
Golden Gate UniversityGolden Gate UniversityMaster of Business Administration
  • 20 Months
Clark University, USClark University, USMS in Project Management
  • 20 Months
New
Edgewood CollegeEdgewood CollegeMaster of Business Administration
  • 23 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 5 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
upGradupGradDigital Marketing Accelerator Program
  • 05 Months

Understanding Gradient Descent in Logistic Regression: Guide for Beginners

By Pavan Vadapalli

Updated on Feb 13, 2025 | 19 min read

Share:

Gradient descent in logistic regression updates the weights by reducing the log-loss; for example, if the initial weight is 0.5 and the gradient is -0.2, the new weight becomes 0.52 after an update.

 While the concept is simple, applying it correctly can be challenging, especially for beginners.

In this blog, you’ll walk through a Gradient Descent in Logistic Regression Example to better understand the process. By the end, you’ll have a clearer grasp of how to use gradient descent to improve model accuracy and performance efficiently.

Let’s get into the details!

Gradient Descent in Logistic Regression & How is it Different from Linear Regression

Logistic regression is a statistical method used for binary classification problems, where the goal is to predict one of two possible outcomes.

For example, logistic regression can predict whether an email is spam or healthy or whether a patient has a disease. Based on input features, it estimates the probability of the default class (e.g., disease = 1 or spam = 1). This makes it ideal for situations where you need to classify data into two categories.

How is Logistic Regression Different from Linear Regression?

While both logistic and linear regression are used for prediction tasks, they differ mainly in the type of output they generate.

  • Linear Regression: 

Outputs a continuous value ranging from negative to positive infinity. It’s used to predict numerical outcomes, such as house or stock prices.

The sigmoid function transforms linear regression into classification by squeezing outputs like 2.5 or -1.3 into probabilities between 0 and 1, such as 0.92 or 0.23.

  • Logistic Regression: 

Unlike continuous values, logistic regression outputs a probability between 0 and 1. This probability is then mapped to one of the two classes (e.g., 0 for non-disease, 1 for disease). The probability represents the likelihood of the event occurring, making logistic regression perfect for classification problems.

Real-World Applications

  • Medical Diagnosis: Logistic regression helps predict the probability of a patient having a certain disease based on factors like age, weight, and symptoms.
  • Fraud Detection: It’s used in banking and finance to classify transactions as legitimate or fraudulent.
  • Email Spam Detection: Logistic regression classifies emails as either spam or non-spam based on content and sender features.

Why is Logistic Regression Popular?

  • Simplicity and Interpretability: Logistic regression’s straightforward math and quick implementation—like predicting if an email is spam based on word frequency—make it ideal for beginners.
  • Speed and Efficiency: It’s computationally efficient, even for large datasets, and it often performs well on binary classification problems.
  • Foundation of Machine Learning: Logistic regression serves as a foundation for many more complex machine learning algorithms, such as neural networks and support vector machines.

Build a strong foundation in machine learning with upGrad’s courses and master essential algorithms like logistic regression, neural networks, and support vector machines.

Also Read: Machine Learning vs Neural Networks: Understanding the Key Differences

Now that you have a solid understanding of logistic regression, let’s dive deeper into the key functions that power its predictions.

Logistic Regression: Detailed Explanation of Functions

In this section, we will break down the key functions involved in logistic regression, focusing on how they contribute to predicting probabilities and optimizing the model. These functions are essential for making accurate predictions and fine-tuning the model through processes like gradient descent in logistic regression.

Sigmoid Function

The sigmoid function is at the heart of logistic regression. It maps any real-valued number to a probability between 0 and 1, making it perfect for binary classification problems. The function is also known for its S-shaped curve, or "S-curve," which smoothly transitions between 0 and 1. 

This is ideal for predicting probabilities, which is the output required in logistic regression.

The following formula defines the sigmoid function:

σ ( z ) = 1 1 + e - z

Where: 

  • z is the linear combination of the input features (e.g., z = w 1 x 1 + w 2 x 2 + b )
  • is the base of the natural logarithm.

The shape of the sigmoid function is a smooth, continuous curve that transitions from 0 to 1. It never reaches 0 or 1, which is essential for logistic regression's probability output.

when   z - ,   σ ( z ) 0
when   z + ,   σ ( z ) 1

This function takes any input, applies the transformation, and outputs a value between 0 and 1, which is interpreted as a probability.

Also Read: What Are Activation Functions in Neural Networks? Functioning, Types, Real-world Examples, Challenge

Cost Function

In machine learning, the cost function (or loss function) measures how well the model's predictions match the actual outcomes. We use the log loss function for logistic regression, which calculates the difference between predicted probabilities and the true labels.

Log loss is used instead of squared error loss because it heavily penalizes confident wrong predictions—like predicting 0.99 for a class that is actually 0—ensuring better probability calibration.

The goal is to minimize this cost by adjusting the model’s parameters (weights) through gradient descent in logistic regression.

The cost function used in logistic regression is defined as:

J ( θ ) = - 1 m i = 1 m y i log ( h θ ( x i ) ) + ( 1 - y i ) l o g ( 1 - h θ ( x i ) )

Where:

  • is the number of training examples,
  • y i is the true label (0 or 1),
  • h θ ( x i ) is the predicted probability for the i-th training example, calculated by the sigmoid function.

The cost function quantifies how far off the predicted probabilities are from the actual outcomes, with the goal being to reduce this error as much as possible.

Why is the Cost Function Necessary?

The cost function is essential because it provides a way to evaluate and optimize the model's predictions. In logistic regression, we minimize this cost through gradient descent, iteratively adjusting the model parameters (weights) to improve the accuracy of our predictions. 

The lower the cost, the better the model fits the data.

Optimizing the Cost Function

To minimize the cost function, we use gradient descent, a technique that iteratively adjusts the weights based on the gradient of the cost function. 

Each step of gradient descent moves the model's parameters in the direction that reduces the cost, eventually leading to the best possible model.

  • Gradient Descent Update Rule:
θ = θ - α · J ( θ ) θ
Where α is the learning rate (step size), and J ( θ ) θ is the gradient of the cost function. Iterating these steps, logistic regression finds parameters that minimize error, improving prediction accuracy.

Also Read: An Introduction to Feedforward Neural Network: Layers, Functions & Importance

Now that we’ve covered the key functions, let’s bring it all together with a practical Gradient Descent in Logistic Regression Example to see how these concepts work in action.

Gradient Descent in Logistic Regression Example

In this section, you’ll take a closer look at how gradient descent in logistic regression is used to optimize the parameters (weights) of the model.   

The goal of gradient descent is to find the set of model parameters that minimize the cost function in logistic regression. It helps us adjust the parameters to make our predictions as accurate as possible.

Formula: 

The gradient descent update rule is as follows:

θ = θ - α · J ( θ ) θ

Where:

  • θ represents the parameters (weights) of the model,
  • α is the learning rate (determines the step size),
  • J ( θ ) θ is the gradient of the cost function (how much the cost changes as we adjust each parameter).

Steps Involved:

  • Initialize θ: Start with initial random values for the weights (parameters).
  • Compute the gradient of the cost function: Calculate how the cost function changes with respect to each parameter.
  • Update θ: Adjust the parameters opposite to the gradient to reduce the cost.
  • Repeat until convergence: Continue updating θ until the cost function converges, meaning there is minimal change between iterations.

Pros and Cons:

  • Pros:
    • Simple to understand and implement.
    • Efficient for small datasets.
  • Cons:
    • Can be slow, especially for large datasets.
    • Sensitive to the choice of learning rate (too high or too low can lead to poor performance).

The idea is to start with an initial set of parameters and gradually adjust them based on the gradient of the cost function. We optimize the model by moving in the direction that reduces the cost.

  • Key Points:
    • It is an iterative process, meaning the parameters are updated multiple times until the model converges to the best solution.
    • The learning rate determines how big a step is taken at each iteration. Too small a learning rate makes the process slow, while too large a rate can overshoot the optimal solution.

To understand it better, let’s implement gradient descent in logistic regression in Python. We'll use a small dataset for simplicity. 

 We have a dataset with two features:

  1. Age: Age of the patient.
  2. Blood Pressure: Blood pressure measurement.

Our task is to predict whether a patient has a disease (1) or not (0) based on these two features. 

import numpy as np
# Define sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
# Define cost function
def cost_function(X, y, theta):
    m = len(y)
    predictions = sigmoid(X.dot(theta))
    cost = (-y.dot(np.log(predictions)) - (1 - y).dot(np.log(1 - predictions))) / m
    return cost
# Gradient descent function
def gradient_descent(X, y, theta, alpha, iterations):
    m = len(y)
    cost_history = np.zeros(iterations)
    for i in range(iterations):
        predictions = sigmoid(X.dot(theta))
        error = predictions - y
        theta -= (alpha / m) * X.T.dot(error)
        cost_history[i] = cost_function(X, y, theta)
    return theta, cost_history
# Real-life dataset: Patient's Age, Blood Pressure, and Disease Outcome (0 or 1)
X = np.array([[1, 55, 120], [1, 60, 130], [1, 65, 140], [1, 70, 160]])  # Features matrix with intercept term
y = np.array([0, 0, 1, 1])  # Labels: 0 = No disease, 1 = Disease
# Initialize parameters and settings
theta = np.zeros(X.shape[1])  # Initial weights (theta)
alpha = 0.01  # Learning rate
iterations = 1000  # Number of iterations
# Perform gradient descent
theta_optimal, cost_history = gradient_descent(X, y, theta, alpha, iterations)
# Output the results
print("Optimal parameters (theta):", theta_optimal)
print("Final cost:", cost_history[-1]) 

Output:

Optimal parameters (theta): [-6.52030656  0.14818361  0.06092552]
Final cost: 0.48671991439924965

Explanation:

  • Dataset: We used a small dataset with patients' age and blood pressure as features. The output labels (0 or 1) represent whether the patient has a disease.
  • Sigmoid Function: The sigmoid() function calculates the probability that the patient has the disease (based on their age and blood pressure). The output is a probability between 0 and 1.
  • Cost Function: The cost_function() calculates the error between the predicted probability and the actual disease outcome (0 or 1). It uses the log loss function, which is commonly used in binary classification tasks like this.
  • Gradient Descent: The gradient_descent() function iteratively updates the model's parameters (weights) to minimize the cost function, which helps the model make better predictions.

After running the algorithm for 1000 iterations, we obtain the optimal parameters (theta values). These parameters represent the age and blood pressure coefficients that best fit the data. The final cost value shows how well the model has learned to predict the disease outcome.

In healthcare, this Gradient Descent in Logistic Regression Example can be used to develop a model that predicts the likelihood of a patient having a disease based on medical test results. This allows healthcare providers to make more informed decisions. 

The gradient descent optimization ensures that the model's parameters are adjusted to minimize prediction errors, leading to better and more accurate outcomes.

Now that we’ve seen how gradient descent works in logistic regression let’s explore—Stochastic Gradient Descent—and how it speeds up the process.

Stochastic Gradient Descent Algorithm

Stochastic Gradient Descent (SGD) is a variation of gradient descent that can significantly speed up the training process, especially when working with large datasets. 

Instead of using the entire dataset to compute the gradient at each iteration, SGD uses only one data point, which makes it computationally faster and more efficient for large-scale problems. 

Update Rule for SGD:

The formula for updating the parameters in SGD is the same as in traditional gradient descent, except that it is applied to one randomly chosen data point at each step:

θ = θ - α · J ( θ ) θ i

Where:

  • θ is the parameter vector,
  • α is the learning rate,
  • J ( θ ) θ i is the gradient computed using one data point.

Steps Involved in SGD:

  • Initialize Parameters: Start with random values for the model’s weights.
  • Randomly Shuffle Dataset: Randomly pick one data point from the dataset.
  • Compute Gradient: Calculate the gradient of the cost function with respect to the parameters for that one data point.
  • Update Parameters: Adjust the parameters based on the gradient.
  • Repeat: Repeat this process for each data point in the training set, iterating over multiple epochs until convergence.

Formula:

The update rule for Stochastic Gradient Descent remains the same as traditional gradient descent, but with a focus on a single training example for each update:

θ = θ - α · J ( θ ; x i , y i )

Where:

  • J ( θ ; x i , y i ) is the gradient of the cost function for a single training example ( x i , y i ) .

Pros and Cons of SGD:

  • Pros:
    • Faster updates: Since it uses only one data point at a time, updates happen much faster, making it suitable for large datasets.
    • Less memory required: Only a single example is needed at each step, so it’s less memory-intensive.
    • Can escape local minima: The noisy updates can help the algorithm jump out of local minima and find better solutions.
  • Cons:
    • High variance: The updates are noisier compared to batch gradient descent because each step is based on a single data point. This can cause the cost function to fluctuate.
    • Convergence issues: The updates can be unstable and imprecise, so convergence may take longer and require more iterations.

Comparison with Batch Gradient Descent:

  • Batch Gradient Descent uses the entire dataset to compute the gradient at each step, which ensures stable updates. However, it is computationally expensive and slow for large datasets.
  • On the other hand, stochastic Gradient Descent (SGD) computes the gradient for just one data point at each step. This makes it much faster, particularly for large datasets, but can be noisy and less stable. It is often preferred for real-time applications and large-scale problems because of its speed and efficiency. 

Let’s consider a real-world application of SGD in the telecom industry. The goal is to predict customer churn (whether a customer will leave the service) based on various features like usage, contract type, and payment history. The dataset is large, and using traditional gradient descent could be slow.

Since telecom datasets are large and updated frequently, SGD is ideal for real-time customer churn prediction as it updates weights after each data point.

Here’s how we could apply Stochastic Gradient Descent in Logistic Regression to predict churn: 

import numpy as np
# Define sigmoid function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
# Cost function
def cost_function(X, y, theta):
    m = len(y)
    predictions = sigmoid(X.dot(theta))
    cost = (-y.dot(np.log(predictions)) - (1 - y).dot(np.log(1 - predictions))) / m
    return cost
# Stochastic Gradient Descent function
def stochastic_gradient_descent(X, y, theta, alpha, iterations):
    m = len(y)
    cost_history = np.zeros(iterations)
    for i in range(iterations):
        for j in range(m):  # loop through each data point
            rand_index = np.random.randint(m)  # Randomly pick a data point
            x_j = X[rand_index, :].reshape(1, X.shape[1])  # Get the random feature row
            y_j = y[rand_index]  # Get the target label for that row
            prediction = sigmoid(x_j.dot(theta))  # Compute prediction
            error = prediction - y_j  # Compute error for this data point
            theta -= alpha * x_j.T.dot(error)  # Update the parameters
        cost_history[i] = cost_function(X, y, theta)  # Track the cost at each iteration
    return theta, cost_history
# Sample Telecom dataset (features and labels)
X = np.array([[1, 2, 0], [1, 3, 1], [1, 4, 0], [1, 5, 1]])  # Features matrix (with intercept term)
y = np.array([0, 0, 1, 1])  # Labels: 0 = No churn, 1 = Churn
# Initialize parameters and settings
theta = np.zeros(X.shape[1])  # Initial weights (theta)
alpha = 0.01  # Learning rate
iterations = 1000  # Number of iterations
# Perform Stochastic Gradient Descent
theta_optimal, cost_history = stochastic_gradient_descent(X, y, theta, alpha, iterations)
# Output the results
print("Optimal parameters (theta):", theta_optimal)
print("Final cost:", cost_history[-1])

Output:

Optimal parameters (theta): [-4.21677517  1.24229867  0.78015261]
Final cost: 0.5401231289708327

Explanation:

  • Real-Time Data Processing: In this example, SGD processes customer data one record at a time, making it much faster for real-time predictions, like predicting whether a customer is likely to churn based on recent usage data.
  • Fast Convergence: Even with a large dataset, the Stochastic Gradient Descent algorithm allows for quick updates and model convergence, making it ideal for large-scale applications like customer churn prediction.

Also Read: Difference Between Classification and Prediction in Data Mining [2025]

Now that we’ve covered Stochastic Gradient Descent, let’s explore a balanced approach—Mini-Batch Gradient Descent- combining the best of both worlds.

Mini-Batch Gradient Descent Algorithm 

Mini-batch gradient descent is a compromise between batch gradient descent and stochastic gradient descent (SGD). It strikes a balance by splitting the dataset into small batches, offering the computational efficiency of batch gradient descent with the faster convergence of SGD. 

Let’s take a closer look at how this approach works. 

In mini-batch gradient descent, instead of using the entire dataset (as in batch gradient descent) or just one data point (as in SGD), the algorithm divides the dataset into small batches. 

At each iteration, the gradient is computed using the average of the gradients for the examples in the batch. This method allows for faster convergence and more stable updates than using a single example and is less computationally expensive than using the entire dataset.

  • Mini-batch size: Mini-batch size is typically chosen based on hardware efficiency and dataset size.
  • Average gradient: The gradients from the examples in the mini-batch are averaged before updating the parameters. This smooths out the noisy updates seen in SGD, while still offering faster convergence compared to batch gradient descent.

Formula:

The update rule for mini-batch gradient descent is similar to the one for SGD, but with gradients averaged over the batch:

θ = θ - α · 1 b i = 1 b J ( θ ) θ i

Where:

  • θ represents the model parameters,
  • α is the learning rate,
  • b is the batch size (number of examples in a mini-batch),
  • J ( θ ) θ i is the gradient of the cost function with respect to θ\thetaθ for a single data point in the mini-batch.

Pros and Cons of Mini-Batch Gradient Descent:

  • Pros:
    • Faster convergence: Mini-batch gradient descent tends to converge faster than both batch gradient descent and SGD, making it suitable for large datasets.
    • Better memory usage: By using smaller batches, it reduces the amount of memory needed compared to batch gradient descent, making it ideal for real-time tasks like stock price prediction.
    • Less noise than SGD: The averaging of gradients reduces the variance seen in stochastic gradient descent, making it more stable.
    • Parallelization: Mini-batches allow for efficient parallel processing, which can speed up training when hardware supports it (e.g., GPUs).
  • Cons:
    • Requires tuning: Choosing the optimal mini-batch size can be tricky and depends on the dataset and hardware.
    • Not as fast as SGD: While it is faster than batch gradient descent, it’s still not as fast as using a single data point (SGD), especially for very large datasets.
    • Convergence issues: A large or small batch size may lead to slower convergence or less accurate results. 

Let’s see how mini-batch gradient descent can be applied to a real-life problem: predicting housing prices based on features like square footage, number of bedrooms, and location.  

Here’s a simplified Python implementation using mini-batch gradient descent: 

import numpy as np
# Define sigmoid function (for logistic regression as an example)
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
# Define cost function
def cost_function(X, y, theta):
    m = len(y)
    predictions = sigmoid(X.dot(theta))
    cost = (-y.dot(np.log(predictions)) - (1 - y).dot(np.log(1 - predictions))) / m
    return cost
# Mini-batch gradient descent function
def mini_batch_gradient_descent(X, y, theta, alpha, batch_size, iterations):
    m = len(y)
    cost_history = np.zeros(iterations)    
    for i in range(iterations):
        # Shuffle the dataset before each iteration
        shuffle_index = np.random.permutation(m)
        X_shuffled = X[shuffle_index]
        y_shuffled = y[shuffle_index]        
        for j in range(0, m, batch_size):  # Loop through mini-batches
            X_batch = X_shuffled[j:j+batch_size]
            y_batch = y_shuffled[j:j+batch_size]            
            predictions = sigmoid(X_batch.dot(theta))  # Predict the output
            error = predictions - y_batch  # Calculate the error
            theta -= (alpha / batch_size) * X_batch.T.dot(error)  # Update the parameters
        cost_history[i] = cost_function(X, y, theta)  # Track cost over iterations
    return theta, cost_history
# Example dataset (features and labels)
X = np.array([[1, 2000, 3], [1, 2500, 4], [1, 1800, 2], [1, 1500, 3]])  # Features: intercept term, square footage, bedrooms
y = np.array([400000, 500000, 300000, 200000])  # Labels: house prices
# Initialize parameters and settings
theta = np.zeros(X.shape[1])  # Initial weights (theta)
alpha = 0.01  # Learning rate
batch_size = 2  # Mini-batch size
iterations = 1000  # Number of iterations
# Perform mini-batch gradient descent
theta_optimal, cost_history = mini_batch_gradient_descent(X, y, theta, alpha, batch_size, iterations)
# Output the results
print("Optimal parameters (theta):", theta_optimal)
print("Final cost:", cost_history[-1]) 

Output: 

Optimal parameters (theta): [ 1.45063473e+04 -4.51352902e+02  3.13093127e+04]
Final cost: 455135.9675005167

Explanation:

  • Mini-Batch Processing: We divide the dataset into mini-batches of size 2 and update the model parameters based on the average gradient of each batch.
  • Real-World Use: This technique can be applied in real estate to predict housing prices based on various factors (like square footage and number of bedrooms). It’s particularly useful when working with large datasets where batch gradient descent would be too slow.

Now that we’ve explored the nuances of mini-batch gradient descent, let’s dive into why gradient descent is such a game-changer for linear regression.

Why Use Gradient Descent in Linear Regression?

Gradient Descent is commonly used in linear regression when traditional methods, like the Normal Equation, may not be efficient or feasible.

While the Normal Equation is great for small datasets, gradient descent provides a more scalable and flexible approach, especially for larger or more complex datasets.

Let’s explore the cases where gradient descent in linear regression is particularly beneficial.

  • Large Datasets:
    Using the Normal Equation can be computationally expensive and slow when the dataset is large. Gradient descent, on the other hand, works efficiently for large datasets since it does not require computing the inverse of a matrix, which is required in the Normal Equation.
  • High-Dimensional Data:
    In cases where you have many features (i.e., high-dimensional data), the Normal Equation becomes more complicated. Gradient descent allows you to work with large numbers of features without dealing with the complexity of matrix inversion.
  • Scalability:
    For very large datasets that don’t fit into memory, gradient descent works well because it can process the data in batches, making it scalable. Mini-batch gradient descent is particularly useful in these cases since it allows for incremental updates with smaller subsets of data.
  • Non-Linear Data:
    If you’re using linear regression but with non-linear features related to the target variable, gradient descent can be adapted for regularization or used with more complex models like polynomial or logistic regression.
  • Memory Efficiency:
    Gradient descent is memory-efficient compared to the Normal Equation because it doesn't require storing large matrices. It only needs memory for the current set of parameters and the gradient updates.
  • Flexibility:
    Gradient descent can be adapted to different variants of regression, such as ridge regression or lasso regression, by modifying the cost function. This flexibility allows you to experiment with varying techniques of regularization.

Imagine you’re working with a dataset of millions of customers with multiple features such as age, income, and purchase behavior. The dataset is large enough that matrix inversion for the Normal Equation is not practical.

In this case, gradient descent in linear regression would allow you to efficiently minimize the cost function and update your model parameters iteratively, even if the dataset is too large to fit in memory all at once.

Also Read: Linear Algebra for Machine Learning: Critical Concepts, Why Learn Before ML

The more you dive into gradient descent and apply it to linear regression, the more comfortable and confident you'll become in optimizing models and solving complex problems across different datasets.

Master Gradient Descent in Linear Regression with upGrad

upGrad’s curriculum builds a strong foundation in gradient descent for linear regression. It also covers advanced concepts and practical applications. With expert-led courses that cover the latest techniques and tools used in machine learning, 

Check out some of the top courses:

You can also get personalized career counseling with upGrad to guide your career path, or visit your nearest upGrad center and start hands-on training today!

Expand your expertise with the best resources available. Browse the programs below to find your ideal fit in Best Machine Learning and AI Courses Online.

Discover in-demand Machine Learning skills to expand your expertise. Explore the programs below to find the perfect fit for your goals.

Discover popular AI and ML blogs and free courses to deepen your expertise. Explore the programs below to find your perfect fit.

Frequently Asked Questions (FAQs)

1. What are the key differences between gradient descent in logistic regression and in linear regression?

2. How does mini-batch gradient descent differ from stochastic gradient descent in logistic regression?

3. Can gradient descent in logistic regression handle multi-class classification?

4. What role does the learning rate play in gradient descent for logistic regression?

5. Why is the cost function important in gradient descent for logistic regression?

6. Can I use gradient descent in logistic regression with very large datasets?

7. How can gradient descent help in regularized logistic regression?

8. What are the limitations of gradient descent in logistic regression?

9. How do you know when gradient descent in logistic regression has converged?

10. Can gradient descent in logistic regression be used for time-series forecasting?

11. How does regularization affect gradient descent in logistic regression?

Pavan Vadapalli

971 articles published

Get Free Consultation

+91

By submitting, I accept the T&C and
Privacy Policy

India’s #1 Tech University

Executive Program in Generative AI for Leaders

76%

seats filled

View Program

Suggested Blogs