Understanding Gradient Descent in Logistic Regression: Guide for Beginners
Updated on Apr 09, 2025 | 19 min read | 15.7k views
Share:
For working professionals
For fresh graduates
More
Updated on Apr 09, 2025 | 19 min read | 15.7k views
Share:
Table of Contents
Gradient descent in logistic regression updates the weights by reducing the log-loss; for example, if the initial weight is 0.5 and the gradient is -0.2, the new weight becomes 0.52 after an update.
While the concept is simple, applying it correctly can be challenging, especially for beginners.
In this blog, you’ll walk through a Gradient Descent in Logistic Regression Example to better understand the process. By the end, you’ll have a clearer grasp of how to use gradient descent to improve model accuracy and performance efficiently.
Let’s get into the details!
Want to master concepts like Gradient Descent and more? Explore our Artificial Intelligence & Machine Learning Courses and take your career to the next level!
Logistic regression is a statistical method used for binary classification problems, where the goal is to predict one of two possible outcomes.
For example, logistic regression can predict whether an email is spam or healthy or whether a patient has a disease. Based on input features, it estimates the probability of the default class (e.g., disease = 1 or spam = 1). This makes it ideal for situations where you need to classify data into two categories.
Explore industry-ready programs designed to help you master AI, data science, and generative technologies:
While both logistic and linear regression are used for prediction tasks, they differ mainly in the type of output they generate.
Outputs a continuous value ranging from negative to positive infinity. It’s used to predict numerical outcomes, such as house or stock prices.
The sigmoid function transforms linear regression into classification by squeezing outputs like 2.5 or -1.3 into probabilities between 0 and 1, such as 0.92 or 0.23.
Unlike continuous values, logistic regression outputs a probability between 0 and 1. This probability is then mapped to one of the two classes (e.g., 0 for non-disease, 1 for disease). The probability represents the likelihood of the event occurring, making logistic regression perfect for classification problems.
Real-World Applications
Why is Logistic Regression Popular?
Also Read: Machine Learning vs Neural Networks: Understanding the Key Differences
Now that you have a solid understanding of logistic regression, let’s dive deeper into the key functions that power its predictions.
In this section, we will break down the key functions involved in logistic regression, focusing on how they contribute to predicting probabilities and optimizing the model. These functions are essential for making accurate predictions and fine-tuning the model through processes like gradient descent in logistic regression.
The sigmoid function is at the heart of logistic regression. It maps any real-valued number to a probability between 0 and 1, making it perfect for binary classification problems. The function is also known for its S-shaped curve, or "S-curve," which smoothly transitions between 0 and 1.
This is ideal for predicting probabilities, which is the output required in logistic regression.
The following formula defines the sigmoid function:
Where:
The shape of the sigmoid function is a smooth, continuous curve that transitions from 0 to 1. It never reaches 0 or 1, which is essential for logistic regression's probability output.
This function takes any input, applies the transformation, and outputs a value between 0 and 1, which is interpreted as a probability.
Also Read: What Are Activation Functions in Neural Networks? Functioning, Types, Real-world Examples, Challenge
In machine learning, the cost function (or loss function) measures how well the model's predictions match the actual outcomes. We use the log loss function for logistic regression, which calculates the difference between predicted probabilities and the true labels.
Log loss is used instead of squared error loss because it heavily penalizes confident wrong predictions—like predicting 0.99 for a class that is actually 0—ensuring better probability calibration.
The goal is to minimize this cost by adjusting the model’s parameters (weights) through gradient descent in logistic regression.
The cost function used in logistic regression is defined as:
Where:
The cost function quantifies how far off the predicted probabilities are from the actual outcomes, with the goal being to reduce this error as much as possible.
Why is the Cost Function Necessary?
The cost function is essential because it provides a way to evaluate and optimize the model's predictions. In logistic regression, we minimize this cost through gradient descent, iteratively adjusting the model parameters (weights) to improve the accuracy of our predictions.
The lower the cost, the better the model fits the data.
To minimize the cost function, we use gradient descent, a technique that iteratively adjusts the weights based on the gradient of the cost function.
Each step of gradient descent moves the model's parameters in the direction that reduces the cost, eventually leading to the best possible model.
Also Read: An Introduction to Feedforward Neural Network: Layers, Functions & Importance
Now that we’ve covered the key functions, let’s bring it all together with a practical Gradient Descent in Logistic Regression Example to see how these concepts work in action.
In this section, you’ll take a closer look at how gradient descent in logistic regression is used to optimize the parameters (weights) of the model.
The goal of gradient descent is to find the set of model parameters that minimize the cost function in logistic regression. It helps us adjust the parameters to make our predictions as accurate as possible.
Formula:
The gradient descent update rule is as follows:
Where:
Steps Involved:
Pros and Cons:
The idea is to start with an initial set of parameters and gradually adjust them based on the gradient of the cost function. We optimize the model by moving in the direction that reduces the cost.
To understand it better, let’s implement gradient descent in logistic regression in Python. We'll use a small dataset for simplicity.
We have a dataset with two features:
Our task is to predict whether a patient has a disease (1) or not (0) based on these two features.
import numpy as np
# Define sigmoid function
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Define cost function
def cost_function(X, y, theta):
m = len(y)
predictions = sigmoid(X.dot(theta))
cost = (-y.dot(np.log(predictions)) - (1 - y).dot(np.log(1 - predictions))) / m
return cost
# Gradient descent function
def gradient_descent(X, y, theta, alpha, iterations):
m = len(y)
cost_history = np.zeros(iterations)
for i in range(iterations):
predictions = sigmoid(X.dot(theta))
error = predictions - y
theta -= (alpha / m) * X.T.dot(error)
cost_history[i] = cost_function(X, y, theta)
return theta, cost_history
# Real-life dataset: Patient's Age, Blood Pressure, and Disease Outcome (0 or 1)
X = np.array([[1, 55, 120], [1, 60, 130], [1, 65, 140], [1, 70, 160]]) # Features matrix with intercept term
y = np.array([0, 0, 1, 1]) # Labels: 0 = No disease, 1 = Disease
# Initialize parameters and settings
theta = np.zeros(X.shape[1]) # Initial weights (theta)
alpha = 0.01 # Learning rate
iterations = 1000 # Number of iterations
# Perform gradient descent
theta_optimal, cost_history = gradient_descent(X, y, theta, alpha, iterations)
# Output the results
print("Optimal parameters (theta):", theta_optimal)
print("Final cost:", cost_history[-1])
Output:
Optimal parameters (theta): [-6.52030656 0.14818361 0.06092552]
Final cost: 0.48671991439924965
Explanation:
After running the algorithm for 1000 iterations, we obtain the optimal parameters (theta values). These parameters represent the age and blood pressure coefficients that best fit the data. The final cost value shows how well the model has learned to predict the disease outcome.
In healthcare, this Gradient Descent in Logistic Regression Example can be used to develop a model that predicts the likelihood of a patient having a disease based on medical test results. This allows healthcare providers to make more informed decisions.
The gradient descent optimization ensures that the model's parameters are adjusted to minimize prediction errors, leading to better and more accurate outcomes.
Now that we’ve seen how gradient descent works in logistic regression let’s explore—Stochastic Gradient Descent—and how it speeds up the process.
Stochastic Gradient Descent (SGD) is a variation of gradient descent that can significantly speed up the training process, especially when working with large datasets.
Instead of using the entire dataset to compute the gradient at each iteration, SGD uses only one data point, which makes it computationally faster and more efficient for large-scale problems.
Update Rule for SGD:
The formula for updating the parameters in SGD is the same as in traditional gradient descent, except that it is applied to one randomly chosen data point at each step:
Where:
Steps Involved in SGD:
Formula:
The update rule for Stochastic Gradient Descent remains the same as traditional gradient descent, but with a focus on a single training example for each update:
Where:
Pros and Cons of SGD:
Comparison with Batch Gradient Descent:
Let’s consider a real-world application of SGD in the telecom industry. The goal is to predict customer churn (whether a customer will leave the service) based on various features like usage, contract type, and payment history. The dataset is large, and using traditional gradient descent could be slow.
Since telecom datasets are large and updated frequently, SGD is ideal for real-time customer churn prediction as it updates weights after each data point.
Here’s how we could apply Stochastic Gradient Descent in Logistic Regression to predict churn:
import numpy as np
# Define sigmoid function
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Cost function
def cost_function(X, y, theta):
m = len(y)
predictions = sigmoid(X.dot(theta))
cost = (-y.dot(np.log(predictions)) - (1 - y).dot(np.log(1 - predictions))) / m
return cost
# Stochastic Gradient Descent function
def stochastic_gradient_descent(X, y, theta, alpha, iterations):
m = len(y)
cost_history = np.zeros(iterations)
for i in range(iterations):
for j in range(m): # loop through each data point
rand_index = np.random.randint(m) # Randomly pick a data point
x_j = X[rand_index, :].reshape(1, X.shape[1]) # Get the random feature row
y_j = y[rand_index] # Get the target label for that row
prediction = sigmoid(x_j.dot(theta)) # Compute prediction
error = prediction - y_j # Compute error for this data point
theta -= alpha * x_j.T.dot(error) # Update the parameters
cost_history[i] = cost_function(X, y, theta) # Track the cost at each iteration
return theta, cost_history
# Sample Telecom dataset (features and labels)
X = np.array([[1, 2, 0], [1, 3, 1], [1, 4, 0], [1, 5, 1]]) # Features matrix (with intercept term)
y = np.array([0, 0, 1, 1]) # Labels: 0 = No churn, 1 = Churn
# Initialize parameters and settings
theta = np.zeros(X.shape[1]) # Initial weights (theta)
alpha = 0.01 # Learning rate
iterations = 1000 # Number of iterations
# Perform Stochastic Gradient Descent
theta_optimal, cost_history = stochastic_gradient_descent(X, y, theta, alpha, iterations)
# Output the results
print("Optimal parameters (theta):", theta_optimal)
print("Final cost:", cost_history[-1])
Output:
Optimal parameters (theta): [-4.21677517 1.24229867 0.78015261]
Final cost: 0.5401231289708327
Explanation:
Also Read: Difference Between Classification and Prediction in Data Mining [2025]
Now that we’ve covered Stochastic Gradient Descent, let’s explore a balanced approach—Mini-Batch Gradient Descent- combining the best of both worlds.
Mini-batch gradient descent is a compromise between batch gradient descent and stochastic gradient descent (SGD). It strikes a balance by splitting the dataset into small batches, offering the computational efficiency of batch gradient descent with the faster convergence of SGD.
Let’s take a closer look at how this approach works.
In mini-batch gradient descent, instead of using the entire dataset (as in batch gradient descent) or just one data point (as in SGD), the algorithm divides the dataset into small batches.
At each iteration, the gradient is computed using the average of the gradients for the examples in the batch. This method allows for faster convergence and more stable updates than using a single example and is less computationally expensive than using the entire dataset.
Formula:
The update rule for mini-batch gradient descent is similar to the one for SGD, but with gradients averaged over the batch:
Where:
Pros and Cons of Mini-Batch Gradient Descent:
Let’s see how mini-batch gradient descent can be applied to a real-life problem: predicting housing prices based on features like square footage, number of bedrooms, and location.
Here’s a simplified Python implementation using mini-batch gradient descent:
import numpy as np
# Define sigmoid function (for logistic regression as an example)
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Define cost function
def cost_function(X, y, theta):
m = len(y)
predictions = sigmoid(X.dot(theta))
cost = (-y.dot(np.log(predictions)) - (1 - y).dot(np.log(1 - predictions))) / m
return cost
# Mini-batch gradient descent function
def mini_batch_gradient_descent(X, y, theta, alpha, batch_size, iterations):
m = len(y)
cost_history = np.zeros(iterations)
for i in range(iterations):
# Shuffle the dataset before each iteration
shuffle_index = np.random.permutation(m)
X_shuffled = X[shuffle_index]
y_shuffled = y[shuffle_index]
for j in range(0, m, batch_size): # Loop through mini-batches
X_batch = X_shuffled[j:j+batch_size]
y_batch = y_shuffled[j:j+batch_size]
predictions = sigmoid(X_batch.dot(theta)) # Predict the output
error = predictions - y_batch # Calculate the error
theta -= (alpha / batch_size) * X_batch.T.dot(error) # Update the parameters
cost_history[i] = cost_function(X, y, theta) # Track cost over iterations
return theta, cost_history
# Example dataset (features and labels)
X = np.array([[1, 2000, 3], [1, 2500, 4], [1, 1800, 2], [1, 1500, 3]]) # Features: intercept term, square footage, bedrooms
y = np.array([400000, 500000, 300000, 200000]) # Labels: house prices
# Initialize parameters and settings
theta = np.zeros(X.shape[1]) # Initial weights (theta)
alpha = 0.01 # Learning rate
batch_size = 2 # Mini-batch size
iterations = 1000 # Number of iterations
# Perform mini-batch gradient descent
theta_optimal, cost_history = mini_batch_gradient_descent(X, y, theta, alpha, batch_size, iterations)
# Output the results
print("Optimal parameters (theta):", theta_optimal)
print("Final cost:", cost_history[-1])
Output:
Optimal parameters (theta): [ 1.45063473e+04 -4.51352902e+02 3.13093127e+04]
Final cost: 455135.9675005167
Explanation:
Now that we’ve explored the nuances of mini-batch gradient descent, let’s dive into why gradient descent is such a game-changer for linear regression.
Gradient Descent is commonly used in linear regression when traditional methods, like the Normal Equation, may not be efficient or feasible.
While the Normal Equation is great for small datasets, gradient descent provides a more scalable and flexible approach, especially for larger or more complex datasets.
Let’s explore the cases where gradient descent in linear regression is particularly beneficial.
Imagine you’re working with a dataset of millions of customers with multiple features such as age, income, and purchase behavior. The dataset is large enough that matrix inversion for the Normal Equation is not practical.
In this case, gradient descent in linear regression would allow you to efficiently minimize the cost function and update your model parameters iteratively, even if the dataset is too large to fit in memory all at once.
Also Read: Linear Algebra for Machine Learning: Critical Concepts, Why Learn Before ML
The more you dive into gradient descent and apply it to linear regression, the more comfortable and confident you'll become in optimizing models and solving complex problems across different datasets.
upGrad’s curriculum builds a strong foundation in gradient descent for linear regression. It also covers advanced concepts and practical applications. With expert-led courses that cover the latest techniques and tools used in machine learning,
Check out some of the top courses:
You can also get personalized career counseling with upGrad to guide your career path, or visit your nearest upGrad center and start hands-on training today!
Expand your expertise with the best resources available. Browse the programs below to find your ideal fit in Best Machine Learning and AI Courses Online.
Discover in-demand Machine Learning skills to expand your expertise. Explore the programs below to find the perfect fit for your goals.
Discover popular AI and ML blogs and free courses to deepen your expertise. Explore the programs below to find your perfect fit.
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Top Resources