Understanding Gradient Descent in Logistic Regression: A Guide for Beginners
Updated on Jun 26, 2025 | 13 min read | 16.18K+ views
Share:
For working professionals
For fresh graduates
More
Updated on Jun 26, 2025 | 13 min read | 16.18K+ views
Share:
Did You Know? The same math that powers Netflix recommendations—gradient descent in logistic regression—is used by doctors to predict trauma patient survival in seconds. Tools like TRISS help ER teams determine who needs surgery immediately, all thanks to algorithms trained on real-life medical emergencies. |
Gradient descent in logistic regression is a method used to find the best-fitting model by minimizing errors. Think of it like seeing the quickest route on a map. If you were in a city trying to get to your destination, you'd keep adjusting your path, taking small steps to get closer to your goal.
But understanding how Gradient descent in logistic regression works can be tricky, especially for beginners.
This article breaks down the concept in simple terms and guide you through the process step by step.
Want to master gradient descent in logistic regression and build efficient models? Explore upGrad’s AI and Machine Learning Courses and gain the skills to develop real-life AI applications with confidence!
Gradient descent in logistic regression is an optimization technique used to adjust the model’s parameters (weights) to minimize errors and improve predictions. In simple terms, it’s like finding the best route to your destination by making small adjustments along the way.
Logistic regression, on the other hand, is a statistical method used to predict binary outcomes, such as whether an email is spam or not, or if a patient has a disease based on certain factors. The goal is to predict the probability of a particular outcome, which falls between 0 and 1.
Handling data for classification tasks in logistic regression isn’t just about collecting features; you need the right optimization methods, like gradient descent. Here are three programs that can help you:
So, why is gradient descent so crucial for logistic regression?
Well, without it, your model would struggle to find the best parameters. Gradient descent helps you automatically adjust the weights of your model, getting closer and closer to the most accurate predictions possible.
The next key components to understand are the sigmoid function and the cost function. These two elements play a crucial role in transforming your data into meaningful predictions and ensuring that the model improves with each step of optimization.
These functions are essential for making accurate predictions and fine-tuning the model through processes like gradient descent in logistic regression.
The sigmoid function is at the heart of logistic regression. It maps any real-valued number to a probability between 0 and 1, making it perfect for binary classification problems. The function is also known for its S-shaped curve, or "S-curve," which smoothly transitions between 0 and 1.
This is ideal for predicting probabilities, which is the output required in logistic regression.
The following formula defines the sigmoid function:
Where:
The shape of the sigmoid function is a smooth, continuous curve that transitions from 0 to 1. It never reaches 0 or 1, which is essential for logistic regression's probability output.
This function takes any input, applies the transformation, and outputs a value between 0 and 1, which is interpreted as a probability.
Also Read: What Are Activation Functions in Neural Networks? Functioning, Types, Real-world Examples, Challenge
In machine learning, the cost function (or loss function) measures how well the model's predictions match the actual outcomes. We use the log loss function for logistic regression, which calculates the difference between predicted probabilities and the true labels.
Log loss is used instead of squared error loss because it heavily penalizes confident wrong predictions—like predicting 0.99 for a class that is actually 0—ensuring better probability calibration.
The goal is to minimize this cost by adjusting the model’s parameters (weights) through gradient descent in logistic regression.
The cost function used in logistic regression is defined as:
Where:
The cost function quantifies how far off the predicted probabilities are from the actual outcomes, with the goal being to reduce this error as much as possible.
Also Read: Understanding What is Feedforward Neural Network: Detailed Explanation
Why is the Cost Function Necessary?
The cost function is essential because it provides a way to evaluate and optimize the model's predictions. In logistic regression, we minimize this cost through gradient descent, iteratively adjusting the model parameters (weights) to improve the accuracy of our predictions.
The lower the cost, the better the model fits the data.
To minimize the cost function, we use gradient descent, a technique that iteratively adjusts the weights based on the gradient of the cost function.
Each step of gradient descent moves the model's parameters in the direction that reduces the cost, eventually leading to the best possible model.
This adjustment is made using the gradient descent update rule, which determines how much the weights should change at each iteration.
Where:
Iterating these steps, logistic regression finds parameters that minimize error, improving prediction accuracy.
While gradient descent is one of the most commonly used optimization techniques, there are several other methods that can be used, depending on the problem and dataset. Here are a few notable ones:
1. Newton's Method
Newton’s method is a second-order optimization technique that uses both the gradient (first derivative) and the Hessian (second derivative) to find the optimal parameters more efficiently than gradient descent.
It updates the parameters with more precision by considering the curvature of the cost function, which can lead to faster convergence.
In each iteration, the update rule for Newton's method is given by:
Where:
Newton’s method is ideal when you want faster convergence, especially for smaller datasets where the second-order derivatives can be computed easily. It's most effective when the cost function is smooth and convex.
However, it is computationally expensive, making it less suitable for large datasets or models with a large number of parameters.
2. Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is a variant of gradient descent that updates the model parameters using only one randomly selected data point at each step, rather than the entire dataset. This makes the updates much faster and allows for more frequent adjustments.
The update rule for SGD is:
Where:
SGD is useful when working with large datasets or when data is arriving in real-time. Since it processes data one point at a time, it's well-suited for tasks like online learning or when the dataset doesn’t fit into memory.
However, the updates are noisier, and it may require more iterations to converge to the optimal solution.
3. Mini-Batch Gradient Descent
Mini-batch Gradient Descent is a compromise between batch gradient descent and SGD. Instead of updating parameters with a single data point (SGD) or the entire dataset (batch gradient descent), mini-batch gradient descent uses small batches of data.
This leads to more stable updates and faster convergence compared to SGD, with better computational efficiency than batch gradient descent.
The update rule for mini-batch gradient descent is similar to that of batch gradient descent, but it uses a subset of the data:
Where:
Mini-batch gradient descent is particularly useful when the dataset is too large to process all at once and when the model needs frequent updates. It strikes a balance between the speed of SGD and the stability of batch gradient descent.
It is commonly used for large-scale machine learning tasks like training deep neural networks.
While the core idea of updating parameters remains the same, the choice of variant depends on the size of your dataset, computational resources, and how fast you need the model to converge.
Struggling to grasp how deep learning and neural networks work? Check out upGrad's free course on Fundamentals of Deep Learning and Neural Networks and learn the key concepts behind AI models. Start today!
Next, let's look at a practical example of Gradient Descent in Logistic Regression to see how it optimizes the model's parameters in action.
To see Gradient Descent in Logistic Regression in action, consider a simple example where the algorithm fine-tunes model parameters to reduce the cost function.
In this case, the example uses a dataset with two features: age and blood pressure, to predict whether a patient has a disease (1) or not (0).
This will illustrate how the model adjusts its parameters through each iteration to improve prediction accuracy.
To understand it better, let’s implement gradient descent in logistic regression in Python. We'll use a small dataset for simplicity.
First, we need to set up the initial values for our model parameters, also known as weights. These parameters determine how much influence each feature (age and blood pressure) will have on the model’s prediction.
We'll initialize these parameters to zeros, as is common in logistic regression. The learning rate (α\alphaα) controls how much the weights change with each update, and we'll set it to 0.01 for this example.
theta = np.zeros(X.shape[1]) # Initialize weights (theta)
alpha = 0.01 # Learning rate
iterations = 1000 # Number of iterations
The sigmoid function transforms the linear combination of the input features into a probability between 0 and 1. This is important for binary classification, where we need to predict whether the outcome is 0 (no disease) or 1 (disease).
The formula for the sigmoid function is:
Where z is the linear combination of the features and weights (i.e., z=θ0 + θ1 ⋅ age + θ2 ⋅ blood pressure.
def sigmoid(z):
return 1 / (1 + np.exp(-z))
The cost function (log-loss or binary cross-entropy) measures how far the model’s predictions are from the actual outcomes (0 or 1). In logistic regression, we want to minimize this cost to make the model as accurate as possible.
The formula for the cost function is:
Where:
def cost_function(X, y, theta):
m = len(y)
predictions = sigmoid(np.dot(X, theta))
cost = - (1/m) * np.sum(y * np.log(predictions) + (1 - y) * np.log(1 - predictions))
return cost
To minimize the cost, we use gradient descent. The gradient of the cost function tells us how to adjust the weights to reduce the error. The update rule is:
def gradient_descent(X, y, theta, alpha, iterations):
m = len(y)
cost_history = np.zeros(iterations)
for i in range(iterations):
predictions = sigmoid(np.dot(X, theta))
error = predictions - y
theta -= (alpha / m) * np.dot(X.T, error)
cost_history[i] = cost_function(X, y, theta)
return theta, cost_history
Now that we have everything set up, we can run gradient descent to optimize the model’s parameters. The algorithm will iterate 1000 times, adjusting the weights at each step to minimize the cost function.
# Sample dataset: Age and Blood Pressure with labels
X = np.array([[1, 55, 120], [1, 60, 130], [1, 65, 140], [1, 70, 160]]) # Features matrix
y = np.array([0, 0, 1, 1]) # Labels (0 = No disease, 1 = Disease)
# Perform gradient descent
theta_optimal, cost_history = gradient_descent(X, y, theta, alpha, iterations)
# Display the final optimal parameters (theta) and the final cost
print("Optimal parameters (theta):", theta_optimal)
print("Final cost:", cost_history[-1])
After running gradient descent, we’ll get the optimal values for the model parameters θ0, θ1, and θ2. These values represent the best-fit weights for age and blood pressure to predict whether a patient has the disease.
For instance, the optimal parameters might look like this:
Optimal parameters (theta): [ -6.52 0.14 0.06 ]
Final cost: 0.49
The cost at the end of the iterations tells us how well the model has minimized the error. A lower cost means the model has better accuracy.
As you apply this technique, remember to fine-tune your learning rate and monitor the convergence carefully. A learning rate that’s too high can cause instability, while a rate too low can make the process slow. Also, always check for overfitting, especially when using small datasets.
You can move on to more advanced topics, such as regularization techniques (L1, L2), to enhance model performance and prevent overfitting. Additionally, delving into Optimization Algorithms like Adam and RMSprop will improve your model’s efficiency and accuracy.
Gradient Descent in Logistic Regression is a crucial technique for optimizing machine learning models, enabling the identification of optimal parameters by minimizing errors. However, you may face challenges when working with large datasets or fine-tuning your learning rate to achieve better convergence.
To enhance your grasp of Gradient Descent in Logistic Regression, experiment with learning rates, apply regularization, and explore techniques like Stochastic Gradient Descent. upGrad’s AI and machine learning courses can deepen your knowledge and help tackle advanced challenges.
In addition to the courses mentioned above, here are some more free courses that can help you elevate your skills:
You can also get personalized career counseling with upGrad to guide your career path, or visit your nearest upGrad center and start hands-on training today!
Expand your expertise with the best resources available. Browse the programs below to find your ideal fit in Best Machine Learning and AI Courses Online.
Discover in-demand Machine Learning skills to expand your expertise. Explore the programs below to find the perfect fit for your goals.
Discover popular AI and ML blogs and free courses to deepen your expertise. Explore the programs below to find your perfect fit.
References:
https://statisticalmusings.netlify.app/post/logistic-regression-mle-a-full-breakdown/
https://www.baeldung.com/cs/gradient-descent-logistic-regression
https://www.linkedin.com/posts/ritikadokania_logistic-regression-quiz-hard-activity-7151592472928735232-xhAL/
900 articles published
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology s...
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Top Resources