View All
View All
View All
View All
View All
View All
View All
View All
View All
View All
View All

Understanding Gradient Descent in Logistic Regression: A Guide for Beginners

By Pavan Vadapalli

Updated on Jun 26, 2025 | 13 min read | 16.18K+ views

Share:

Did You Know?

The same math that powers Netflix recommendations—gradient descent in logistic regression—is used by doctors to predict trauma patient survival in seconds. Tools like TRISS help ER teams determine who needs surgery immediately, all thanks to algorithms trained on real-life medical emergencies.

Gradient descent in logistic regression is a method used to find the best-fitting model by minimizing errors. Think of it like seeing the quickest route on a map. If you were in a city trying to get to your destination, you'd keep adjusting your path, taking small steps to get closer to your goal. 

But understanding how Gradient descent in logistic regression works can be tricky, especially for beginners. 

This article breaks down the concept in simple terms and guide you through the process step by step.

Want to master gradient descent in logistic regression and build efficient models? Explore upGrad’s AI and Machine Learning Courses and gain the skills to develop real-life AI applications with confidence!

What is Gradient Descent in Logistic Regression? Functions & Variants

Gradient descent in logistic regression is an optimization technique used to adjust the model’s parameters (weights) to minimize errors and improve predictions. In simple terms, it’s like finding the best route to your destination by making small adjustments along the way.

Logistic regression, on the other hand, is a statistical method used to predict binary outcomes, such as whether an email is spam or not, or if a patient has a disease based on certain factors. The goal is to predict the probability of a particular outcome, which falls between 0 and 1.

Handling data for classification tasks in logistic regression isn’t just about collecting features; you need the right optimization methods, like gradient descent. Here are three programs that can help you:

So, why is gradient descent so crucial for logistic regression? 

Well, without it, your model would struggle to find the best parameters. Gradient descent helps you automatically adjust the weights of your model, getting closer and closer to the most accurate predictions possible.

The next key components to understand are the sigmoid function and the cost function. These two elements play a crucial role in transforming your data into meaningful predictions and ensuring that the model improves with each step of optimization.

Placement Assistance

Executive PG Program12 Months
background

Liverpool John Moores University

Master of Science in Machine Learning & AI

Dual Credentials

Master's Degree18 Months

These functions are essential for making accurate predictions and fine-tuning the model through processes like gradient descent in logistic regression.

Sigmoid Function

The sigmoid function is at the heart of logistic regression. It maps any real-valued number to a probability between 0 and 1, making it perfect for binary classification problems. The function is also known for its S-shaped curve, or "S-curve," which smoothly transitions between 0 and 1. 

This is ideal for predicting probabilities, which is the output required in logistic regression.

The following formula defines the sigmoid function:

σ ( z ) = 1 1 + e - z

 

Where: 

  • z is the linear combination of the input features given by z = w 1 x 1 + w 2 x 2 + b
  • is the base of the natural logarithm (approximately 2.718)

The shape of the sigmoid function is a smooth, continuous curve that transitions from 0 to 1. It never reaches 0 or 1, which is essential for logistic regression's probability output.

when   z - ,   σ ( z ) 0

 

when   z + ,   σ ( z ) 1

 

This function takes any input, applies the transformation, and outputs a value between 0 and 1, which is interpreted as a probability.

If you’re interested in exploring how these principles apply to innovative technologies like AI content generation and machine learning, enroll in upGrad’s DBA in Emerging Technologies with Concentration in Generative AI. Master the techniques behind intelligent, data-driven applications. Start today!

In machine learning, the cost function (or loss function) measures how well the model's predictions match the actual outcomes. We use the log loss function for logistic regression, which calculates the difference between predicted probabilities and the true labels.

Log loss is used instead of squared error loss because it heavily penalizes confident wrong predictions—like predicting 0.99 for a class that is actually 0—ensuring better probability calibration.

The goal is to minimize this cost by adjusting the model’s parameters (weights) through gradient descent in logistic regression.

The cost function used in logistic regression is defined as:

J ( θ ) = - 1 m i = 1 m y i log ( h θ ( x i ) ) + ( 1 - y i ) log ( 1 - h θ ( x i ) )

 

Where:

  • is the number of training examples,
  • yis the true label (0 or 1) for the iii-th example,
  • hθ(xi) is the predicted probability for the iii-th example, calculated by the sigmoid function.

The cost function quantifies how far off the predicted probabilities are from the actual outcomes, with the goal being to reduce this error as much as possible.

Also Read: Understanding What is Feedforward Neural Network: Detailed Explanation

Why is the Cost Function Necessary?

The cost function is essential because it provides a way to evaluate and optimize the model's predictions. In logistic regression, we minimize this cost through gradient descent, iteratively adjusting the model parameters (weights) to improve the accuracy of our predictions. 

The lower the cost, the better the model fits the data.

Struggling to optimize models using the cost function in real-life AI applications? Check out upGrad’s Executive Programme in Generative AI for Leaders, where you’ll gain hands-on experience with optimization techniques. Start today!

Optimizing the Cost Function

To minimize the cost function, we use gradient descent, a technique that iteratively adjusts the weights based on the gradient of the cost function. 

Each step of gradient descent moves the model's parameters in the direction that reduces the cost, eventually leading to the best possible model.

This adjustment is made using the gradient descent update rule, which determines how much the weights should change at each iteration.

  • Gradient Descent Update Rule:
θ = θ - α J ( θ ) θ

 

Where:

  • α is the learning rate (step size),
  • J ( θ ) θ is the gradient of the cost function with respect to the parameters θ.

Iterating these steps, logistic regression finds parameters that minimize error, improving prediction accuracy.

Now that you've learned how to optimize the cost function, take your skills further with upGrad’s free course Logistic Regression for Beginners. Gain a deeper understanding of the algorithm and how to apply it to real-world problems. Start today!

Variants of Gradient Descent (Update Rules)

While gradient descent is one of the most commonly used optimization techniques, there are several other methods that can be used, depending on the problem and dataset. Here are a few notable ones:

1. Newton's Method

Newton’s method is a second-order optimization technique that uses both the gradient (first derivative) and the Hessian (second derivative) to find the optimal parameters more efficiently than gradient descent. 

It updates the parameters with more precision by considering the curvature of the cost function, which can lead to faster convergence.

In each iteration, the update rule for Newton's method is given by:

θ = θ - ( H ( θ ) ) - 1 J ( θ )

 

Where:

  • H(θ) is the Hessian matrix (second-order partial derivatives),
  • ∇J(θ) is the gradient (first-order partial derivatives).

Newton’s method is ideal when you want faster convergence, especially for smaller datasets where the second-order derivatives can be computed easily. It's most effective when the cost function is smooth and convex. 

However, it is computationally expensive, making it less suitable for large datasets or models with a large number of parameters.

Finding it hard to understand the math behind algorithms like Newton's Method? Check out upGrad’s free course Linear Algebra for Analysis and learn key concepts that will help you understand complex algorithms and improve your problem-solving skills. Start today!

2. Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is a variant of gradient descent that updates the model parameters using only one randomly selected data point at each step, rather than the entire dataset. This makes the updates much faster and allows for more frequent adjustments.

The update rule for SGD is:

θ = θ - α · J ( θ ; x i , y i )

 

Where:

  • xi and yi are the features and label of the selected data point,
  • α is the learning rate,
  • ∇J(θ; xi, yi) is the gradient of the cost function with respect to the parameters for that single data point.

SGD is useful when working with large datasets or when data is arriving in real-time. Since it processes data one point at a time, it's well-suited for tasks like online learning or when the dataset doesn’t fit into memory. 

However, the updates are noisier, and it may require more iterations to converge to the optimal solution.

3. Mini-Batch Gradient Descent

Mini-batch Gradient Descent is a compromise between batch gradient descent and SGD. Instead of updating parameters with a single data point (SGD) or the entire dataset (batch gradient descent), mini-batch gradient descent uses small batches of data. 

This leads to more stable updates and faster convergence compared to SGD, with better computational efficiency than batch gradient descent.

The update rule for mini-batch gradient descent is similar to that of batch gradient descent, but it uses a subset of the data:

θ = θ - α b · i = 1 b J ( θ ; x i , y i )

 

Where:

  • b is the mini-batch size,
  • The sum is over the mini-batch of size b,
  • α is the learning rate.

Mini-batch gradient descent is particularly useful when the dataset is too large to process all at once and when the model needs frequent updates. It strikes a balance between the speed of SGD and the stability of batch gradient descent. 

It is commonly used for large-scale machine learning tasks like training deep neural networks.

While the core idea of updating parameters remains the same, the choice of variant depends on the size of your dataset, computational resources, and how fast you need the model to converge. 

Struggling to grasp how deep learning and neural networks work? Check out upGrad's free course on Fundamentals of Deep Learning and Neural Networks and learn the key concepts behind AI models. Start today!

Next, let's look at a practical example of Gradient Descent in Logistic Regression to see how it optimizes the model's parameters in action.

Gradient Descent in Logistic Regression Example

To see Gradient Descent in Logistic Regression in action, consider a simple example where the algorithm fine-tunes model parameters to reduce the cost function. 

In this case, the example uses a dataset with two features: age and blood pressure, to predict whether a patient has a disease (1) or not (0). 

This will illustrate how the model adjusts its parameters through each iteration to improve prediction accuracy.

  • Key Points:
    • It is an iterative process, meaning the parameters are updated multiple times until the model converges to the best solution.
    • The learning rate determines how big a step is taken at each iteration. Too small a learning rate makes the process slow, while too large a rate can overshoot the optimal solution.

To understand it better, let’s implement gradient descent in logistic regression in Python. We'll use a small dataset for simplicity. 

1. Initialize Parameters

First, we need to set up the initial values for our model parameters, also known as weights. These parameters determine how much influence each feature (age and blood pressure) will have on the model’s prediction.

We'll initialize these parameters to zeros, as is common in logistic regression. The learning rate (α\alphaα) controls how much the weights change with each update, and we'll set it to 0.01 for this example. 

theta = np.zeros(X.shape[1])  # Initialize weights (theta)
alpha = 0.01  # Learning rate
iterations = 1000  # Number of iterations

2. Sigmoid Function

The sigmoid function transforms the linear combination of the input features into a probability between 0 and 1. This is important for binary classification, where we need to predict whether the outcome is 0 (no disease) or 1 (disease).

The formula for the sigmoid function is:

σ ( z ) = 1 1 + e - z

 

Where z is the linear combination of the features and weights (i.e., z=θ0 + θ1 ⋅ age + θ2 ⋅ blood pressure. 

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

3. Cost Function

The cost function (log-loss or binary cross-entropy) measures how far the model’s predictions are from the actual outcomes (0 or 1). In logistic regression, we want to minimize this cost to make the model as accurate as possible.

The formula for the cost function is:

J ( θ ) = - 1 m i = 1 m y i log ( h θ ( x i ) ) + ( 1 - y i ) log ( 1 - h θ ( x i ) )

 

Where:

  • m is the number of training examples,
  • yi is the true label (0 or 1),
  • hθ(xi) is the predicted probability calculated by the sigmoid function. 
def cost_function(X, y, theta):
    m = len(y)
    predictions = sigmoid(np.dot(X, theta))
    cost = - (1/m) * np.sum(y * np.log(predictions) + (1 - y) * np.log(1 - predictions))
    return cost

4. Gradient Descent Update Rule

To minimize the cost, we use gradient descent. The gradient of the cost function tells us how to adjust the weights to reduce the error. The update rule is:

θ = θ - α J ( θ ) θ

 

def gradient_descent(X, y, theta, alpha, iterations):
    m = len(y)
    cost_history = np.zeros(iterations)
    for i in range(iterations):
        predictions = sigmoid(np.dot(X, theta))
        error = predictions - y
        theta -= (alpha / m) * np.dot(X.T, error)
        cost_history[i] = cost_function(X, y, theta)
    return theta, cost_history

5. Running Gradient Descent

Now that we have everything set up, we can run gradient descent to optimize the model’s parameters. The algorithm will iterate 1000 times, adjusting the weights at each step to minimize the cost function. 

# Sample dataset: Age and Blood Pressure with labels
X = np.array([[1, 55, 120], [1, 60, 130], [1, 65, 140], [1, 70, 160]])  # Features matrix
y = np.array([0, 0, 1, 1])  # Labels (0 = No disease, 1 = Disease)
# Perform gradient descent
theta_optimal, cost_history = gradient_descent(X, y, theta, alpha, iterations)

# Display the final optimal parameters (theta) and the final cost
print("Optimal parameters (theta):", theta_optimal)
print("Final cost:", cost_history[-1])

6. Interpreting the Results

After running gradient descent, we’ll get the optimal values for the model parameters θ0​, θ1, and θ2​. These values represent the best-fit weights for age and blood pressure to predict whether a patient has the disease.

For instance, the optimal parameters might look like this: 

Optimal parameters (theta): [ -6.52   0.14   0.06 ]
Final cost: 0.49

The cost at the end of the iterations tells us how well the model has minimized the error. A lower cost means the model has better accuracy.

As you apply this technique, remember to fine-tune your learning rate and monitor the convergence carefully. A learning rate that’s too high can cause instability, while a rate too low can make the process slow. Also, always check for overfitting, especially when using small datasets.

As machine learning continues to advance, understanding its legal implications becomes more critical. Check out upGrad’s LL.M. in AI and Emerging Technologies (Blended Learning Program), where you'll explore the intersection of law, technology, and AI, shaping the future of autonomous systems. Start today!

You can move on to more advanced topics, such as regularization techniques (L1, L2), to enhance model performance and prevent overfitting. Additionally, delving into Optimization Algorithms like Adam and RMSprop will improve your model’s efficiency and accuracy.

Master Gradient Descent in Linear Regression with upGrad

Gradient Descent in Logistic Regression is a crucial technique for optimizing machine learning models, enabling the identification of optimal parameters by minimizing errors. However, you may face challenges when working with large datasets or fine-tuning your learning rate to achieve better convergence.

To enhance your grasp of Gradient Descent in Logistic Regression, experiment with learning rates, apply regularization, and explore techniques like Stochastic Gradient Descent. upGrad’s AI and machine learning courses can deepen your knowledge and help tackle advanced challenges.

In addition to the courses mentioned above, here are some more free courses that can help you elevate your skills:

You can also get personalized career counseling with upGrad to guide your career path, or visit your nearest upGrad center and start hands-on training today!

Expand your expertise with the best resources available. Browse the programs below to find your ideal fit in Best Machine Learning and AI Courses Online.

Discover in-demand Machine Learning skills to expand your expertise. Explore the programs below to find the perfect fit for your goals.

Discover popular AI and ML blogs and free courses to deepen your expertise. Explore the programs below to find your perfect fit.

References:
https://statisticalmusings.netlify.app/post/logistic-regression-mle-a-full-breakdown/
https://www.baeldung.com/cs/gradient-descent-logistic-regression 
https://www.linkedin.com/posts/ritikadokania_logistic-regression-quiz-hard-activity-7151592472928735232-xhAL/

Frequently Asked Questions (FAQs)

1. Can Gradient Descent in Logistic Regression handle non-linear data?

2. How do you know when Gradient Descent in Logistic Regression has converged?

3. Can I use Gradient Descent in Logistic Regression for multi-class classification?

4. How does the choice of batch size impact Gradient Descent in Logistic Regression?

5. Can Gradient Descent in Logistic Regression be used for regression tasks?

6. What’s the difference between Batch Gradient Descent and Mini-Batch Gradient Descent in Logistic Regression?

7. How does Gradient Descent in Logistic Regression handle high-dimensional data?

8. How do you visualize the optimization process in Gradient Descent in Logistic Regression?

9. Is it possible for Gradient Descent in Logistic Regression to get stuck in a local minimum?

10. How do you handle noisy data when using Gradient Descent in Logistic Regression?

11. Can Gradient Descent in Logistic Regression be used for imbalanced datasets?

Pavan Vadapalli

900 articles published

Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology s...

Get Free Consultation

+91

By submitting, I accept the T&C and
Privacy Policy

India’s #1 Tech University

Executive Program in Generative AI for Leaders

76%

seats filled

View Program

Top Resources

Recommended Programs

LJMU

Liverpool John Moores University

Master of Science in Machine Learning & AI

Dual Credentials

Master's Degree

18 Months

IIITB
bestseller

IIIT Bangalore

Executive Diploma in Machine Learning and AI

Placement Assistance

Executive PG Program

12 Months

upGrad
new course

upGrad

Advanced Certificate Program in GenerativeAI

Generative AI curriculum

Certification

4 months