Home
Blog
Artificial Intelligence
Understanding Gradient Descent in Logistic Regression: A Guide for Beginners

Understanding Gradient Descent in Logistic Regression: A Guide for Beginners

Updated on Jun 26, 2025 | 13 min read | 17.18K+ views

Did You Know?

The same math that powers Netflix recommendations—gradient descent in logistic regression—is used by doctors to predict trauma patient survival in seconds. Tools like TRISS help ER teams determine who needs surgery immediately, all thanks to algorithms trained on real-life medical emergencies.

Gradient descent in logistic regression is a method used to find the best-fitting model by minimizing errors. Think of it like seeing the quickest route on a map. If you were in a city trying to get to your destination, you'd keep adjusting your path, taking small steps to get closer to your goal.

But understanding how Gradient descent in logistic regression works can be tricky, especially for beginners.

This article breaks down the concept in simple terms and guide you through the process step by step.

Want to master gradient descent in logistic regression and build efficient models? Explore upGrad’s AI and Machine Learning Courses and gain the skills to develop real-life AI applications with confidence!

Popular AI Programs

LLM in Law and Technology from OPJ PG in AI and ML Course Masters in AI and ML in India AI for Business Leaders Course Generative AI Certification Course

What is Gradient Descent in Logistic Regression? Functions & Variants

Gradient descent in logistic regression is an optimization technique used to adjust the model’s parameters (weights) to minimize errors and improve predictions. In simple terms, it’s like finding the best route to your destination by making small adjustments along the way.

Logistic regression, on the other hand, is a statistical method used to predict binary outcomes, such as whether an email is spam or not, or if a patient has a disease based on certain factors. The goal is to predict the probability of a particular outcome, which falls between 0 and 1.

Handling data for classification tasks in logistic regression isn’t just about collecting features; you need the right optimization methods, like gradient descent. Here are three programs that can help you:

So, why is gradient descent so crucial for logistic regression?

Well, without it, your model would struggle to find the best parameters. Gradient descent helps you automatically adjust the weights of your model, getting closer and closer to the most accurate predictions possible.

The next key components to understand are the sigmoid function and the cost function. These two elements play a crucial role in transforming your data into meaningful predictions and ensuring that the model improves with each step of optimization.

Machine Learning Courses to upskill

Explore Machine Learning Courses for Career Progression

IIIT Bangalore

Executive Diploma in Machine Learning and AI

360° Career Support

Executive PG Program12 Months

Liverpool John Moores University

Master of Science in Machine Learning & AI

Double Credentials

Master's Degree18 Months

These functions are essential for making accurate predictions and fine-tuning the model through processes like gradient descent in logistic regression.

Sigmoid Function

The sigmoid function is at the heart of logistic regression. It maps any real-valued number to a probability between 0 and 1, making it perfect for binary classification problems. The function is also known for its S-shaped curve, or "S-curve," which smoothly transitions between 0 and 1.

This is ideal for predicting probabilities, which is the output required in logistic regression.

The following formula defines the sigmoid function:

σ (z) = \frac{1}{1 + e^{- z}}

Where:

z is the linear combination of the input features given by $z = w_{1} x_{1} + w_{2} x_{2} + b$
e is the base of the natural logarithm (approximately 2.718)

The shape of the sigmoid function is a smooth, continuous curve that transitions from 0 to 1. It never reaches 0 or 1, which is essential for logistic regression's probability output.

when z \to - \infty, σ (z) \to 0

when z \to + \infty, σ (z) \to 1

This function takes any input, applies the transformation, and outputs a value between 0 and 1, which is interpreted as a probability.

If you’re interested in exploring how these principles apply to innovative technologies like AI content generation and machine learning, enroll in upGrad’s DBA in Emerging Technologies with Concentration in Generative AI. Master the techniques behind intelligent, data-driven applications. Start today!

Also Read: What Are Activation Functions in Neural Networks? Functioning, Types, Real-world Examples, Challenge

Cost Function

Subscribe to upGrad's Newsletter

Join thousands of learners who receive useful tips

Promise we won't spam!

In machine learning, the cost function (or loss function) measures how well the model's predictions match the actual outcomes. We use the log loss function for logistic regression, which calculates the difference between predicted probabilities and the true labels.

Log loss is used instead of squared error loss because it heavily penalizes confident wrong predictions—like predicting 0.99 for a class that is actually 0—ensuring better probability calibration.

The goal is to minimize this cost by adjusting the model’s parameters (weights) through gradient descent in logistic regression.

The cost function used in logistic regression is defined as:

J (θ) = - \frac{1}{m} \sum_{i = 1}^{m} [y_{i} \log (h_{θ} (x_{i})) + (1 - y_{i}) \log (1 - h_{θ} (x_{i}))]

Where:

m is the number of training examples,
y_iis the true label (0 or 1) for the iii-th example,
h_θ(x_i) is the predicted probability for the iii-th example, calculated by the sigmoid function.

The cost function quantifies how far off the predicted probabilities are from the actual outcomes, with the goal being to reduce this error as much as possible.

Also Read: Understanding What is Feedforward Neural Network: Detailed Explanation

Why is the Cost Function Necessary?

The cost function is essential because it provides a way to evaluate and optimize the model's predictions. In logistic regression, we minimize this cost through gradient descent, iteratively adjusting the model parameters (weights) to improve the accuracy of our predictions.

The lower the cost, the better the model fits the data.

Struggling to optimize models using the cost function in real-life AI applications? Check out upGrad’s Executive Programme in Generative AI for Leaders, where you’ll gain hands-on experience with optimization techniques. Start today!

Optimizing the Cost Function

To minimize the cost function, we use gradient descent, a technique that iteratively adjusts the weights based on the gradient of the cost function.

Each step of gradient descent moves the model's parameters in the direction that reduces the cost, eventually leading to the best possible model.

This adjustment is made using the gradient descent update rule, which determines how much the weights should change at each iteration.

Gradient Descent Update Rule:

θ = θ - α \frac{\partial J (θ)}{\partial θ}

Where:

α is the learning rate (step size),
$\frac{\partial J (θ)}{\partial θ}$ is the gradient of the cost function with respect to the parameters θ.

Iterating these steps, logistic regression finds parameters that minimize error, improving prediction accuracy.

Now that you've learned how to optimize the cost function, take your skills further with upGrad’s free course Logistic Regression for Beginners. Gain a deeper understanding of the algorithm and how to apply it to real-world problems. Start today!

Variants of Gradient Descent (Update Rules)

While gradient descent is one of the most commonly used optimization techniques, there are several other methods that can be used, depending on the problem and dataset. Here are a few notable ones:

1. Newton's Method

Newton’s method is a second-order optimization technique that uses both the gradient (first derivative) and the Hessian (second derivative) to find the optimal parameters more efficiently than gradient descent.

It updates the parameters with more precision by considering the curvature of the cost function, which can lead to faster convergence.

In each iteration, the update rule for Newton's method is given by:

θ = θ - (H (θ))^{- 1} \nabla J (θ)

Where:

H(θ) is the Hessian matrix (second-order partial derivatives),
∇J(θ) is the gradient (first-order partial derivatives).

Newton’s method is ideal when you want faster convergence, especially for smaller datasets where the second-order derivatives can be computed easily. It's most effective when the cost function is smooth and convex.

However, it is computationally expensive, making it less suitable for large datasets or models with a large number of parameters.

Finding it hard to understand the math behind algorithms like Newton's Method? Check out upGrad’s free course Linear Algebra for Analysis and learn key concepts that will help you understand complex algorithms and improve your problem-solving skills. Start today!

2. Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is a variant of gradient descent that updates the model parameters using only one randomly selected data point at each step, rather than the entire dataset. This makes the updates much faster and allows for more frequent adjustments.

The update rule for SGD is:

θ = θ - α \cdot \nabla J (θ; x_{i}, y_{i})

Where:

x_i and y_i are the features and label of the selected data point,
α is the learning rate,
∇J(θ; x_i, y_i) is the gradient of the cost function with respect to the parameters for that single data point.

SGD is useful when working with large datasets or when data is arriving in real-time. Since it processes data one point at a time, it's well-suited for tasks like online learning or when the dataset doesn’t fit into memory.

However, the updates are noisier, and it may require more iterations to converge to the optimal solution.

3. Mini-Batch Gradient Descent

Mini-batch Gradient Descent is a compromise between batch gradient descent and SGD. Instead of updating parameters with a single data point (SGD) or the entire dataset (batch gradient descent), mini-batch gradient descent uses small batches of data.

This leads to more stable updates and faster convergence compared to SGD, with better computational efficiency than batch gradient descent.

The update rule for mini-batch gradient descent is similar to that of batch gradient descent, but it uses a subset of the data:

θ = θ - \frac{α}{b} \cdot \sum_{i = 1}^{b} \nabla J (θ; x_{i}, y_{i})

Where:

b is the mini-batch size,
The sum is over the mini-batch of size b,
α is the learning rate.

Mini-batch gradient descent is particularly useful when the dataset is too large to process all at once and when the model needs frequent updates. It strikes a balance between the speed of SGD and the stability of batch gradient descent.

It is commonly used for large-scale machine learning tasks like training deep neural networks.

While the core idea of updating parameters remains the same, the choice of variant depends on the size of your dataset, computational resources, and how fast you need the model to converge.

Struggling to grasp how deep learning and neural networks work? Check out upGrad's free course on Fundamentals of Deep Learning and Neural Networks and learn the key concepts behind AI models. Start today!

Next, let's look at a practical example of Gradient Descent in Logistic Regression to see how it optimizes the model's parameters in action.

Gradient Descent in Logistic Regression Example

To see Gradient Descent in Logistic Regression in action, consider a simple example where the algorithm fine-tunes model parameters to reduce the cost function.

In this case, the example uses a dataset with two features: age and blood pressure, to predict whether a patient has a disease (1) or not (0).

This will illustrate how the model adjusts its parameters through each iteration to improve prediction accuracy.

Key Points:
- It is an iterative process, meaning the parameters are updated multiple times until the model converges to the best solution.
- The learning rate determines how big a step is taken at each iteration. Too small a learning rate makes the process slow, while too large a rate can overshoot the optimal solution.

To understand it better, let’s implement gradient descent in logistic regression in Python. We'll use a small dataset for simplicity.

1. Initialize Parameters

First, we need to set up the initial values for our model parameters, also known as weights. These parameters determine how much influence each feature (age and blood pressure) will have on the model’s prediction.

We'll initialize these parameters to zeros, as is common in logistic regression. The learning rate (α\alphaα) controls how much the weights change with each update, and we'll set it to 0.01 for this example.

theta = np.zeros(X.shape[1])  # Initialize weights (theta)
alpha = 0.01  # Learning rate
iterations = 1000  # Number of iterations

2. Sigmoid Function

The sigmoid function transforms the linear combination of the input features into a probability between 0 and 1. This is important for binary classification, where we need to predict whether the outcome is 0 (no disease) or 1 (disease).

The formula for the sigmoid function is:

σ (z) = \frac{1}{1 + e^{- z}}

Where z is the linear combination of the features and weights (i.e., z=θ₀ + θ₁ ⋅ age + θ₂ ⋅ blood pressure.

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

3. Cost Function

The cost function (log-loss or binary cross-entropy) measures how far the model’s predictions are from the actual outcomes (0 or 1). In logistic regression, we want to minimize this cost to make the model as accurate as possible.

The formula for the cost function is:

J (θ) = - \frac{1}{m} \sum_{i = 1}^{m} [y_{i} \log (h_{θ} (x_{i})) + (1 - y_{i}) \log (1 - h_{θ} (x_{i}))]

Where:

m is the number of training examples,
y_i is the true label (0 or 1),
h_θ(x_i) is the predicted probability calculated by the sigmoid function.

def cost_function(X, y, theta):
    m = len(y)
    predictions = sigmoid(np.dot(X, theta))
    cost = - (1/m) * np.sum(y * np.log(predictions) + (1 - y) * np.log(1 - predictions))
    return cost

4. Gradient Descent Update Rule

To minimize the cost, we use gradient descent. The gradient of the cost function tells us how to adjust the weights to reduce the error. The update rule is:

θ = θ - α \frac{\partial J (θ)}{\partial θ}

def gradient_descent(X, y, theta, alpha, iterations):
    m = len(y)
    cost_history = np.zeros(iterations)
    for i in range(iterations):
        predictions = sigmoid(np.dot(X, theta))
        error = predictions - y
        theta -= (alpha / m) * np.dot(X.T, error)
        cost_history[i] = cost_function(X, y, theta)
    return theta, cost_history

5. Running Gradient Descent

Now that we have everything set up, we can run gradient descent to optimize the model’s parameters. The algorithm will iterate 1000 times, adjusting the weights at each step to minimize the cost function.

# Sample dataset: Age and Blood Pressure with labels
X = np.array([[1, 55, 120], [1, 60, 130], [1, 65, 140], [1, 70, 160]])  # Features matrix
y = np.array([0, 0, 1, 1])  # Labels (0 = No disease, 1 = Disease)
# Perform gradient descent
theta_optimal, cost_history = gradient_descent(X, y, theta, alpha, iterations)

# Display the final optimal parameters (theta) and the final cost
print("Optimal parameters (theta):", theta_optimal)
print("Final cost:", cost_history[-1])

6. Interpreting the Results

After running gradient descent, we’ll get the optimal values for the model parameters θ₀, θ₁, and θ₂. These values represent the best-fit weights for age and blood pressure to predict whether a patient has the disease.

For instance, the optimal parameters might look like this:

Optimal parameters (theta): [ -6.52   0.14   0.06 ]
Final cost: 0.49

The cost at the end of the iterations tells us how well the model has minimized the error. A lower cost means the model has better accuracy.

As you apply this technique, remember to fine-tune your learning rate and monitor the convergence carefully. A learning rate that’s too high can cause instability, while a rate too low can make the process slow. Also, always check for overfitting, especially when using small datasets.

As machine learning continues to advance, understanding its legal implications becomes more critical. Check out upGrad’s LL.M. in AI and Emerging Technologies (Blended Learning Program), where you'll explore the intersection of law, technology, and AI, shaping the future of autonomous systems. Start today!

You can move on to more advanced topics, such as regularization techniques (L1, L2), to enhance model performance and prevent overfitting. Additionally, delving into Optimization Algorithms like Adam and RMSprop will improve your model’s efficiency and accuracy.

Master Gradient Descent in Linear Regression with upGrad

Gradient Descent in Logistic Regression is a crucial technique for optimizing machine learning models, enabling the identification of optimal parameters by minimizing errors. However, you may face challenges when working with large datasets or fine-tuning your learning rate to achieve better convergence.

To enhance your grasp of Gradient Descent in Logistic Regression, experiment with learning rates, apply regularization, and explore techniques like Stochastic Gradient Descent. upGrad’s AI and machine learning courses can deepen your knowledge and help tackle advanced challenges.

In addition to the courses mentioned above, here are some more free courses that can help you elevate your skills:

You can also get personalized career counseling with upGrad to guide your career path, or visit your nearest upGrad center and start hands-on training today!

Expand your expertise with the best resources available. Browse the programs below to find your ideal fit in Best Machine Learning and AI Courses Online.

Discover in-demand Machine Learning skills to expand your expertise. Explore the programs below to find the perfect fit for your goals.

In-demand Machine Learning Skills

Artificial Intelligence Courses	Tableau Courses
NLP Courses	Deep Learning Courses

Discover popular AI and ML blogs and free courses to deepen your expertise. Explore the programs below to find your perfect fit.

Popular AI and ML Blogs & Free Courses

IoT: History, Present & Future	Machine Learning Tutorial: Learn ML	What is Algorithm?
Robotics Engineer Salary in India : All Roles	A Day in the Life of a Machine Learning Engineer: What do they do?	What is Information Technology?
Permutation vs Combination: Difference between Permutation and Combination	Learning Artificial Intelligence & Machine Learning - How to Start	Machine Learning with R: Everything You Need to Know
NLP Free Course	Fundamentals of Deep Learning of Neural Networks	Linear Regression: Step by Step Guide
Artificial Intelligence in the Real World	Introduction to Tableau	Case Study using Python, SQL and Tableau

References:
https://statisticalmusings.netlify.app/post/logistic-regression-mle-a-full-breakdown/
https://www.baeldung.com/cs/gradient-descent-logistic-regression
https://www.linkedin.com/posts/ritikadokania_logistic-regression-quiz-hard-activity-7151592472928735232-xhAL/

Frequently Asked Questions (FAQs)

1. Can Gradient Descent in Logistic Regression handle non-linear data?

Gradient Descent in Logistic Regression is primarily used for linear classification tasks. However, if your data is non-linear, logistic regression can still work by using transformations like polynomial features. For more complex non-linear problems, consider using other models like support vector machines or neural networks, which can better handle non-linear data relationships.

2. How do you know when Gradient Descent in Logistic Regression has converged?

Gradient Descent in Logistic Regression is considered to have converged when the cost function shows minimal change between iterations. This indicates that the model parameters are stable and the cost is close to its minimum. Another way to check convergence is by monitoring the gradient values—when these values are close to zero, the algorithm has reached the optimal solution.

3. Can I use Gradient Descent in Logistic Regression for multi-class classification?

Yes, Gradient Descent in Logistic Regression can be extended to multi-class classification through techniques like one-vs-rest or softmax regression. These approaches allow you to apply logistic regression to problems where there are more than two possible classes. The fundamental optimization process remains the same, but the model is adapted to handle multiple classes efficiently.

4. How does the choice of batch size impact Gradient Descent in Logistic Regression?

The batch size in Gradient Descent in Logistic Regression determines how much data is used to calculate each gradient update. A small batch size (stochastic gradient descent) introduces noise but speeds up the process, while a larger batch size provides more accurate estimates but can be slower. Finding the right batch size helps balance speed and model accuracy, particularly when scaling to large datasets.

5. Can Gradient Descent in Logistic Regression be used for regression tasks?

While Gradient Descent in Logistic Regression is primarily used for binary classification tasks, the core concept of gradient descent can be applied to other types of regression. For regression problems, Linear Regression is used, where the goal is to predict continuous values. Logistic regression, however, is suited for classification tasks due to its probability output via the sigmoid function.

6. What’s the difference between Batch Gradient Descent and Mini-Batch Gradient Descent in Logistic Regression?

The key difference between Batch Gradient Descent and Mini-Batch Gradient Descent in Logistic Regression is how data is processed. Batch Gradient Descent uses the entire dataset to compute the gradient at each step, making it more stable but slower for large datasets. Mini-Batch Gradient Descent, on the other hand, processes data in smaller chunks, offering faster convergence while still maintaining some stability.

7. How does Gradient Descent in Logistic Regression handle high-dimensional data?

When working with high-dimensional data, Gradient Descent in Logistic Regression can still be effective but may suffer from issues like slower convergence or overfitting. In these cases, dimensionality reduction techniques like PCA (Principal Component Analysis) or regularization methods (L1 or L2) can be applied to improve model performance and prevent overfitting while optimizing the cost function.

8. How do you visualize the optimization process in Gradient Descent in Logistic Regression?

Visualizing Gradient Descent in Logistic Regression involves plotting the cost function over iterations. As the algorithm progresses, the plot will show the cost gradually decreasing. For a better understanding, 3D visualizations of the cost surface can be used to show how gradient descent moves through different points in the parameter space, eventually converging to the minimum cost.

9. Is it possible for Gradient Descent in Logistic Regression to get stuck in a local minimum?

Unlike some other optimization problems, Gradient Descent in Logistic Regression is unlikely to get stuck in local minima because the cost function is convex. This means it has a single global minimum. However, improper initialization or a poor choice of learning rate can cause the algorithm to converge to suboptimal solutions or fail to converge altogether.

10. How do you handle noisy data when using Gradient Descent in Logistic Regression?

Noisy data can negatively affect the performance of Gradient Descent in Logistic Regression by causing fluctuations in the gradient. Techniques such as data smoothing, regularization, or outlier removal can help reduce the impact of noise. Additionally, using mini-batch gradient descent helps smooth out the noisy updates by averaging gradients over a batch of data points.

11. Can Gradient Descent in Logistic Regression be used for imbalanced datasets?

Yes, Gradient Descent in Logistic Regression can be used for imbalanced datasets, but it may not perform well if the class distribution is highly skewed. To address this, techniques like class weighting or oversampling the minority class can be employed. Regularization can also help by preventing the model from being overly influenced by the dominant class, allowing it to better handle class imbalance.

Pavan Vadapalli

907 articles published

Pavan Vadapalli is the Director of Engineering , bringing over 18 years of experience in software engineering, technology leadership, and startup innovation. Holding a B.Tech and an MBA from the India...

Speak with AI & ML expert

By submitting, I accept the T&C and
Privacy Policy

India’s #1 Tech University

Executive Program in Generative AI for Leaders

76%

seats filled

View Program

Top Resources