Vanishing Gradient Problem in Deep Neural Networks

by Avi Kedare

Introduction

When I first started building deep neural networks, I had this assumption more layers means better performance. More depth, more abstraction, better features, right? Seemed logical. So I stacked 10, 12, 15 layers and hit train.

The loss just… sat there. Barely moved. Like the network had completely given up learning.

I checked my code a dozen times. Loss function, correct. Data pipeline, fine. Learning rate, reasonable. But something was broken in a way that wasn't obvious from the outside.

It took me a while to figure out what was actually happening inside those early layers. The gradients those little correction signals that tell each layer how to update its weights were becoming so small by the time they reached the front of the network that those layers were essentially frozen. Not frozen because I told them to be. Frozen because the math was killing the signal before it could get there.

That's the vanishing gradient problem. And once you actually understand it, you start seeing it everywhere in deep learning history it's the reason people used shallow networks for years, the reason LSTMs were invented, the reason ReLU replaced sigmoid overnight, and the reason ResNet changed everything.

Let's get into it properly.

Understanding Backpropagation First

Before we can talk about vanishing gradients, you need a solid picture of how backpropagation actually works not the textbook version, just the intuition.

During a forward pass, data flows through the network layer by layer.

Each layer performs:

a linear transformation
followed by a nonlinear activation

$z_l = W_l \cdot a_{l-1} + b_l$ $a_{l} = σ (z_{l})$

Where:

$W_l$ is the weight matrix
$a_{l-1}$ is the activation from the previous layer
$b_l$ is the bias
$\sigma$ is the activation function

The network stacks these operations repeatedly until it produces a final prediction.

Then the loss is computed.

And this is where backpropagation begins.

Backward pass is where the network actually learns.

Every weight gets updated using gradients computed through the chain rule.

$W \leftarrow W - η \cdot \frac{\partial L}{\partial W}$

Here:

$\eta$ is the learning rate
$\frac{\partial L}{\partial W}$ tells the weight how it should move

The important detail is this:

To compute gradients for an early layer, the network has to multiply gradients from every layer after it.

Layer 2 depends on:

layer 3
layer 4
layer 5
all the way to the output

And this repeated multiplication is exactly where the problem starts.

Why Gradients Vanish: The Chain Rule Is Merciless

Let me make this precise. Consider a network with $L$ layers. The gradient of the loss with respect to the first layer's weights requires computing:

$$ \frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial a_L} \cdot \prod_{l=2}^{L} \frac{\partial a_l}{\partial a_{l-1}} $$

Each term in that product is itself a product of the weight matrix and the derivative of the activation function at that layer:

$$ \frac{\partial a_l}{\partial a_{l-1}} = W_l \cdot \sigma'(z_l) $$

If $\sigma'(z_l)$is consistently small say, close to 0 then that product collapses exponentially as you go deeper. With 10 layers of small derivatives, the gradient at layer 1 is virtually gone.

The question then becomes: which activation function produces small derivatives?

Sigmoid. Every single time.

Sigmoid Saturation: The Root Cause

The sigmoid function looks like this:

$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

And its derivative is:

$$ \sigma'(z) = \sigma(z) \cdot (1 - \sigma(z)) $$

One important thing to notice here the maximum value of $\sigma'(z)$ is 0.25, and that only happens right at $z = 0$. For any input with large positive or large negative magnitude, the derivative approaches 0.

Figure 1: Sigmoid Activation Function and Its Derivative

Left : the familiar S-shaped sigmoid curve, outputting values between 0 and 1.

Right : the derivative of sigmoid, which peaks at 0.25 and collapses toward 0 at both extremes. This collapse is the direct cause of vanishing gradients.

The derivative is telling you: "How sensitive is my output to a small change in input?" When that answer is "barely at all," learning stops.

And the problem is that during training, neuron activations frequently sit in those saturated regions especially early in training when weights are not yet calibrated. So the network is just… quietly dying in the early layers, and you don't even see it in the loss curve right away.

Real Impact on Training

After training some models, I noticed something strange. I had a 12-layer fully connected network on a classification task. Loss dropped nicely for the first few epochs, then completely plateaued. I tried:

Halving the learning rate → same plateau
Adding more neurons → same plateau
Training for 3x more epochs → still flat

Then I added gradient logging. What I saw was pretty eye-opening:

# PyTorch: log gradient norms per layer
for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad norm = {param.grad.norm().item():.8f}")

Output (approximate):

layer12.weight: grad norm = 0.48231100
layer11.weight: grad norm = 0.12043200
layer10.weight: grad norm = 0.03010400
layer9.weight:  grad norm = 0.00752600
layer8.weight:  grad norm = 0.00188100
layer7.weight:  grad norm = 0.00047000
layer6.weight:  grad norm = 0.00011700
layer5.weight:  grad norm = 0.00002900
layer4.weight:  grad norm = 0.00000730
layer3.weight:  grad norm = 0.00000180
layer2.weight:  grad norm = 0.00000046
layer1.weight:  grad norm = 0.00000011

That's a factor of ~4 million difference between layer 12 and layer 1. Layer 1 was receiving a gradient so small it was effectively receiving nothing. The early layers weren't learning at all they were frozen by mathematics, not by design.

In CNNs, this is particularly damaging because the early convolutional layers are supposed to learn edge detectors and texture patterns the foundational visual features. If they don't learn, nothing above them can compensate.

Table 1: Activation Function Comparison

ReLU and Why It Actually Fixed This

The Rectified Linear Unit is almost embarrassingly simple:

$$ \text{ReLU}(z) = \max(0, z) $$

Its derivative is either 0 (for negative inputs) or 1 (for positive inputs):

$$ \text{ReLU}'(z) = \begin{cases} 0 & \text{if } z < 0 \ 1 & \text{if } z > 0 \end{cases} $$

And that derivative of 1 is the whole point. When gradients pass through a ReLU layer in the positive region, they don't get multiplied by something less than 1 they pass through unchanged. No shrinkage. No saturation.

At first I thought deeper networks would always perform better, but when I switched from sigmoid to ReLU in that same 12-layer network, the training behavior completely changed. Loss dropped steadily, gradient norms across early layers stayed in a reasonable range, and the model actually converged.

Figure 2: ReLU Activation Function

ReLU passes positive values unchanged and zeros out negatives.

Its derivative is exactly 1 in the positive region no multiplication by a fraction, so gradients travel backward without shrinking. The right panel shows this clearly: the derivative is binary, not tapered.

The one problem with ReLU is "dying ReLU" neurons that always receive negative input get a gradient of 0 and permanently stop learning. Leaky ReLU fixes this with a small slope (0.01) for negative inputs, keeping those neurons alive.

ResNet Skip Connections: A Different Solution Entirely

Even with ReLU, going really deep like 100+ layers still causes training instability. He et al. (2015) thought about this differently. Instead of just fixing the activation function, what if you gave gradients a direct highway to travel through?

That's what residual connections (skip connections) do:

$$ a_l = F(a_{l-1}, W_l) + a_{l-1} $$

The layer output is the learned transformation plus the identity the raw input passes through directly. During backpropagation, the gradient of this block is:

$$ \frac{\partial a_l}{\partial a_{l-1}} = \frac{\partial F}{\partial a_{l-1}} + 1 $$

That +1 is the key. No matter how small$\frac{\partial F}{\partial a_{l-1}}$ gets, the gradient is never less than 1. It can't vanish. The gradient always has a direct path back through the identity connection.

Figure 3: ResNet Skip Connection Architecture

The identity shortcut (top path) bypasses the weight layers entirely. During backpropagation, gradients can travel backward directly through the skip connection without passing through any weight matrices. This is why ResNet-152 can train stably where a plain 152-layer network cannot.

This is a fundamentally different approach from fixing activation functions. You're restructuring the gradient flow itself. ResNet-50, ResNet-101, ResNet-152 all became possible because of this simple addition.

Practical Observations During Training

Let me share some things I actually noticed when running experiments not theoretical, just what you'll observe if you do the same.

1. Loss plateauing after a few epochs
The most common early sign. If your loss stops decreasing after epoch 3-5 in a deep network and you're using sigmoid activations, vanishing gradients are likely. Switching to ReLU usually fixes this within the same training run (just restart with new activations).

2. Gradient norms decaying exponentially by layer
Use TensorBoard's histogram summaries or just print grad norms:

# TensorFlow gradient monitoring
with tf.GradientTape() as tape:
    loss = model(x)
grads = tape.gradient(loss, model.trainable_variables)
for grad, var in zip(grads, model.trainable_variables):
    tf.summary.histogram(var.name, grad)

In TensorBoard, you'll visually see histograms for early layers collapsing toward zero while later layers have wide, healthy distributions.

3. Early CNN layers not updating
I ran a CNN on CIFAR-10 with sigmoid activations and visualized the first-layer filters after 20 epochs. They looked like random noise completely unlearned. Same network with ReLU had clean, structured edge detectors by epoch 5.

4. Comparing sigmoid vs ReLU experimentally
Same architecture, same dataset, same hyperparameters:

# PyTorch: swap activations to compare
class DeepNet(nn.Module):
    def __init__(self, activation='relu'):
        super().__init__()
        act = nn.ReLU() if activation == 'relu' else nn.Sigmoid()
        self.layers = nn.Sequential(
            nn.Linear(784, 256), act,
            nn.Linear(256, 256), act,
            nn.Linear(256, 256), act,
            nn.Linear(256, 256), act,
            nn.Linear(256, 128), act,
            nn.Linear(128, 10)
        )

With sigmoid: final accuracy ~68%, gradients in layer 1 ≈ 1e-7
With ReLU: final accuracy ~94%, gradients in layer 1 ≈ 0.08

That's not a subtle difference. It's a fundamental one.

5. Batch Normalization as a stabilizer
Adding BatchNorm between layers:

nn.Sequential(
    nn.Linear(256, 256),
    nn.BatchNorm1d(256),
    nn.ReLU()
)

BatchNorm re-centers and rescales activations at each layer, which keeps inputs to each layer in a range where activation derivatives are healthy. It doesn't fully solve vanishing gradients but it keeps them manageable even with sigmoid in shallower networks.

Table 2: Vanishing vs Exploding Gradients

Table 3: Modern Solutions Comparison

Gradient Shrinking Through Layers

This is literally what backpropagation looks like in a deep sigmoid network. Every layer multiplies the gradient by at most 0.25. Across 10 layers, that's multiplication by (0.25)¹⁰ a number so small it's computationally irrelevant. The first layer literally doesn't learn.

Sigmoid Saturation Animation

When neurons land in the flat regions of sigmoid which happens constantly during training their derivative is nearly zero. That zero multiplies everything behind it in the chain rule. One saturated layer effectively cuts the gradient highway. And deep networks have many such layers.

ReLU vs Sigmoid Gradient Flow

This is why the entire deep learning field shifted from sigmoid to ReLU almost overnight after the early 2010s. ReLU doesn't shrink gradients in the positive region it passes them through with a derivative of exactly 1. The gradient signal survives the trip back to early layers, and those layers actually learn.

Figure 4: ResNet Architecture at Scale

The key architectural difference between a plain deep network and ResNet. The skip connections (shown as arrows bypassing blocks) give gradients a shortcut path. Even if the learned transformation F produces near-zero gradients, the +1 from the identity ensures the gradient is never smaller than 1 at each block.

Key Takeaways

If you take nothing else from this article, let these stick:

Vanishing gradients happen because of repeated multiplication of small numbers — specifically, the small derivatives of saturating activation functions like sigmoid.
Sigmoid's max derivative is 0.25, which means multiplied across 10 layers you get essentially nothing reaching the first few layers.
ReLU solved this by having a derivative of 1 in the positive region — gradients pass through unchanged.
ResNet solved it structurally — skip connections give gradients a direct, unobstructed backward path regardless of how deep the network is.
BatchNorm, careful weight initialization (He/Xavier), and gradient clipping are additional tools in your arsenal against unstable gradient flow.
You can measure this yourself. Print gradient norms per layer in PyTorch or add histogram summaries in TensorFlow. If early layers have near-zero norms, you've found your problem.

Conclusion

The vanishing gradient problem isn't some edge case it was the central reason deep learning stalled for years before the mid-2010s. Once you understand the chain rule and what happens when you repeatedly multiply small numbers, the whole story makes sense. Sigmoid was killing gradients. ReLU stopped the bleeding. ResNet gave gradients a highway.

When I look at the gradient norm printout from my early experiments those twelve lines showing exponential decay from 0.48 down to 0.00000011 it's hard not to feel a bit of respect for the researchers who figured this out before modern tooling made it this visible. They had to think through the math on paper and trust the theory.

Now you can just print a number and see it happening in real time.

The next time your loss plateaus and you can't figure out why, check your gradient norms. Layer by layer. The answer is usually right there.

References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. https://www.deeplearningbook.org/
Karpathy, A. et al. CS231n: Convolutional Neural Networks for Visual Recognition. Stanford University. https://cs231n.stanford.edu/
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep Residual Learning for Image Recognition. arXiv:1512.03385. https://arxiv.org/abs/1512.03385
Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167. https://arxiv.org/abs/1502.03167
Vaswani, A. et al. (2017). Attention Is All You Need. arXiv:1706.03762. https://arxiv.org/abs/1706.03762
PyTorch Documentation — Autograd, Gradient Monitoring. https://pytorch.org/docs/stable/autograd.html
TensorFlow Documentation — GradientTape, TensorBoard. https://www.tensorflow.org/api_docs/python/tf/GradientTape
Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.

Search This Blog

The Vanishing Gradient Problem