"You Won’t Need This After School” — Why That’s a Lie: The Backpropagation Story

“Why are we even learning this? We’ll never use this after school.”

If you’ve gone to school in Sierra Leone or really anywhere, you’ve heard someone say that. Maybe you've even said it yourself. Whether it was calculus, the chain rule, or linear algebra, these topics felt disconnected from real life. Just numbers on a chalkboard, far from any “real job.”

But what if I told you that those same “useless” topics are exactly what’s powering AI, ChatGPT, autonomous cars, disease diagnosis, and modern robotics? What if I told you that one of the most revolutionary breakthroughs in technology is just a clever use of the chain rule. Yes, the one from calculus 101.

Welcome to Backpropagation.

What Is Backpropagation Anyway?

Backpropagation or “backprop” for short, is the technique that makes it possible for neural networks to learn. It’s the secret sauce that allows these models to improve with experience, adjusting themselves based on the mistakes they make.

From predicting your favorite songs to diagnosing diseases from X-rays, backprop is there quietly doing the math in the background.

It’s not just code; it’s math. Pure, beautiful math. And it’s changing the world.

The technique was formally introduced in the landmark 1986 paper "Learning representations by back-propagating errors" by Rumelhart, Hinton, and Williams [source]. This foundational work became the backbone of modern neural networks.

The Chain Rule Is the Hero

At the heart of backprop is a powerful mathematical idea: the chain rule from calculus.

The chain rule says if one variable depends on another, which itself depends on another, you can use derivatives to trace how a change in one affects the others.

Chain Rule Proof:

Let’s say you have a loss LLL that depends on prediction yyy, and yyy depends on input xxx.

Then by the chain rule:

\begin{equation*} \frac{dL}{dx} = \frac{dL}{dy} \cdot \frac{dy}{dx} \end{equation*}

In neural networks, this principle applies layer by layer, allowing us to send gradients from the output layer back through the network to update the weights.

In general, if:

\begin{equation*} f(x) = h(g(x)) \end{equation*}

Then:

\begin{equation*} \frac{df}{dx} = \frac{dh}{dg} \cdot \frac{dg}{dx} \end{equation*}

This is the foundation of how neural networks improve.

Step-by-Step: How Backpropagation Works

Let’s walk through the process:

Forward Pass: Input data is passed through the network to produce a prediction.
Loss Computation: The difference between the prediction and the true value is calculated.
Backward Pass: The network calculates gradients using the chain rule.
Weight Update: The model uses these gradients to update its weights using an optimizer like gradient descent.

How Gradients Flow (with Math)

Let’s consider a mini neural network:

Input xxx
Hidden layer with weights W1W_1W1, activation a1a_1a1
Output layer with weights W2W_2W2, activation a2a_2a2
Loss function LLL

Forward Pass:

\begin{equation*} \begin{aligned} z^{[1]} &= W^{[1]} x + b^{[1]} \\ a^{[1]} &= \sigma\left(z^{[1]}\right) \\ z^{[2]} &= W^{[2]} a^{[1]} + b^{[2]} \\ a^{[2]} &= \sigma\left(z^{[2]}\right) \\ L &= \mathcal{L}\left(a^{[2]}, y\right) \end{aligned} \end{equation*}

Backward Pass (Gradients):

\begin{equation*} \begin{aligned} \delta^{[2]} &= \frac{\partial L}{\partial a^{[2]}} \odot \sigma'\left(z^{[2]}\right) \\ \frac{\partial L}{\partial W^{[2]}} &= \delta^{[2]} \cdot \left(a^{[1]}\right)^T \\ \frac{\partial L}{\partial b^{[2]}} &= \delta^{[2]} \\[1em] \delta^{[1]} &= \left(W^{[2]}\right)^T \delta^{[2]} \odot \sigma'\left(z^{[1]}\right) \\ \frac{\partial L}{\partial W^{[1]}} &= \delta^{[1]} \cdot x^T \\ \frac{\partial L}{\partial b^{[1]}} &= \delta^{[1]} \end{aligned} \end{equation*}

This chain-like structure is how gradients “propagate back,” hence the name backpropagation.

Why It Works

Imagine standing on a mountain in the dark and trying to find your way down.
You take a small step in the direction that decreases your altitude (loss).
You keep stepping until you reach the lowest point; that’s gradient descent.

Backpropagation tells us the direction and steepness of the slope. With each step (iteration), we get closer to the point where the model performs best.

A 3D plot of a loss surface with a ball rolling down a hill, representing gradient descent.

[source]

Why This Matters for Students

If you’re a student reading this, don’t believe the lie.
Don’t believe that what you’re learning won’t matter.

The math you study in school is more than a requirement. It’s a foundation for invention.

It may seem hard now. It may feel disconnected. But the world is being rebuilt by people who understand the tools you’re learning today.

From machine learning to space travel, math is the language of the future. If you understand it, you’re not just ready; you’re necessary.

Backprop in Action

import torch
import torch.nn as nn
import torch.optim as optim

# XOR truth table
# Input pairs
x = torch.tensor([[0., 0.],
                  [0., 1.],
                  [1., 0.],
                  [1., 1.]], requires_grad=True)

# Expected output (XOR)
y_true = torch.tensor([[0.],
                       [1.],
                       [1.],
                       [0.]])

# Define a simple neural network with 1 hidden layer
model = nn.Sequential(
    nn.Linear(2, 4),
    nn.Sigmoid(),
    nn.Linear(4, 1),
    nn.Sigmoid()
)

# Loss and optimizer
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Training loop
for epoch in range(10000):
    y_pred = model(x)
    loss = loss_fn(y_pred, y_true)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 1000 == 0:
        print(f"Epoch {epoch} Loss: {loss.item():.4f}")

# Final results
print("\nFinal outputs after training:")
with torch.no_grad():
    print(model(x))

Basic PyTorch example where a neural network learns to perform the XOR logic gate using backpropagation.

Conclusion

So the next time someone leans back in a lecture hall and says,

“We’ll never use this after school…”

You can smile and say:

“Maybe not today, but I will.”

Because somewhere in this world, a neural network is learning to save lives.
And it’s all thanks to the chain rule.