The paper
Learning representations by back-propagating errors (Rumelhart, Hinton & Williams, 1986) is one of the most consequential papers in AI history. It showed that networks with hidden layers could be trained efficiently — the missing piece that had stalled neural network research for over a decade.
What was novel in 1986
Before this paper, researchers could only train networks without hidden layers (perceptrons). The problem: for a network with hidden units, there was no principled way to assign credit or blame for errors back to the weights deep inside the network.
Backpropagation solved this by applying the chain rule of calculus systematically — propagating the gradient of the loss function backward through each layer in turn. Three ideas made this tractable:
- Differentiable activation functions — using sigmoid (rather than a hard step function) meant gradients could flow through neurons
- The chain rule as an algorithm — framing gradient computation as a graph traversal that reuses intermediate results
- Generalisation beyond two layers — the same rule works for any depth, which is what makes deep learning possible
Interactive visualisation
The diagram below shows a small three-layer network. Use it to build intuition for how signals flow forward and gradients flow backward.
The mathematics
For a network with input x, hidden layer h, and output ŷ, the forward pass computes:
h = σ(W · x + b₁)
ŷ = σ(V · h + b₂)
E = ½(y - ŷ)²
The backward pass uses the chain rule to compute ∂E/∂W and ∂E/∂V, then updates all weights by gradient descent:
W ← W - η · ∂E/∂W
V ← V - η · ∂E/∂V
The key insight is that ∂E/∂W can be decomposed via the chain rule into factors that are each easy to compute locally — and this decomposition generalises to any number of layers.
Why it matters today
Every modern neural network — including the transformers behind LLMs — is trained using backpropagation. The 1986 formulation is essentially unchanged; what has changed is hardware (GPUs), data scale, and architectural choices like attention. The core optimisation loop remains: forward pass, compute loss, backward pass, update weights.