The paper

Learning representations by back-propagating errors (Rumelhart, Hinton & Williams, 1986) is one of the most consequential papers in AI history. It showed that networks with hidden layers could be trained efficiently — the missing piece that had stalled neural network research for over a decade.

What was novel in 1986

Before this paper, researchers could only train networks without hidden layers (perceptrons). The problem: for a network with hidden units, there was no principled way to assign credit or blame for errors back to the weights deep inside the network.

Backpropagation solved this by applying the chain rule of calculus systematically — propagating the gradient of the loss function backward through each layer in turn. Three ideas made this tractable:

  1. Differentiable activation functions — using sigmoid (rather than a hard step function) meant gradients could flow through neurons
  2. The chain rule as an algorithm — framing gradient computation as a graph traversal that reuses intermediate results
  3. Generalisation beyond two layers — the same rule works for any depth, which is what makes deep learning possible

Interactive visualisation

The diagram below shows a small three-layer network. Use it to build intuition for how signals flow forward and gradients flow backward.

Study Materials · Nature 1986

Learning Representations
by Back-Propagating Errors

Rumelhart, Hinton & Williams · 1986

Historical Context
Perceptronx1x2ŷinput → output onlyWith Hidden Layerx1x2x3h1h2ŷinputhiddenoutput← no desired state
The Credit Assignment Problem+
Perceptron Limitation+
What Backprop Solved+
Core Algorithm
w₁₁w₂₁w₁₂w₂₂w₁₃w₂₃v₁₁v₁₂x1x2x3h1h2ŷInputHiddenOutputwᵢⱼ = weight from input xⱼ → hidden hᵢ · · · vᵢⱼ = weight from hidden hⱼ → output→ Forward Pass← Backward Pass ∂E/∂wEerror
Forward Pass+
Sigmoid Activation+
Error Function E+
Backward Pass+
Weight Update Rule+
Momentum (eq. 9)+
Key Insights
weight space →Error Elocal minglobal minstartgradientdescent
Representation Learning+
Why Nonlinearity Matters+
Local Minima in Practice+
Extension to RNNs+
Experiments
Symmetry Detection Network (Fig. 1)x1+14x2+7x3+4x4−4x5−7x6−14midh1bias −1.1h2bias −1.1sym?bias +6.4InputHiddenOutputexcitatoryinhibitory
Symmetry Detection (Fig. 1)+
Family Tree Task (Fig. 2–4)+
Weight Decay+

BACKPROPAGATION · FORWARD & BACKWARD PASS VISUALISATION

The mathematics

For a network with input x, hidden layer h, and output ŷ, the forward pass computes:

h = σ(W · x + b₁)
ŷ = σ(V · h + b₂)
E = ½(y - ŷ)²

The backward pass uses the chain rule to compute ∂E/∂W and ∂E/∂V, then updates all weights by gradient descent:

W ← W - η · ∂E/∂W
V ← V - η · ∂E/∂V

The key insight is that ∂E/∂W can be decomposed via the chain rule into factors that are each easy to compute locally — and this decomposition generalises to any number of layers.

Why it matters today

Every modern neural network — including the transformers behind LLMs — is trained using backpropagation. The 1986 formulation is essentially unchanged; what has changed is hardware (GPUs), data scale, and architectural choices like attention. The core optimisation loop remains: forward pass, compute loss, backward pass, update weights.