The paper

Learning representations by back-propagating errors (Rumelhart, Hinton & Williams, 1986) is one of the most consequential papers in AI history. It showed that networks with hidden layers could be trained efficiently — the missing piece that had stalled neural network research for over a decade.

What was novel in 1986

Before this paper, researchers could only train networks without hidden layers (perceptrons). The problem: for a network with hidden units, there was no principled way to assign credit or blame for errors back to the weights deep inside the network.

Backpropagation solved this by applying the chain rule of calculus systematically — propagating the gradient of the loss function backward through each layer in turn. Three ideas made this tractable:

Differentiable activation functions — using sigmoid (rather than a hard step function) meant gradients could flow through neurons
The chain rule as an algorithm — framing gradient computation as a graph traversal that reuses intermediate results
Generalisation beyond two layers — the same rule works for any depth, which is what makes deep learning possible

Interactive visualisation

The diagram below shows a small three-layer network. Use it to build intuition for how signals flow forward and gradients flow backward.

Study Materials · Nature 1986

Learning Representations
by Back-Propagating Errors

Rumelhart, Hinton & Williams · 1986

Historical Context

The Credit Assignment Problem+

Perceptron Limitation+

What Backprop Solved+

Core Algorithm

Forward Pass+

Sigmoid Activation+

Error Function E+

Backward Pass+

Weight Update Rule+

Momentum (eq. 9)+

Key Insights

Representation Learning+

Why Nonlinearity Matters+

Local Minima in Practice+

Extension to RNNs+

Experiments

Symmetry Detection (Fig. 1)+

Family Tree Task (Fig. 2–4)+

Weight Decay+

The mathematics

The key insight is that the gradient of the error with respect to any weight can be decomposed via the chain rule into factors that are each easy to compute locally — and this decomposition generalises to any number of layers.

Technical Detail Forward & Backward Pass Equations

For a network with input x, hidden layer h, and output ŷ, the forward pass computes:

h = σ(W · x + b₁)
ŷ = σ(V · h + b₂)
E = ½(y - ŷ)²

The backward pass uses the chain rule to compute ∂E/∂W and ∂E/∂V, then updates all weights by gradient descent:

W ← W - η · ∂E/∂W
V ← V - η · ∂E/∂V

∂E/∂W can be decomposed into a product of local derivatives at each layer — each computed in a single backward traversal, reusing intermediate results from the forward pass.

Why it matters today