personal learning notes

The Road to Understanding LLMs

A chronological journey through 17 foundational papers — from backpropagation to chain-of-thought prompting.

ERA 1 · THE FOUNDATIONS 1986–2003
1986

Backpropagation

Rumelhart, Hinton & Williams

The algorithm that unlocked deep learning

1997

Long Short-Term Memory

Hochreiter & Schmidhuber

Teaching networks to remember over long sequences

1998

Convolutional Neural Networks

LeCun et al.

Spatial structure meets learnable filters

2003

Neural Language Model

Bengio et al.

Words as vectors, language as probability

ERA 2 · THE BUILDING BLOCKS 2013–2016
2013

Word2Vec

Mikolov et al.

King − Man + Woman = Queen

2014

Sequence to Sequence

Sutskever, Vinyals & Le

Encoding meaning, then decoding it in another language

2015

Adam Optimizer

Kingma & Ba

The optimizer that trains almost everything

2015

Attention

Bahdanau, Cho & Bengio

Letting the decoder look back at relevant input

2016

Residual Networks

He et al.

Skip connections that let gradients flow through 152 layers

2016

BPE Tokenisation

Sennrich, Haddow & Birch

Breaking words into learnable subword pieces

ERA 3 · THE BREAKTHROUGH 2017
2017

Attention Is All You Need

Vaswani et al.

The architecture behind every modern LLM

ERA 4 · THE LLM ERA 2018–2022
2018

GPT-1

Radford et al.

Pre-train once, fine-tune for anything

2019

BERT

Devlin et al.

Reading left and right at the same time

2020

Scaling Laws

Kaplan et al.

More data, more compute, predictably better

2020

GPT-3

Brown et al.

175 billion parameters and emergent few-shot abilities

2022

InstructGPT

Ouyang et al.

Aligning language models with human intent

2022

Chain-of-Thought

Wei et al.

Think step by step