Neural Language Model · ML Learning

Paper: Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3, 1137–1155.

Context

Base large language models are a type of ‘probabilistic language model’ — a system that assigns probabilities to word sequences. At inference time, text generation looks like this: for a given input string, say, “the cat sat on the”, the model returns a probability distribution across its entire vocabulary for the next word:

Word	Probability
mat	0.23
floor	0.18
chair	0.11
…	…
zorbled	0.000001

In this case, selecting the most likely word gives the sentence “the cat sat on the mat”, and the whole sentence can in turn be fed back into the language model in a loop until some criteria is met.

Today, training probabilistic language models is a solved problem. But in the year 2003, the field was dominated by n-gram models which perform poorly on textual inputs they haven’t seen before.

N-gram models estimate the probability of the next word by counting how often word combinations appear in a training corpus, generalising to unseen sequences only by shortening the context window until something familiar is found. This gives them no mechanism to reason about word similarity: having seen “the cat sat on the mat” tells an n-gram model nothing about “the dog sat on the mat.” Plausible sentences which don’t exist in the training data are assigned probabilities close to zero.

Imagine the consequences for text generation based on the illustration above: the model would, in fact behave as a stochastic parrot, endlessly forced to repeat only the sentences that had been seen in training.

As we are aware, even early LLMs were able to produce plausible sentences which conform to syntax and grammar of the training corpus without

As we established in backpropagation neural networks are capable of autonomously learning features of the data, in this case syntax and grammar. Suddenly,

Bengio et al. proposed a qualitatively different attack on the problem, introducing learned distributed word representations jointly trained with a neural probability function. It is, in retrospect, one of the foundational papers of modern NLP and a direct ancestor of word2vec, GloVe, and the embedding layers that are now ubiquitous in deep learning.

1. The Root Problem: Curse of Dimensionality

The joint probability of a sequence of $n$ words over a vocabulary $|V|$ has up to $|V|^n$ possible configurations. With $|V| = 100{,}000$ and $n = 10$ , that’s $10^{50}$ possible sequences — a space that can never be observed, let alone estimated from data.

This is the curse of dimensionality as Bengio frames it: not just a computational complaint, but a statistical one. Your training data is maximally sparse relative to the space you need to generalise over. Almost every sequence you’ll ever want to evaluate is one you’ve never seen.

Everything else in the paper is a response to this single problem.

2. How N-Gram Models Cope — and Their Limits

N-gram models sidestep the curse by severely restricting the problem. Rather than modelling the full joint distribution, they truncate the conditioning context to the last $n-1$ words:

$\hat{P}(w_t \mid w_{t-1}, \ldots, w_1) \approx \hat{P}(w_t \mid w_{t-1}, \ldots, w_{t-n+1})$

This reduces the table from $|V|^T$ entries to $|V|^{n-1}$ — large but tractable. Each entry is estimated by counting co-occurrences in the training corpus.

The immediate problem: most trigrams will be unobserved, producing zero probability estimates for plausible sequences. Two main approaches manage this: interpolation (blend estimates across context lengths) and back-off (try the longest context first, fall back if unseen). Modified Kneser-Ney — the strongest baseline in Bengio’s paper — is a refinement that adjusts lower-order counts to better reflect how words behave as continuations rather than raw frequency.

Deep Dive Smoothing & Back-off Formulations

Smoothed (Interpolated) Trigram — Jelinek & Mercer (1980)

Blend estimates across all orders with learned mixing weights:

$\hat{P}(w_t \mid w_{t-1}, w_{t-2}) = \lambda_3(q_t)\, p_3 + \lambda_2(q_t)\, p_2 + \lambda_1(q_t)\, p_1 + \lambda_0(q_t) \cdot \frac{1}{|V|}$

The $\lambda_i(q_t)$ are mixing weights conditioned on the frequency of the context $q_t$ : when $(w_{t-2}, w_{t-1})$ is frequently observed, trust the trigram ( $\lambda_3$ large); when it’s rare, lean on lower-order statistics. The weights are fit via EM on a held-out set. This is exactly Bengio’s $\alpha_i(q_t)$ formulation in Section 4.1.

Back-off Trigram — Katz (1987)

Rather than always blending, back-off cascades: try the most specific model, and only fall back if evidence is too thin:

$\hat{P}(w_t \mid w_{t-1}, w_{t-2}) = \begin{cases} \alpha \cdot p_3(w_t \mid w_{t-1}, w_{t-2}) & \text{if count} > 0 \\ \beta(w_{t-2}, w_{t-1}) \cdot \hat{P}(w_t \mid w_{t-1}) & \text{otherwise} \end{cases}$

The Shared Limitation

Both approaches share the same fundamental generalisation mechanism: reduce context length until you hit something you’ve seen. Bengio’s phrase “gluing very short overlapping pieces” describes this precisely.

Neither model has any notion of word similarity. Having seen “the cat sat on the mat” a thousand times gives you no mechanism to assign high probability to “the dog sat on the mat.” You’d fall back to the unigram probability of “dog,” discarding all context. N-grams cope with the curse of dimensionality only by discarding information — the generalisation is purely structural, not semantic.

3. The Proposed Solution: A Qualitatively Different Generalisation

Bengio’s model attacks the curse differently. Instead of asking “have I seen this exact context before?”, it asks “have I seen similar contexts before?” — where similarity is defined geometrically in a learned embedding space.

The model learns simultaneously:

A word embedding matrix $C$ — a $|V| \times m$ matrix where row $i$ is a real-valued feature vector for word $i$ , placing every word as a point in $\mathbb{R}^m$
A probability function — a neural network mapping a context of embedded words to a probability distribution over the next word

The full parameter set is trained end-to-end by gradient ascent on log-likelihood.

Technical Detail The Model Equations

The full model computes:

$\hat{P}(w_t \mid w_{t-1}, \ldots, w_{t-n+1}) = \frac{e^{y_{w_t}}}{\sum_i e^{y_i}}$

where the unnormalised log-probabilities are:

$y = b + Wx + U\tanh(d + Hx)$

and $x = (C(w_{t-1}), C(w_{t-2}), \ldots, C(w_{t-n+1}))$ is the concatenation of context word embeddings. The full parameter set is $\theta = (C, b, W, U, d, H)$ .

4. Why Hidden Layers Matter

An important predecessor — Xu & Rudnicky (2000) — proposed using neural networks for language modelling but used networks without hidden units and a single input word. These two limitations are independent and worth separating:

Single input word → at most bigram statistics. If the network only sees $w_{t-1}$ , it can only model $\hat{P}(w_t \mid w_{t-1})$ regardless of architectural sophistication. This is a data starvation problem, not an architectural one.

No hidden units → only linear input-output mapping. Without hidden units, contributions from each context word are purely additive — the model cannot represent interactions between context words. The tanh hidden layer in Bengio’s model is what changes this: it mixes across the entire concatenated context before the non-linearity, enabling conjunctive features — genuine higher-order conditional behaviour.

Technical Detail Why Linearity Limits Expressiveness

Without hidden units, the model reduces to:

$y = \text{softmax}(Wx + b)$

Log-probabilities are a linear function of the input. Even with multiple input words, each word’s contribution is independent and additive. True trigram behaviour requires jointly conditioning on $(w_{t-2}, w_{t-1})$ as a pair, not just summing their independent contributions.

Because $H$ mixes across the entire concatenated context before the non-linearity is applied, the hidden units can represent dependencies between context words:

	Single input word	Multiple input words
No hidden units	Unigram + bigram (linear)	Formally higher-order, but only additively
Hidden units	Capped at bigram by data	Genuine $n$ -th order interactions

5. Joint Learning is Load-Bearing

The word embeddings and the prediction network are optimised simultaneously against the same log-likelihood objective. Gradients flow all the way back into $C$ — the embedding matrix is updated at every step just like any other weight.

This matters because it creates a feedback loop: every prediction error reshapes both the probability function and the geometry of the embedding space. The map of words is continuously adjusted to make prediction easier; the predictor co-adapts to exploit the current map.

Bengio explicitly tried the alternative — initialising word features from LSI co-occurrence statistics and holding them fixed — and reports it as unsuccessful. LSI embeddings optimise for a different objective (document retrieval similarity) using a separate process (SVD). There is no feedback between the downstream task and the representation. The failed experiment is direct empirical evidence that the joint optimisation is what drives the gains: a map optimised separately for a different purpose doesn’t transfer cleanly.

6. Generalisation Emerges, It Isn’t Designed

The model is never told that “dog” and “cat” are similar. Their representations converge geometrically because they appear in similar predictive contexts, and therefore receive similar gradient signals during training. Every sentence containing “cat” moves $C(\text{cat})$ in directions that make that sentence more probable; sentences containing “dog” move $C(\text{dog})$ in similar directions. The geometry of the embedding space is a consequence of optimising for prediction.

Because the probability function is a smooth function of the embeddings, nearby points in embedding space produce nearby output distributions. Generalisation falls out naturally: a sentence never seen in training gets high probability if it is composed of words that are geometrically close to words from seen sentences.

This is the mechanism by which each training sentence informs the model about an exponential number of neighbouring sentences — the paper’s central claim against the curse of dimensionality.

7. The Distinction from Prior Neural Approaches

The model’s goal is explicitly a statistical model of word sequence distributions, not a model of linguistic structure. Bengio distances the work from Miikkulainen & Dyer (1991), who used neural networks to learn word roles in sentence structure.

But this doesn’t mean the model avoids linguistic structure — it learns a proxy for it implicitly. Semantic and syntactic regularities are what’s statistically predictive, so they’re what gets encoded in the embedding space. The distinction is between:

Explicit, interpretable linguistic representations (syntactic trees, word roles)
Distributed, implicit representations optimised purely for predictive accuracy

Bengio’s model does the latter. Structure in the embedding space is a by-product of the objective, not a design target.

8. Results

On the Brown corpus, the best neural network configuration achieved approximately 24% lower perplexity than the best n-gram baseline (class-based Kneser-Ney). On the larger AP News corpus the improvement was approximately 8%. The model also demonstrated an ability to exploit longer contexts — performance continued improving as context window size increased, whereas n-gram models plateaued.

Mixing neural network output probabilities with interpolated trigram probabilities consistently reduced perplexity further, suggesting the two models make errors in qualitatively different places — consistent with their fundamentally different generalisation mechanisms.

Key Takeaways

The curse of dimensionality is the root problem. The space of possible word sequences is combinatorially explosive. All of statistical language modelling is a response to this.
N-grams cope by discarding information. Truncating context and counting frequencies is a non-parametric, local generalisation strategy with no notion of word similarity.
The proposed solution is a qualitatively different kind of generalisation. Not “reduce context until seen” but “find similar contexts via geometry in embedding space.”
Hidden layers are what make higher-order context usable. Without them, even multi-word inputs produce only linear, additive combinations — not genuine conjunctive dependencies.
Joint learning is load-bearing, not incidental. The feedback loop between prediction errors and embedding geometry is what drives the gains. Fixed pre-computed representations don’t transfer.
Generalisation emerges from optimisation. Word similarity in the embedding space is a consequence of predicting well, not an explicit design target. Linguistic structure is incidentally encoded.