How one paper reshaped AI: Transformers
A transcript-driven, interactive walkthrough of the Transformer architecture: why it replaced RNNs/LSTMs and how attention lets tokens communicate.
ML mapping
The goal of machine learning is to learn a mapping from inputs to outputs. Neural networks learn this by stacking layers that transform representations.
Why sequences are hard
For language, each token depends on context. If tokens are processed independently, meaning gets lost. RNNs/LSTMs tried to carry context with a moving memory, but training stayed sequential and long-range information faded.
The transformer idea
Transformers keep the “stacked layers” idea, but add attention — a communication layer where every token can look at every other token and decide what matters.
Transformer Block (repeated N times)
Encoder / decoder (classic Transformer)
The original 2017 Transformer uses an encoder-decoder. The encoder builds representations of the input. The decoder generates output tokens, using both its own history and the encoder’s output.
Where do BERT and GPT fit?
Worked example: “it” needs context
In attention, a token can look at all other tokens and decide what’s relevant. Try a toy example (not a real model):
Data flow (text → representations)
Text becomes tokens, tokens become vectors (embeddings), and we add positional information so the model knows order.
Why positional info matters
Zoom-in: Attention (Q, K, V)
Each token creates a Query (what I’m looking for), a Key (what I match on), and a Value (what I share).
Computed in parallel by stacking Q, K, V into matrices — that’s why training is fast.
Attention formula (intuition)
How it learns (why the weights mean something)
At the start, parameters are random, so attention is meaningless. During training, the model learns patterns (like pronouns looking to relevant nouns).
Variants (masked, multi-head, cross)
Variations enforce causality (masked), combine multiple attention “views” (multi-head), or mix information from another sequence (cross-attention).
Why it generalized beyond text
If you can represent data as a sequence of elements that should “talk” to each other, transformers shine.