🧠

How one paper reshaped AI: Transformers

A transcript-driven, interactive walkthrough of the Transformer architecture: why it replaced RNNs/LSTMs and how attention lets tokens communicate.

If you remember one thing
A transformer is a network that lets its inputs talk to each other. It’s not magic — it’s communication.
Interactive walkthrough
Start here
Tip: click any box to jump to its explanation.

The goal of machine learning is to learn a mapping from inputs to outputs. Neural networks learn this by stacking layers that transform representations.

For language, each token depends on context. If tokens are processed independently, meaning gets lost. RNNs/LSTMs tried to carry context with a moving memory, but training stayed sequential and long-range information faded.

Transformers keep the “stacked layers” idea, but add attention — a communication layer where every token can look at every other token and decide what matters.

The original 2017 Transformer uses an encoder-decoder. The encoder builds representations of the input. The decoder generates output tokens, using both its own history and the encoder’s output.

Where do BERT and GPT fit?
Encoder-only models (BERT-style) are great at understanding and classification. Decoder-only models (GPT-style) are great at generation.

In attention, a token can look at all other tokens and decide what’s relevant. Try a toy example (not a real model):

Sentence
Focus token
Attention weights
Click different tokens to see how “focus” changes. In real transformers, these weights are learned during training.

Text becomes tokens, tokens become vectors (embeddings), and we add positional information so the model knows order.

Why positional info matters
Attention doesn’t know order by default. Positional signals make “Jake learned AI” different from “AI learned Jake”.

Each token creates a Query (what I’m looking for), a Key (what I match on), and a Value (what I share).

Attention formula (intuition)
$\mathrm{Attention}(Q,K,V)=\mathrm{softmax}(QK^\top / \sqrt{d_k})\,V$ — match queries with keys, then blend values using the resulting weights.

At the start, parameters are random, so attention is meaningless. During training, the model learns patterns (like pronouns looking to relevant nouns).

Variations enforce causality (masked), combine multiple attention “views” (multi-head), or mix information from another sequence (cross-attention).