🧠

How one paper reshaped AI: Transformers

A transcript-driven, interactive walkthrough of the Transformer architecture: why it replaced RNNs/LSTMs and how attention lets tokens communicate.

If you remember one thing

A transformer is a network that lets its inputs talk to each other. It’s not magic — it’s communication.

Interactive walkthrough

Start here

Tip: click any box to jump to its explanation.

ML mapping Why sequences are hard The transformer idea Encoder / decoder Worked example Data flow Attention (Q,K,V) How it learns Variants Beyond text

ML mapping

The goal of machine learning is to learn a mapping from inputs to outputs. Neural networks learn this by stacking layers that transform representations.

Machine Learning Goal

Learn a mapping: Inputs → Outputs

→

Neural Network

Stacked layers transform representations

→

Task Output

Price, label, next token, …

Sequential Data

Words / tokens need context

→

RNN / LSTM

One token at a time (slow)

→

Problems

No parallelism + weak long-range memory

Transformer (2017)

Adds an attention “communication layer”

→

All-to-all context

Every token can look at every other token

→

Why it won

Parallel training + long dependencies

Why sequences are hard

For language, each token depends on context. If tokens are processed independently, meaning gets lost. RNNs/LSTMs tried to carry context with a moving memory, but training stayed sequential and long-range information faded.

The transformer idea

Transformers keep the “stacked layers” idea, but add attention — a communication layer where every token can look at every other token and decide what matters.

Transformer Block (repeated N times)

Attention

Tokens communicate (mix information)

→

MLP / Feed-Forward

Each token refines privately

→

Stable Training

Residual connections + LayerNorm

Encoder / decoder (classic Transformer)

The original 2017 Transformer uses an encoder-decoder. The encoder builds representations of the input. The decoder generates output tokens, using both its own history and the encoder’s output.

Input tokens

Source sequence

→

Encoder stack

Self-attention + MLP × N

→

Encoder memory

Context for decoding

Previous outputs

Shifted right tokens

→

Decoder stack

Masked self-attn + cross-attn + MLP × N

→

Next token

Generated output

Where do BERT and GPT fit?

Encoder-only models (BERT-style) are great at understanding and classification. Decoder-only models (GPT-style) are great at generation.

Worked example: “it” needs context

In attention, a token can look at all other tokens and decide what’s relevant. Try a toy example (not a real model):

Sentence

Focus token

Attention weights

Click different tokens to see how “focus” changes. In real transformers, these weights are learned during training.

Data flow (text → representations)

Text becomes tokens, tokens become vectors (embeddings), and we add positional information so the model knows order.

Tokenizer

Text → tokens

→

Embedding

Tokens → vectors

Positional Info

Adds order

Stacked Blocks

Attention + MLP repeated

→

Contextual Vectors

One vector per token

→

Task Head

Next token / classifier / etc.

Why positional info matters

Attention doesn’t know order by default. Positional signals make “Jake learned AI” different from “AI learned Jake”.

Zoom-in: Attention (Q, K, V)

Each token creates a Query (what I’m looking for), a Key (what I match on), and a Value (what I share).

Query (Q)

What am I looking for?

Keys (K)

What do others have?

→

Scores

Similarity via dot products

Softmax

Normalize → attention weights

→

Values (V)

Content to share

→

Weighted Sum

New token representation

Computed in parallel by stacking Q, K, V into matrices — that’s why training is fast.

Attention formula (intuition)

$\mathrm{Attention}(Q,K,V)=\mathrm{softmax}(QK^\top / \sqrt{d_k})\,V$ — match queries with keys, then blend values using the resulting weights.

How it learns (why the weights mean something)

At the start, parameters are random, so attention is meaningless. During training, the model learns patterns (like pronouns looking to relevant nouns).

Start: random

Queries/keys/values are noise

→

Training

Optimize to reduce loss

→

After learning

Attention captures structure

Variants (masked, multi-head, cross)

Variations enforce causality (masked), combine multiple attention “views” (multi-head), or mix information from another sequence (cross-attention).

Masked attention

Can’t look ahead

Multi-head

Multiple heads

Cross-attention

Attend to another sequence

Why it generalized beyond text

Text

Tokens

Images

Patches

Audio

Frames

Code

Tokens

If you can represent data as a sequence of elements that should “talk” to each other, transformers shine.

← Back to AI