🧩

Embeddings: turning words into meaning-carrying vectors

Why simple token IDs fail, how one-hot and bag-of-words work (and break), and how dense embeddings + positional encoding prepare inputs for attention.

If you remember one thing
An embedding is a dense vector where “nearby” vectors mean “similar” words.

The first rule of training a machine learning model: convert the input into numbers.

A simple idea is to assign each unique word a unique number (token ID). But token IDs create fake “distances”. If bad is 22 and great is 21, the model may treat them as similar just because the numbers are close.

One-hot represents each word as a vector of length |V| (vocabulary size). Only one position is 1; all others are 0.

Why one-hot is limited
It’s sparse (50k vocab ⇒ 50k dimensions), expensive, and it can’t represent similarity (good vs great) because every word is independent.

Bag-of-words counts occurrences of words in a sentence. Unigram uses single words; bigram/trigram add short context by counting word pairs/triples.

Why n-grams don’t scale
The feature space explodes and stays sparse, and the model only sees a tiny local window (e.g., trigram sees only 3 words).

To predict the next token, a model needs semantic understanding (similar words keep meaning) and context (what “old” describes; what “dusty” refers to). This is why embeddings and attention matter.

An embedding is a dense vector in a continuous space. Similar words have vectors that are close. The direction between vectors can encode relationships.

What’s inside an embedding dimension?
We don’t hand-design the “features”. They emerge from training data and the learning objective.

Word2Vec trains embeddings by predicting words from context (CBOW) or predicting context from a word (Skip-gram). After training, the learned weight matrix acts like an embedding lookup table.

Transformers use a trainable embedding layer: token IDs select rows from an embedding matrix (a learned lookup table). These embeddings are trained jointly with the whole model.

Because Transformers process tokens in parallel, we must add position information. We add a positional vector to each token embedding (same shape, values change).