🧩

Embeddings: turning words into meaning-carrying vectors

Why simple token IDs fail, how one-hot and bag-of-words work (and break), and how dense embeddings + positional encoding prepare inputs for attention.

If you remember one thing

An embedding is a dense vector where “nearby” vectors mean “similar” words.

Everything becomes numbers Token IDs (why not) One-hot encoding Bag of words / n-grams Semantic + context Word embeddings Word2Vec training Transformer embedding layer Positional encoding

Everything becomes numbers

The first rule of training a machine learning model: convert the input into numbers.

Images

Pixels → numbers (0..1)

→

Model input

Vectors / matrices

Text

Words → tokens

→

Model input

Features (numbers)

Token IDs (why not)

A simple idea is to assign each unique word a unique number (token ID). But token IDs create fake “distances”. If bad is 22 and great is 21, the model may treat them as similar just because the numbers are close.

Token ID

good → 6

≠

Meaning

good ≈ great

Token ID

great → 21, bad → 22

→

Wrong signal

21 close to 22 ⇒ “similar”

One-hot encoding

One-hot represents each word as a vector of length |V| (vocabulary size). Only one position is 1; all others are 0.

Word

"great"

→

One-hot

[0,0,0,1,0, …]

Why one-hot is limited

It’s sparse (50k vocab ⇒ 50k dimensions), expensive, and it can’t represent similarity (good vs great) because every word is independent.

Bag of Words / n-grams

Bag-of-words counts occurrences of words in a sentence. Unigram uses single words; bigram/trigram add short context by counting word pairs/triples.

Unigram

Count words

→

Bigram

Count word pairs

→

N-gram

Short context (but huge space)

Why n-grams don’t scale

The feature space explodes and stays sparse, and the model only sees a tiny local window (e.g., trigram sees only 3 words).

Semantic + contextual meaning

To predict the next token, a model needs semantic understanding (similar words keep meaning) and context (what “old” describes; what “dusty” refers to). This is why embeddings and attention matter.

Word embeddings

An embedding is a dense vector in a continuous space. Similar words have vectors that are close. The direction between vectors can encode relationships.

Dense vector

Not sparse like one-hot

→

Similarity

king close to queen

→

Analogy

king − man + woman ≈ queen

What’s inside an embedding dimension?

We don’t hand-design the “features”. They emerge from training data and the learning objective.

Word2Vec training (CBOW & Skip-gram)

Word2Vec trains embeddings by predicting words from context (CBOW) or predicting context from a word (Skip-gram). After training, the learned weight matrix acts like an embedding lookup table.

CBOW

context → target

Skip-gram

target → context

→

Embedding matrix

Weights as lookup table

Embedding layer in Transformers

Transformers use a trainable embedding layer: token IDs select rows from an embedding matrix (a learned lookup table). These embeddings are trained jointly with the whole model.

Token IDs

[t1, t2, …]

→

Embedding matrix

|V| × d

→

Token vectors

n × d

Positional encoding

Because Transformers process tokens in parallel, we must add position information. We add a positional vector to each token embedding (same shape, values change).

Embedding

Token meaning

Position

Token order

→

Ready for attention

Context building

← Back to AI