Embeddings: turning words into meaning-carrying vectors
Why simple token IDs fail, how one-hot and bag-of-words work (and break), and how dense embeddings + positional encoding prepare inputs for attention.
Everything becomes numbers
The first rule of training a machine learning model: convert the input into numbers.
Token IDs (why not)
A simple idea is to assign each unique word a unique number (token ID). But token IDs create fake “distances”. If bad is 22 and great is 21, the model may treat them as similar just because the numbers are close.
One-hot encoding
One-hot represents each word as a vector of length |V| (vocabulary size). Only one position is 1; all others are 0.
Why one-hot is limited
Bag of Words / n-grams
Bag-of-words counts occurrences of words in a sentence. Unigram uses single words; bigram/trigram add short context by counting word pairs/triples.
Why n-grams don’t scale
Semantic + contextual meaning
To predict the next token, a model needs semantic understanding (similar words keep meaning) and context (what “old” describes; what “dusty” refers to). This is why embeddings and attention matter.
Word embeddings
An embedding is a dense vector in a continuous space. Similar words have vectors that are close. The direction between vectors can encode relationships.
What’s inside an embedding dimension?
Word2Vec training (CBOW & Skip-gram)
Word2Vec trains embeddings by predicting words from context (CBOW) or predicting context from a word (Skip-gram). After training, the learned weight matrix acts like an embedding lookup table.
Embedding layer in Transformers
Transformers use a trainable embedding layer: token IDs select rows from an embedding matrix (a learned lookup table). These embeddings are trained jointly with the whole model.
Positional encoding
Because Transformers process tokens in parallel, we must add position information. We add a positional vector to each token embedding (same shape, values change).