Word2Vec
An algorithm that learns word embeddings by predicting words from their context.
Word2Vec is an algorithm that learns to represent words as dense vectors (embeddings) based on the idea that words appearing in similar contexts have similar meanings.
The Core Idea
Words that appear near each other in sentences are probably related. Word2Vec learns vectors where similar words end up close together.
"The cat sat on the mat"
"The dog sat on the rug"
→ "cat" and "dog" appear in similar contexts
→ Their vectors will be similar
How It Works (Simplified)
Given a word, predict the words around it (or vice versa). The model learns embeddings as a side effect of solving this prediction task.
Skip-gram
Given a word, predict its neighbours.
Input: "sat"
Predict: "cat", "on", "the"
Good for rare words, works well with small datasets.
CBOW (Continuous Bag of Words)
Given neighbours, predict the word.
Input: "cat", "on", "the"
Predict: "sat"
Faster to train, works well with frequent words.
Training Process
- Slide a window across the text
- For each word, create training pairs with its neighbours
- Train a shallow neural network to predict one from the other
- The hidden layer weights become the word embeddings
Window size = 2
"The cat sat on the mat"
↓
("sat", "The"), ("sat", "cat"), ("sat", "on"), ("sat", "the")
Simple Example
from gensim.models import Word2Vec
sentences = [
["the", "cat", "sat", "on", "the", "mat"],
["the", "dog", "sat", "on", "the", "rug"],
["cats", "and", "dogs", "are", "pets"],
]
# Train model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
# Get embedding for a word
cat_vector = model.wv["cat"] # 100-dimensional vector
# Find similar words
similar = model.wv.most_similar("cat")
# [("dog", 0.92), ("pets", 0.85), ...]
The Magic of Vector Arithmetic
Word2Vec embeddings capture semantic relationships through arithmetic:
king - man + woman ≈ queen
paris - france + italy ≈ rome
bigger - big + small ≈ smaller
This works because the vectors encode meaning as directions in space.
Why Dense Vectors?
Before Word2Vec, words were often represented as one-hot vectors — a single 1 in a vocabulary-sized vector of 0s.
| Representation | ”cat” (vocab=10,000) | Similarity |
|---|---|---|
| One-hot | [0,0,0,1,0,0,…,0] | Can’t compare |
| Word2Vec | [0.2, -0.5, 0.8, …] (100 dims) | Cosine similarity works |
Dense vectors let you measure similarity and do arithmetic.
Beyond Words
The same technique works for anything with sequential context:
| Domain | ”Words" | "Sentences” |
|---|---|---|
| NLP | Words | Sentences |
| Recommendations | Items | User sessions |
| Music | Songs | Playlists |
| Graphs | Nodes | Random walks |
This insight powers techniques like Item2Vec and Node2Vec.
Hyperparameters
| Parameter | What it does | Typical values |
|---|---|---|
vector_size | Embedding dimensions | 50–300 |
window | Context window size | 2–10 |
min_count | Ignore rare words | 1–5 |
sg | 0=CBOW, 1=Skip-gram | 1 for small data |
Limitations
- No word meaning: Just co-occurrence, doesn’t “understand” words
- One vector per word: Can’t handle polysemy (“bank” = river bank or financial bank)
- Static: Trained once, doesn’t adapt to new text
- Out of vocabulary: No embedding for unseen words
Modern approaches like BERT and GPT address these limitations but are more computationally expensive.
See Also
- Random Walks — Uses Word2Vec on graph traversals