Word2Vec

Word2Vec is an algorithm that learns to represent words as dense vectors (embeddings) based on the idea that words appearing in similar contexts have similar meanings.

The Core Idea

Words that appear near each other in sentences are probably related. Word2Vec learns vectors where similar words end up close together.

"The cat sat on the mat"
"The dog sat on the rug"

→ "cat" and "dog" appear in similar contexts
→ Their vectors will be similar

How It Works (Simplified)

Given a word, predict the words around it (or vice versa). The model learns embeddings as a side effect of solving this prediction task.

Skip-gram

Given a word, predict its neighbours.

Input: "sat"
Predict: "cat", "on", "the"

Good for rare words, works well with small datasets.

CBOW (Continuous Bag of Words)

Given neighbours, predict the word.

Input: "cat", "on", "the"
Predict: "sat"

Faster to train, works well with frequent words.

Training Process

Slide a window across the text
For each word, create training pairs with its neighbours
Train a shallow neural network to predict one from the other
The hidden layer weights become the word embeddings

Window size = 2

"The cat sat on the mat"
         ↓
("sat", "The"), ("sat", "cat"), ("sat", "on"), ("sat", "the")

Simple Example

from gensim.models import Word2Vec

sentences = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "sat", "on", "the", "rug"],
    ["cats", "and", "dogs", "are", "pets"],
]

# Train model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

# Get embedding for a word
cat_vector = model.wv["cat"]  # 100-dimensional vector

# Find similar words
similar = model.wv.most_similar("cat")
# [("dog", 0.92), ("pets", 0.85), ...]

The Magic of Vector Arithmetic

Word2Vec embeddings capture semantic relationships through arithmetic:

king - man + woman ≈ queen
paris - france + italy ≈ rome
bigger - big + small ≈ smaller

This works because the vectors encode meaning as directions in space.

Why Dense Vectors?

Before Word2Vec, words were often represented as one-hot vectors — a single 1 in a vocabulary-sized vector of 0s.

Representation	”cat” (vocab=10,000)	Similarity
One-hot	[0,0,0,1,0,0,…,0]	Can’t compare
Word2Vec	[0.2, -0.5, 0.8, …] (100 dims)	Cosine similarity works

Dense vectors let you measure similarity and do arithmetic.

Beyond Words

The same technique works for anything with sequential context:

Domain	”Words"	"Sentences”
NLP	Words	Sentences
Recommendations	Items	User sessions
Music	Songs	Playlists
Graphs	Nodes	Random walks

This insight powers techniques like Item2Vec and Node2Vec.

Hyperparameters

Parameter	What it does	Typical values
`vector_size`	Embedding dimensions	50–300
`window`	Context window size	2–10
`min_count`	Ignore rare words	1–5
`sg`	0=CBOW, 1=Skip-gram	1 for small data

Limitations

No word meaning: Just co-occurrence, doesn’t “understand” words
One vector per word: Can’t handle polysemy (“bank” = river bank or financial bank)
Static: Trained once, doesn’t adapt to new text
Out of vocabulary: No embedding for unseen words

Modern approaches like BERT and GPT address these limitations but are more computationally expensive.