Word2Vec

An algorithm that learns word embeddings by predicting words from their context.

Word2Vec is an algorithm that learns to represent words as dense vectors (embeddings) based on the idea that words appearing in similar contexts have similar meanings.

The Core Idea

Words that appear near each other in sentences are probably related. Word2Vec learns vectors where similar words end up close together.

"The cat sat on the mat"
"The dog sat on the rug"

→ "cat" and "dog" appear in similar contexts
→ Their vectors will be similar

How It Works (Simplified)

Given a word, predict the words around it (or vice versa). The model learns embeddings as a side effect of solving this prediction task.

Skip-gram

Given a word, predict its neighbours.

Input: "sat"
Predict: "cat", "on", "the"

Good for rare words, works well with small datasets.

CBOW (Continuous Bag of Words)

Given neighbours, predict the word.

Input: "cat", "on", "the"
Predict: "sat"

Faster to train, works well with frequent words.

Training Process

  1. Slide a window across the text
  2. For each word, create training pairs with its neighbours
  3. Train a shallow neural network to predict one from the other
  4. The hidden layer weights become the word embeddings
Window size = 2

"The cat sat on the mat"

("sat", "The"), ("sat", "cat"), ("sat", "on"), ("sat", "the")

Simple Example

from gensim.models import Word2Vec

sentences = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "sat", "on", "the", "rug"],
    ["cats", "and", "dogs", "are", "pets"],
]

# Train model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

# Get embedding for a word
cat_vector = model.wv["cat"]  # 100-dimensional vector

# Find similar words
similar = model.wv.most_similar("cat")
# [("dog", 0.92), ("pets", 0.85), ...]

The Magic of Vector Arithmetic

Word2Vec embeddings capture semantic relationships through arithmetic:

king - man + woman ≈ queen
paris - france + italy ≈ rome
bigger - big + small ≈ smaller

This works because the vectors encode meaning as directions in space.

Why Dense Vectors?

Before Word2Vec, words were often represented as one-hot vectors — a single 1 in a vocabulary-sized vector of 0s.

Representation”cat” (vocab=10,000)Similarity
One-hot[0,0,0,1,0,0,…,0]Can’t compare
Word2Vec[0.2, -0.5, 0.8, …] (100 dims)Cosine similarity works

Dense vectors let you measure similarity and do arithmetic.

Beyond Words

The same technique works for anything with sequential context:

Domain”Words""Sentences”
NLPWordsSentences
RecommendationsItemsUser sessions
MusicSongsPlaylists
GraphsNodesRandom walks

This insight powers techniques like Item2Vec and Node2Vec.

Hyperparameters

ParameterWhat it doesTypical values
vector_sizeEmbedding dimensions50–300
windowContext window size2–10
min_countIgnore rare words1–5
sg0=CBOW, 1=Skip-gram1 for small data

Limitations

  • No word meaning: Just co-occurrence, doesn’t “understand” words
  • One vector per word: Can’t handle polysemy (“bank” = river bank or financial bank)
  • Static: Trained once, doesn’t adapt to new text
  • Out of vocabulary: No embedding for unseen words

Modern approaches like BERT and GPT address these limitations but are more computationally expensive.

See Also

-
-