Lecture 5 - Sequence Models & Word Embeddings

Welcome back!

Last time: Tokenization - how text becomes pieces a model can process

Today: How those pieces get meaning - word embeddings and sequence models

Why this matters: These are the building blocks of every LLM

Ice breaker: Personal Corpus

(See Poll Everywhere)

Connecting the pieces

Lecture 2: Classical NLP (BoW, TF-IDF, n-grams) - count words

Lecture 3: Neural networks - the learning machinery

Lecture 4: Tokenization - break text into pieces

Today: Learn representations that capture meaning

Spoiler: LLMs are basically this idea at massive scale.

Agenda for today

  1. From counting to meaning: the distributional hypothesis
  2. Encoder-decoder framework for sequence tasks
  3. Word embeddings: Word2Vec and how neural networks learn meaning
  4. Properties of Embeddings
  5. Ethics: Bias in embeddings
  6. Quick intro to RNNs (and why transformers replaced them)

Part 1: The Distributional Hypothesis

The problem with counting

Remember n-grams from Lecture 2?

Training text: "I love NLP. I love machine learning."

Bigram model learns: I -> love, love -> (NLP or machine)

But what if we see: "I adore NLP"?

The model has no idea that "adore" and "love" are similar!

We need a representation that captures semantic similarity

The insight: distributional hypothesis

"You shall know a word by the company it keeps"

  • J.R. Firth, 1957

Intuition: Words that appear in similar contexts have similar meanings

This is the foundation of how LLMs work.

For more theories, go down a rabbit hole on "semiotics"

Think about these sentences

"The cat sat on the mat"

"The dog sat on the mat"

"The automobile sat on the mat" (weird!)

Question: What other contexts do "cat" and "dog" share?

From contexts to vectors

Idea: Represent each word as a vector based on the contexts where it appears

Words in similar contexts lead to similar vectors

Example (simplified):

  • "cat" -> [0.8 near "sat", 0.9 near "mat", 0.7 near "pet", ...]
  • "dog" -> [0.9 near "sat", 0.8 near "mat", 0.9 near "pet", ...]
  • "automobile" -> [0.1 near "sat", 0.0 near "mat", 0.0 near "pet", ...]

cat and dog vectors are close together in high-dimensional space!

Similarity voting - poll everywhere

Which word is MOST similar to "cat"?

A) dog B) car C) meow D) kitten

Round 2: Which is most similar to "Taylor Swift"?

A) Beyoncé B) Taylor Smith C) Travis Kelce D) 1989

Round 3: Which is most similar to "bank"?

A) river B) money C) rob D) save

What would a computer answer?

Part 2: Encoder-Decoder Framework

From single words to sequences

Word embeddings solve: Representing individual words as vectors

But many NLP tasks need: Processing and generating sequences

Machine translation: "I love NLP" -> "J'adore le NLP"

Summarization: Long article -> short summary

Question answering: Question + context -> answer

We need architectures for sequence-to-sequence tasks

The encoder-decoder architecture

High-level idea:

Encoder: Read the input sequence, build a representation

Decoder: Generate the output sequence using that representation

Example (translation):

  • Encoder reads English: "I love NLP"
  • Encoder outputs: [0.134, 0.841, ... , 0.529]
  • Decoder uses that vector to generate French: "J'adore le NLP"

This framework is still how modern LLMs work:

  • GPT, Claude, LLaMA: Decoder-only (generate text from a prompt)
  • BERT: Encoder-only (understand text, don't generate)
  • T5, translation models: Full encoder-decoder

Real-world impact: Google Translate (2016)

In 2016, Google switched from phrase-based translation to a neural encoder-decoder model.

Translation quality improved more in that single jump than in the previous 10 years combined.

Google released a great paper, "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation" if you want to learn more!

Encoder-decoder diagram

The context vector is the bottleneck!

This pattern is everywhere

Encoder-decoder isn't just for language, it's any time you compress information through a bottleneck and reconstruct on the other side.

DomainEncoderBottleneckDecoder
TranslationRead English sentenceContext vectorGenerate French sentence
Audio streamingRaw audio waveformCompressed bitstreamReconstructed audio
Image compressionFull-resolution photoSmall image fileReconstructed image
BiologyOriginal genetic sequenceEmbedding spaceSimilar sequences functionally (VAE)

The key trade-off is always the same: How small can you make the middle representation while still reconstructing something useful?

  • Stable Diffusion (that's the "latent" in Latent Diffusion)
  • Meta's EnCodec, Google's SoundStream

The bottleneck problem

Challenge: Compress an entire sentence into a single fixed-size vector

Short sentences: "Hi" -> 1 vector (okay)

Long sentences: "The quick brown fox jumps over the lazy dog" -> 1 vector (hard)

Very long: "In the beginning was the Word, and the Word was with God..." -> 1 vector (impossible)

The fixed-size vector becomes a bottleneck for long sequences

But what if we just... didn't compress?

Thought experiment: What if the decoder could look at ALL the word vectors' states, not just a single combined one?

Instead of: Input -> Encoder -> one vector -> Decoder -> Output

What about: Input -> Encoder -> all words available -> Decoder picks what it needs -> Output

This is exactly what attention does. Wednesday's topic!

Part 3: Word2Vec - Learning Embeddings with Neural Networks

From framework to technique

We have a framework: encoder builds a representation, decoder uses it. But how do we actually learn those word representations?

Word2Vec (Mikolov et al., 2013): Train a neural network on a dead-simple task: given a word, predict its neighbors. The representations it learns along the way turn out to capture meaning.

Skip-gram: The training task

  • Training sentence: "The cat sat on the mat"
  • Center word: "sat"
  • Context window (size 2): the 2 words on each side
  • Training pairs generated: (sat, cat), (sat, on), (sat, the), (sat, the)
     The    cat   [sat]   on    the    mat
      ↑      ↑   center   ↑     ↑
   context context      context context

Each pair is a separate training example. Slide the window across billions of sentences and you get billions of training pairs.

Window size is a hyperparameter, typically 5-10.

  • Larger windows capture semantic/topical similarity ("dog" and "cat" both appear near "pet")
  • Smaller windows capture syntactic similarity ("dog" and "cat" both follow "the")

Skip-gram: The training data

What does the training set actually look like? Each center word paired with each context word is one training example: input x, target y.

"The cat sat on the mat", window = 2:

Input (x)Target (y)
Thecat
Thesat
catThe
catsat
caton
satThe
satcat
saton
satthe
oncat
onsat
onthe
onmat
thesat
theon
themat
maton
matthe

18 training pairs from a single 6-word sentence. The network sees each row independently: "given this input word (as a one-hot vector), try to predict this target word." Scale to billions of sentences and you get billions of training pairs.

Skip-gram: The architecture

This is a mini encoder-decoder:

Encoder (embedding layer): One-hot vector for "sat" (size 50,000) × weight matrix W_embed (50,000 × 300) = word vector (size 300)

This is just a lookup - multiplying a one-hot vector by a matrix pulls out one row.

Decoder (context layer): Word vector (size 300) × weight matrix W_context (300 × 50,000), then softmax to get probability for each word in the vocabulary

Training: For each pair (sat, cat), did the model assign high probability to "cat"? If not, backpropagation adjusts both weight matrices.

After training, we throw away the decoder (W_context). The encoder weights (W_embed) are the word embeddings. Each row is a word's vector.

From token to embedding

Putting it together - how does raw text become vectors?

"The cat sat" passes through tokenizer to get token IDs [0, 1, 2], then embedding lookup gives us three 300-dim vectors

For "cat" (token ID 1), the lookup selects row 1 from the embedding matrix:

   W_embed:         dim1   dim2   dim3   ...  (300 cols)
   ID 0  "the":  [  0.12, -0.34,  0.56,  ... ]
   ID 1  "cat":  [  0.78,  0.23, -0.11,  ... ]  ← this row
   ID 2  "sat":  [  0.45,  0.67,  0.89,  ... ]
   ...              (one row per token in vocabulary)

The tokenizer decides WHAT gets embedded. The embedding matrix learns HOW to represent it.

Thought experiment: Training data matters

Turn to your neighbor: Pick one of these domains. What does "cell" mean there?

  • Medical journals: "cell membrane", "cell division", "stem cell"
  • Legal documents: "prison cell", "jail cell", "cell block"
  • Tech blogs: "cell phone", "cellular network", "spreadsheet cell"
  • Biology textbooks: "cell wall", "cell nucleus"

Same word, completely different vectors. The distributional hypothesis means your embeddings are only as good as the text they learned from.

Part 4: Properties of Embeddings

The embedding space

After training, each word is a 300-dimensional vector (typically)

Why 300? More dimensions = more nuance. 50 dimensions might capture "cat is an animal." 300 dimensions can also capture "cat is small, is a pet, is independent, is internet-famous, purrs, has whiskers..."

Example (simplified to 2D for visualization):

"king"   -> [0.5, 0.8]
"queen"  -> [0.6, 0.7]
"man"    -> [0.3, 0.9]
"woman"  -> [0.4, 0.8]
"banana" -> [0.9, 0.1]

Similar words are close together in this space

Vector arithmetic: The famous example

king - man + woman ≈ queen

Paris - France + Italy ≈ Rome

better - good + bad ≈ worse

The embeddings capture relationships, not just similarity!

The limitation: One vector per word

Think about this sentence:

"I deposited money at the bank before walking along the river bank."

Word2Vec gives "bank" ONE vector. Same representation for both meanings.

Question: How would you want a smarter system to handle this?

FYI - there are other word embeddings

GloVe (2014) and FastText (2016) used the same distributional hypothesis but with different technical tricks.

FastText is notable for handling out-of-vocabulary words by using character n-grams.

Word2Vec is the most conceptually clear, which is why we focused on it.

Loading word embeddings in python

Let's actually work with pre-trained word vectors!

# Using gensim library
from gensim.models import KeyedVectors

# Load pre-trained vectors
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors.bin', binary=True)

# Find similar words
model.most_similar("king")
# Output: [('queen', 0.65), ('monarch', 0.58), ('prince', 0.55), ...]

# Compute similarity
model.similarity("cat", "dog")    # High! ~0.76
model.similarity("cat", "car")    # Low  ~0.31

# Analogies
model.most_similar(positive=["woman", "king"], negative=["man"])
# Output: [('queen', 0.71), ...]

Challenge: Best and worst analogies

Each pair has a mission:

  1. Find the most surprising analogy that works
  2. Find one you expected to work but doesn't

Format: [word1] - [word2] + [word3] ≈ ?

Starting ideas:

  • swimming - swim + run ≈ ?
  • France - Paris + London ≈ ?
  • good - bad + ugly ≈ ?

Also try: projector.tensorflow.org

We'll vote on the best find!

Do modern LLMs use Word2Vec?

No, but they use the same concept.

  • GPT, Claude, LLaMA all have an embedding layer as their first layer
  • Each token in the vocabulary gets a learned vector (typically 4096+ dimensions now)
  • These embeddings are learned during training, not separately

How big is this? GPT-2's embedding table alone:

  • 50,257 tokens
  • 768 dimensions
  • 38.6 million parameters and that's just the first layer of a "small" model

The key difference:

  • Word2Vec embeddings are static: "bank" has one vector whether it's a river bank or a money bank
  • Modern LLMs start with the same kind of static lookup table, but then transformer layers use attention to build context-dependent representations on top
  • By layer 40, "bank" looks completely different depending on whether "river" or "money" is nearby

Where are embeddings used today? (skim)

"If LLMs learn their own embeddings, is Word2Vec obsolete?"

Not quite! Embeddings are still everywhere:

ApplicationHow embeddings help
Search / RetrievalFind documents similar to a query (semantic search)
Recommendations"Users who liked X also liked Y"
RAG systemsFind relevant chunks to feed to an LLM
ClusteringGroup similar documents automatically
Anomaly detectionFind outliers in text data

E.g. Spotify: Your listening history becomes a point in "music space," and recommendations are nearby points.

When to use pre-trained embeddings vs. LLMs:

  • Embeddings: Fast, cheap, good for similarity/search
  • LLMs: Slower, expensive, good for generation/reasoning

Part 5: Ethics and Bias in Embeddings

The problem: Embeddings learn human biases

Remember: Embeddings learn from text data

That text reflects human biases

So embeddings encode those biases into the vectors**

We just learned vector arithmetic. Let's try one more:

man - woman + doctor ≈ ???

man - woman + doctor ≈ ?

Result: "nurse"

man - woman + programmer ≈ ?

Result: "homemaker"

These reflect gender stereotypes in the training data

If you're interested, check out the famous paper "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings"

Let's try a few! (And test your embeddings from earlier)

jupyter notebook scripts/embedding_demo.ipynb

Real-world impact: Amazon's recruiting tool

2014-2017: Amazon built AI to screen resumes

Trained on: Historical resumes (mostly from men, especially in tech roles)

Result: The model learned to penalize:

  • The word "women's" so "women's chess club captain" was a red flag
  • Graduates of all-women's colleges
  • Any signal correlated with being female

Outcome: Amazon scrapped the tool in 2017

This isn't a "bug in the algorithm" it's the algorithm doing exactly what we taught it. The bias is in the data.

Discussion: Where do these biases come from?

Turn to your neighbor:

  1. Why do word embeddings encode bias?
  2. Where in the pipeline does bias enter?
  3. What can we do about it?

The bias pipeline

1. Training data reflects historical bias

2. Algorithm accurately learns patterns (including biased ones)

3. Embeddings encode those biases as geometric relationships

4. Downstream applications (hiring, lending, recommendations) amplify bias

The algorithm is doing its job - that's the problem!

Can we "debias" embeddings?

Bolukbasi et al.'s approach:

  1. Identify a "gender direction" in embedding space
  2. For neutral words (like professions), remove the gender component
  3. Preserve gender for definitional words (king/queen, father/mother)

Does this work?

Partially - reduces some measurable biases

But doesn't eliminate them, and may introduce new problems

Hard to define "fair" - what should the "right" associations be?

Have you seen this? Have there been times ChatGPT/Claude/etc. gave you a response that felt stereotypical or made assumptions?

The deeper questions

Open discussion:

Should we try to debias embeddings? Why or why not?

If embeddings accurately reflect reality, is that itself a problem?

Who gets to decide what's "biased" vs "accurate"?

Whose responsibility is this: researchers? companies? users?

There are no easy answers - that's what makes this important

What companies do now

2016 framing: "Debias word embeddings"

2026 framing: "Align LLMs with human values"

ApproachHow it works
Data curationFilter training data for quality and balance
RLHFTrain model to prefer "good" outputs (Week 7)
Content filtersBlock harmful outputs at inference time
Red-teamingHire people to find problems before users do

None of these fully solve the problem. Active research area.

Part 6: RNNs - Context You Should Know

The implementation question

We've seen the encoder-decoder framework. We've seen how to learn word vectors.

But here's the problem: neural networks expect fixed-size inputs. Sentences have variable length.

How would YOU feed a sentence into a neural network?

Take 15 seconds to think about it.

How to build the encoder and decoder?

Option 1: Just use feed-forward networks

  • Problem: Can't handle variable length sequences!

Option 2: Recurrent Neural Networks (RNNs)

  • Process sequences one step at a time
  • Maintain "hidden state" that carries information
  • This was the dominant approach 2014-2017

Option 3: Transformers with attention

  • This is what won (2017+)
  • We'll finally dig into this starting Wednesday

RNNs in 60 seconds

Before transformers (2014-2017), RNNs were how we processed sequences.

The idea: Process tokens one at a time, maintaining a "hidden state" that carries information forward.

"I"         -> h1
"love" & h1 -> h2
"NLP" & h2  -> h3

You don't need to know the math. Just know they existed and why they lost.

If you're curious - check out Andrej Karpathy's excellent (viral) blog post "The Unreasonable Effectiveness of RNNs"

Why transformers replaced RNNs

Can't parallelize

  • Each token waits for the previous. Can't parallelize across a sentence.
  • A 10T parameter model like GPT-5 would take hundreds of years to train.

Vanishing gradients

  • Information from early tokens fades.
  • Hard to connect "The cat that..." to "...was hungry" 50 words later.

Context bottleneck

  • Entire input compressed to one context vector
  • Same problem we discussed - can't fit a novel into 512 numbers

LSTMs/GRUs helped with gradients but didn't fix parallelization or bottleneck.

The solution: Attention (Wednesday!)

Attention lets the model:

  • Process all tokens in parallel (fast!)
  • Look directly at any input token when generating output (no bottleneck!)
  • Learn which tokens are relevant to which (better long-range connections!)

This is why we have modern LLMs. Without attention, GPT-5 couldn't exist.

What we've learned today

Distributional hypothesis: words in similar contexts have similar meanings

Encoder-decoder framework for sequence-to-sequence tasks

Word2Vec: train neural networks to predict context, learn embeddings

Embeddings encode societal biases from training data

RNNs briefly (transformers replaced them!)

Connecting the dots

Lecture 2: Classical NLP (counting, BoW, n-grams)

Lecture 3: Neural networks (the learning machinery)

Lecture 4: Tokenization (how we break text into pieces)

Lecture 5 (today): Embeddings + encoder-decoder (putting it together for sequences)

Wednesday: Attention, the key ingredient in transformers

Lab/reflection for week 4 due Friday

  • Explore sequence-to-sequence concepts and word embeddings
  • Use pre-trained embeddings (gensim) for exploration
  • Experiment with encoder-decoder concepts
  • Try to build a network with an attention mechanism by hand (play around!)