WEEK 4: Word Embeddings & Attention

This week we learn how neural networks capture meaning. Monday we'll explore word embeddings and the distributional hypothesis, the key insight behind how LLMs represent language. Wednesday we'll see how attention solves the bottleneck problem in sequence models and sets the stage for transformers.

This week's checklist (due Friday 2/13)

Attend Lecture 5 (Mon, Feb 9): Word embeddings & sequence models
Attend Discussion Section (Tue, Feb 10): Exploring word vectors
Attend Lecture 6 (Wed, Feb 11): Attention mechanisms
Complete Week 4 Reflection and Lab 3, pushed to GitHub

This week's learning objectives

After Lecture 5 (Mon 2/9) students will be able to...

Word Embeddings:

Explain the distributional hypothesis: "you shall know a word by the company it keeps"
Describe how Word2Vec learns word vectors by predicting context
Use vector arithmetic to explore semantic relationships (king - man + woman = queen)
Recognize that modern LLMs use the same concept, just at scale

Sequence Models:

Understand the encoder-decoder framework for sequence-to-sequence tasks
Explain why RNNs struggled with long sequences (vanishing gradients)
Identify the bottleneck problem: compressing everything into one fixed vector
Discuss bias in word embeddings and its real-world consequences

After Lecture 6 (Wed 2/11) students will be able to...

Attention:

Explain how attention solves the bottleneck problem
Understand the Query, Key, Value framework using the library metaphor
Walk through scaled dot-product attention step by step
Describe why we scale by √d_k and apply softmax
Distinguish cross-attention (decoder attending to encoder) from self-attention (sequence attending to itself)
Explain multi-head attention: why multiple heads capture different relationships (syntax, semantics, position)

Discussion Section (Tue 2/10): Word Vectors & PyTorch Practice

Part 1: Exploring Word Vectors (~25 min)

Load pre-trained word vectors (Word2Vec via gensim)
Explore word similarity: find nearest neighbors for different words
Try the famous analogies: king - man + woman = ?
Investigate bias: profession + gender associations
Visualize clusters in 2D (using t-SNE or PCA)

Part 2: Building a Text Pipeline in PyTorch (~25 min)

Tokenize text using a BPE tokenizer (HuggingFace tokenizers library)
Convert token IDs to embeddings (using nn.Embedding)
Build a simple feed-forward classifier (embeddings, average, linear layer, prediction)
Train on a small sentiment dataset and see the full pipeline

Week 4 Reflection Prompts

Write 300-500 words reflecting on this week's content. Pick one or two prompts that resonate, or go in your own direction:

The distributional hypothesis says meaning comes from context. Do you understand words that way? When you encounter a new word, how do you figure out what it means and how does that compare to what Word2Vec does?
Word embeddings encode "bank" as a single vector, but you effortlessly distinguish financial banks from riverbanks. What's your brain doing that Word2Vec can't? Does attention get closer to how you actually process language?
We saw that embeddings trained on human text absorb human biases. If a company ships a product built on biased embeddings, who bears responsibility - the researchers, the company, the training data creators, or someone else? What would you want done about it?
Now that you've seen embeddings, encoder-decoder models, and attention, are any project ideas starting to take shape for you? What problems or datasets interest you?
Is there a concept from this week that felt like it "clicked" or one that still feels fuzzy? What would help it land?

Remember to write in your own voice, without AI assistance. These reflections are graded on completion only and help me understand what's working for you.

Lab 3: Embeddings and Attention

Due: Friday, Feb 13 by 11:59pm

Choose your focus:

Option A: Word Embeddings Exploration

Load pre-trained embeddings (Word2Vec, GloVe, or fastText via gensim)
Find interesting analogies and relationships
Investigate bias: gender, profession, nationality associations
Compare: do different embedding models have different biases?
Visualize clusters of related words

Option B: Attention Implementation

Implement scaled dot-product attention from scratch in PyTorch
Test on simple sequences with small Q, K, V matrices
Visualize attention weights as heatmaps
Experiment: what happens with different d_k values? With multiple heads?
Try self-attention: feed the same sequence as Q, K, and V and see what patterns emerge?

Option C: Connect the Two

Start with word embeddings as your vectors
Apply self-attention to a sentence to produce contextualized representations
Visualize: which words attend to which? Does "it" attend to the noun it refers to?

Resources for further learning

On word embeddings

Word2Vec Explained by Jay Alammar - Beautiful visualizations
TensorFlow Embedding Projector - Explore word vectors interactively
Bolukbasi et al. - Man is to Computer Programmer as Woman is to Homemaker? - The bias paper we discuss

On attention

Visualizing A Neural Machine Translation Model by Jay Alammar - Attention introduction
The Illustrated Transformer by Jay Alammar - Preview of next week

Videos

Attention in transformers, visually explained by 3Blue1Brown - Chapter 6
Word2Vec Paper Walkthrough by Yannic Kilcher

Papers (optional)

Efficient Estimation of Word Representations (Mikolov et al., 2013) - The Word2Vec paper
- Link
Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al., 2014) - The attention breakthrough
- Link

Lauren's CDS593 Materials