WEEK 4: Word Embeddings & Attention

This week we learn how neural networks capture meaning. Monday we'll explore word embeddings and the distributional hypothesis, the key insight behind how LLMs represent language. Wednesday we'll see how attention solves the bottleneck problem in sequence models and sets the stage for transformers.

This week's checklist (due Friday 2/13)

  • Attend Lecture 5 (Mon, Feb 9): Word embeddings & sequence models
  • Attend Discussion Section (Tue, Feb 10): Exploring word vectors
  • Attend Lecture 6 (Wed, Feb 11): Attention mechanisms
  • Complete Week 4 Reflection and Lab 3, pushed to GitHub

This week's learning objectives

After Lecture 5 (Mon 2/9) students will be able to...

Word Embeddings:

  • Explain the distributional hypothesis: "you shall know a word by the company it keeps"
  • Describe how Word2Vec learns word vectors by predicting context
  • Use vector arithmetic to explore semantic relationships (king - man + woman = queen)
  • Recognize that modern LLMs use the same concept, just at scale

Sequence Models:

  • Understand the encoder-decoder framework for sequence-to-sequence tasks
  • Explain why RNNs struggled with long sequences (vanishing gradients)
  • Identify the bottleneck problem: compressing everything into one fixed vector
  • Discuss bias in word embeddings and its real-world consequences

After Lecture 6 (Wed 2/11) students will be able to...

Attention:

  • Explain how attention solves the bottleneck problem
  • Understand the Query, Key, Value framework using the library metaphor
  • Walk through scaled dot-product attention step by step
  • Describe why we scale by √d_k and apply softmax
  • Distinguish cross-attention (decoder attending to encoder) from self-attention (sequence attending to itself)
  • Explain multi-head attention: why multiple heads capture different relationships (syntax, semantics, position)

Discussion Section (Tue 2/10): Word Vectors & PyTorch Practice

Part 1: Exploring Word Vectors (~25 min)

  1. Load pre-trained word vectors (Word2Vec via gensim)
  2. Explore word similarity: find nearest neighbors for different words
  3. Try the famous analogies: king - man + woman = ?
  4. Investigate bias: profession + gender associations
  5. Visualize clusters in 2D (using t-SNE or PCA)

Part 2: Building a Text Pipeline in PyTorch (~25 min)

  1. Tokenize text using a BPE tokenizer (HuggingFace tokenizers library)
  2. Convert token IDs to embeddings (using nn.Embedding)
  3. Build a simple feed-forward classifier (embeddings, average, linear layer, prediction)
  4. Train on a small sentiment dataset and see the full pipeline

Week 4 Reflection Prompts

Write 300-500 words reflecting on this week's content. Pick one or two prompts that resonate, or go in your own direction:

  • The distributional hypothesis says meaning comes from context. Do you understand words that way? When you encounter a new word, how do you figure out what it means and how does that compare to what Word2Vec does?
  • Word embeddings encode "bank" as a single vector, but you effortlessly distinguish financial banks from riverbanks. What's your brain doing that Word2Vec can't? Does attention get closer to how you actually process language?
  • We saw that embeddings trained on human text absorb human biases. If a company ships a product built on biased embeddings, who bears responsibility - the researchers, the company, the training data creators, or someone else? What would you want done about it?
  • Now that you've seen embeddings, encoder-decoder models, and attention, are any project ideas starting to take shape for you? What problems or datasets interest you?
  • Is there a concept from this week that felt like it "clicked" or one that still feels fuzzy? What would help it land?

Remember to write in your own voice, without AI assistance. These reflections are graded on completion only and help me understand what's working for you.

Lab 3: Embeddings and Attention

Due: Friday, Feb 13 by 11:59pm

Choose your focus:

Option A: Word Embeddings Exploration

  • Load pre-trained embeddings (Word2Vec, GloVe, or fastText via gensim)
  • Find interesting analogies and relationships
  • Investigate bias: gender, profession, nationality associations
  • Compare: do different embedding models have different biases?
  • Visualize clusters of related words

Option B: Attention Implementation

  • Implement scaled dot-product attention from scratch in PyTorch
  • Test on simple sequences with small Q, K, V matrices
  • Visualize attention weights as heatmaps
  • Experiment: what happens with different d_k values? With multiple heads?
  • Try self-attention: feed the same sequence as Q, K, and V and see what patterns emerge?

Option C: Connect the Two

  • Start with word embeddings as your vectors
  • Apply self-attention to a sentence to produce contextualized representations
  • Visualize: which words attend to which? Does "it" attend to the noun it refers to?

Resources for further learning

On word embeddings

On attention

Videos

Papers (optional)

  • Efficient Estimation of Word Representations (Mikolov et al., 2013) - The Word2Vec paper
  • Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al., 2014) - The attention breakthrough