Lecture 6 - Attention Mechanisms

Welcome back!

Last time: Encoder-decoder models and word embeddings - how we represent meaning and handle sequences

Today: The mechanism that revolutionized NLP - attention

Why this matters: Attention solves the bottleneck problem and enables transformers

Ice breaker

What do you see in this picture? Can you tell what's going on?

Visual saccades
The classic test

Agenda for today

Quick recap: the bottleneck problem
Attention intuition: Query, Key, Value
The math: scaled dot-product attention
Board work: computing attention step by step
Self-attention: a sequence attending to itself
Multi-head attention: multiple perspectives
Masked attention: padding and causal masks

Part 1: Recap - The Bottleneck Problem

Remember encoder-decoder models?

From Lecture 5:

Input sequence -> Encoder -> Fixed-size context vector -> Decoder -> Output sequence

Example task: Translate English to French

"The snow closed the campus" -> [encoder] -> c -> [decoder] -> "La neige a fermé le campus"

The bottleneck problem

Challenge: Compress entire input sequence into one fixed-size vector

Long inputs lose information:

Short sentence (5 words) -> c (256 dims) -> works ok

Long paragraph (100 words) -> c (256 dims) -> loses details!

It's like summarizing a novel in one sentence - you lose crucial details

Try it: 5-word summary

Pick your favorite book or movie. Summarize the entire story in exactly 5 words.

Share with your neighbor, can they guess what it is?

Hard, right? That's the bottleneck problem. Now imagine compressing a 100-word paragraph into a 256-dimensional vector.

What if we could look back?

Intuition: When generating each output word, look at all the input words and focus on the most relevant ones

Example: Translating "I got cash from the bank on the way home"

When generating "banque" (bank), the model attends to both "bank" and "cash" - it needs the context to know this is a financial bank, not a riverbank

This is attention!

Attention: high-level idea

Instead of a single context vector, the decoder gets a dynamic context for each output

Each decoder step:

Look at all encoder hidden states (roughly, token embeddings)
Decide which ones are most relevant
Create a weighted combination
Use that as context for this step

Result: The model can focus on different parts of the input for different outputs

Part 2: Query, Key, Value - The Attention Intuition

Three roles in attention

Attention uses three different representations of the same data:

Query (Q): "What am I looking for?"

Key (K): "What do I contain?"

Value (V): "What do I actually output?"

Metaphor: Googling your symptoms

You wake up with a headache and blurry vision. Naturally, you do the responsible thing and consult Dr. Google.

Your search: "headache blurry vision" - This is Q

Page titles and descriptions: What each result claims to be about - These are Ks

The actual articles: The content you read when you click - These are Vs

Metaphor: Googling your symptoms

Type in your symptoms (Q)
Skim titles and descriptions for matches (compare Q to all Ks)
Click into the most relevant results and read them (retrieve their Vs)
Combine what you read into your (probably wrong) self-diagnosis

This is exactly how attention works!

Attention beyond translation

Translation is our running example, but attention is everywhere:

Document summarization: When generating each summary word, attend to the most relevant sentences in the source document

Image captioning: When generating "dog," attend to the dog region of the image; when generating "frisbee," shift attention to the frisbee

Question answering: Given a question about a passage, attend to the sentences most likely to contain the answer

The same Q, K, V mechanism works across all these tasks!

Q, K, V in the decoder attending to encoder

Example: Translating "The snow closed" -> "La neige a ___"

Decoder is generating the next French word

Query (Q): Current decoder state (Q = "what's the next word in my translation after 'La neige a'")

Keys (K): All encoder hidden states (titles/descriptions for "The", "snow", and "closed")

Values (V): The same encoder hidden states (full content of "The", "snow", and "closed")

Process:

Compare Q to all Ks -> get relevance scores
Use scores to weight the Vs
Output weighted combination of Vs

Why K and V are separate

Question: If K and V both come from encoder hidden states, why distinguish them?

Answer: We transform them differently!

In practice:

$K = W_{K} \times (encoder hidden state)$ = Optimized for matching

$V = W_{V} \times (encoder hidden state)$ = Optimized for content

W_K and W_V are learned projection matrices (sometimes called weight matrices)

Keys learn to be good for comparison (which inputs match this query?)

Values learn to be good for output (what information to pass forward?)

Part 3: The Math - Scaled Dot-Product Attention

The attention formula

Given: Queries (Q), Keys (K), Values (V)

Compute:

$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$

Let's break this down step by step

Step 1: Compute similarity scores

$Q K^{T}$

What this does: Dot product between query and all keys

Intuition: "How well does my query match each key?"

Output: Similarity scores (higher = more relevant)

Dimensions:

$Q$ : $(1 \times d_{k})$ - one query

$K$ : $(n \times d_{k})$ - n keys (one per input token)

$Q K^{T}$ : $(1 \times n)$ - one score per input token

Step 2: Scale by sqrt(d_k)

$\frac{Q K ^{T}}{d _{k}}$

Why scale? Dot products get large when dimensionality ( $d_{k}$ ) is high

Problem with large scores: Softmax saturates (pushes probabilities toward 0 or 1)

Solution: Divide by $d_{k}$ to keep scores in a reasonable range

Step 3: Softmax

$softmax (\frac{Q K ^{T}}{d _{k}})$

What softmax does: Converts scores to probabilities (sum to 1)

Input: Raw similarity scores [3.2, 1.1, 5.8]

Output: Attention weights [0.15, 0.05, 0.80]

Interpretation: "Focus 80% on token 3, 15% on token 1, 5% on token 2"

Step 4: Weighted sum of values

$softmax (\frac{Q K ^{T}}{d _{k}}) V$

Finally: Multiply attention weights by values

This creates a weighted combination of the input values

Example:

Attention weights: [0.15, 0.05, 0.80]

Values: $v_{1}, v_{2}, v_{3}$

Output: $0.15 \times v_1 + 0.05 \times v_2 + 0.80 \times v_3$

The output focuses on the most relevant values!

$ $Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$ $1. C o m p u t es imi l a r i t y :$ Q \cdot K^T $2. S c a l e : d i v i d e b y$ \sqrt{d_k}$ 3. Normalize: softmax -> probabilities 4. Weighted sum: multiply by V

Result: Context vector that focuses on relevant input tokens

Computational cost: $O(n^2) $* * I m p or t an t c a v e a t : * * A tt e n t i o n co m p a rese v ery q u ery t oe v ery k ey F or a se q u e n ceo f l e n g t hn : -$ QK^T $p ro d u ces an$ (n \times n) $ma t r i x - T ha t^{'} s$ n^2$ similarity calculations!

Implications:

Short sequences (100 tokens): 10,000 comparisons - fast
Long sequences (10,000 tokens): 100,000,000 comparisons - slow!

This is why: Long documents are challenging, and researchers work on "efficient attention" variants

Quick check: vibe-coding and context limits

How's the vibe-coding going? Have you encountered:

Your conversation gets long,
the model starts "forgetting" earlier context
and eventually you hit a structural limit on context length

We know forgetting happened with RNNs - why is it still happening with attention?

Quick check: Do you understand the formula?

Turn to your neighbor (2 min):

In your own words, explain what each step accomplishes:

QK^T - what does this compute?
Softmax - why do we need this?
Multiply by V - what's the result?

Part 4: "Board" (Screen) Work

Let's calculate attention by hand

Scenario: Translating "snow closed campus"

We have 3 input tokens (words), and we're generating an output

Simplified example with d_k = 4

(Real models use d_k = 64 or larger, but 4 is enough to see the pattern)

Step 1: Set up matrices

Query (what we're looking for):

Q = [1, 0, 1, 2]

Keys (what each input contains):

K = [[2, 1, 0, 1],   ← "snow"
     [0, 2, 1, 0],   ← "closed"
     [2, 0, 1, 2]]   ← "campus"

Values (what we output):

V = [[1, 0, 1, 2],   ← "snow"
     [0, 1, 2, 0],   ← "closed"
     [2, 1, 0, 1]]   ← "campus"

Step 2: Compute $QK^TQ \cdot K^T $m e an s d o tp ro d u c t o f Qw i t h e a c h ro w o f K$ Q = [1, 0, 1, 2]Q \cdot [2, 1, 0, 1] = 1\times2 + 0\times1 + 1\times0 + 2\times1 = $* * 4 * * \leftarrow s imi l a r i t y w i t h " s n o w "$ Q \cdot [0, 2, 1, 0] = 1\times0 + 0\times2 + 1\times1 + 2\times0 = $* * 1 * * \leftarrow s imi l a r i t y w i t h " c l ose d "$ Q \cdot [2, 0, 1, 2] = 1\times2 + 0\times0 + 1\times1 + 2\times2 =$ 7 ← similarity with "campus"

Scores: [4, 1, 7]

Observation: "campus" has highest similarity to our query!

Step 3: Scale by $\sqrt{d_k}d_k = 4 $, so$ \sqrt{d_k} =$ 2

Scaled scores: [4/2, 1/2, 7/2] = [2, 0.5, 3.5]

Step 4: Apply softmax

Scaled scores: [2, 0.5, 3.5]

Softmax: Convert to probabilities (approximate!)

$\text{softmax}([2, 0.5, 3.5]) \approx$ [0.18, 0.04, 0.78]

Check: 0.18 + 0.04 + 0.78 = 1.0

Interpretation:

Focus 78% on "campus"
Focus 18% on "snow"
Focus 4% on "closed"

Step 5: Weighted sum of values

Attention weights: [0.18, 0.04, 0.78]

Values:

$V_1 $= [1, 0, 1, 2] \leftarrow " s n o w " -$ V_2 $= [0, 1, 2, 0] \leftarrow " c l ose d " -$ V_3 $= [2, 1, 0, 1] \leftarrow " c am p u s " * * O u tp u t : * *$ = 0.18 \times [1, 0, 1, 2] + 0.04 \times [0, 1, 2, 0] + 0.78 \times [2, 1, 0, 1]\approx [0.18, 0, 0.18, 0.36] + [0, 0.04, 0.08, 0] + [1.56, 0.78, 0, 0.78]\approx$ [1.74, 0.82, 0.26, 1.14]

This is our context vector - a weighted combination focused on "campus"

What did we just do?

Started with: Query asking "what am I looking for?"

Compared to: Keys for each input token

Found: "campus" was most relevant (similarity = 7, then scaled to 3.5)

Retrieved: Weighted combination of values, focused 78% on "campus"

Result: A context vector that emphasizes "campus", the most relevant input

This is attention!

Attention Variants

Part 5: Self-Attention

From cross-attention to self-attention

So far, we've seen the decoder attending to the encoder (cross-attention).

But what if Q, K, and V all come from the same sequence?

Self-attention: Each word in a sentence attends to all other words (including itself)

Why? To build better representations by capturing relationships within the sequence

Self-attention in action

Input sentence: "The animal didn't cross the street because it was too tired"

Question: What does "it" refer to?

Self-attention for the word "it":

Query: "it" embedding
Keys/Values: All word embeddings in the sentence

Results:

High attention to "animal" (that's what "it" refers to!)
Low attention to "street"

The Great Jay Alammar

The process

For each word in the sequence:

Create Q, K, V from that word's embedding (using learned projection matrices W_Q, W_K, W_V)
Compare Q to all K's (including itself) -> attention weights
Weighted sum of all V's -> contextualized representation

Do this for ALL words simultaneously! (this is why transformers are parallelizable, unlike RNNs)

Result: Every word gets a new representation that incorporates information from the whole sequence

The Great Jay Alammar II

Cross-attention vs self-attention

	Cross-attention	Self-attention
Q comes from	Decoder	Same sequence
K, V come from	Encoder	Same sequence
Purpose	"What input is relevant to what I'm generating?"	"How do words in this sequence relate to each other?"
Formula	$\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $∣$ \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$

The math is identical. Only the source of Q, K, V changes.

Live demo: BertViz

Before we calculate by hand, let's see what attention actually looks like in a real model.

Demo: scripts/bertviz_demo.ipynb

Was this insightful at all? You might take sides between the papers:

"Attention is not explanation" (Jain and Wallace, 2019)
"Attention is not not explanation" (Wiegreffe and Pinter, 2019)

Part 6: Multi-Head Attention

One head isn't enough

In "The snow closed the campus":

Syntactic: "snow" is the subject of "closed"
Semantic: "snow" and "campus" (weather event affecting a place)
Positional: "snow" is near "The"

Problem: A single attention mechanism tries to capture all these relationships at once

Solution: Run multiple attention "heads" in parallel - each one learns to focus on different things

Multi-head attention: The idea

Instead of one set of Q, K, V:

Run h different attention mechanisms in parallel (typically h = 8 or 16)

Each head:

Has its own W_Q, W_K, W_V projection matrices
Learns to focus on different aspects
Produces its own output

Finally: Concatenate all head outputs and project

Multi-head attention formula

For each head i:

$ $head_{i} = Attention (X W_{Q}^{i}, X W_{K}^{i}, X W_{V}^{i})$ $or w ec an w r i t e$ $head_{i} = Attention (Q^{i}, K^{i}, V^{i})$ $* * C o n c a t e na t e a ll h e a d s : * *$ $MultiHead (Q, K, V) = Concat (head_{1}, \dots, head_{h}) W_{O}$ W_O $: O u tp u tp ro j ec t i o nma t r i x (l e a r n e d) * * T y p i c a l se t u p : * * 8 h e a d s, e a c h w i t h$ d_k = d_v = 64$, total model dimension = 512

The Great Jay Alammar III

If you had 8 attention heads in this class...

What would each one attend to?

Head 1: The slides
Head 2: What the professor is saying
Head 3: Whether it's almost 1:35
Head 4: ?
Head 5: ?
Head 6: ?
Head 7: ?
Head 8: ?

The point: Each head specializes. No single head can capture everything, that's why we need multiple.

Stepping back

You now understand the core mechanism behind every modern LLM.

The attention formula (cross-attention, self-attention, same math) is what powers ChatGPT, Claude, BERT, and every transformer.

Multi-head attention just runs it multiple times in parallel for richer representations.

Question if we have time: How similar is this to how our brains work?

Part 7: Masked Attention

Masking Demystified

Why do we need masking? Two reasons:

Padding: Batches have different sequence lengths
Causal attention: Decoders can't look at future tokens

Padding mask

Problem: Batching sequences of different lengths

Batch:
  Sentence 1: "The cat sat on the mat"  (6 tokens)
  Sentence 2: "I love NLP"              (3 tokens)

Solution: Pad shorter sequence
  Sentence 1: [The, cat, sat, on, the, mat]
  Sentence 2: [I, love, NLP, PAD, PAD, PAD]

But we don't want attention to [PAD] tokens!

Padding mask: how it works

Create mask: 1 = real token, 0 = padding

Sentence 2: [I,  love,  NLP,  PAD,  PAD,  PAD]
Mask:       [1,   1,    1,    0,    0,    0  ]

During attention: Set masked positions to -∞

Before mask: QK^T = [2.1, 1.5, 3.2, 0.8, 0.5, 0.7]
After mask:        [2.1, 1.5, 3.2, -∞,  -∞,  -∞ ]
After softmax:     [0.3, 0.2, 0.5, 0.0, 0.0, 0.0]

Result: Padding gets zero attention weight.

Causal mask (for decoders/generation)

Problem: During training, the decoder can't peek at future tokens

Solution: Lower triangular mask - each position attends only to itself and earlier positions

        pos 0  pos 1  pos 2  pos 3
pos 0   [  1     0      0      0   ]   "The"
pos 1   [  1     1      0      0   ]   "cat"
pos 2   [  1     1      1      0   ]   "sat"
pos 3   [  1     1      1      1   ]   "on"

Why? When generating "cat", the model has only seen "The". The mask enforces this at training time too.

Extra Discussion: moltbook

In the last few minutes... what do you think?

Project idea: scraping/analyzing this or writing your own bot to join them?

What we learned today

Attention solves the bottleneck problem - dynamic context instead of one fixed vector

Q, K, V framework: Query what you want, match against Keys, retrieve Values

Self-attention: The same mechanism, but a sequence attends to itself

Multi-head attention: Multiple perspectives in parallel

Masked attention: Two flavors - padding masks (ignore [PAD] tokens) and causal masks (can't peek at future)

Next time: The full transformer architecture

Highly recommended reading: The Illustrated Transformer by Jay Alammar

Lab reminder: Lab/reflections for week 4 due Friday

Tuesday: Positional encoding + encoder/decoder blocks + the complete picture

Lauren's CDS593 Materials