Lecture 7 - Transformer Architecture

["snow", "melts"], one-hot encoded gives us a $2 \times 50,000 $ma t r i x * * St e p 2 : E mb e dd in g s * * - M u lt i pl y b ye mb e dd in g ma t r i x M$ (50,000 \times 512) \to 2 \times 512 $* * St e p 3 : A dd p os i t i o na l e n co d in g * * - A dd e d n o t co n c a t e na t e d (s am e ma t r i x s i ze) * * R es u lt : * * M a t r i x X$ (2 \times 512)$

From embeddings to Q, K, V

(Assuming self-attention)

Three learned projection matrices: $W_Q $,$ W_K $,$ W_V $(e a c h $512 \times 512$ )

$Q = X \times W_{Q}$

Project embedding into query space

$K = X \times W_{K}$

Project embedding into key space (for matching)

$V = X \times W_{V}$

Project embedding into value space (for content)

Projection matrices are learned during training

Now we can use the attention formula

Once we have Q, K, V:

$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$

Let's draw it out

You try first:

Sketch the flow for your own 2-word sentence:

Start with text
Tokenization
Embedding matrices and embeddings
$W_{Q}, W_{K}, W_{V}$ and $Q, K, V$
Attention formula and final output

What are the matrix dimensions at each step?

Then we'll draw on the board together

Quick reminder: Multi-head attention mechanics

Inside "Multi-Head Attention":

Split into h heads (typically 8)
Each head runs attention independently with own $W_{Q}, W_{K}, W_{V}$ projection matrices
Concatenate all head outputs
Project with output projection matrix $W_{O}$

Result: Each head focuses on different aspects (syntax, semantics, position)

Output dimension: Still $d_{m o d e l}$ (512), same as input

Dimension notation: $d_{m o d e l}$ vs $d_{k}$

Important terminology clarification:

$d_{m o d e l}$ = full model dimension (typically 512)

Size of token embeddings
Input/output size of each transformer layer
Also called $d_{e mb}$ or embedding dimension

$d_{k}$ = dimension per attention head (typically 64)

With 8 heads and $d_{m o d e l} = 512$ : each head gets $d_{k} = 512/8 = 64$
Appears in the scaling factor: $d_{k}$ in the attention formula

Relationship: $d_{k} = d_{m o d e l} / h$ where $h$ = number of heads

The building blocks for a complete transformer

Self-attention: Each position attends to all positions
Multi-head attention: Multiple attention mechanisms in parallel

New today:

Positional encoding: Add position information
Feed-forward networks: Process each position independently
Layer normalization + residual connections: Stabilize training

Next: Understand the new pieces, then assemble

Part 2: Building Blocks

Positional Encoding: The order problem

Problem: Attention doesn't perceive sequence order

"The cat sat on the mat" and "mat the on sat cat The" have equivalent representations

Why? Attention just looks at relationships, not order

Solution: Positional encoding

Idea: Add positional information to embeddings

Before: X = [embedding for "cat", embedding for "sat", ...]

After: X = [embedding + position 0, embedding + position 1, ...]

Result: Model knows "cat" at position 0, "sat" at position 1

How to encode position?

Option 1: Learned embeddings (modern models)

Option 2: Fixed sinusoidal functions (original paper)

Sinusoidal positional encodings

FYI / you're not responsible for these formulas:

$P E_{(p os, 2 i)} = sin (\frac{p os}{1000 0 ^{2 i / d}})$ $P E_{(p os, 2 i + 1)} = cos (\frac{p os}{1000 0 ^{2 i / d}})$

Intuition: Different frequencies create unique "fingerprints" for each position

Why this works: Model can learn absolute and relative positions

Embeddings + positional encoding

Token embeddings: $(se q_l e n \times d_{m o d e l})$
Positional encodings: $(se q_l e n \times d_{m o d e l})$
Add them: input = embeddings + positional encodings
Pass to rest of model

Result: Each token embedding has WHAT it is (word) and WHERE it is (position)

Positional encoding added at input to BOTH encoder and decoder

Residual connections

Problem: Deep networks hard to train (vanishing gradients)

Solution: Add input back to output

Instead of: output = Layer(input)

We do: output = input + Layer(input)

input ───┬───> [Layer] ───> (+) ───> output
         │                   ↑
         └───────────────────┘
          (residual / skip connection)

Why this helps: Model can ignore unhelpful layers (set contribution ≈ 0)

Also helps gradients flow backward during training

In transformers: EVERY sublayer (attention, FFN) has residual connection

Layer normalization

After each sublayer:

Rescale to mean = 0, variance = 1
Stabilizes training (prevents values getting too large/small)

In transformers: Layer norm happens AFTER residual connection

Full pattern: output = LayerNorm(input + Sublayer(input))

Feed-forward network (FFN)

After attention, EACH POSITION goes through small neural network:

$FFN (x) = max (0, x W_{1} + b_{1}) W_{2} + b_{2}$

Structure:

Input: $d_{m o d e l}$ (e.g., 512)
Hidden layer: $d_{ff}$ (e.g., 2048) - much wider!
Output: $d_{m o d e l}$ (e.g., 512)
Activation: ReLU (the max(0, ...))

Key: Applied to each position INDEPENDENTLY. Same FFN weights shared across all positions, different inputs per position

The FFN is just a 2-layer neural network (also called a multi-layer perceptron or MLP)

Pattern: Attention mixes info ACROSS positions, FFN processes each position individually (adds capacity and non-linearity)

FFN much wider than model dimension (This is where many parameters live)

Quick break: What surprises you?

Turn to your neighbor (2 min):

You've now seen all the building blocks: attention, positional encoding, residual connections, layer norm, FFN.

What surprised you?
What seems clever?
What seems redundant or over-engineered?

Share with class: Any "aha" moments or lingering confusion?

Part 3: Full Transformer Architecture

The complete picture

Original transformer: Encoder-Decoder architecture for translation

Full diagram first, then build up piece by piece:

From "Attention is All You Need":

Encoder block components

Each encoder block has TWO sublayers:

Multi-head self-attention
- Input sequence attends to itself
- Each position can see all positions
Feed-forward network (FFN)
- FFN per position independently
- Typically: 512 to 2048 to 512

Both sublayers have:

Residual connection (add input to output)
Layer normalization

What is "encoder output"?

After 6 stacked blocks: matrix $(se q_l e n \times d_{m o d e l})$
Each row = processed embedding of one input token
Entire matrix feeds into decoder's cross-attention (used as K and V)
Encoder runs ONCE, output reused at every decoder step

Decoder block components

Each decoder block has THREE sublayers:

Masked multi-head self-attention
- Output tokens attend to previous tokens only
- Can't see future (prevents cheating!)
Multi-head cross-attention - Connection to encoder!
- Decoder attends to encoder output
- Q from previous layer (masked self-attention output)
- K and V from encoder output (processed input)
Feed-forward network (FFN)
- Same as encoder

All three sublayers: Residual connections + layer norm

Why masked? During generation we don't know future tokens yet!

Encoder vs Decoder: Key differences

Similar building blocks, important differences:

Component	Encoder	Decoder
Input	Entire source sequence	Output tokens generated so far
Self-attention	Can see all positions	Masked (can't see future)
Cross-attention	None	Attends to encoder output
Sublayers per block	2 (self-attn + FFN)	3 (masked self-attn + cross-attn + FFN)
Purpose	Build rich representation	Generate output one token at a time

Both: 6 stacked blocks, residual connections, layer norm

Learned vs computed parameters

Important distinction:

Learned during training (model parameters):

$W_{Q}, W_{K}, W_{V}$ projection matrices (in each attention layer)
$W_{O}$ output projection matrix (in multi-head attention)
FFN weights ( $W_{1}, W_{2}, b_{1}, b_{2}$ )
Layer norm parameters (scale and shift)
Embedding matrices

Computed during forward pass:

Q, K, V matrices (from $X \times W_{Q}$ , $X \times W_{K}$ , $X \times W_{V}$ )
Attention weights (softmax of $Q \cdot K^{T}$ )
Attention output (weighted sum of V)

From decoder to predictions

After 6 decoder blocks, how do we get next token?

Step 1: Decoder output

After all 6 blocks: matrix $(se q_l e n \times d_{m o d e l})$
Still in embedding space (512 dimensions)

Step 2: Linear projection

Learned weight matrix: $(d_{m o d e l} \times vocab_size)$
Maps embedding space to vocabulary space
Output: $(seq_len \times vocab_size)$

Step 3: Softmax

Creates probability distribution over vocabulary per position

Step 4: Select next token

Sample or argmax to pick actual token (we'll see more next time)

Autoregressive generation in action

Translating "snow melts" into "la neige fond"

Step 0: Encoder processes "snow melts" ONCE to get encoder output E

Step 1:

Decoder input: [START]
Processes: masked self-attn on [START], cross-attn to E, FFN
Output: "la" (predicted)

Step 2:

Decoder input: [START, "la"]
Processes: masked self-attn on [START, "la"], cross-attn to E, FFN
Output: "neige" (predicted)

Step 3:

Decoder input: [START, "la", "neige"]
Processes: masked self-attn on [START, "la", "neige"], cross-attn to E, FFN
Output: "fond" (predicted)

Encoder output E constant. Only decoder input grows

Let's think about - what are the decoder's INPUTS?

Decoder has TWO separate input sources:

Input 1: From encoder (via cross-attention)

Encoder processes "snow melts" ONCE to get encoder output
This output REUSED at every decoder step
Used in cross-attention layer (K and V)

Input 2: Decoder's own previous outputs (via masked self-attention)

Starts with [START] token
Grows: [START], then [START, "neige"], then [START, "neige", "fond"]
Each token attends to all previous in THIS sequence
Used in masked self-attention layer

Encoder runs ONCE. Decoder runs MULTIPLE times (once per output token)

What exactly feeds back?

What gets added to decoder input at each step?

The predicted TOKEN (after sampling/argmax from probability distribution)

Complete loop:

Decoder outputs hidden states $(se q_l e n \times d_{m o d e l})$
Linear projects to vocabulary $(se q_l e n \times vocab_size)$
Softmax gives us probabilities over vocabulary
Sample or argmax to get predicted token (e.g., "la")
Convert token to embedding (via embedding matrix)
This embedding added to decoder input for next step

Not probabilities or raw hidden states, but embedded token

Training vs Inference

What you just saw: INFERENCE (generating one token at a time)

During TRAINING, it's different:

Training:

Have full target: [START, "la", "neige", "fond"]
Decoder processes ENTIRE sequence at once (with masking)
Each position predicts next token in parallel
Fast and efficient!

Inference (generation):

Generate one token at a time
Decoder runs sequentially (once per output token)
Slower but necessary (don't know answer yet!)

Why training fast (parallel) but generation slow (sequential)!

Quick check: Trace the flow (pairs, 5 min)

Turn to your neighbor, trace through:

Input: "snow melts" (English), Output: "neige fond" (French)

Answer together:

"snow" through encoder block - what TWO sublayers?
Decoder generates "fond" - which THREE attention mechanisms?
Where does positional encoding get added?
What's the purpose of cross-attention?
How many times encoder run? Decoder run?

Drawing Practice

Now YOU draw the architecture!

Work in pairs. Follow step-by-step instructions on handout

Take your time. Best way to absorb this and practice for midterm

Drawing Activity: Your Checklist

Work in pairs. Try to draw from what you remember!

Input path - how do tokens enter the model?
One encoder block - what are the two sublayers? What connects them?
Encoder stacking - how many blocks? What comes out?
One decoder block - this one has THREE sublayers. What are they? Where does the encoder connect?
Decoder output path - how do we get from decoder output to a word prediction?
Label the three types of attention in your diagram

Compare with your partner. Raise hand if questions!

Now let's build it together on the board!

Your turn to teach ME:

I'll draw based on YOUR instructions:

Where do I start?
What comes next?
Did I get this right?

Call out if you see a mistake

What we learned today

Complete data flow: Text → tokens → embeddings → multiply by $W_{Q}$ , $W_{K}$ , $W_{V}$ → Q/K/V vectors → attention output

Building blocks: Positional encoding (inject order), residual connections (help training), layer norm (stabilize), FFN (add capacity)

Encoder blocks (2 sublayers): Self-attention + FFN. Runs ONCE, produces rich representation

Decoder blocks (3 sublayers): Masked self-attention + cross-attention + FFN. Runs MULTIPLE times, generates one token at a time

Training vs inference: Training uses "teacher forcing" (parallel), inference is autoregressive (sequential)

Logistical notes

Recommended:

Review Jay Alammar's "Illustrated Transformer" post
Try sketching transformer architecture from memory

Portfolio Piece 1 Due Friday/Sunday

Quick reflection due too! Friday/Sunday

Exam 1: Monday, Feb 23 (everything through transformers & decoding)

Appendix: Full Step-by-Step Drawing Instructions

Use this to check your work or practice at home.

Step 1: Input path (both encoder and decoder)

Box: "Input tokens" (e.g., "snow melts")
Arrows point to "Embedding + Positional Encoding"
Note dimensions: $(se q_l e n \times d_{m o d e l})$ , typically $d_{m o d e l}$ = 512

Step 2: Draw ONE encoder block (vertically)

Box: "Multi-Head Self-Attention"
Show residual connection: arrow AROUND it
Box: "Add & Norm"
Box: "Feed-Forward Network (FFN)"
Show residual connection: arrow around FFN
Box: "Add & Norm"

Step 3: Show encoder stacking

Write "×6" next to encoder block (or draw 2-3 stacked)
Label output: "Encoder Output" (feeds into decoder)

Step 4: Draw ONE decoder block

Box: "Masked Multi-Head Self-Attention" (can't see future)
Residual connection + "Add & Norm"
Box: "Multi-Head Cross-Attention"
- IMPORTANT: Arrow FROM encoder output TO this layer
Residual connection + "Add & Norm"
Box: "Feed-Forward Network (FFN)"
Residual connection + "Add & Norm"

Step 5: Complete decoder output path

Write "×6" for decoder stacking
Arrow to "Linear" (projects to vocab size)
Arrow to "Softmax"
Output: "Probability distribution over vocabulary"

Lauren's CDS593 Materials