Lecture 7 - Transformer Architecture

Welcome back!

Last time: Attention, self-attention, multi-head attention

Today: Full transformer architecture

Why this matters: Every major LLM uses transformers (GPT, BERT, Claude, Gemini)

Logistics

  • Portfolio piece due Friday (slash Sunday)
    • Scope ~ blog post
  • Decoding and midterm review tomorrow
  • Exam Monday
SectionTopicPoints
1Text Representation20
2Attention Mechanisms20
3Transformer Components20
4Decoder & Generation20
5Responsible AI20

Ice breaker (think/pair/share)

What differences have you noticed across LLM models - from GPT-2/3 to today's models?

Agenda for today

  1. Recap + Data flow: From text to Q/K/V
  2. Building blocks: Positional encoding, residual connections, layer norm, FFN
  3. Full architecture: Encoder and decoder deep dive
  4. Hands-on: Drawing the transformer together

Part 1: Recap and Data Flow

Monday's key ideas

Cross-attention: Decoder attends to encoder

  • "What input is relevant to what I'm generating?"

Self-attention: Sequence attends to itself

  • "How do words relate to each other?"

Multi-head attention: Multiple attention heads in parallel

  • Different heads capture syntax, semantics, position

The formula:

Today: How these pieces snap together

But first: where do Q, K, V come from?

I was clear about: Attention formula, combining Q, K, V

I was not clear about: Where do we GET Q, K, V?

So let's track the complete flow

From raw input to embeddings

Starting point: "snow melts"

Let's assume the size of our embeddings is

Step 1: Tokenization

  • ["snow", "melts"], one-hot encoded gives us a $2 \times 50,000(50,000 \times 512) \to 2 \times 512(2 \times 512)$

From embeddings to Q, K, V

(Assuming self-attention)

Three learned projection matrices: $W_QW_KW_V)

  • Project embedding into query space

  • Project embedding into key space (for matching)

  • Project embedding into value space (for content)

Projection matrices are learned during training

Now we can use the attention formula

Once we have Q, K, V:

Let's draw it out

You try first:

Sketch the flow for your own 2-word sentence:

  • Start with text
  • Tokenization
  • Embedding matrices and embeddings
  • and
  • Attention formula and final output

What are the matrix dimensions at each step?

Then we'll draw on the board together

Quick reminder: Multi-head attention mechanics

Inside "Multi-Head Attention":

  1. Split into h heads (typically 8)
  2. Each head runs attention independently with own projection matrices
  3. Concatenate all head outputs
  4. Project with output projection matrix

Result: Each head focuses on different aspects (syntax, semantics, position)

Output dimension: Still (512), same as input

Dimension notation: vs

Important terminology clarification:

= full model dimension (typically 512)

  • Size of token embeddings
  • Input/output size of each transformer layer
  • Also called or embedding dimension

= dimension per attention head (typically 64)

  • With 8 heads and : each head gets
  • Appears in the scaling factor: in the attention formula

Relationship: where = number of heads

The building blocks for a complete transformer

  • Self-attention: Each position attends to all positions
  • Multi-head attention: Multiple attention mechanisms in parallel

New today:

  • Positional encoding: Add position information
  • Feed-forward networks: Process each position independently
  • Layer normalization + residual connections: Stabilize training

Next: Understand the new pieces, then assemble

Part 2: Building Blocks

Positional Encoding: The order problem

Problem: Attention doesn't perceive sequence order

"The cat sat on the mat" and "mat the on sat cat The" have equivalent representations

Why? Attention just looks at relationships, not order

Solution: Positional encoding

Idea: Add positional information to embeddings

Before: X = [embedding for "cat", embedding for "sat", ...]

After: X = [embedding + position 0, embedding + position 1, ...]

Result: Model knows "cat" at position 0, "sat" at position 1

How to encode position?

Option 1: Learned embeddings (modern models)

Option 2: Fixed sinusoidal functions (original paper)

Sinusoidal positional encodings

FYI / you're not responsible for these formulas:

Intuition: Different frequencies create unique "fingerprints" for each position

Why this works: Model can learn absolute and relative positions

Embeddings + positional encoding

  1. Token embeddings:

  2. Positional encodings:

  3. Add them: input = embeddings + positional encodings

  4. Pass to rest of model

Result: Each token embedding has WHAT it is (word) and WHERE it is (position)

Positional encoding added at input to BOTH encoder and decoder

Residual connections

Problem: Deep networks hard to train (vanishing gradients)

Solution: Add input back to output

Instead of: output = Layer(input)

We do: output = input + Layer(input)

input ───┬───> [Layer] ───> (+) ───> output
         │                   ↑
         └───────────────────┘
          (residual / skip connection)

Why this helps: Model can ignore unhelpful layers (set contribution ≈ 0)

Also helps gradients flow backward during training

In transformers: EVERY sublayer (attention, FFN) has residual connection

Layer normalization

After each sublayer:

  • Rescale to mean = 0, variance = 1
  • Stabilizes training (prevents values getting too large/small)

In transformers: Layer norm happens AFTER residual connection

Full pattern: output = LayerNorm(input + Sublayer(input))

Feed-forward network (FFN)

After attention, EACH POSITION goes through small neural network:

Structure:

  • Input: (e.g., 512)
  • Hidden layer: (e.g., 2048) - much wider!
  • Output: (e.g., 512)
  • Activation: ReLU (the max(0, ...))

Key: Applied to each position INDEPENDENTLY. Same FFN weights shared across all positions, different inputs per position

The FFN is just a 2-layer neural network (also called a multi-layer perceptron or MLP)

Pattern: Attention mixes info ACROSS positions, FFN processes each position individually (adds capacity and non-linearity)

FFN much wider than model dimension (This is where many parameters live)

Quick break: What surprises you?

Turn to your neighbor (2 min):

You've now seen all the building blocks: attention, positional encoding, residual connections, layer norm, FFN.

  • What surprised you?
  • What seems clever?
  • What seems redundant or over-engineered?

Share with class: Any "aha" moments or lingering confusion?

Part 3: Full Transformer Architecture

The complete picture

Original transformer: Encoder-Decoder architecture for translation

Full diagram first, then build up piece by piece:

From "Attention is All You Need":

Encoder block components

Each encoder block has TWO sublayers:

  1. Multi-head self-attention

    • Input sequence attends to itself
    • Each position can see all positions
  2. Feed-forward network (FFN)

    • FFN per position independently
    • Typically: 512 to 2048 to 512

Both sublayers have:

  • Residual connection (add input to output)
  • Layer normalization

What is "encoder output"?

  • After 6 stacked blocks: matrix
  • Each row = processed embedding of one input token
  • Entire matrix feeds into decoder's cross-attention (used as K and V)
  • Encoder runs ONCE, output reused at every decoder step

Decoder block components

Each decoder block has THREE sublayers:

  1. Masked multi-head self-attention

    • Output tokens attend to previous tokens only
    • Can't see future (prevents cheating!)
  2. Multi-head cross-attention - Connection to encoder!

    • Decoder attends to encoder output
    • Q from previous layer (masked self-attention output)
    • K and V from encoder output (processed input)
  3. Feed-forward network (FFN)

    • Same as encoder

All three sublayers: Residual connections + layer norm

Why masked? During generation we don't know future tokens yet!

Encoder vs Decoder: Key differences

Similar building blocks, important differences:

ComponentEncoderDecoder
InputEntire source sequenceOutput tokens generated so far
Self-attentionCan see all positionsMasked (can't see future)
Cross-attentionNoneAttends to encoder output
Sublayers per block2 (self-attn + FFN)3 (masked self-attn + cross-attn + FFN)
PurposeBuild rich representationGenerate output one token at a time

Both: 6 stacked blocks, residual connections, layer norm

Learned vs computed parameters

Important distinction:

Learned during training (model parameters):

  • projection matrices (in each attention layer)
  • output projection matrix (in multi-head attention)
  • FFN weights ()
  • Layer norm parameters (scale and shift)
  • Embedding matrices

Computed during forward pass:

  • Q, K, V matrices (from , , )
  • Attention weights (softmax of )
  • Attention output (weighted sum of V)

From decoder to predictions

After 6 decoder blocks, how do we get next token?

Step 1: Decoder output

  • After all 6 blocks: matrix
  • Still in embedding space (512 dimensions)

Step 2: Linear projection

  • Learned weight matrix:
  • Maps embedding space to vocabulary space
  • Output:

Step 3: Softmax

  • Creates probability distribution over vocabulary per position

Step 4: Select next token

  • Sample or argmax to pick actual token (we'll see more next time)

Autoregressive generation in action

Translating "snow melts" into "la neige fond"

Step 0: Encoder processes "snow melts" ONCE to get encoder output E

Step 1:

  • Decoder input: [START]
  • Processes: masked self-attn on [START], cross-attn to E, FFN
  • Output: "la" (predicted)

Step 2:

  • Decoder input: [START, "la"]
  • Processes: masked self-attn on [START, "la"], cross-attn to E, FFN
  • Output: "neige" (predicted)

Step 3:

  • Decoder input: [START, "la", "neige"]
  • Processes: masked self-attn on [START, "la", "neige"], cross-attn to E, FFN
  • Output: "fond" (predicted)

Encoder output E constant. Only decoder input grows

Let's think about - what are the decoder's INPUTS?

Decoder has TWO separate input sources:

Input 1: From encoder (via cross-attention)

  • Encoder processes "snow melts" ONCE to get encoder output
  • This output REUSED at every decoder step
  • Used in cross-attention layer (K and V)

Input 2: Decoder's own previous outputs (via masked self-attention)

  • Starts with [START] token
  • Grows: [START], then [START, "neige"], then [START, "neige", "fond"]
  • Each token attends to all previous in THIS sequence
  • Used in masked self-attention layer

Encoder runs ONCE. Decoder runs MULTIPLE times (once per output token)

What exactly feeds back?

What gets added to decoder input at each step?

The predicted TOKEN (after sampling/argmax from probability distribution)

Complete loop:

  1. Decoder outputs hidden states
  2. Linear projects to vocabulary
  3. Softmax gives us probabilities over vocabulary
  4. Sample or argmax to get predicted token (e.g., "la")
  5. Convert token to embedding (via embedding matrix)
  6. This embedding added to decoder input for next step

Not probabilities or raw hidden states, but embedded token

Training vs Inference

What you just saw: INFERENCE (generating one token at a time)

During TRAINING, it's different:

Training:

  • Have full target: [START, "la", "neige", "fond"]
  • Decoder processes ENTIRE sequence at once (with masking)
  • Each position predicts next token in parallel
  • Fast and efficient!

Inference (generation):

  • Generate one token at a time
  • Decoder runs sequentially (once per output token)
  • Slower but necessary (don't know answer yet!)

Why training fast (parallel) but generation slow (sequential)!

Quick check: Trace the flow (pairs, 5 min)

Turn to your neighbor, trace through:

Input: "snow melts" (English), Output: "neige fond" (French)

Answer together:

  1. "snow" through encoder block - what TWO sublayers?

  2. Decoder generates "fond" - which THREE attention mechanisms?

  3. Where does positional encoding get added?

  4. What's the purpose of cross-attention?

  5. How many times encoder run? Decoder run?

Drawing Practice

Now YOU draw the architecture!

Work in pairs. Follow step-by-step instructions on handout

Take your time. Best way to absorb this and practice for midterm

Drawing Activity: Your Checklist

Work in pairs. Try to draw from what you remember!

  1. Input path - how do tokens enter the model?
  2. One encoder block - what are the two sublayers? What connects them?
  3. Encoder stacking - how many blocks? What comes out?
  4. One decoder block - this one has THREE sublayers. What are they? Where does the encoder connect?
  5. Decoder output path - how do we get from decoder output to a word prediction?
  6. Label the three types of attention in your diagram

Compare with your partner. Raise hand if questions!

Now let's build it together on the board!

Your turn to teach ME:

I'll draw based on YOUR instructions:

  • Where do I start?
  • What comes next?
  • Did I get this right?

Call out if you see a mistake

What we learned today

Complete data flow: Text → tokens → embeddings → multiply by , , → Q/K/V vectors → attention output

Building blocks: Positional encoding (inject order), residual connections (help training), layer norm (stabilize), FFN (add capacity)

Encoder blocks (2 sublayers): Self-attention + FFN. Runs ONCE, produces rich representation

Decoder blocks (3 sublayers): Masked self-attention + cross-attention + FFN. Runs MULTIPLE times, generates one token at a time

Training vs inference: Training uses "teacher forcing" (parallel), inference is autoregressive (sequential)

Logistical notes

Recommended:

  • Review Jay Alammar's "Illustrated Transformer" post
  • Try sketching transformer architecture from memory

Portfolio Piece 1 Due Friday/Sunday

Quick reflection due too! Friday/Sunday

Exam 1: Monday, Feb 23 (everything through transformers & decoding)

Appendix: Full Step-by-Step Drawing Instructions

Use this to check your work or practice at home.

Step 1: Input path (both encoder and decoder)

  • Box: "Input tokens" (e.g., "snow melts")
  • Arrows point to "Embedding + Positional Encoding"
  • Note dimensions: , typically = 512

Step 2: Draw ONE encoder block (vertically)

  • Box: "Multi-Head Self-Attention"
  • Show residual connection: arrow AROUND it
  • Box: "Add & Norm"
  • Box: "Feed-Forward Network (FFN)"
  • Show residual connection: arrow around FFN
  • Box: "Add & Norm"

Step 3: Show encoder stacking

  • Write "×6" next to encoder block (or draw 2-3 stacked)
  • Label output: "Encoder Output" (feeds into decoder)

Step 4: Draw ONE decoder block

  • Box: "Masked Multi-Head Self-Attention" (can't see future)
  • Residual connection + "Add & Norm"
  • Box: "Multi-Head Cross-Attention"
    • IMPORTANT: Arrow FROM encoder output TO this layer
  • Residual connection + "Add & Norm"
  • Box: "Feed-Forward Network (FFN)"
  • Residual connection + "Add & Norm"

Step 5: Complete decoder output path

  • Write "×6" for decoder stacking
  • Arrow to "Linear" (projects to vocab size)
  • Arrow to "Softmax"
  • Output: "Probability distribution over vocabulary"