Lecture 9 - Pre-training LLMs: From Transformers to GPT

Welcome back!

Last time: Exam 1 on foundations and transformer architecture

Today: How do transformers become useful LLMs? The journey from toy models to GPT-5

Ice breaker

In a class, internship, project, or job, what's the largest ML model of any kind you've trained in terms of:

Compute time
Training set size
Cloud compute cost
Number of parameters

Agenda for today

From toy transformers to LLMs: what changes at scale?
Pre-training deep dive: data, objectives, infrastructure
Scaling laws: bigger is better (with caveats)
Activity: Design your training run
Ethics spotlight: who pays the real costs?

Part 1: From Toy Transformers to LLMs

Recap: You've seen transformers

In Weeks 4-5, you learned:

Attention mechanism (Q, K, V)
Multi-head attention
Transformer architecture (encoder + decoder blocks)

In labs (tomorrow!): You will implement attention and a tiny transformer

Typical lab-scale transformer:

Vocab size: 5,000-10,000 tokens
Embedding dimension: 128-256
Number of layers: 2-4
Number of heads: 4-8
Total parameters: ~1-10 million
Training time: minutes to hours on a single GPU

Transformer variants

Three flavors, depending on which attention mask you use:

Encoder-only (BERT, RoBERTa):
- Bidirectional attention - each token sees the full sequence.
- Best for understanding tasks (classification, named entity recognition, question answering)
Decoder-only (GPT, Claude, Gemini, Llama):
- Causal masking (the lower-triangular mask from L6)- each token sees only the past.
- Best for generation.
Encoder-decoder (T5, BART, original transformer):
- Encoder reads input bidirectionally, decoder generates output autoregressively.
- Best for translation, summarization, anything mapping one sequence to another

Note: BERT's prediction head is training scaffolding and is discarded when fine-tuning. GPT's LM head is kept since generation is the task.

Modern LLMs are almost all decoder-only. Why?

Why decoder-only won

The downside:
- Causal masking = each token sees only the past
- "bank" in "I went to the bank of the river" can't see "river" yet - genuinely ambiguous
For generation, it doesn't matter:
- Answer tokens attend to the full prompt - "river" is visible at generation time
- Disambiguation happens when it needs to, not at encoding time
Where encoder-only still wins:
- Embeddings and retrieval - RAG systems use BERT-style models for indexing

Scale: Production LLMs

GPT-3 (2020):

175 billion parameters
~34 days on 10,000 V100 GPUs

GPT-4 (2023, rumored):

~1.7 trillion parameters (mixture of experts)
months of training, >$100 million

GPT-5 (August 2025):

Parameters undisclosed, 272,000-token context window,
~$500 million per run (Wall Street Journal)

Big context doesn't mean perfect memory

GPT-5's 272,000-token context window. Does the model use it all equally?

Liu et al. (2023): "Lost in the Middle" - models attend much more to information at the start and end of context. Performance degrades on information buried in the middle.

For practice: Put your most critical content first or last. This is one reason RAG can outperform stuffing everything into context. (More in Week 10.)

What changes at scale?

Data: From thousands of examples to trillions of tokens
Compute: From one GPU to thousands, from hours to months
Infrastructure: Distributed training, checkpointing, monitoring
Cost: From free (Colab) to millions of dollars
Capabilities: Emergent abilities that don't appear at small scale
Stakes: One bug can waste weeks and millions of dollars

Part 2: Pre-training Deep Dive

What is pre-training?

Pre-training = learning from raw text

No labels, no human annotations
Just predict: "What comes next?" (GPT) or "What's masked?" (BERT)
Learn language patterns, facts, reasoning from observation
Then fine-tune for specific tasks (next week's lecture!)

Why "pre-training"? The "pre" means before fine-tuning/post-training - it's still the main event (99%+ of the compute)

Training objectives

GPT (causal LM):
- Predict the next token, left-to-right only
- Naturally generates next tokens - generation is "free"
BERT (masked LM):
- Predict masked tokens using both sides of context (~15% masked)
- Sees full context - understanding and classification are "free"

What does the training signal look like?

Loss = cross-entropy over next-token predictions

At each position, predict from ~32K-100K BPE tokens.

Loss = $- lo g p (correct token)$ . Lower is better.

Perplexity = $e^{loss}$ - this is the standard metric you'll see in papers

Perplexity 10: model is "as confused as if choosing uniformly among 10 options"
Perplexity 1: perfect prediction
GPT-3 achieves ~20 perplexity on standard benchmarks

Learning rate schedule:

Warmup for ~1K steps (avoid early instability), then cosine decay to near-zero
Big updates early, fine adjustments late - standard for all modern LLMs

Where does training data come from?

Modern LLMs are trained on diverse text sources:

Common Crawl: Web pages (petabytes of text)
Books: Fiction and non-fiction (Books3 dataset, ~100k books)
Wikipedia: High-quality encyclopedic content
Code: GitHub repositories (for Codex, Copilot)
Research papers, news articles, forums, social media...

Before we look at how it's done...

Quick discussion (2 min):

If you were building a training dataset from a raw scrape of the internet - what would you keep? What would you throw out? What percentage do you think actually makes it into the final training data?

Data curation: It's not just "download the internet"

Raw Common Crawl is full of garbage:

Spam, ads, boilerplate text
Duplicate content (same text repeated thousands of times)
Low-quality text (typos, gibberish, machine-generated)
Toxic content (hate speech, explicit material)
Personal information (emails, phone numbers, addresses)

What raw web text actually looks like

A realistic sample (before cleaning):

Home | About | Services | Contact | Home | About | Services | Contact
BUY CHEAP WIDGETS ONLINE! Best widget prices 2019! Cheap widgets!
Click here  click here  click here  click here  click here
Copyright © 2019 All rights reserved  Privacy Policy  Terms  Sitemap
Lorem ipsum dolor sit amet consectetur adipiscing elit sed do eiusmod

After cleaning (~2% survives):

Transformer models represent each token as a high-dimensional vector.
Self-attention allows the model to weigh the relevance of every other
token when producing a representation for each position in the sequence.

Most of the web looks like the top example - not bad writing, just no signal

Data cleaning pipeline

Deduplication: Remove near-duplicate documents
Quality filtering: Heuristics (word count, punctuation, ratio of letters to numbers)
Toxicity filtering: Remove hate speech, explicit content
PII removal: Scrub personal information
Classifier-based filtering: Train a model to predict quality

GPT-3 result: ~45TB in, ~570GB out - over 98% filtered out

Who decides what's "quality"?

OpenAI's approach (WebText):

Positive examples: text from URLs shared in Reddit posts with 3+ upvotes
Positive examples: Wikipedia articles
Negative examples: everything else from Common Crawl

What does "Reddit-approved" text bias toward?

English content, Western topics, tech/finance/gaming
Demographics: young, male, college-educated
Writing styles that get upvotes (confident, punchy, sometimes glib)

Every quality signal encodes someone's judgment. This is where bias enters before any intentional decisions.

Curriculum learning

Not all data should be seen in random order

Idea (Bengio et al., 2009): Start with easier examples, gradually increase difficulty

Two mechanisms:

Data ordering: Simple, clean text early; complex documents, code, math later
Data mix scheduling: Change the proportion of each source over training

"Annealing":

Near end of training: upweight highest-quality data (books, math, code)
Why it matters: these are the final updates - nothing comes after to overwrite them
The low learning rate means small, stable adjustments, so the annealing data steers the final resting point without instability
LLaMA-3: final phase emphasized STEM and code to sharpen reasoning

Training infrastructure

Why can't you just use a bigger GPU?

175B params × 2 bytes (FP16) = ~350GB. An A100 has 80GB. The model doesn't fit.

Distributed training across thousands of GPUs:

Data parallelism: Each GPU holds a full model copy, processes different batches
Model parallelism: Split layers across GPUs - GPU 1 runs layers 1-24, GPU 2 runs 25-48, etc.
Pipeline parallelism: Different GPUs handle different stages of the forward pass

Training infrastructure hacks

ZeRO (Zero Redundancy Optimizer):
- Adam tracks momentum + variance per weight - optimizer states add ~4x the weight memory
- Partitions weights + gradients + optimizer states across GPUs - each stores only 1/N
Mixed precision (FP16/BF16):
- Forward/backward in 16-bit float (half the memory of FP32)
- Weight updates stay in FP32 for numerical stability

Checkpointing and monitoring

Training runs for weeks/months - things will go wrong

Checkpointing: Save model state every N steps
Monitoring: Track loss, gradients, activation statistics
Debugging: If loss spikes or diverges, roll back to last good checkpoint
Failures: Hardware failures, out-of-memory errors, network issues

This is unglamorous - but it's what makes it all possible.

Part 3: Scaling Laws

The scaling hypothesis

Observation: More compute + more data + bigger models = better performance

But how much better?

Empirical finding (Kaplan et al., 2020):

Loss scales predictably with model size, dataset size, and compute
Power law relationship: Loss ~ C^(-α) where C is compute

From the paper "Scaling Laws for Neural Language Models"

Kaplan scaling laws (2020)

Key findings:

Model size matters most: Bigger models are more sample-efficient
Data and compute trade off: You can get same performance with more data + smaller model, or less data + bigger model
Smooth scaling: No discontinuities or surprises (at least in terms of loss)

Chinchilla scaling laws (2022)

Old wisdom (GPT-3 era): large models, modest data
New wisdom (Chinchilla): balance model size AND data size for a fixed compute budget
Proof: Chinchilla (70B params, 1.4T tokens) beats Gopher (280B params, 300B tokens) at same compute
Implication: GPT-3 was undertrained - race shifted from "biggest model" to "best training recipe"

Why is there an optimal balance?

If you had 10x the compute budget, where should you spend it - model or data?

Loss from training a model with $N$ parameters on $D$ tokens:

$L (N, D) = E + \frac{A}{N ^{α}} + \frac{B}{D ^{β}}$

$E$ = irreducible loss. Even perfect prediction can't eliminate language's inherent entropy.
$A / N^{α}$ = model-size term. More parameters, lower loss. Diminishing returns.
$B / D^{β}$ = data-size term. More tokens, lower loss. Also diminishing returns.

Two knobs. Each attacks a different term.

IsoFLOP curves - how Chincilla was perfected

Where do major models fall relative to Chincilla?

Params vs tokens scatter plot: where do major models fall relative to the Chinchilla-optimal line?

The data wall

Scaling laws assume unlimited data. We're nearly out.

Models have trained on essentially all publicly available text: Common Crawl, Wikipedia, books, code, forums
The Chinchilla rule says a 7B model needs 140B tokens. GPT-4-scale models need trillions - we've used them.
More compute doesn't help if there's no new data to train on

The proposed solution: synthetic data

Use existing models to generate new training data
LLaMA-3, Phi-3, and others already rely on this heavily

The question: Does synthetic data preserve quality? Or do errors and biases amplify?

"Model collapse" (Shumailov et al., 2023): quality degrades when models train on their own outputs repeatedly - errors and biases compound across generations

Emergent abilities

Something unexpected: capabilities that suddenly appear at scale

Small models can't do arithmetic, large models can
Small models can't do few-shot learning, large models can
Chain-of-thought reasoning emerges around 60B-100B parameters
True phase transitions, or just crossing a usefulness threshold?
Caveat: discrete (0/100%) metrics make smooth improvement look like sudden jumps

Wei et al. (2022), "Emergent Abilities of Large Language Models"

Wait - are emergent abilities real?

Schaeffer et al. (2023): "Are Emergent Abilities a Mirage?"

The finding: switch the metric, and the phase transitions largely disappear

Discrete metric: "Did the model get this exactly right?" - 0% or 100%. Small model: 0%, large model: 80%, looks like a sudden jump.
Continuous metric: "How many digits of the answer are correct?" shows smooth improvement across all model sizes. No jump.

The phase transition is in the metric, not the model

Why this matters for AI safety:

If emergence is real: we might be blindsided by sudden dangerous capability jumps
If it's a measurement artifact: scaling is more predictable than we thought
The debate is unsettled, and it changes how you think about risk

Part 4: Activity - Design Your Training Run

Activity: Design your training run

The scenario: Your lab has $10 million in compute budget. Your goal: build a model that achieves a passing score on the LSAT - trained from scratch, no fine-tuning of existing models.

With a partner (5 min):

Dataset: What text would you train on? Estimate how many tokens you could collect.
Model size: Chinchilla rule: ~20 tokens per parameter. What size does your dataset imply?
Compute check: Look up current H100 cloud pricing (~$2-4/hr per GPU on Lambda Labs or AWS). Does $10M cover your training run?

Be ready to share your numbers.

Activity debrief

What did people find? Token count, implied model size, estimated compute cost.

The twist: compute is not the bottleneck.

High-quality legal text (court opinions, casebooks, LSAT prep) is probably 1-10 billion tokens.
Chinchilla-optimal for 5B tokens: ~250M parameters.
Training cost: roughly $10-50K. You have $9.95 million left over.

The bigger question: Would a 250M-parameter model trained from scratch on legal text outperform GPT-4 with a good system prompt? Probably not - which raises a question for Wednesday: what if you fine-tuned an existing model on that same legal corpus?

Who can afford to train LLMs?

At $5-100 million per training run:

Big tech companies (OpenAI/Microsoft, Google, Meta, Anthropic)
Well-funded startups (Cohere, Inflection, Mistral)
Large research labs (DeepMind, Allen AI, EleutherAI with donations)
Not: Most universities, small companies, researchers, or countries

This concentrates power: who trains the models decides what they can do, whose values they encode, and who gets access. Most researchers must use APIs from the same handful of companies.

Plot twist: DeepSeek-R1 (January 2025)

DeepSeek, a Chinese AI lab, released a frontier-quality model for ~$6 million

Competitive with GPT-4 on reasoning and coding benchmarks
US export controls blocked access to H100 GPUs - they used older H800s
Constraint forced efficiency: distillation, RL without human labels, mixture-of-experts
- MoE: only a fraction of parameters activate per token - effective compute much lower than total param count

The caveats:

$6M = compute only. Salaries, data, failed runs, and the cost of the teacher model they distilled from aren't included
They had access to outputs from much more expensive models for distillation
But even with all that: the efficiency gap with frontier US labs is real and significant

Does this change who can train LLMs? Or does it just change what "affordable" means?

DeepSeek: the deeper questions (skip if short on time)

Distillation: DeepSeek trained on outputs from GPT-4 and Claude
- You can absorb an expensive model's knowledge without paying for it
- Raises questions about licensing, competitive moats, and who "owns" learned capabilities
Chip restrictions: Did export controls fail? Or create just enough friction?
- Being denied H100s forced efficiency innovations that might not have happened otherwise
- The bottleneck may shift from hardware to algorithmic know-how - harder to restrict

Part 5: Ethics Spotlight

The real costs of scale

We've covered environmental and data ethics before. Quick recap:

Carbon: GPT-3 training ~550 tons CO₂ (one-time). Inference at scale is the ongoing cost.
Copyright: Scraped without permission. Lawsuits from authors (Sarah Silverman), artists, programmers.
Bias: Encoded in data choices before any intentional decisions - starting with Reddit upvotes.

The part we haven't talked about: where does the infrastructure go?

Case study: New Brunswick, NJ (February 2026)

A community just stopped an AI data center:

Proposed: 27,000 sq ft facility at 100 Jersey Avenue in New Brunswick, NJ
City Council voted unanimously to cancel it on Feb 19, 2026
Concerns: electricity costs, water consumption, noise, neighborhood impact
"We don't want these kinds of centers that's going to take resources from the community." - Bruce Morgan, president of the New Brunswick NAACP

Site will instead host 600 apartments (10% affordable housing), startup warehouse space, and a public park
Context: NJ residents have seen significant electric bill increases partly due to existing data center operations
Rutgers University is in New Brunswick - students were among those who packed City Hall

Discussion: Is there a sustainable path forward?

Should we slow down LLM scaling given environmental costs?
How can we make LLM training more accessible and democratic?
What regulations (if any) should exist for training data sourcing?

Wrap-up: Key takeaways

Scale changes everything: LLMs aren't just bigger models, they're different engineering challenges
Training is expensive: $5-100 million, weeks to months, thousands of GPUs
Scaling laws are predictable: More compute + more data = better performance (with diminishing returns)
Chinchilla insight: Balance model size and data size for compute-optimal training
Ethics matter: Environmental impact, data sourcing, concentration of power

Looking ahead

Next lecture (Wednesday):

Post-training: What happens after pre-training?
Instruction tuning: Making models follow instructions
RLHF: Reinforcement learning from human feedback
Alignment: Whose values? How do we ensure safety?

Due Wednesday:

Portfolio piece peer reviews
You can expect exam grades back

Due Friday:

Reflections
Course survey
Participation self-assessment
I'll ask you to decide about oral re-exams

Lauren's CDS593 Materials