Lecture 12 - Fine-tuning Strategies

Welcome back

Last time (Monday): LLM landscape - choosing the right model

Today: Adapting models to your needs through fine-tuning

Looking ahead: Prompt engineering, safety, RAG, agents

Ice breaker

What's something you've changed your mind about in the last year?

Agenda for today

  1. The adaptation spectrum (when to fine-tune)
  2. Fine-tuning basics
  3. Parameter-Efficient Fine-Tuning (PEFT)
  4. Activity: Find an adapter
  5. Safety considerations

Part 1: The Adaptation Spectrum

The problem: General models don't fit specific needs

Foundation models are trained on broad data

But you need:

  • Domain-specific knowledge (legal, medical, etc.)
  • Your company's writing style
  • Behavior on specific tasks
  • Access to private data

Question: How do we adapt general models to specific needs?

The adaptation spectrum

Option 1: Just use the API (zero-shot)

Option 2: Prompt engineering (few-shot)

Option 3: RAG

Option 4: Fine-tuning

Option 5: Train from scratch

Each has trade-offs in cost, effort, performance, and control

The adaptation spectrum

ApproachProsConsWhen to use
API (zero-shot)No setup, SOTA performancePer-token cost, no customizationLow volume, getting started
Prompt engineeringEasy, no training neededContext window limits, inconsistentHave good examples, task fits context
RAGFresh data, no retrainingNeeds retrieval infrastructureData changes frequently, factual Q&A
Fine-tuningConsistent, no prompt overheadNeeds data, compute, expertiseSpecific style/domain, high volume
Train from scratchFull control$10M+, months of workGoogle, Meta, OpenAI

Focus today: fine-tuning. RAG and prompt engineering are coming soon.

Cost comparison over time

Draw on the board:

  • API calls: Linear growth (cost per query)
  • Prompt engineering: Slightly higher per query (more tokens)
  • Fine-tuning: High upfront cost, then flat (hosting) or per-query (API)
  • Training from scratch: Massive upfront, then flat

Fine-tuning has upfront cost, but saves money at scale

Think-pair-share: Which option?

Scenario: You're building a chatbot to answer FAQs about your university's course catalog (100+ courses, enrollment rules, degree requirements)

Question: Which adaptation approach? Why?

Turn to your neighbor (2 min)

When prompting runs out of steam

Task: Customer service emails in your company's exact voice

Prompt: "Write a shipping delay apology in a warm, friendly tone."

Attempt 1 - Zero-shot: Generic. Might not match brand voice.

Attempt 2 - Few-shot (3 examples in prompt): Starts getting better.

Attempt 3 - Many examples (30+ in prompt): Context window fills up. Tokens get expensive. Still inconsistent.

At this point, fine-tuning pays off. It bakes the examples into the weights - no prompt overhead, consistent every time.

Decision framework: When to fine-tune

Fine-tune when:

  • Task-specific knowledge not in base model
  • Specific style or format required (and an API/constained output does not suffice)
  • High volume (cost-effective at scale)
  • You have quality training data

Don't fine-tune if:

  • Base model already works well (just prompt it!)
  • You have < 100 examples
  • Data/knowledge changes frequently
  • Low volume use case

Rule of thumb: Try prompting first, fine-tune if needed

Part 2: Fine-tuning Details

Wait, didn't we already do this?

In Lecture 10, we covered supervised fine-tuning (SFT) as part of post-training.

That SFT was: base model + human-written instruction-response pairs = a model that can follow instructions.

Today's fine-tuning is different. We're starting from a model that already works as an assistant - and specializing it for a particular job.

  • Post-training SFT: general capability (base model becomes useful assistant)
  • Task fine-tuning: specific capability (useful assistant becomes expert at your task)

How fine-tuning works

Start with pre-trained model

Already knows language, reasoning, world knowledge

Continue training on your specific data

Much less data needed (100s-10,000s examples vs billions)

Model adapts to your task

What you need for fine-tuning

Training data: Input-output pairs for your task

Compute: GPU access (can rent from cloud)

Tooling: Hugging Face transformers and PEFT packages, OpenAI fine-tuning API, etc.

Evaluation plan: How to measure success

Use cases for fine-tuning

Style transfer: Match your brand voice

Domain adaptation: Medical, legal, technical writing

Task-specific: Summarization, translation, Q&A

Format control: Structured outputs (JSON, SQL) (along with constained output)

Behavior modification: More concise, more detailed, etc.

Catastrophic forgetting

Problem: Fine-tuning can erase general knowledge

Example:

  • You fine-tune on medical Q&A
  • Model becomes great at medicine
  • But now it's bad at general knowledge!

Why? Model overwrites weights, "forgets" pre-training




Solutions: Smaller learning rates, mixing in general data, PEFT methods

Overfitting in fine-tuning

A related risk: Memorizing training data instead of learning patterns

Symptoms:

  • Perfect on training data, bad on new examples
  • Repeats exact phrasing from training
  • Doesn't generalize

Solutions: More data, regularization, early stopping, validation set

Fine-tuning costs (as of early 2026)

OpenAI-managed fine-tuning (GPT-4o mini):

  • Cheapest option for API-based fine-tuning
  • Training: ~$3 per 1M tokens; inference cheaper than base GPT-4o

Self-hosting an open model with LoRA:

  • GPU rental: $1-$3/hour
  • Fine-tune a 7B model in 1-4 hours: total cost often under $10

Part 3: Parameter-Efficient Fine-Tuning (PEFT)

The problem with full fine-tuning

Full fine-tuning: Update all model parameters

For GPT-3.5 (175B parameters):

  • Requires storing full model copy for each task
  • Need massive GPU memory
  • Risk of catastrophic forgetting

Question: Can we get most benefits with less cost?

PEFT: Parameter-Efficient Fine-Tuning

Most model behavior comes from pre-training. You only need to adjust a little bit.

Idea: Freeze most parameters, train a small number

Result: 1000x fewer trainable parameters

Benefits: Cheaper, faster, less forgetting

LoRA: Low-Rank Adaptation

Most popular PEFT method

Instead of updating weight matrix W:

  • Add two small matrices: B (d×r) and A (r×d)
  • W_new = W + BA
  • W is frozen, only B and A are trained

Why "low-rank"? r is much smaller than d - the bottleneck is what makes it cheap

Quick calculation (what's the real savings?)

An attention weight matrix W that's 4096 by 4096 has 16 million parameters

Question: If you could only update W using a rank-8 approximation (two thin matrices that multiply together to give a 4096×4096 result) how many total numbers would you need?

Work it out with your neighbor (1 min)

LoRA intuition

Why does this work?

Hypothesis: The changes needed for fine-tuning are low-rank

  • Most dimensions don't need adjustment
  • Only a few directions of change matter
  • Low-rank matrices capture those key directions

Empirically: Works very well in practice!

LoRA benefits

Efficiency: Train 0.1% of parameters instead of 100%

Speed: Much faster training

Memory: Can fine-tune on smaller GPUs

Storage: Adapters are tiny (1-10MB vs 350GB)

Multi-task: Load different adapters for different tasks

LoRA in practice

Using Hugging Face PEFT library:

from peft import get_peft_model, LoraConfig

# Load base model
model = AutoModelForCausalLM.from_pretrained("llama-3-8b")

# Configure LoRA
lora_config = LoraConfig(
    r=8,  # rank (bottleneck size)
    lora_alpha=16, # adapter weight / importance
    target_modules=["q_proj", "v_proj"],  # which layers (usually attention)
    lora_dropout=0.1,
)

# Wrap with LoRA
model = get_peft_model(model, lora_config)

# Train as usual
trainer.train()

# Save just the adapter (tiny file!)
model.save_pretrained("my_lora_adapter")

The adapter ecosystem

Hugging Face Hub has 100,000+ LoRA adapters (as of early 2026)

A few examples with file sizes:

  • Medical domain adapter for Mistral 7B: 4 MB
  • SQL generation adapter for LLaMA 3 8B: 8 MB
  • Customer service tone adapter: 3 MB

For comparison: Base LLaMA 3 8B model = 14 GB

From one foundation model, you can get many specialized models, swapping adapters in milliseconds

Other PEFT methods (briefly)

Prefix tuning: Add trainable prefix tokens to each layer

Adapter layers: Insert small trainable layers between frozen layers

Prompt tuning: Train soft prompts (embedding vectors, not tokens)

All share the same goal: Freeze most of the model, train a small part

LoRA is most popular due to simplicity and effectiveness

Full fine-tuning vs LoRA comparison

MetricFull fine-tuningLoRA
Parameters trained100% (175B)0.1% (175M)
GPU memory350GB20GB
Training timeDaysHours
Storage per task350GB10MB
Catastrophic forgettingHigh riskLow risk
PerformanceSlightly betterNearly as good

LoRA is 99% as good at 1% of the cost

Going further: QLoRA (if time)

Problem: Even LoRA requires loading the full base model

LLaMA 3 8B at 16-bit precision = ~16GB GPU memory. Needs an A100.

QLoRA (Dettmers et al., 2023): quantization plus LoRA

  1. Load base model in 4-bit precision (reduces 14GB to ~5GB)
  2. Train LoRA adapters at normal precision (same as before)
  3. Result: fine-tune 7B+ models on a single consumer GPU

Full training also needs optimizer states, gradients, and activations, pushing a 7B model to 60-80GB total. QLoRA's real win is bringing that down to ~10-16GB to fit on one GPU (e.g. a gaming PC).

Standard practice now: most small-team fine-tuning uses this

If you want to fine-tune for your final project, Google Colab + QLoRA is a solid plan.

Part 4: Activity - Find an Adapter

Find an adapter for your scenario

Each group gets one scenario. Browse huggingface.co/models?other=lora and find the best adapter you can for your use case (~5 min).

Report back:

  • Which adapter did you pick? What base model does it use?
  • What does the model card say about training data?
  • What's missing? Do you trust it? What would make you nervous about deploying it?

Scenario 1 (legal): A law firm needs to extract key clauses and flag risks in contracts. Legal language is highly specialized.

Scenario 2 (SQL): Analysts need to query a database using plain English. The system must return valid SQL, every time.

Scenario 3 (math tutoring): A tutoring platform needs to walk students through algebra and calculus problems step by step, showing work and explaining each move.

Scenario 4 (medical): A clinical tool to suggest follow-up tests based on patient symptoms. Very high stakes.

Scenario 5 (multilingual): Customer support for an e-commerce platform serving users in English, Spanish, French, German, and Japanese.

Scenario 6 (financial): Extract key figures and risk factors from earnings reports and SEC filings.

What did we notice?

Domain knowledge not in base model: strong case for fine-tuning

Fluency is not accuracy: a model can explain a wrong answer very clearly (math tutoring is a hard case)

Guaranteed output format: constrain at inference time, or fine-tune, or both

Frequently changing info: fine-tuning won't help , we need RAG (coming soon)

Model cards matter: training data, coverage, and known limitations are all your problem once you deploy

Part 5: Safety in Fine-tuning

Fine-tuning can undo safety training

When you were browsing adapters, I asked "would you trust it?"

Remember: Base models are post-trained for safety (RLHF, Constitutional AI)

Fine-tuning can overwrite this!

In fact, some people intentionally fine-tune to remove safety guardrails ("uncensored models").

"With power comes responsibility." If you fine-tune, you're responsible for the model's behavior.

How fragile is safety training?

If RLHF takes thousands of hours of human feedback to instill safety...

How many fine-tuning examples would it take to undo it?

A) Tens of thousands B) Thousands C) Hundreds D) About 100

How fragile is safety training?

Research finding (Yang et al., 2023 "Shadow Alignment"):

Fine-tuning on ~100 harmful examples significantly degraded safety guardrails in LLaMA models

The asymmetry: Months of alignment training, undone in hours

Why? Alignment suppresses harmful outputs - it doesn't erase the knowledge. Fine-tuning can shift the distribution back.

Discussion (1-2 min if we have time):

  • Who's responsible when someone fine-tunes an open model to remove safety guardrails?
  • Does this change how you think about open vs. closed model debates from Monday?

Your responsibility when fine-tuning

You own the model's behavior after fine-tuning

  • Test for safety issues, biases, harmful outputs
  • Red-team your fine-tuned model
  • Consider: Do you need custom safety training?

We'll cover safety and red-teaming in detail next week

Evaluation is critical

Don't just look at task performance!

Evaluate:

  • Task accuracy (did it learn what you wanted?)
  • Generalization (works on new examples?)
  • General knowledge (did it forget other capabilities?)
  • Safety (does it refuse harmful requests?)
  • Bias (fair across demographics?)

Use a held-out test set, not training data!

Data quality and model freshness

Garbage in, garbage out - more so with fine-tuning

Your fine-tuned model will faithfully reproduce patterns in your training data, including mistakes.

Common pitfalls:

  • Inconsistent labels (same input, different outputs)
  • Poor coverage (edge cases not represented)
  • Test data leaked into training

Data quality matters more than data quantity past a certain threshold

Your fine-tuned model also has a shelf life

  • Adapters are tied to a base model at a point in time
  • Frequently changing knowledge (prices, inventory, recent events) doesn't belong in weights. Use RAG instead

What we've learned today

  • Adaptation spectrum: prompting to fine-tuning to training
  • Fine-tuning adapts pre-trained models to specific tasks
  • LoRA makes fine-tuning efficient (0.1% of parameters)
  • Try prompting first, fine-tune when needed
  • For structured output: constrain at inference time if your runtime supports it; fine-tune when you need portability or a complex schema
  • Data quality matters more than quantity; fine-tuned models go stale when base models update
  • Fine-tuning brings responsibility for safety

Looking ahead

Due Sunday: Week 8 Lab

And start thinking about projects/groups!

  • Monday: Prompt engineering and prompt injection
  • Wednesday: Safety, alignment, and red-teaming
  • Week 10: RAG - combining retrieval with generation