WEEK 9: Prompt Engineering and Safety

This week covers two topics that are deeply connected: how to get LLMs to do what you want (prompt engineering), and what happens when someone tries to make them do things they shouldn't (prompt injection, jailbreaking, alignment failures). Monday teaches systematic prompt engineering - the techniques that separate casual users from skilled practitioners. Wednesday goes deeper on safety: red-teaming, alignment challenges, and responsible deployment. You'll come away understanding both how to wield these models effectively and what makes them hard to control.

This week's checklist

  • Attend your oral exam time, if applicable
  • Attend Lecture 13 (Mon, Mar 23): Prompt Engineering and Prompt Injection
  • (No discussion this week)
  • Attend Lecture 14 (Wed, Mar 25): Safety, Alignment, and Red-Teaming
  • Submit Week 9 Reflection + Project ideation (due on Gradescope by Sun, Mar 29 by 11:59pm)

This week's learning objectives

After Lecture 13 (Mon Mar 23) students will be able to...

  • Apply core prompting principles: specificity, context, examples, output format
  • Design effective few-shot examples and know how many to use
  • Implement chain-of-thought prompting and explain why it helps reasoning tasks
  • Identify when zero-shot, few-shot, or chain-of-thought is the right approach
  • Explain prompt injection (direct and indirect) and why it's hard to defend against
  • Describe basic mitigation strategies: input sanitization, output filtering, instruction hierarchy

After Lecture 14 (Wed Mar 25) students will be able to...

  • Define the alignment tax: making models safer often makes them less capable
  • Explain jailbreaking: how roleplay, hypotheticals, and encoding bypass safety guardrails
  • Design a basic red-teaming protocol for an LLM application
  • Engage with value alignment questions: whose values, how to handle cultural disagreement
  • Describe responsible disclosure practices when finding LLM vulnerabilities
  • Distinguish between safety (preventing harm) and alignment (matching human values) as separate challenges

Discussion Section (Tue Mar 24): Red-Teaming Exercise

Discussion is cancelled this week due to a timing conflict.

Week 9 Reflection + Project Ideation

Due: Sunday, March 29 by 11:59pm

Weight: Counts as part of the completion-based tasks category. Graded for completion.

No lab this week. Two parts: a short reflection on course content, and your first project deliverable, both due on Gradescope.

Part 1: Reflection (200-300 words)

Some prompts to consider (you don't need to address all of them):

  • What surprised you about prompt injection or jailbreaking? Were there techniques that seemed obviously exploitable? Are there obvious defenses that weren't implemented?
  • The code/data separation problem (everything is tokens) is fundamentally different from traditional software security. Do you think this is a solvable problem, or something we'll always be managing?
  • If you were deploying an LLM for a real application, what safety measures would you implement? What would you still be worried about?
  • The Character.AI case and the alignment tax represent two failure modes: too little safety and too much. Which failure mode worries you more, and why?

Write in your own voice, without AI assistance.

Part 2: Project Ideation

Submit 2 project ideas. No commitment yet, this is to get you thinking early and let us flag scope issues before you're invested. The Gradescope assignment will walk you through these questions for each idea (with an optional open box if you have a third).

For each idea, answer:

  1. What problem are you solving, and for whom? Describe a real task or pain point. Be specific: "summarizing legal contracts for paralegals" not "using AI for law."
  2. What technique(s) would you use? Pick from what we've covered or will cover: prompting, fine-tuning, RAG, agents, or a combination. Why does that approach fit your problem better than the alternatives?
  3. What data or resources would you need? What model would you start from? Is there a dataset you'd use, or would you need to collect/create one? Are there access or cost constraints?
  4. What's your biggest open question or risk? What might not work? What would you need to figure out first?

Finally: Are you working solo or in a group? If group, list members. If looking for a partner, say so and we'll help match people.

Resources for further learning

Prompt engineering

Security and safety

Alignment and red-teaming