Final Project Guide
The final project is where you put it all together. You'll build something real with LLMs, evaluate it honestly, and present it to the class. It's 30% of your grade and the biggest single thing you'll produce in this course.
You can work solo or in a team of 2-3. Solo is totally fine, most teams will be pairs, and three-person teams should expect a higher bar for scope and complexity (more on that below). There is a wide range for acceptable topics, the only requirement is that it has to meaningfully use LLMs and involve something you can actually evaluate.
Deliverable timeline
| Due | What | Details |
|---|---|---|
| Sun Mar 29 | Ideation | 2-3 project ideas + team confirmation |
| Sun Apr 12 | Abstract | 200-300 words committing to a direction |
| Mon Apr 13 | Project clinic | Come with your abstract and questions |
| Sun Apr 19 | Readiness check | Confirm data, compute, and repo are in place |
| Sun Apr 26 | Progress check-in | 300 words + repo showing work in progress |
| Mon/Wed Apr 27-29 | Presentations | 8-10 min + Q&A |
| Fri May 1 | Final write-up | Report + code repo |
All intermediate deliverables are graded for completion only (using the usual late penalties). Full descriptions for each are in the relevant week guides.
Scope expectations by team size
Solo projects are more targeted. Pick one technique, apply it well, evaluate it thoroughly. You don't need a polished UI or a multi-component system. Focus on depth over breadth.
Pair projects (most common) should feel like building out an application. Two people means you can go deeper on evaluation, compare more approaches, or build a more complete system.
Three-person projects carry a higher expectation for scope and complexity. If three people could have done the same project as a pair, the scope wasn't ambitious enough. Documentation must include a clear division of labor. Each person's contribution should be individually substantial.
For group projects, include a brief statement of who contributed what. I will also eask students on teams to comment privately about whether there were issues in how work was divided up and will take this into account during evaluations.
What to build
Projects generally fall into a few categories. I've included examples at different team sizes so you can calibrate scope.
RAG applications
- Solo: Q&A system over a specific corpus (your research papers, a textbook, legal documents). Getting basic retrieval and generation working is just the starting point. Try multiple retrieval strategies (keyword vs. semantic, different chunking approaches), build a golden test set, and rigorously evaluate what works and what doesn't.
- Pair: Add a UI, more complex reasoning given retrieved information, more focus on safety and security for end-users. Evaluate both retrieval and generation quality separately. OR Testing fine-tuning alongside RAG.
- Trio: Full pipeline with access control or multi-user support, systematic error analysis, and a production-readiness assessment.
Fine-tuning projects
- Solo: Fine-tune a model for a specific task in an area of interest or research. Compare base vs. fine-tuned performance on a held-out test set. Compare results from different base models, hyperparameter choices, and reflect on design decisions.
- Pair: Compare fine-tuning approaches (full fine-tune vs. LoRA vs. prompt tuning) on the same task, or fine-tune for a harder task that requires careful data curation. Include cost/performance tradeoff analysis. Likely includes a user-facing component.
- Trio: Multi-stage fine-tuning pipeline, or fine-tuning combined with another technique (RAG, agents). Systematic evaluation across multiple dimensions.
Agent applications
- Solo: Single-purpose agent (research assistant, code reviewer, data analyst) with tool use and a user interface. Getting an agent to call a tool is just the starting point. Try different prompting strategies, evaluate on concrete tasks with clear success criteria, and reflect on what design decisions made the agent more or less reliable.
- Pair: Multi-step agent with multiple tools, error recovery, and a comparison of different prompting/orchestration strategies. Thoughtful analysis of safety, access issues, and legal risks.
- Trio: Multi-agent system or complex workflow with planning, memory, and evaluation of failure modes.
Model architecture projects
- Solo: Train a small language model from scratch on a specific corpus (song lyrics, legal text, a programming language). Experiment with architecture choices: attention variants, positional encoding, tokenization strategy. Evaluate how design decisions affect output quality. Won't rival GPT-5, but you'll learn a lot about what actually matters in the architecture. You may need more compute than you can get on Colab.
- Pair: Systematic comparison of architecture decisions. Train multiple small models with different configurations on the same data and evaluate tradeoffs (quality vs. training cost vs. inference speed). Experiment with training regimes, training set curation, hyperparameters, curriculum learning. Could include teacher-student distillation from a larger model.
Safety and red-teaming projects
- Solo: Build a guardrail or content filtering system for a specific use case. Or: systematic red-teaming of a model for a specific domain (medical advice, legal guidance, financial recommendations) with a taxonomy of failure modes. Or: bias auditing pipeline that detects and measures bias across demographic groups for a specific task, with mitigation strategies implemented and evaluated.
- Pair: Significant experimentation in safety and moderation as an additional component to a larger project (eg. solo-scoped RAG plus significant safety work)
These are starting points. The best projects come from your own interests and research areas.
Tips
Scope it right. You have about 3 weeks to build. A focused system that works and is well-evaluated beats an ambitious system that barely runs. If you bite off more than you can chew, you can round out the project with a postmortem of what you'd do differently with more time or compute.
The bar is higher than "it works." Getting a basic proof of concept running is step one, not the finish line. What makes a project strong is what happens after that. Why did you make the design choices you made? What alternatives did you consider? How do you know it's working well, and where does it fall short?
Have a baseline. "My RAG system answers questions" is not an evaluation. "My RAG system answers 73% of questions correctly vs. 41% without retrieval" is. Measure where you're starting from and know where you're going.
Document failures. Knowing what you tried, what happened, what you'd do differently, is (in the long-run) worth as much making things that work. A project that tried three approaches and carefully explains why each failed is stronger than one that tried one approach and got lucky.
Think about who's affected. Who would use this? What could go wrong? What biases might your system have? What ethical challenges give you pause? This may be "just" a class project, but what if it wasn't? How would you feel if what you built was actually deployed?
Iterate. Your first attempt probably won't be your best. Try something, evaluate it, adjust. The write-up should tell the story of that process, not just describe the final artifact.
Rubric overview
Total: 50 points (see the full rubric for detailed criteria at each level)
| Category | Points | What we're looking for |
|---|---|---|
| Scope & Ambition | 10 | Challenging problem, appropriate for team size. This is the main place the team-size expectations show up. |
| Design Decisions | 5 | You considered alternatives and can explain why you made the choices you did. Not just "I used cosine similarity" but "I tried cosine and BM25 and here's why I went with..." |
| Technical Execution | 5 | It works, the code is reasonable, architecture makes sense |
| Use of Course Concepts | 5 | Uses what we learned and makes connections across topics |
| Evaluation & Analysis | 10 | Baselines, metrics, error analysis, honest reporting of what works and what doesn't. Double-weighted for importance. |
| Iteration & Reflection | 5 | What didn't work? What did you try and abandon? What would you do next? |
| Ethics & Limitations | 5 | Who's affected? What could go wrong? What are you not capturing? |
| Documentation & Presentation | 5 | Clear write-up, clear presentation, organized code. |
Proposal and checkpoint deliverables are graded separately for completion (not included in the 50 points above).
Getting unstuck
If you're blocked on data, compute, or scope, flag it in your next deliverable or come to office hours. That's what the check-ins are for.
- Office hours: see the course calendar
- Project clinic: Mon Apr 13 (come with your abstract)
- TA support during discussion section, Week 13