AI Interpretability: Meaning, Methods, and Limits
Graduate Reading Group (UC Berkeley, Spring 2026)
Purpose
The project is an opportunity to pursue a question that emerges from our discussions. It can be empirical, theoretical, or conceptual—what matters is that it engages seriously with the interpretability landscape and produces something that could seed further inquiry.
Teams
Form groups of 2-4 people. If possible, try to combine different perspectives and areas of expertise within your team.
Scope
Projects should connect to AI interpretability broadly construed. You might:
- Replicate and extend a method from the readings
- Develop a theoretical framework or formalize an intuition
- Design an empirical study testing an interpretability claim
- Analyze the epistemic status of a class of methods
- Propose and prototype a new technique
- Critically evaluate a theory of change linking interpretability to safety
Ambitious but incomplete work is fine. We’d rather see an interesting question partially answered than a boring question fully resolved.
Finding Project Ideas
Beyond questions that arise from our readings and discussions, two resources may help spark ideas:
Princeton Interpretability Research Highlights — A curated collection of recent interpretability research organized by theme.
“Open Problems in Mechanistic Interpretability” — A systematic survey of open questions in the field, useful for identifying tractable problems at various levels of difficulty.
You’re welcome to draw from these or pursue something entirely different.
Timeline
Week 9 (March 20): Submit a one-page project proposal containing:
- Team members and their disciplinary backgrounds
- The question or problem you’re addressing
- Why it matters (connection to course themes)
- Proposed approach
- What success would look like
RRR Week (May 11-15): Final deliverables due and poster session.
Deliverables
Poster for public presentation during RRR week, open to the Berkeley community.
Online artifact in one of the following formats (your choice):
- Short writeup (4-8 pages)
- Documented code repository with README
- Blog post suitable for a technical audience
- Other format by approval
The artifact should allow someone outside the group to understand what you did and why it matters.
What We’re Looking For
We’re not expecting publication-ready work. We want to see evidence of serious engagement: a well-posed question, a thoughtful approach, and honest reflection on what you learned. A project that tackles something hard and partially succeeds is more valuable than one that answers a trivial question completely.
Finding Collaborators
We’ll facilitate team formation during Weeks 4-5. If you have a project idea but need collaborators, or want to join a team but don’t have an idea yet, we’ll create a shared space to match people.