AI Interpretability: Meaning, Methods, and Limits
Graduate Reading Group (UC Berkeley, Spring 2026)
- RSVP for this reading group to indicate your interest.
- Discover other programming from Berkeley’s AI Risk group, or join the mailing list
Curriculum structure
The course follows the arc of its title:
- Part I: Meaning (Weeks 1–4) — What is interpretability for? What kinds of questions can it answer?
- Part II: Methods (Weeks 5–13) — What can we actually do with real models, and what kinds of causal claims follow?
- Part III: Limits (Weeks 14–15) — When does interpretability work, and what role should it play in mitigating AI risk?
The schedule below is tentative. Please do not hesitate to contact Will Fithian (wfithian@berkeley.edu) if you have additional papers or topics to suggest.
Part I: Meaning
Week 1 (Jan 23) — Framing Interpretability
- Mechanistic Interpretability and AI Safety: A Review (Sections 1-2, 6-8; optionally sections 3-5) (2024)
- Leonard Bereska & Efstratios Gavves
- https://arxiv.org/html/2404.14082v3
Introduces concepts and goals of mechanistic interpretability, its relevance to AI safety, and its challenges and limitations as a safety strategy.
- (Optional) The Urgency of Interpretability (2025)
- Dario Amodei
- https://www.darioamodei.com/post/the-urgency-of-interpretability
Week 2 (Jan 30) — Interpretability in Neuroscience
- Crafting Interpretable Embeddings for Language Neuroscience by Asking LLMs Questions (2025)
- Kushal Benara et al.
- https://pmc.ncbi.nlm.nih.gov/articles/PMC12021422/
Uses LLM-generated questions to build interpretable feature spaces for predicting brain activity from language stimuli.
- Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data (2025)
- Richard Antonello & Alexander Huth
- https://direct.mit.edu/nol/article/5/1/64/113632/Predictive-Coding-or-Just-Feature-Discovery-An
Challenges the interpretation that LLM–brain alignment reflects predictive coding, arguing it may reflect shared feature discovery instead.
Week 3 (Feb 6) — Mechanistic Interpretability and Circuits
- Zoom In: Circuits (2020)
- Chris Olah et al.
- https://distill.pub/2020/circuits/zoom-in
Canonical introduction to mechanistic interpretability, circuit-based explanations, and feature-level analysis.
Week 4 (Feb 13) — Interpretability as a Safety Strategy
Neel Nanda (2025)
- A Pragmatic Vision for Interpretability https://www.alignmentforum.org/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability
- How Can Interpretability Researchers Help AGI Go Well? https://www.alignmentforum.org/posts/MnkeepcGirnJn736j/how-can-interpretability-researchers-help-agi-go-well
Agenda-setting essays that question the most ambitious goals of mechanistic interpretability and articulate a pragmatic, safety-oriented theory of change for interpretability research.
Part II: Methods
Week 5 (Feb 20) — Causal Intervention in Practice
- Locating and Editing Factual Associations in GPT (NeurIPS 2022)
- Kevin Meng et al.
- https://arxiv.org/abs/2202.05262
- Project page: https://rome.baulab.info/
ROME: causal tracing to identify where factual associations live in transformer computation, and targeted weight editing to modify them.
- How to Use and Interpret Activation Patching (2024)
- Stefan Heimersheim & Neel Nanda
- https://arxiv.org/abs/2404.15255
Practical guide to activation patching methodology, clarifying when and how causal interventions yield meaningful interpretability claims.
Week 6 (Feb 27) — Steering Vectors
- Steering Language Models With Activation Engineering (2023)
- Alexander Matt Turner et al.
- https://arxiv.org/abs/2308.10248
Activation Addition (ActAdd): inference-time steering by adding activation-space directions computed from prompt pairs; foundational steering method.
Week 7 (Mar 6) — Behavioral Directions and Internal State Variables
- Refusal in Language Models Is Mediated by a Single Direction (2024)
- Yoni Arditi, Neel Nanda et al.
- https://arxiv.org/abs/2406.11717
- Persona Vectors: Monitoring and Steering Model Behavior (2025)
- Anthropic Interpretability Team
- https://www.anthropic.com/research/persona-vectors
Together, these readings show how important model behaviors can be mediated by low-dimensional structure in activation space.
Week 8 (Mar 13) — Machine Unlearning
- TOFU: A Task of Fictitious Unlearning for LLMs (2024)
- Pratyush Maini et al.
- https://arxiv.org/abs/2401.06121
Benchmark and evaluation suite for LLM unlearning; studies what it would mean to behave as if certain data were never learned.
Week 9 (Mar 20) — Latent Knowledge
- Discovering Latent Knowledge in Language Models Without Supervision (ICLR 2023; arXiv 2022)
- Collin Burns et al.
- https://arxiv.org/abs/2212.03827
Extracts truth-related structure from activations to separate what models know from what they say.
Spring recess: Mar 23–27
Week 10 (Mar 31–Apr 3) — Causal Scrubbing
- Causal Scrubbing: A Method for Rigorously Testing Mechanistic Interpretability Hypotheses (2022)
- Lawrence Chan et al.
- https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing
Behavior-preserving resampling ablations as a way to test mechanistic hypotheses (interpretation as a falsifiable causal claim).
Week 11 (Apr 10) — Causal Abstractions
- Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability (2023–2025)
- Atticus Geiger et al.
- https://arxiv.org/abs/2301.04709
Develops formal criteria for when mechanistic explanations support valid higher-level causal abstractions, grounding interpretability claims in intervention semantics.
Week 12 (Apr 17) — Sparse Auto-encoders
- Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (2023)
- Anthropic Interpretability Team
- https://transformer-circuits.pub/2023/monosemantic-features
Uses sparse dictionary learning to decompose MLP activations into interpretable, monosemantic features; foundational method for scalable mechanistic interpretability.
Week 13 (Apr 24) — Large-Scale Mechanistic Interpretability
- On the Biology of a Large Language Model (2025)
- Anthropic Interpretability Team
- https://www.anthropic.com/research/biology-of-a-large-language-model
Anthropic’s most ambitious attempt to construct detailed mechanistic explanations of a frontier language model, tracing behavior through features, circuits, and attribution graphs.
Part III: Limits
Week 14 (May 1) — Mesa-Optimization and Deceptive Alignment
- Risks from Learned Optimization in Advanced ML Systems (2019)
- Evan Hubinger et al.
- https://arxiv.org/abs/1906.01820
Classic paper on adversarial AI behavior and deceptive alignment.
Week 15 (May 8) — Deceptive Interpretability
- Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems (2025)
- Simon Lermen, Mateusz Dziemian, Natalia Pérez-Campanero Antolín
- https://arxiv.org/abs/2504.07831
Shows how models/agents can produce deceptive explanations that evade automated interpretability-based oversight; stress-tests interpretability as safety infrastructure.
Expectations and Participation
Preparation
- Complete the assigned reading before each session
- Come prepared with questions, critiques, or connections to other work
Discussion
- Active participation is expected; the value of the group depends on collective engagement
- Discussions will focus on: What does this paper claim? How do we evaluate those claims? What are the implications for AI safety?