AI Interpretability: Meaning, Methods, and Limits

Graduate Reading Group (UC Berkeley, Spring 2026)

Interested?

Curriculum structure

The course follows the arc of its title:

  • Part I: Meaning (Weeks 1–4) — What is interpretability for? What kinds of questions can it answer?
  • Part II: Methods (Weeks 5–13) — What can we actually do with real models, and what kinds of causal claims follow?
  • Part III: Limits (Weeks 14–15) — When does interpretability work, and what role should it play in mitigating AI risk?

The schedule below is tentative. Please do not hesitate to contact Will Fithian (wfithian@berkeley.edu) if you have additional papers or topics to suggest.


Part I: Meaning

Week 1 (Jan 23) — Framing Interpretability

Mechanistic Interpretability and AI Safety: A Review (Sections 1-2, 6-8; optionally sections 3-5) (2024)
Leonard Bereska & Efstratios Gavves
https://arxiv.org/html/2404.14082v3

Introduces concepts and goals of mechanistic interpretability, its relevance to AI safety, and its challenges and limitations as a safety strategy.

(Optional) The Urgency of Interpretability (2025)
Dario Amodei
https://www.darioamodei.com/post/the-urgency-of-interpretability

Week 2 (Jan 30) — Interpretability in Neuroscience

Crafting Interpretable Embeddings for Language Neuroscience by Asking LLMs Questions (2025)
Kushal Benara et al.
https://pmc.ncbi.nlm.nih.gov/articles/PMC12021422/

Uses LLM-generated questions to build interpretable feature spaces for predicting brain activity from language stimuli.

Predictive Coding or Just Feature Discovery? An Alternative Account of Why Language Models Fit Brain Data (2025)
Richard Antonello & Alexander Huth
https://direct.mit.edu/nol/article/5/1/64/113632/Predictive-Coding-or-Just-Feature-Discovery-An

Challenges the interpretation that LLM–brain alignment reflects predictive coding, arguing it may reflect shared feature discovery instead.

Week 3 (Feb 6) — Mechanistic Interpretability and Circuits

Zoom In: Circuits (2020)
Chris Olah et al.
https://distill.pub/2020/circuits/zoom-in

Canonical introduction to mechanistic interpretability, circuit-based explanations, and feature-level analysis.

Week 4 (Feb 13) — Interpretability as a Safety Strategy

Neel Nanda (2025)

Agenda-setting essays that question the most ambitious goals of mechanistic interpretability and articulate a pragmatic, safety-oriented theory of change for interpretability research.


Part II: Methods

Week 5 (Feb 20) — Causal Intervention in Practice

Locating and Editing Factual Associations in GPT (NeurIPS 2022)
Kevin Meng et al.
https://arxiv.org/abs/2202.05262
Project page: https://rome.baulab.info/

ROME: causal tracing to identify where factual associations live in transformer computation, and targeted weight editing to modify them.

How to Use and Interpret Activation Patching (2024)
Stefan Heimersheim & Neel Nanda
https://arxiv.org/abs/2404.15255

Practical guide to activation patching methodology, clarifying when and how causal interventions yield meaningful interpretability claims.

Week 6 (Feb 27) — Steering Vectors

Steering Language Models With Activation Engineering (2023)
Alexander Matt Turner et al.
https://arxiv.org/abs/2308.10248

Activation Addition (ActAdd): inference-time steering by adding activation-space directions computed from prompt pairs; foundational steering method.

Week 7 (Mar 6) — Behavioral Directions and Internal State Variables

Refusal in Language Models Is Mediated by a Single Direction (2024)
Yoni Arditi, Neel Nanda et al.
https://arxiv.org/abs/2406.11717
Persona Vectors: Monitoring and Steering Model Behavior (2025)
Anthropic Interpretability Team
https://www.anthropic.com/research/persona-vectors

Together, these readings show how important model behaviors can be mediated by low-dimensional structure in activation space.

Week 8 (Mar 13) — Machine Unlearning

TOFU: A Task of Fictitious Unlearning for LLMs (2024)
Pratyush Maini et al.
https://arxiv.org/abs/2401.06121

Benchmark and evaluation suite for LLM unlearning; studies what it would mean to behave as if certain data were never learned.

Week 9 (Mar 20) — Latent Knowledge

Discovering Latent Knowledge in Language Models Without Supervision (ICLR 2023; arXiv 2022)
Collin Burns et al.
https://arxiv.org/abs/2212.03827

Extracts truth-related structure from activations to separate what models know from what they say.

Spring recess: Mar 23–27

Week 10 (Mar 31–Apr 3) — Causal Scrubbing

Causal Scrubbing: A Method for Rigorously Testing Mechanistic Interpretability Hypotheses (2022)
Lawrence Chan et al.
https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing

Behavior-preserving resampling ablations as a way to test mechanistic hypotheses (interpretation as a falsifiable causal claim).

Week 11 (Apr 10) — Causal Abstractions

Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability (2023–2025)
Atticus Geiger et al.
https://arxiv.org/abs/2301.04709

Develops formal criteria for when mechanistic explanations support valid higher-level causal abstractions, grounding interpretability claims in intervention semantics.

Week 12 (Apr 17) — Sparse Auto-encoders

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (2023)
Anthropic Interpretability Team
https://transformer-circuits.pub/2023/monosemantic-features

Uses sparse dictionary learning to decompose MLP activations into interpretable, monosemantic features; foundational method for scalable mechanistic interpretability.

Week 13 (Apr 24) — Large-Scale Mechanistic Interpretability

On the Biology of a Large Language Model (2025)
Anthropic Interpretability Team
https://www.anthropic.com/research/biology-of-a-large-language-model

Anthropic’s most ambitious attempt to construct detailed mechanistic explanations of a frontier language model, tracing behavior through features, circuits, and attribution graphs.


Part III: Limits

Week 14 (May 1) — Mesa-Optimization and Deceptive Alignment

Risks from Learned Optimization in Advanced ML Systems (2019)
Evan Hubinger et al.
https://arxiv.org/abs/1906.01820

Classic paper on adversarial AI behavior and deceptive alignment.

Week 15 (May 8) — Deceptive Interpretability

Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems (2025)
Simon Lermen, Mateusz Dziemian, Natalia Pérez-Campanero Antolín
https://arxiv.org/abs/2504.07831

Shows how models/agents can produce deceptive explanations that evade automated interpretability-based oversight; stress-tests interpretability as safety infrastructure.


Expectations and Participation

Preparation

  • Complete the assigned reading before each session
  • Come prepared with questions, critiques, or connections to other work

Discussion

  • Active participation is expected; the value of the group depends on collective engagement
  • Discussions will focus on: What does this paper claim? How do we evaluate those claims? What are the implications for AI safety?