AI Interpretability: Meaning, Methods, and Limits

Graduate Reading Group (UC Berkeley, Spring 2026)

Interested?

Logistics

Format: We will read 1–2 papers per week; one group member will lead a 45-minute presentation followed by group discussion.

Organizers: Will Fithian (Statistics), Wes Holliday (Philosophy), and Sean Richardson (Statistics PhD student)

Meeting Details: Fridays 1-2:30pm, in Latimer 120.

Course credit: We can enroll a limited number of students in Stat 298 (section 3) for 2 units of course credit. The requirement for course credit is regular attendance plus presenting a paper at one session. The course listing will appear in the course catalog within a few days.

Description

This is an interdisciplinary reading group organized around the question:

What kinds of interpretability claims are meaningful, useful, and reliable as AI systems become more capable and are deployed in higher-stakes contexts?

We will study interpretability as a scientific and safety-relevant practice: a collection of methods, epistemic standards, and theory-of-change assumptions aimed at gaining traction on the internal structure of modern ML systems in ways that could plausibly matter for oversight, control, and alignment. Questions include:

  • Why study interpretability? How do we formulate interpretability claims rigorously, and assess evidence for and against?
  • How are interpretability methods currently used to understand and steer state-of-the-art systems? How do we know if they are working?
  • What are the limitations of these methods, or of the interpretability paradigm more broadly? What role can it play in reducing risks to society?

Our aim is to bring together an interdisciplinary community including perspectives from machine learning, statistics, causal inference, philosophy of science, and AI governance, and to seed new research programs.

Learning objectives

By the end of this course, participants should be able to:

  1. Distinguish different senses of “interpretability” and “understanding”
  2. Evaluate mechanistic claims in light of causal and robustness criteria
  3. Recognize when interpretability evidence is strong, weak, or misleading
  4. Articulate and critique realistic theories of change connecting interpretability to AI risk mitigation

Assumed background

We welcome broad interdisciplinary participation, but our curriculum is designed for participants with:

  • Comfort reading technical ML papers
  • Interest in evaluating interpretability claims critically
  • Background in one or more relevant fields such as statistics, philosophy, causal inference, or cognitive science

No prior interpretability research experience is required.

Tentative Reading Schedule

The reading list below may be revised according to participant interests.

Part I: Meaning (Weeks 1–4)

What is interpretability for? What kinds of questions can it answer? How do mechanistic explanations differ from post-hoc rationalizations?

Week Date Topic Reading
1 Jan 23 Interpretability and AI Safety Mechanistic Interpretability and AI Safety: A Review — Bereska & Gavves (2024), sections 1-2 and 6-8 (secs 3-5 optional). Optional: The Urgency of Interpretability — Amodei (2025)
2 Jan 30 Interpretability in Neuroscience Crafting Interpretable Embeddings for Language Neuroscience by Asking LLMs Questions — Benara et al. (2025) + Predictive Coding or Just Feature Discovery? — Antonello & Huth (2025)
3 Feb 6 Mechanistic Interpretability Zoom In: Circuits — Olah et al. (2020)
4 Feb 13 Interpretability as Safety Strategy A Pragmatic Vision for Interpretability + How Can Interpretability Researchers Help AGI Go Well? — Nanda (2025)

Part II: Methods (Weeks 5–13)

What can interpretability methods actually do with real models? When does an intervention support an explanatory claim, and when is it merely a control knob?

Week Date Topic Reading
5 Feb 20 Causal Intervention (ROME) Locating and Editing Factual Associations in GPT — Meng et al. (NeurIPS 2022) + How to use and interpret activation patching — Heimersheim & Nanda (2024)
6 Feb 27 Steering Vectors Steering Language Models With Activation Engineering — Turner et al. (2023)
7 Mar 6 Behavioral Directions Refusal in LLMs Is Mediated by a Single Direction — Arditi et al. (2024) + Persona Vectors — Anthropic (2025)
8 Mar 13 Machine Unlearning TOFU: A Task of Fictitious Unlearning for LLMs — Maini et al. (2024)
9 Mar 20 Latent Knowledge Discovering Latent Knowledge in Language Models Without Supervision — Burns et al. (ICLR 2023)
Mar 23–27 Spring recess
10 Apr 3 Causal Scrubbing Causal Scrubbing: Rigorously Testing Mechanistic Hypotheses — Chan et al. (2022)
11 Apr 10 Causal Abstractions Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability — Geiger et al. (2023–2025)
12 Apr 17 Sparse Auto-encoders Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
13 Apr 24 Large-Scale Mechanistic Interp On the Biology of a Large Language Model — Anthropic (2025)

Part III: Limits (Weeks 14–15)

Given its limits, what role can interpretability play in reducing catastrophic risk?

Week Dates Topic Reading
14 May 1 Mesa-Optimization Risks from Learned Optimization in Advanced ML Systems — Hubinger et al. (2019)
15 May 8 Deceptive Interpretability Deceptive Automated Interpretability: LMs Coordinating to Fool Oversight — Lermen et al. (2025)

See the full syllabus for paper summaries and additional context.


Contact

Questions? Contact Will Fithian (wfithian@berkeley.edu).