As neural networks grow in influence and capability, understanding the mechanisms behind their decisions remains a fundamental scientific challenge. This gap between performance and understanding limits our ability to predict model behavior, ensure reliability, and detect sophisticated adversarial or deceptive behavior. Many of the deepest scientific mysteries in machine learning may remain out of reach if we cannot look inside the black box.
Mechanistic interpretability addresses this challenge by developing principled methods to analyze and understand a model’s internals–weights and activations–and to use this understanding to gain greater insight into its behavior, and the computation underlying it.
The field has grown rapidly, with sizable communities in academia, industry and independent research, 140+ papers submitted to our ICML 2024 workshop, dedicated startups, and a rich ecosystem of tools and techniques. This workshop goal is to bring together diverse perspectives from the community to discuss recent advances, build common understanding and chart future directions.


The first Mechanistic Interpretability Workshop (ICML 2024)
Workshop Goals
The mechanistic interpretability field benefits from a rich diversity of approaches—from rigorous mathematical analysis to large-scale empirical studies, from reverse-engineering a model via bottom-up circuit analysis, to assisting behavioral analysis via top-down analysis of model representations. But all are unified by the belief that there is meaning and structure to be found inside neural networks, and that this is worth studying.
This diversity reflects the field’s breadth and the many valid paths toward understanding neural networks. But those in these different sub-communities often lack natural venues to meet. Our workshop aims to:
- Showcase cutting-edge research across all approaches to mechanistic interpretability
- Foster cross-pollination between different methodological traditions
- Identify convergent insights emerging from diverse research programs
- Build understanding between different perspectives, research communities, and terminologies
- Welcome newcomers by providing clear entry points into the field
We hope to explore points of active debate in the field including:
- How researchers should prioritise between gathering evidence via rigorous qualitative analysis vs performance on benchmarks/real-world tasks
- What are the implications of new paradigms like reasoning models for the field’s priorities?
- Whether the field’s north star should be complete reverse engineering, achieving high-level understanding, or something else entirely
- How reliable or useful are popular methods such as sparse autoencoders, and how much should we prioritize them compared to other research directions?
- The relative merits of a perspective of curiosity driven basic science vs working towards specific goals
- How important are unsupervised techniques with the potential to surprise us, such as transcoders, compared to simple supervised techniques such as probing.
In this workshop, we hope to bring together researchers from across these many perspectives and communities—along with skeptics, experts in adjacent fields, and those simply curious to learn more—to facilitate healthy discussion and move towards a greater mutual understanding as a field.
Through our call for papers, we hope to facilitate the sharing of work in this fast-moving field, across all of these axes, and especially work that helps to bridge these gaps. We welcome any submissions that seek to further our ability to use the internals of models to achieve understanding, regardless of how unconventional the approach may be. Please see the call for papers page for further details and particular topics of interest.
We welcome attendees from all backgrounds, regardless of your prior research experience or if you have work published at this workshop. Note that while you do not need to be registered for the NeurIPS main conference to attend this workshop, you do need to be registered for the NeurIPS workshop track. No further registration needed, seating is first-come first-served.
Learning More
Here are some resources you may find useful for learning more about the mechanistic interpretability field and performing research:
- We recommend starting with the review paper Open Problems in Mechanistic Interpretability for an overview of the field
- Ferrando et al is a good primer on the key techniques of the field
- The ARENA coding tutorials are a great place to learn how to implement these techniques in practice
Resources for doing research
- Popular libraries include: TransformerLens (PyTorch, best for <=9B models), nnsight (PyTorch, more flexible and scales better), Penzai (Jax)
- The Mechanistic Interpretability Benchmark
- The Gemma Scope Sparse Autoencoders (interactive tutorial)
Relevant online communities: