Mechanistic Interpretability Workshop at NeurIPS 2025

Submit papers here by Aug 22.

As neural networks grow in influence and capability, understanding the mechanisms behind their decisions remains a fundamental scientific challenge. This gap between performance and understanding limits our ability to predict model behavior, ensure reliability, and detect sophisticated adversarial or deceptive behavior. Many of the deepest scientific mysteries in machine learning may remain out of reach if we cannot look inside the black box.

Mechanistic interpretability addresses this challenge by developing principled methods to analyze and understand a model’s internals–weights and activations–and to use this understanding to gain greater insight into its behavior, and the computation underlying it.

The field has grown rapidly, with sizable communities in academia, industry and independent research, 140+ papers submitted to our ICML 2024 workshop, dedicated startups, and a rich ecosystem of tools and techniques. This workshop goal is to bring together diverse perspectives from the community to discuss recent advances, build common understanding and chart future directions.

The first Mechanistic Interpretability Workshop (ICML 2024)

Keynote Speakers

Chris Olah

Interpretability Lead and Co-founder, Anthropic

Been Kim

Senior Staff Research Scientist, Google DeepMind

Sarah Schwettmann

Co-founder of Transluce

Workshop Goals

The mechanistic interpretability field benefits from a rich diversity of approaches—from rigorous mathematical analysis to large-scale empirical studies, from reverse-engineering a model via bottom-up circuit analysis, to assisting behavioral analysis via top-down analysis of model representations. But all are unified by the belief that there is meaning and structure to be found inside neural networks, and that this is worth studying.

This diversity reflects the field’s breadth and the many valid paths toward understanding neural networks. But those in these different sub-communities often lack natural venues to meet. Our workshop aims to:

Showcase cutting-edge research across all approaches to mechanistic interpretability
Foster cross-pollination between different methodological traditions
Identify convergent insights emerging from diverse research programs
Build understanding between different perspectives, research communities, and terminologies
Welcome newcomers by providing clear entry points into the field

We hope to explore points of active debate in the field including:

How researchers should prioritise between gathering evidence via rigorous qualitative analysis vs performance on benchmarks/real-world tasks
What are the implications of new paradigms like reasoning models for the field’s priorities?
Whether the field’s north star should be complete reverse engineering, achieving high-level understanding, or something else entirely
How reliable or useful are popular methods such as sparse autoencoders, and how much should we prioritize them compared to other research directions?
The relative merits of a perspective of curiosity driven basic science vs working towards specific goals
How important are unsupervised techniques with the potential to surprise us, such as transcoders, compared to simple supervised techniques such as probing.

In this workshop, we hope to bring together researchers from across these many perspectives and communities—along with skeptics, experts in adjacent fields, and those simply curious to learn more—to facilitate healthy discussion and move towards a greater mutual understanding as a field.

Through our call for papers, we hope to facilitate the sharing of work in this fast-moving field, across all of these axes, and especially work that helps to bridge these gaps. We welcome any submissions that seek to further our ability to use the internals of models to achieve understanding, regardless of how unconventional the approach may be. Please see the call for papers page for further details and particular topics of interest.

We welcome attendees from all backgrounds, regardless of your prior research experience or if you have work published at this workshop. Note that while you do not need to be registered for the NeurIPS main conference to attend this workshop, you do need to be registered for the NeurIPS workshop track. No further registration needed, seating is first-come first-served.

Learning More

Here are some resources you may find useful for learning more about the mechanistic interpretability field and performing research:

We recommend starting with the review paper Open Problems in Mechanistic Interpretability for an overview of the field
Ferrando et al is a good primer on the key techniques of the field
The ARENA coding tutorials are a great place to learn how to implement these techniques in practice
Neel Nanda’s guide on how to skill up, learn how to do research, and pursue a career as a researcher in the field

Resources for doing research