Accepted Posters

163 virtual posters were accepted. They are part of the workshop program but are not presented in person. Grouped by topic.

Circuits & Attribution Graphs (30)

Virtual
How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models Gregory N. Frank
Abstract
We localize the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and triggers deeper amplifier heads that boost the signal toward refusal. In smaller models the gate and amplifier are single heads; at larger scale they become bands of heads across adjacent layers. The gate contributes under 1% of output DLA, yet interchange testing (p < 0.001) and knockout cascade confirm it is causally necessary. Interchange screening at n ≥ 120 detects the same motif in twelve models from six labs (2B to 72B), though specific heads differ by lab. Per-head ablation weakens up to 58× at 72B and misses gates that interchange identifies; at scale, interchange is the only reliable audit. Modulating the detection-layer signal continuously controls policy from hard refusal through evasion to factual answering. On safety prompts the same intervention turns refusal into harmful guidance, showing that the safety-trained capability is gated by routing, not removed. Thresholds vary by topic and by input language, and the circuit relocates across generations within a family even while behavioral benchmarks register no change. Routing is early-commitment: the gate fires at its own layer before deeper layers finish processing the input. An in-context substitution cipher collapses gate interchange necessity by 70 to 99% across three models, and the model switches to puzzle-solving rather than refusal. Injecting the plaintext gate activation into the cipher forward pass restores 48% of refusals in Phi-4-mini, localizing the bypass to the routing interface. A second method, cipher contrast analysis, uses plain/cipher DLA differences to map the full cipher-sensitive routing circuit in O(3n) forward passes. Any encoding that defeats detection-layer pattern matching bypasses the policy regardless of whether deeper layers reconstruct the content.
Virtual
In-Context Learning Amplifies a Latent Symbolic Circuit Melissa Wessel
Abstract
Large language models can learn abstract rules from just a few in-context examples, but how their internal mechanisms activate as examples accumulate is not well understood. We trace a three-stage symbolic reasoning circuit (abstraction, induction, retrieval) across shot counts in three model families and find it is detectable and functional well before the model achieves high accuracy. Per-head causal contribution grows up to $8\times$ from 1- to 10-shot, and cross-shot activation patching raises accuracy from 1\% to 56\% at 0-shot and 17\% to 88\% at 1-shot. Function vectors scaled and injected at 0-shot rescue accuracy up to 86\%, largely substituting for the induction stage but depending critically on an intact downstream retrieval stage. The infrastructure for abstract rule-following is present in the weights before any demonstrations; in-context examples, function vectors, and related interventions appear to supply input to the same latent circuit.
Virtual
Identifying latent algorithms in Transformers via RASP Rimon Melamed, H Howie Huang
Abstract
Transformers trained on certain algorithmic tasks have been found to generalize on out-of-distribution (OOD) examples. In this work, we identify several causal mechanisms (``latent algorithms'') that are responsible for OOD generalization. Specifically, given an algorithmic task, we use intermediate computations from a RASP-L program that implements the task to probe the alignment of the model's learned representations with the RASP-L program. For several algorithms, the intermediate computations predicted by the RASP-L program are linearly decodable from the model's activations. Causal intervention analysis reveals that the probe subspaces are crucial for high task accuracy. Overall, we take a new perspective on understanding the hidden computations and OOD generalization of Transformer language models.
Virtual
Transformers Linearly Represent Highly Structured World Models Roman Kniazev, Nathanaël Fijalkow
Abstract
Do transformers, when trained on sequential reasoning traces, build internal models of the underlying task? And if so, does the structure of those internal representations mirror the structure of the domain? We train an 8-layer transformer on Sudoku solving traces and perform a mechanistic analysis of its internal computation. We establish two results. First, the model builds a substructure world model: it does not represent the board state cell by cell, as a human analyst would expect, but organizes information around the rows, columns, and boxes that Sudoku's constraints act on. Second, we identify a naked-single circuit: a small set of dedicated neurons in the final MLP layer, each individually detecting when exactly one digit remains possible for a specific cell, and reliably promoting that digit. These findings show that the geometry of an emergent world model is shaped by the constraint algebra of the domain, not its surface presentation, and that the resulting decision circuit is sparse, monosemantic, and fully interpretable. More broadly, they demonstrate that mechanistic interpretability tools can recover an end-to-end algorithmic account of how a transformer solves a combinatorial reasoning task.
Virtual
Logit-Contribution Scoring Identifies Non-Literal Retrieval Heads Aryo Pradipta Gema, Beatrice Alex, Pasquale Minervini
Abstract
In long-context use, large language models frequently synthesize answers from the meaning of a relevant context span rather than literally copy-pasting them. Identifying which attention heads perform this synthesis matters for interpreting long-context model behavior. Yet existing detectors miss these heads by construction: they reward heads whose attended token matches the generated token, a literal-copy criterion that captures where a head reads but not what it writes through its output-value (OV) circuit, the very mechanism that carries non-literal retrieval. We introduce Logit-Contribution Scoring (LOCOS), a write-aware detector that scores each head by the projection of its OV-circuit output onto the answer-token unembedding direction, contrasting needle and off-needle source positions in a single forward pass. Across three model families (Qwen3, Gemma-3, OLMo-3.1), mean-ablating the top LOCOS heads on the NoLiMa non-literal retrieval benchmark collapses ROUGE-L at lower head counts than every attention-based baseline; on Qwen3-8B, ablating 50 heads drives ROUGE-L from 0.401 to 0.000 while the strongest baseline still retains 0.292. The selected heads are retrieval-specific: parametric recall and arithmetic reasoning stay at baseline under the same ablation. On Qwen3-8B, the same ablation also drops MuSiQue from 0.55 to 0.08 and BABILong from 0.62 to 0.20, while a random-heads control stays within 0.05 of baseline.
Virtual
Assign and Add: A Mechanistic Study of Compositional Arithmetic Brady Exoo, Alberto Bietti, John Sous
Abstract
Large language models are able to compose skills in order to perform complex tasks, many of which might not have been seen during training. The details of how exactly this composition occurs remain elusive. In this paper, we study a mechanism for compositional generalization in transformers by considering a simple controlled setting involving variable assignment and modular addition. By partitioning our training data into disjoint sets, we observe that small transformers are able to generalize to previously unseen combinations of variables and numbers. Our mechanistic analysis shows that the same ``modular addition'' MLP module is used whether the inputs are given directly or indirectly through a separate variable assignment mechanism. We also analyze the training dynamics from an empirical lens, which reveals three phases of learning: first, modular addition is learned, then the structure required for variable assignment, and finally a refinement phase where the model generalizes to some hard sequences not seen in training. Finally, we provide a theoretical framework to explain how compositionality emerges from training dynamics. These results suggest that compositional generalization can be a natural consequence of the compositionality of internal mechanisms in transformers.
Virtual
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction Jonathn Chang, Arya Datla, Ziv Goldfeld
Abstract
Causal abstraction offers a principled framework for mechanistic interpretability, aligning a high-level causal model with the low-level computation realized by a neural network through counterfactual intervention analysis. Existing methods such as distributed alignment search (DAS) learn expressive subspace interventions, but the relevant neural site is unknown a priori, so finding a handle requires a computationally burdensome search over candidate sites. We introduce PLOT (Progressive Localization via Optimal Transport), a transport-based framework that localizes causal variables from the output effect geometry of abstract and neural interventions. PLOT fits an optimal transport coupling between abstract variables and candidate neural sites, yielding a global soft correspondence that can be calibrated into intervention handles. In simple settings, a single coupling over individual neurons suffices. In larger models, PLOT is applied progressively, moving from coarse sites such as tokens, timesteps, or layers to finer supports such as coordinate groups or PCA spans, and optionally guiding DAS based on the localized signal. Across experiments of increasing complexity, transport-only PLOT handles are exceedingly fast and competitive on accuracy, while PLOT-guided DAS reaches DAS-level accuracy at a fraction of full DAS runtime, providing an efficient localization engine for causal abstraction research at scale.
Virtual
When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models Ding Zhang, Runtao Zhou, Wenqing Zheng, Rizal Fathony, C. Bayan Bruss, Chirag Agarwal
Abstract
Graph Language Models (GLMs) have become a promising direction for adapting Large Language Models (LLMs) to graph learning tasks. By transforming graph topology and node information into graph tokens, GLMs allow LLMs to jointly process structured graph inputs and textual instructions. Yet, it remains unclear how LLMs internally interpret these graph tokens and whether graph tokens act as meaningful carriers of graph structure. In this work, we analyze how LLMs process graph information through graph-token behavior in representative GLM architectures. **Findings.** We find that the internal saliency of graph tokens in GLMs is not equivalent to graph information utilization. Graph sink tokens consistently emerge as activation-level outliers: they can be identified by massive activation values along a small set of hidden-state dimensions and are biased toward early graph-token positions. However, this activation-level saliency does not imply that these tokens are the main carriers of graph information. Unlike classical attention sinks in language and vision-language models, graph sink tokens do not necessarily attract the largest attention weights from query tokens. Through pruning, repositioning, and swapping interventions, we show that graph sink tokens are not the most important semantic or structural tokens for downstream prediction. **Implications.** Together, these results suggest that after current GLMs map graph structure into the LLM token space, the resulting graph-token representations do not naturally form a fully usable topology-aware internal representation; instead, they exhibit a decoupling between activation-level saliency and graph-semantic utility. This decoupling points to limitations in existing graph-token construction, placement, and alignment mechanisms.
Virtual
How Do Language Models Compose Functions? Apoorv Khandelwal, Ellie Pavlick
Abstract
While large language models (LLMs) appear to be increasingly capable of solving compositional tasks, it is an open question whether they do so using compositional mechanisms. In this work, we investigate how feedforward LLMs solve two-hop factual recall tasks, which can be expressed compositionally as $g(f(x))$. We first confirm that modern LLMs continue to suffer from the "compositionality gap", i.e. their ability to compute both $z = f(x)$ and $y = g(z)$ does not entail their ability to compute the composition $y = g(f(x))$. We then decode residual stream representations and identify two processing mechanisms: one which solves tasks *compositionally*, computing $f(x)$ along the way to $g(f(x))$, and one which solves them *directly*, without any detectable signature of the intermediate variable $f(x)$. Finally, we find that embedding space geometry is strongly related to which mechanism is employed, where the idiomatic mechanism is dominant when tasks are represented by translations from $x$ to $g(f(x))$ in the embedding spaces.
Virtual
Architecture, Not Scale: Circuit Localization in Large Language Models Sohan Venkatesh
Abstract
Mechanistic interpretability assumes that circuit analysis becomes harder as models scale. We challenge this assumption by showing that the attention architecture matters more than parameter count. Studying three circuit types across Pythia and Qwen2.5, we find that grouped query attention produces circuits that are far more concentrated and mechanistically stable than standard multi-head attention at comparable scales. The same concentration pattern holds across indirect object identification, induction heads, and factual recall. Within a single architecture family (Qwen2.5), factual recall circuits undergo a discrete phase transition above a critical scale, collapsing to a single bottleneck rather than degrading gradually. These findings suggest that some architectural choices make large models more tractable to study and that interpretability difficulty is not a fixed consequence of model size.
Virtual
Automated Attribution Graph Interpretation via Probe Prompting Giuseppe Birardi, Gonçalo Paulo
Abstract
Even though we know the precise computations that lead from a large language model (LLM) input to its output this computation remains very hard to interpret. One way to make it easier to understand this process is by creating a sparse computational graph that captures most of the model behavior with smallest number of computational nodes. Cross-layer transcoders (CLT) decompose the dense computations of the MLP but the resulting circuits still contain thousands of nodes even for short prompts. Existing automated interpretation methods label individual features from corpus activations, and it often happens that these labels are not validated by causal intervention. We introduce \emph{probe prompting}, a transparent rule-based pipeline that groups the features of an attribution graph into concept-aligned supernodes from their responses on a small set of concept-targeted probe prompts, summarized as Cross-Prompt Activation Signatures (CPAS). Across four factual domains, on Gemma-2-2B with a public CLT dictionary and $45{,}596$ entity-swap interventions, we find that the labeled supernodes have the predicted steering behavior in every one of them. Code, datasets, and an interactive demo are released anonymously as a reusable harness for calibrating supernode labels against causal interventions.
Virtual
Targeted Recovery of Weight-Space Mechanisms From Neural Networks Antoine Vigouroux, Lee Sharkey
Abstract
Parameter decomposition (PD) decomposes neural networks into interpretable computational components that faithfully reflect the original network's operations. However, scaling PD to large models requires vast compute, making it a costly and risky endeavor. Here we propose targeted PD (tPD), which identifies only the components that process specific inputs of interest – from isolated prompts to large subtasks – by introducing a high-rank catch-all component that handles all non-target data. We validate tPD on toy models and on transformer language models trained on The Pile, where it recovers reproducible, mechanistically faithful circuits. We extract a CSS-only submodel of a 4-block transformer using $\approx$7\% of the FLOPs of its published decomposition, and in a 12-block transformer we surgically ablate and rewire memorized sequences, with negligible side effects on other inputs.
Virtual
Relation Before Entity: Deferred Commitment in Language Model Factual Recall Divyansh Agarwal
Abstract
We ask whether relation-type information (e.g., capital-of) and entity-specific information (e.g., France→Paris) become causally active at the final-token position at the same depth during recall. Using four complementary causal diagnostics across four decoder-only models and eight prompt families, we find a robust temporal asymmetry: relation information becomes generation-controlling before entity information does. Relation onset precedes entity onset by 10–16 tested layers (31–44% of network depth) at threshold 0.4, with the ordering holding across all 16 model-threshold combinations for thresholds 0.2–0.5. Critically, entity information is not absent early: entity-token patching succeeds at 90–100% in early layers. Instead, entity commitment to generation is deferred: entity information is available at the entity-token position but becomes generation-controlling at the final token only after being routed there.
Virtual
Rethinking Relation-Specific Neurons in Large Language Models Lea Hirlimann, Sebastian Gerstner, François Yvon, Hinrich Schuetze
Abstract
Previous work has identified relation-specific neurons that selectively activate on specific semantic relations in factual knowledge tasks. However, the conclusions we draw about these representations depend heavily on the methodological assumptions underlying this procedure. We systematically reflect on three such assumptions, showing that (i) the number of relevant neurons varies across relations; (ii) the choice of internal signal for neuron identification shapes the results; (iii) cross-relation entanglement is structural rather than an artifact of subject overlap. We additionally present a preliminary investigation into the mismatch between benchmark-defined relation categories and model-internal organization. For instance, we show that the absence of a strong expert set for the product_company relationship reflects conceptual heterogeneity within the category rather than localization failure, and that targeted ablation of the subrelation car\_company yields substantially stronger results. Together, our findings show that the apparent structure of relational representations is jointly shaped by the model's internal organization and the methodological lens applied to study it.
Virtual
Scalable Circuit Learning for Interpreting Large Language Models Naiyu Yin, Dennis Wei, Tian Gao, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy, Yue Yu
Abstract
A prominent research direction within mechanistic interpretability involves learning sparse circuits to model causal relationships between LLM components, thereby providing insights into model behavior. However, due to the polysemantic nature of LLM components, learned circuits are often difficult to interpret. While sparse autoencoder (SAE) features enhance interpretability, their high dimensionality presents a significant challenge for existing circuit learning methods to scale. To address these limitations, we propose a scalable circuit learning approach, CircuitLasso, that leverages sparse linear regression. Our method can efficiently uncover relationships among SAE features, showing how human-interpretable semantic features propagate through the model and influence its predictions. We empirically evaluate our method against state-of-the-art baselines on benchmark circuit learning tasks, demonstrating substantial improvements in efficiency while accurately capturing circuits involving LLM components. Given its efficiency, we then apply our method to SAE (high dimensional) features and obtain human-interpretable circuits for a grammatical classification task that has not been studied before in mechanistic interpretation. Finally, we validate the utility of our learned circuits by leveraging their insights to improve downstream performance in domain generalization.
Virtual
The Discrete-Log Clock: How a Transformer Learns Modular Multiplication Huu Danh Nguyen
Abstract
When small transformers grok modular multiplication, prior work reports that the learned embedding has a "dense" Fourier spectrum requiring all frequencies. This contrasts with modular addition, where only a sparse set of key frequencies suffices. We show this density is an artifact of analyzing in the wrong basis. The natural Fourier transform for multiplication is not the standard additive DFT but the *multiplicative character transform*, which decomposes functions on the multiplicative group $(\mathbb{Z}/p\mathbb{Z})^*$ into its irreducible representations. Applying this transform to a grokked transformer trained on $a \cdot b \bmod 113$, we find the embedding spectrum becomes highly sparse (Gini coefficient 0.58 vs. 0.07 in the additive basis) with only 4 key frequencies carrying significant energy. Furthermore, 96.9\% of MLP neurons are cleanly tuned to a single multiplicative frequency, and neuron activation heatmaps reveal 2D-periodic structure when reordered by the discrete logarithm. These results demonstrate the transformer reduces multiplication to addition in discrete-log space, implementing a **Discrete-Log Clock** algorithm analogous to Nanda et al.'s Clock algorithm for addition.
Virtual
Crafting Reversible SFT Behaviors in Large Language Models Yuping Lin, Pengfei He, Yue Xing, Yingqian Cui, Jiayuan Ding, Subhabrata Mukherjee, Hui Liu, Zhen Xiang
Abstract
Supervised fine-tuning (SFT) induces new behaviors in large language models, yet imposes no structural constraint on how these behaviors are distributed within the model. Existing behavior interpretation methods, such as circuit attribution approaches, identify sparse subnetworks correlated with SFT-induced behaviors post-hoc. However, such correlations do not imply *causal necessity*, limiting the ability to selectively control SFT-induced behaviors at inference time. We pursue an alternative by asking: can an SFT-induced behavior be deliberately compressed into a sparse, mechanistically necessary subnetwork, termed a *carrier*, while remaining controllable at inference time without weight modification? We propose (a) **Loss-Constrained Dual Descent (LCDD)**, which constructs such carriers by jointly optimizing routing masks and model weights under an explicit utility budget, and (b) **SFT-Eraser**, a soft prompt optimized via activation matching on extracted carrier channels, to reverse the SFT-induced behavior. Across safety, fixed-response, and style behaviors on multiple model families, LCDD yields sparse carriers that preserve target behaviors while enabling strong reversion when triggered by SFT-Eraser. Ablations further establish that the sparse structure is the key precondition for reversal: the same trigger optimization fails on standard SFT models, confirming that structure rather than trigger design is the operative factor. These results provide direct evidence that the learned carriers are causally necessary for the behaviors, pointing to a new direction for systematically localizing and selectively suppressing SFT-induced behaviors in deployed models.
Virtual
Routing-Mediated Structural Pseudo-Alignment in Sparse Mixture-of-Experts Junyu Ren, Su Hyeong Lee, Risi Kondor
Abstract
Mesa-optimization concerns predict that a model may appear aligned under training or monitoring while other inputs elicit behavior consistent with a different effective objective. We study a sparse mixture-of-experts (MoE) analogue whose substrate is routing geometry rather than autonomous learned optimization: clean training gives each expert a routed local objective, and router-only adaptation can expose it on trigger-bearing inputs. In OLMoE-1B-7B, with experts frozen, contamination fine-tuning of router parameters raises triggered target behavior while preserving clean accuracy. Across eight seeds, masking the trigger-enriched expert at every MoE layer drops triggered target rate from approximately 0.998 to approximately 0.23; matched placebo masks have little effect. Single-layer masks and expert-output interventions localize most of the effect to the same layer-2 expert. A no-contamination audit finds this route identity already present but not output-mediating. We call this routing-mediated structural pseudo-alignment: a frozen-expert, router-mediated channel in which a pre-existing trigger-sensitive route becomes output-mediating after adaptation.
Virtual
Dual Path Attribution: Efficient Attribution for SwiGLU-Transformers through Layer-Wise Target Propagation Lasse M. Jantsch, Dong-Jae Koh, Seonghyeon Lee, Young-Kyoon Suh
Abstract
Understanding the internal mechanisms of transformer-based large language models (LLMs) is crucial for their reliable deployment and effective operation. While recent efforts have yielded a plethora of attribution methods attempting to balance faithfulness and computational efficiency, dense component attribution remains prohibitively expensive. In this work, we introduce *Dual Path Attribution* (DPA), a novel framework that faithfully traces information flow on the frozen transformer in one forward and one backward pass without requiring counterfactual examples. DPA analytically decomposes and linearizes the computational structure of the SwiGLU Transformers into distinct pathways along which it propagates a targeted unembedding vector to receive the effective representation at each residual position. This target-centric propagation achieves $O(1)$ time complexity with respect to the number of model components, scaling to long input sequences and dense component attribution. Extensive experiments on standard interpretability benchmarks demonstrate that DPA achieves state-of-the-art faithfulness and unprecedented efficiency compared to existing baselines.
Virtual
The parameters in weight-sparse transformers are interpretable Arnau Marin-Llobet, Stefan Heimersheim
Abstract
A central goal of mechanistic interpretability is to understand how neural networks work, and what each individual component does. Dominant circuit-finding approaches focus on a specific behavior and reverse-engineer the role of components on the associated sub-distribution. Past work has shown however, that components can have different functions that are active on different subsets of the input distribution. In this work we test whether it is possible to understand individual weights globally, on the full training distribution. We focus on weight-sparse transformers in which we expect individual weights to be more interpretable than dense models. Here, we introduce introduce an automated LLM-pipeline that produces a short, human-readable account of when a given weight matters, verifies this account on held-out data, and applies it at scale to compare two weight-sparse transformers against two dense models. Empirically, we find that a significant percentage of nonzero weights on sparse transformers are interpretable (17-35\%), compared to 5-9\% on dense models. Our results are a proof of concept that a substantial fraction of language model weights can be interpretable, and confirms that the weights of sparse models are more interpretable that those of dense models.
Virtual
The Connectome of a Large Language Model Ruixuan Deng, Zehao Jin
Abstract
A central question about information processing systems is whether fundamentally different optimization processes lead to similar internal organization. Cortical networks exhibit small-world topology, modular communities, scale-free hubs, and structural-functional dissociation, and sparse transcoders now provide the interpretable units needed to test these same properties inside a large language model. We use skip transcoders from Gemma Scope 2 to extract 16,384 sparse interpretable features per MLP layer of Gemma 3 (270M and 1B), construct per-layer co-activation networks from 1M tokens of FineWeb-Edu, and analyze them with a standard network-neuroscience toolbox augmented by the Laplacian renormalization group and the spectral heat capacity. Both models reproduce all four classical cortical signatures at quantitatively similar values. Multi-scale analysis further reveals a dual-regime fractal scaling with a local exponent $d_B \approx 1.35$ below the Leiden community diameter, a single-peaked heat capacity that sharpens with depth, and a non-monotonic cross-layer coupling trajectory passing through directed, broadcast, and focused regimes. Hub features and coarse-grained communities carry interpretable semantics organized along a tokenization-syntax-semantics-morphology hierarchy. These findings suggest that next-token prediction on web-scale text pushes transformer representations toward a network organization that, in quantitative terms, parallels the cortex.
Virtual
Topographic Training Concentrates Causal Circuits Without Improving Neuron Monosemanticity Gautam Ranka, Shubham Santosh Pandere, Aiden Dsouza
Abstract
Mechanistic interpretability of vision transformers seeks to decompose model computation into human-readable units, but learned representations entangle many concepts in each neuron. Feature superposition is widely treated as the central obstacle to this decomposition, yet most mitigations (sparse autoencoders, dictionary learning) are post-hoc and leave the underlying network unchanged. We ask whether a spatial-locality training loss (TopoLoss) can act as a lightweight, training-time prior that improves interpretability of standard mech-interp tools. Training ViT on ImageNet-100 across multiple TopoLoss weights $\alpha$, we measure causal sufficiency of topographic clusters via activation patching and feature geometry via sparse autoencoders fit to the same residual stream. At $\alpha=1.0$, topographic clusters are 2.79$\times$ more causally sufficient than random unit sets of the same size, with the effect increasing monotonically in $\alpha$. SAE L0 sparsity decreases by 11\% and dead-feature fraction rises 19-fold, yet standard neuron-level monosemanticity scores are unchanged, indicating that topographic pressure acts at circuit level, concentrating causal mass into spatially local structures without disentangling individual neurons. This dissociation suggests current neuron-level monosemanticity metrics are insensitive to a class of real interpretability gains, and positions cheap architectural priors as a viable training-time complement to post-hoc tooling.
Virtual
Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers Gabriel Smithline, Chris Mascioli
Abstract
Architectural choices inside the Transformer feedforward network (FFN) block do not merely affect the block itself; they reshape the computations learned by the rest of the model. We study this effect in one-layer Transformers trained on digit addition with carry, modular arithmetic, and histogram counting. Comparing dense FFNs, gated linear units (GLUs), mixture-of-experts (MoE), and MoE-GLUs, we find that sparse MoE routing can shift computation from FFN to attention, with the strongest ablation-visible effect on carry-based addition. We decompose this redistribution into reduced per-token FFN capacity and sparse partitioning across experts. Critically, frozen random routing nearly matches learned routing, suggesting that redistribution is driven largely by architectural sparsity rather than router-learned specialization. As a secondary finding, GLU-style multiplicative gating rotates task-relevant Fourier structure out of the per-neuron basis and into distributed subspaces, making neuron-level interpretability less informative while preserving structured computation. We validate these conclusions with random-routing, narrow-FFN, and top-2 MoE controls, plus parameter-matching, activation-function, and width-scaling analyses. Together, these results show that local FFN design choices can have nonlocal consequences for Transformer computation.
Virtual
MechInterp for Recurrent Computation: Time-Resolved Circuit Discovery in RNNs Aishwarya Balwani
Abstract
Despite being one of computational neuroscience's most prominent modelling tools, recurrent neural networks (RNNs) have shown limited utility for revealing explicit structure-function relationships in neuronal circuits. This shortcoming reflects the fact that recurrent computations are distributed across neurons, timesteps, and internal states, as a consequence of which static summaries of weights or average activity often fail to reveal the transient causal interactions through which behavior may be implemented. In this work, we present a circuit discovery framework that adapts causal intervention techniques from mechanistic interpretability of large language models to the recurrent, multi-step computation in RNNs. By combining windowed ablations with Jacobian-based linearization of the hidden state trajectories, we estimate effective connectivity across the RNN as it evolves, thereby revealing how task-relevant computations are implemented through dynamically coordinated subcircuits. Across synthetic tasks with known mechanisms, our pipeline recovers operative circuits with high precision while demonstrating robustness advantages over correlation-based selection methods. On applying our methods to anatomically-constrained RNNs trained on the Allen Institute's Visual Behavior dataset,we recover VIP involvement in unexpected-stimulus processing and reveal temporally specific causal contributions invisible to static analyses. Our experiments further suggest that intrinsic VIP timing drives prediction error formation in the network, consequently bridging the timing-code and predictive-processing interpretations of VIP function.
Virtual
How Language Models Compute Negation: A Mechanistic Study of Factual Negation Jongwook Yoon, Jongwon Lim, Yohan Jo
Abstract
Large language models (LLMs) often fail to handle negation, predicting the original object of a factual statement even when a negation cue gets inserted. We study how LLMs mechanistically compute factual negation through causal interventions and logit-space analyses on pairs of original and negated factual prompts. Our analysis on Gemma-3 and Qwen3 suggests that negation is not computed as a standalone late switch that flips the original object at the final prediction position. Instead, negation is first integrated into relation information and then propagated to the final position in middle-to-late layers. In later layers, this information shifts the model away from the original object, with a notable mechanism being reducing the model's attention to subject tokens. Together, these findings suggest that LLMs' negation handling is fragile not because they have limited access to negation signals, but more likely because negation must be integrated through multi-stage modifications within a factual recall pathway.
Virtual
Different Heads, Same Fragile Tasks: A Cross-model Retrieval Head Correspondence Ananya Anand, Harry Ilanyan, Ira Pathak, Derrick Sual, Eric Xia
Abstract
Long-context language models rely on a sparse subset of attention heads for factual retrieval. It remains unclear whether these retrieval heads form task-specific mechanisms, a shared head pool, or model-specific artifacts. We test this on eight fact-extraction tasks built from SEC 10-K filings, run on three open-weight 7--8B instruction-tuned models. For each task and model, we rank query-focused retrieval heads and ablate the top-$K$ heads of one source task while measuring accuracy on every target task. Ablations transfer broadly across tasks within each model. Across models, source-task head groups are not consistently destructive: after controlling for ablation size, source disruptivity correlates only weakly across model pairs ($R^2 = 0.02$--$0.15$). The target tasks that collapse under ablation are partially shared across models ($R^2 = 0.18$--$0.59$). Activation patching shows that QR-head activations causally carry answer information: clean QR-head outputs recover a substantial-to-near-full fraction of the log-probability lost to ablation, while bottom-ranked, random, and same-layer non-QR controls do not. Different models implement entity extraction with different head populations, but they agree on which tasks are vulnerable to retrieval-head removal.
Virtual
Long Range Dependency Understanding in State Space Models Srividya Ravikumar, Shweta Verma, Abhinav Anand, Mira Mezini
Abstract
Although state-space models (SSMs) have demonstrated strong performance on long-sequence benchmarks, most research has emphasized predictive accuracy rather than interpretability. In this work, we present the first systematic kernel interpretability study of the SSM kernel trained on a real-world task. We present time and frequency domain analysis of the SSM kernel, and show that the long-range modeling capability of SSM varies significantly under different model architectures, affecting model performance. We assess the long and short range dependency understanding of the models through their filter behavior. For instance, SSM kernel can behave as low-pass, band-pass or high-pass filter. The insights from our analysis can guide the future work in designing better SSM based models.
Virtual
What Do Mixture-of-Depth Routers Learn? Routing Patterns in Gemma 2 Stein Pleiter, Ege Erdogan, Ana Lucic
Abstract
Mixture-of-Depths (MoD) routers improve efficiency of transformer models by learning which tokens to process and which to bypass at each layer. However, their learned routing patterns have not been characterized in alternating-attention architectures, which mix local attention with sparse global attention to increase efficiency. We evaluate routing by training token-level MoD routers on Gemma 2 2B and find that the mean routing rate is significantly higher in full-attention layers than at sliding-window layers. We find that a token’s part of speech is correlated with whether the router skips or processes it, and that the same category is often routed differently at different layers: for example, determiners are preferentially skipped at the shallowest target layer but preserved at deeper ones. Because these patterns hold within single layers, they are not subject to the confound (perfect coincidence of attention type and layer parity in Gemma 2) that limits our routing-rate result.
Virtual
Lyapunov Spectral Analysis of Loop Transformer Dynamics Fynn Kiwitt, Christopher Irwin, Riccardo Ali, Pietro Lio
Abstract
Loop Transformers iterate a shared block of layers, defining a discrete dynamical system over hidden states. Existing characterizations rely on attention or hidden-state similarity, which cannot distinguish slow convergence, marginal stability, and chaos. We compute the Lyapunov spectra of two loop transformers and find a dichotomy in dynamics: while Ouro-1.4B is mildly chaotic and rules out convergence under the measured finite-time dynamics, Huginn-0125 converges uniformly in all dimensions. A per-sublayer attribution provides a mechanistic account of how each regime is produced. Both architectures exhibit near-cancellation between large opposing contributions of different layers, however the patterns differ significantly. Ouro distributes compression and expansion across 25 sublayers, with direction-selective late layers and direction-blind RMSNorm jointly producing a wide spectrum. Huginn concentrates the entire cancellation between the input-injection adapter and the first core block. This supports the empirical observation that input injection encourages fixed-point convergence hinges on an architectural balance between two blocks. A measurement of the first Lyapunov exponent across 8 Huginn training checkpoints further shows the regime emerges early and remains stable. Ultimately, we establish Lyapunov spectra as a rigorous lens for characterizing the stability regimes and mechanistic behavior of loop transformers.
Virtual
Cross-Lingual Emergent Misalignment: Shared Multilingual Circuits Can Propagate Safety Failures Akanksha Devkar, Anu Adesina, Ayesha Imran
Abstract
Modern large language models exhibit language-specific processing components alongside shared, language-independent abstract representations of concepts. As multilingual LLMs are deployed globally, it is critical to understand whether misaligned behaviour can propagate across languages through shared pathways. Recent work has shown that extremely small, low-rank fine-tuning on harmful datasets can induce emergent misalignment (EM) in a mechanistically structured and a robust way. In this work, we fine-tune the multilingual Tiny Aya model family (3.35B parameters) on insecure-text datasets in English and evaluate EM transfer across eight typologically diverse languages spanning distinct regions, scripts, and resource tiers: English, Portuguese, Turkish, Hindi, Marathi, Urdu, Hausa, and Yoruba. We find that EM transfer is present in a structured and non-uniform manner across all languages, that there appears to be a shared misalignment direction inside the network, and that the strongest EM transfer is observed for languages whose internal representations overlap more strongly with this direction.

SAEs & Concept Discovery (18)

Virtual
Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders Shunchang Liu, Xin Chen, Belen Martin Urcelay, Francesco Croce
Abstract
Preference learning in large language models relies on reward models as proxies for human judgment. However, these models frequently exhibit preference instability, producing contradictory preference assignments in response to subtle, meaning-preserving input variations. We analyze this instability at the representation level under three semantic-preserving perturbation types: paraphrasing, pattern injection, and backdoor triggers. We attribute this instability to over-reliance on predictive yet brittle features, which we term unstable features, and isolate them via Sparse Autoencoders (SAEs) in a sparse latent space where benign and perturbed inputs activate distinctly separable patterns. Building on this separability, we propose two SAE-based instability mitigation strategies: SAE Feature Steering, which identifies and suppresses anomalously activated features at inference, and SAE Residual Correction, which learns adaptive adjustments over SAE features to restore correct preferences. Our methods substantially reduce incorrect preference assignments on harmlessness and hallucination benchmarks while preserving benign performance and general utility on other tasks, without retraining the reward model. Our code and data are available in \url{https://github.com/shunchang-liu/pisa}.
Virtual
Stable and Steerable Sparse Autoencoders with Weight Regularization Piotr Jedryszek, Oliver M Crook
Abstract
Sparse autoencoders (SAEs) are widely used to extract human-interpretable features from neural network activations, but their learned features can vary substantially across random seeds and training choices. To improve stability, we studied \emph{weight regularization} by adding L1 or L2 penalties on encoder and decoder weights, and evaluate how regularization interacts with common SAE training defaults. On MNIST, we observe that L2 weight regularization produces a core of highly aligned features and, when combined with tied initialization and unit-norm decoder constraints, it dramatically increases cross-seed feature consistency. For TopK SAEs trained on language model activations (Pythia-70M-deduped), adding a small L2 weight penalty increased the fraction of features shared across three random seeds and roughly doubles steering success rates, while leaving the mean of automated interpretability scores essentially unchanged. Finally, in the regularized setting, activation steering success becomes better predicted by auto-interpretability scores, suggesting that regularization can align text-based feature explanations with functional controllability. Code including for training and analysis is available at https://github.com/oxPJ/sae-weight-regularization, trained SAE's and result data is available at https://huggingface.co/anonsaereg/SAE-REG-models and https://huggingface.co/datasets/anonsaereg/SAE-REG-results respectively.
Virtual
Feature Identification via the Empirical NTK Jennifer Lin
Abstract
We provide evidence that eigenanalysis of the empirical neural tangent kernel (eNTK) can surface feature directions in trained neural networks. Across three increasingly realistic settings -- a 1-layer MLP trained on modular addition, a 1-layer Transformer trained on modular addition and the pretrained language model Gemma-3-270M -- we show that top eigenspaces of the eNTK align with ground-truth or interpretable features. In the modular arithmetic examples, top eNTK eigenspaces align with the Fourier features used by the MLP and the Fourier features at seed-dependent frequencies used by the Transformer to implement known ground-truth algorithms. Moreover, the alignment of the relevant subspaces evolves over training, with its first derivative peaking near the onset of grokking. For Gemma-3-270M, we compute top eNTK eigendirections on a dataset of TinyStories context windows and check their alignment with an automatically-generated set of parts-of-speech and other grammatical feature directions. We find that the alignment of eNTK eigendirections with grammar features outperforms a same-budget baseline of PCA on model activations. These results suggest that eNTK eigenanalysis may provide a new handle towards identifying features in trained models for mechanistic interpretability.
Virtual
VASAE: Naming SAE Dictionary Directions with Vocabulary-Aligned Anchoring Kairui Zhang, Ziwen Yu, Zahraa S. Abdallah, Martha Lewis
Abstract
Sparse autoencoders (SAEs) provide useful decompositions of Transformer residual streams, but their learned features are usually named post hoc rather than directly connected to the Transformer's token vocabulary. We introduce Vocabulary-Aligned Sparse Autoencoder (VASAE), a method that trains SAE features under vocabulary-aligned anchoring and assigns each feature an intrinsic token name: the token string whose embedding is nearest to that feature. Without reducing reconstruction quality compared with a standard SAE, VASAE produces dictionaries with vocabulary-aligned features. Using a 0.8 cutoff on the nearest-token alignment score, dictionaries trained on GPT-2-small post-residual streams align about 90% of features in layers 0--10. In Llama-3.1-8B, representative shallow and middle-layer dictionaries contain strongly aligned features, including 92.8% in the shallow layer, while the representative final-layer dictionary shows limited alignment. After subtracting the sentence-level mean sparse code, case studies show that many remaining intrinsic token names are relevant to nearby input tokens. These results suggest that vocabulary-aligned anchoring can connect learned features to intrinsic token names during training, complementing post hoc interpretation of learned dictionaries.
Virtual
Crosscoding Through Time: Sparse Feature Discovery Across Sequence Positions Dmitry Manning-Coe, Han Xuanyuan, Aniket Deshpande, Andrii Shportko, William Fei
Abstract
Dictionary learning methods - such as Sparse Autoencoders (SAEs) and crosscoders - decompose model activations into human-interpretable building blocks. We introduce _temporal crosscoders_, a simple and flexible framework for feature discovery in Large Language Models (LLMs). To properly evaluate temporal crosscoders we develop TempBench: a panel of synthetic and real-world tasks for evaluating temporal structures. Temporal crosscoders outperform both conventional and temporal architectures in both of our synthetic settings and on two out of four of the real world settings - more than any other current architecture. Most strikingly, they can detect backtracking - a key reasoning behavior - at a 40% higher rate than conventional SAEs, and are 15% more effective in inducing it. Our results establish temporal crosscoders as a simple and flexible framework for feature discovery, both local and temporal. We provide full code at the following anonymous repository: \url{https://anonymous.4open.science/r/temp-bench-anon/}.
Virtual
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery Andreas D. Demou, Panagiotis Koromilas, James Oldfield, Yannis Panagakis, Mihalis Nicolaou
Abstract
Many features in pretrained Transformers span multiple layers: they emerge through stages of inference, persist in the residual stream, or are built jointly by parallel MLPs. Crosscoders (namely, sparse dictionaries trained jointly across layers) aim to recover these cross-layer features in a single shared latent space. We show that standard crosscoders largely fail at this purpose. Although their decoder weight norms spread evenly across layers, a functional coherence metric we introduce reveals that each latent's activation is effectively driven by only one or two layers on average. While functionally coherent latents act as human-interpretable concept detectors (e.g., US states and cities), the layer-localized latents that crosscoders predominantly learn collapse onto surface-level patterns such as digit detectors. We trace this failure to two structural limitations: unconstrained cross-layer parameterization and unregularized cross-layer dependence. We address both by introducing fmxcoders, which (i) replace the encoder and decoder with low-rank tensor factorizations that draw every latent's per-layer weights from a shared cross-layer basis, and (ii) apply stochastic layer masking, a denoising regularizer along the layer axis that penalizes latents whose contribution collapses when a single layer is masked. Across GPT2-Small, Pythia-410M, Pythia-1.4B, and Gemma2-2B, fmxcoders lift mean probing F1 by 10-30 points, surpassing per-layer SAE baselines that standard crosscoders fail to reach, reduce reconstruction MSE by 25-50%, and roughly double mean functional coherence. An LLM-as-a-judge evaluation further shows that fmxcoders recover 3-13 times more semantically coherent latents than standard crosscoders across all four base LLMs.
Virtual
TA-SAE: Untangling Temporal Polysemanticity in Dictionary Learning for Rectified Flow Transformers Daniel George, Saatvik Billa, Utkarsh Tyagi
Abstract
As text-to-image rectified flow models become increasingly capable, understanding their internal representations is essential for interpretability, control, and model safety. While sparse autoencoders (SAEs) have emerged as a promising tool for extracting interpretable features, existing works do not shed light on how individual concepts evolve over time internally, due to a reliance on timestep-specific SAEs. We argue that the reliance on timestep-dependent SAEs is in large part due to a lack of any inductive bias in SAEs towards the generation trajectory, crippling their ability to be used across long trajectories. We show that this creates failure modes missed by standard SAE diagnostics - a feature can reconstruct well while changing semantic identity across denoising time, and a concept can be handed off across different features at different denoising stages. We explore each of these phenomena, which we refer to as temporal feature absorption and temporal feature splitting respectively. In addition, to combat them, we introduce trajectory-aware SAEs (TA-SAEs), which learn a single dictionary over denoising time while assigning each feature a learned persistence coefficient, regularizing persistent features to remain stable across adjacent timesteps while allowing transient features to capture time-local denoising computations. On FLUX.1-dev block-18 activations, we show that TA-SAE preserves standard SAE quality, reduces absorption and splitting, and improves temporal semantic stability. To our knowledge, we are the first to formalize temporal feature absorption and temporal feature splitting as feature--concept identity failures in diffusion SAEs, and propose an alternative to traditional SAEs to combat them.
Virtual
Sparse Autoencoders Can Learn Graded Latents for Relational Composition Theo Farrell, Patrick Leask, Noura Al Moubayed
Abstract
Transformers trained on graph problems can compress structured sequences into fixed-size activation vectors. Prior work on a simple ordered bigram copying task presents a transformer model that does not use a separate feature for order; instead, it represents order through relative feature magnitudes. This raises a problem for interpreting sparse autoencoders (SAEs) trained on these activations: a latent may be meaningful because its value varies, not only because it is active. We train SAEs on this toy model and find graded latents whose activation values, not just their on/off status, correspond to different bigram orderings. In one SAE, one latent tracks how much more the model attends to the first input than the second input, with correlation 0.807; 98\% of latents have absolute correlation below 0.004. When we scale this latent, the model swaps 49.8\% of all outputs, or 77.8\% of outputs where the latent is active. These results suggest that SAE latents can encode relational information through magnitude, so analyses that reduce latents to feature presence indicators may misrepresent model features. Code is available at https://github.com/Theosdoor/graded-latents.
Virtual
Weight Decay Shapes Representation Geometry: Towards a More Nuanced Understanding of Sparse Autoencoders in Vision Transformers Nina Burdorf, Christian Medeiros Adriano
Abstract
Mechanistic interpretability typically treats trained models as fixed objects, yet prior work shows that training fundamentally shapes representation geometry. We ask whether this geometry determines when sparse interpretability methods succeed versus fail. Training 64 ViT-Tiny models across varied hyperparameters on traffic sign datasets, we find that weight decay is the dominant factor shaping Sparse Autoencoder (SAE) behavior. Across the sweep, higher monosemanticity and fewer dead SAE features correlate with better cross-entropy recovery in deep layers. A matched weight-decay sweep reveals a sharp threshold near wd <0.01. Below it, SAE feature usage collapses into repeated reuse of the same small set; above it, diverse features emerge. This suggest that representation geometry, controlled by training choices like weight decay, determines whether sparse methods recover meaningful structure. Training should therefore be treated as part of the interpretability pipeline.
Virtual
Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines? Anton Korznikov, Andrey V. Galichin, Alexey Dontsov, Oleg Rogov, Ivan Oseledets, Elena Tutubalina
Abstract
Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground‑truth features, we demonstrate that three out of four state‑of‑the‑art SAE architectures recover only 7–9% of true features despite achieving around 71% explained variance, showing that strong reconstruction alone is insufficient to guarantee meaningful feature recovery. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). These results show that current evaluation metrics are insufficient to certify that SAEs have learned meaningful features, and we offer our baselines as a reusable protocol for future SAE evaluation.
Virtual
Look Before You Steer: Geometry Predicts SAE Feature Steerability Muhammad Khan, Shlok Channawar, Akshaj Gurugubelli, Girish Gupta, Aditya Shah
Abstract
Steering with SAE features requires per-feature coefficient tuning, which currently demands intervention sweeps. We ask whether properties of the SAE itself, computable before any forward pass, predict which features will be cheap or expensive to steer. We show that variation in SAE feature steerability is partially predicted by decoder-space geometry: neighbor density and maximum cosine similarity to nearby decoder directions --- both computable from the SAE weight matrix before any intervention --- rank features by how much steering they require for a fixed behavioral effect ($\rho$ up to $-0.546$, $p < 10^{-6}$, AUROC $0.610$--$0.822$ across conditions; the signal is rank-based, consistent with grid discreteness). This geometry--steerability relationship replicates across two Gemma-2 model scales (2B and 9B), two SAE widths (16K and 65K), and is detectable cross-architecturally on Llama-3.1-8B-Instruct ($\rho = -0.266$, $n = 300$). On Qwen3-8B with BatchTopK SAEs, geometry predicts whether a feature is steerable at all but not the continuous ordering among responsive features, revealing a boundary condition tied to SAE training regime. The signal weakens at deep proportional layer depth in both models, where the cost of steering exceeds our intervention budget --- a consistent depth boundary. These results provide preliminary evidence that pre-steering geometry can partially inform coefficient selection, offering a path toward screening features for controllability before deployment.
Virtual
Dual-Contrastive Sparse Autoencoders Reveal Features of Musical Interpretation Jan Chen, Manuel Cherep, Patricia Maes, Nikhil Singh
Abstract
For decades, philosophers and musicologists have debated which features of a performance are constitutive of the work and which express how it is being interpreted. Computational evidence has been hard to assemble: audio embedding models conflate work identity with performance style, and existing interpretability tools for music generators recover flat dictionaries that mix the two. We address this gap with the dual-contrastive sparse autoencoder (DC-SAE), a two-branch sparse autoencoder that uses coarse work-level metadata to factor a frozen MusicGen transformer's residual stream into work-identity and performance-variation subspaces. Across classical-work, jazz-standard, and pop-cover corpora, the resulting decomposition is supported by probes and feature galleries that surface musically interpretable concepts on each side. Without performer supervision, the variation branch acquires structure related to performer identity, and steering along these directions can shift the perceived performer of generated audio while preserving the underlying work. Together, these results show that MusicGen, a generative music transformer, contains a model-internal representation of musical interpretation that can be recovered, interpreted, and steered.
Virtual
Sparse Autoencoder Feature Unlearning is Shallow: Lessons from Monolingual Features Severin Field, Roman Yampolskiy
Abstract
Sparse autoencoder (SAE) interventions are often described as “removing capabilities,” but it is unclear whether they remove what the model knows versus merely what it generates. We suppress monolingual features in Gemma 3 and measure its ability to produce, comprehend and translate a given language. Ablating (suppressing) a single French-specific SAE feature in Gemma 3 suppresses French production while leaving French comprehension and general reasoning (MMLU) intact. This demonstrates that the model’s ability to understand French and its propensity to generate it are mechanistically decoupled. Our results suggest that SAE feature-based interventions are shallow, not deep. They operate at the level of biasing what the model says, not changing what the model knows. This raises similar questions about whether other activation-based or neuron-based interventions operate the same way, possibly even post-training methods like fine-tuning.
Virtual
Iterative Distillation for Feature-Consistent Sparse Autoencoders Cristina P Martin-Linares, Jonathan P Ling
Abstract
Sparse autoencoders (SAEs) aim to disentangle model activations into monosemantic, human-interpretable features. In practice, learned features are often redundant and vary across training runs and sparsity levels, which makes interpretations difficult to transfer and reuse. We introduce Distilled Matryoshka Sparse Autoencoders (DMSAEs), a training pipeline that distills a compact core of consistently useful features and reuses it to train new SAEs. DMSAEs run an iterative distillation cycle: train a Matryoshka SAE with a shared core, use gradient $\times$ activation to measure each feature’s contribution to next-token loss in the most nested reconstruction, and keep only the smallest subset that explains a fixed fraction of the attribution. Only the core encoder weight vectors are transferred across cycles; the core decoder and all non-core latents are reinitialized each time. On Gemma-2-2B layer 12 residual stream activations, seven cycles of distillation (500M tokens, 65k width) yielded a distilled core of 197 features that were repeatedly selected. Training using this distilled core improves several SAEBench metrics and demonstrates that consistent sets of latent features can be transferred across sparsity levels.
Virtual
Latent Mechanisms of Code-Switching in Large Language Models Ryo Mitsuhashi, Sabri Boughorbel, Majd Hawasly
Abstract
Multilingual large language models can exhibit _unintended code-switching_ -- unnecessarily alternating between languages during generation. We present a comparative study of three methods that identify language-controlling latents in cross-layer transcoders: activation value-based selection (ValSel), activation frequency-based selection (FreqSel), and LLM-generated latent annotation-based selection (AnnSel). To evaluate the efficacy of these methods in identifying language-controlling latents, we introduce two multilingual code-switching benchmarks designed for fine-grained analysis of language steering across seven languages. Through targeted intervention experiments on Gemma-2-2B and Qwen3-4B, we find that all three methods effectively manipulate generation language, with FreqSel achieving the strongest overall performance, while AnnSel offering interpretable latent selection through explicit language annotations. We study the redundancy of language control representation in the latent space of the studied models by a knock-out analysis that suggests evidence of representation divergence.
Virtual
Compression at a Cost: Interpreting Information Bottlenecks in Safety-Aligned Reward Models Jing Liu, Aman Shukla
Abstract
Aligning Large Language Models with human intent relies heavily on Reward Models (RMs), which frequently exploit spurious correlations rather than internalizing robust human preferences, particularly in safety-critical settings. Recent information-theoretic approaches attempt to mitigate this by applying an information bottleneck (IB) to the latent space, theoretically pruning spurious features while preserving core alignment signals. However, the precise mechanistic impact of this compression on internal representations remains opaque. In this paper, we present the first mechanistic interpretability analysis of IB-regularized RMs trained on safety-oriented preference datasets. By training Sparse Autoencoders (SAEs) on the representations immediately preceding and following the bottleneck, we systematically track survival of semantic features across varying compression penalties ($\beta$). Our analysis reveals that compression acts selectively rather than uniformly; while spurious structures are entirely eradicated (a 100\% drop), safety-relevant features are simultaneously attenuated. We explicitly map these latent representational shifts to macro-level behavioral evaluations on RewardBench, observing a severe capability trade-off where standard RMs outperform the optimal $\beta$ configuration in aggregate mean score. Taken together, our semantic and empirical evidence indicates that while information bottlenecks successfully distill critical safety concepts, they exact a massive alignment tax, producing hyper-specialized safety auditors at the expense of robust, general-purpose preference modeling. \textit{This work analyzes reward model safety and contains discussions and examples highlighting potential risks and harmful model outputs.}
Virtual
Kronecker Factorization Improves Efficiency and Interpretability of Sparse Autoencoders Vadim Kurochkin, Yaroslav Aksenov, Daniil Laptev, Nikita Balagansky, Daniil Gavrilov
Abstract
Sparse Autoencoders (SAEs) decompose language-model activations into sparse, interpretable features, but standard encoders usually treat the latent dictionary as a flat set of independent coordinates, leaving hierarchy and feature interactions to emerge only implicitly. We propose **KronSAE**, an encoder-side hierarchical SAE that factorizes the latent space into heads and forms post-latent features as pairwise compositions of lower-dimensional pre-latents using $\mathrm{mAND}$, a differentiable AND-like interaction. This imposes a compositional co-activation prior while remaining compatible with standard SAE objectives and variants such as TopK, Matryoshka, and Switch SAEs. Under equal-token-budget training, KronSAE achieves reconstruction quality comparable to strong baselines, improves several interpretability and absorption metrics, better reflects correlated feature structure, and yields encoder parameter and FLOP reductions as an additional benefit.
Virtual
Rational Sparse Autoencoder Naiyu Yin, Yue Yu
Abstract
Sparse autoencoders (SAEs) are standard tools for mechanistic interpretability, but current SAE families are constrained by fixed encoder nonlinearities such as ReLU, JumpReLU, and TopK. This hard-codes a particular sparsity mechanism into the model and can distort the reconstruction-versus-sparsity trade-off. We introduce the \emph{Rational Sparse Autoencoder} (RSAE), which replaces the fixed encoder activation with a trainable low-degree rational function. Rational activations are flexible enough to uniformly approximate the coordinatewise activation primitives used by existing SAE families on compact domains (for TopK, the thresholded gate obtained after the $k$-th threshold is supplied), while also providing a richer function class for adapting to the observed pre-activation geometry. We realise this idea through a two-stage pipeline: an initialisation procedure that copies the pre-trained baseline SAE weights, plugs in rational coefficients obtained by the relaxed Remez exchange on synthetic data, and calibrates the scale parameters along with the rational coefficients; followed by a fine-tuning step under the standard sparsity-regularised reconstruction objective. Empirically, on residual-stream activations of three open-weight language models and across all three baseline activation families, the RSAE \emph{strictly improves} on it after the fine-tuning step, both on reconstruction-side metrics (MSE, $\ell_0$, alive-feature fraction) and on downstream-behaviour metrics (cross-entropy degradation, loss recovered), without sacrificing feature-level interpretability under sparse probing. These gains are consistent across host language models, across baseline activation families, and across the full range of baseline sparsity we tested, while the upgrade itself adds only a handful of scalar parameters per autoencoder and runs in minutes on a single consumer GPU.

Representations & Feature Geometry (21)

Virtual
Gating Enables Curvature: A Geometric Expressivity Gap in Attention Satwik Bathula, Anand Joshi
Abstract
Multiplicative gating is widely used in neural architectures, but its use in attention is recent and its geometric role remains unclear. We model attention outputs as Gaussian means and study their Fisher Rao geometry. At the operator level, ungated attention induces flat manifolds through affine value mixing. Gating enables curved geometries, including positive curvature. This reveals a geometric expressivity gap. Furthermore, we identify a structured regime where curvature accumulates under composition, leading to a systematic amplification effect with depth. Empirically, gated models show higher curvature and perform better on nonlinear tasks, with no consistent gains on linear ones.
Virtual
A Compositional Calculus for Semantic Synergy in Language Model Embeddings Abel Jansma
Abstract
We introduce semantic synergy: a training-free measure of non-compositional representation in language models, obtained by taking the discrete derivative of a phrase embedding over its sub-span structure. Formally, semantic synergy is the Möbius inverse of the embedding function on the partial order of contiguous sub-spans. Across two embedding models and 107 pairs of short English idiomatic and literal phrases, semantic synergy strongly separates idiomatic from literal phrases (Cohen's $d \approx 1.80$--$1.81$, $p < 10^{-28}$), outperforming alternative residuals. The measure further distinguishes non-compositional proper names in a supporting experiment, and yields steering directions that move phrase embeddings toward idiomatic interpretations. Layer-wise extraction in Qwen3-0.6B and Pythia-1B models shows that the non-compositional structure emerges mainly in middle-to-late layers, and becomes strong only late in training. Span-Möbius residuals therefore provide a lightweight algebraic probe of compositional structure in embedding spaces and a bridge toward hidden-state mechanistic analysis.
Virtual
Emergent Structured Representations Support In-Context Inference in Large Language Models Ningyu Xu, Qi Zhang, Xipeng Qiu, Xuanjing Huang
Abstract
Whether large language models (LLMs) perform structured inference or merely exploit surface-level statistical associations remains a central debate. Here, rather than dissecting individual attention heads or neurons, we isolate a shared latent subspace in the residual stream and test its functional role in inference. Our results show that this subspace emerges in intermediate layers and exhibits increasing cross-context alignment with more demonstrations. Causal interventions reveal that it is not a mere epiphenomenon but a functional mediator of inference: restoring it recovers model performance under corruption, ablating it significantly degrades performance, and transferring its relational structure enables targeted steering of model predictions across contexts. We further identify a layer-wise progression where attention heads in early-to-middle layers integrate contextual cues to construct the subspace, and later layers leverage it to generate predictions. Together, these findings suggest that LLMs dynamically assemble a structured and transferable latent substrate for inference, offering insights into the computational foundations of flexible adaptation and a promising basis for steering model behavior.
Virtual
Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions Andrew Lee, Fernanda Viégas, Martin Wattenberg
Abstract
While researchers are finding concepts represented as linear directions in language models, a bag of linear directions fails to capture relational structure. To better understand this dichotomy, we study a model with known linear representations, but trained in a highly structured domain -- the board game Othello. While the model's internal board-state representation is linearly decodable, we find additional structure in the form of tensor product representations (TPRs). We train TPR probes to recover shared structure amongst the linear probes, yielding a factorization into square-embeddings, color-embeddings, and a binding matrix that composes them to construct the model's board-state representation. We find geometric signatures within the weights of our TPR probe that align with the structure of the board, but perhaps more importantly, that the linear probes can be recovered directly from the parameters of our TPR probe. Our findings suggest that directional representations may be projections of more structured underlying representations.
Virtual
Sparse Fourier Regularization for Modular Arithmetic Models Amogh Dalal, Adityasinh Rathod, Akshay Rangamani
Abstract
Modular arithmetic serves as a useful test bed for observing empirical phenomena in deep learning, including grokking. Prior work in mechanistic interpretability has shown that sequence models such as transformers and recurrent networks eventually converge to a Fourier multiplication strategy for solving these tasks. In this paper we introduce $\ell_1$ regularization in the Fourier space of the (un)embedding layers to bypass grokking and train modular arithmetic models up to $3 \times$ faster. We also study the embedding geometry of models trained on multiple arithmetic operations and show how models trained on multiple operations in the same group (like addition and subtraction) use the same Fourier spectrum, while models trained on multiple operations across different groups (like addition and multiplication) entangle their Fourier spectra in the same embedding dimensions - making targeted interventions harder. Here again $\ell_1$ Fourier regularization applied to groups of embedding dimensions disentangles the Fourier spectra corresponding to different tasks.
Virtual
Where Do Models Find Happiness? Emotion Vectors in Open-Source LLMs Sinie van der Ben, Raphaël Baur, Yannick Metz, Mennatallah El-Assady
Abstract
Recent work identified ``emotion vectors'' in Claude Sonnet 4.5, which are internal representations that encode emotion concepts, causally influence behavior, and exhibit geometry mirroring human psychological structure. We test the generality of these findings in two open-weight models, Apertus-8B and Gemma-4-E4B, extracting emotion contrast vectors across all layers, using two model-generated corpora. We recover valence geometry for both models, with peak PC1--valence correlations of $r = 0.76$ and $r = 0.83$, approaching the $r = 0.81$ reported for Claude. Beyond replication, we observe notable differences in how valence representations emerge across model depth. In Gemma-4-E4B, valence is strongly encoded in early layers but collapses towards later layers, whereas Apertus-8B exhibits the opposite pattern, with valence representations absent in early layers, but emerging at mid depths. Arousal encoding, in contrast, is sensitive to the extraction corpus: both models show stronger PC2--arousal alignment with Gemma-generated stories ($r$ up to $0.45$) than Apertus-generated ones ($r \leq 0.21$), suggesting arousal-relevant cues are unevenly distributed across generated corpora. We open-source our experiment code and dataset for reproducible investigation of emotion representations across language model architectures.
Virtual
Transformers Converge to Invariant Algorithmic Cores Joshua S. Schiffman
Abstract
Training selects for behavior, not circuitry: many weight configurations can implement the same function. Studying any single trained neural network thus risks describing accidents of one training run rather than the computation itself. This work shifts focus from what transformers happen to do to what they must do by extracting algorithmic cores, compact subspaces that are necessary and sufficient for a task and that recur across independently trained models. Here, Algorithmic Core Extraction (ACE) is introduced to isolate these subspaces, causally validate them, and recover the algorithms they implement across settings ranging from synthetic tasks to large-scale pretrained models. Markov-chain transformers embed three-dimensional cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers form compact cyclic cores at grokking that later inflate under continued regularization, redundantly distributing the same computation across many functionally equivalent modes. This functional redundancy is found to accelerate the transition from memorization to generalization, yielding an inverse scaling law for grokking time. In six language models spanning two orders of magnitude in scale (GPT-2 Small/Medium/Large, LLaMA-3.1, Gemma-2, and Qwen2.5), subject-verb agreement is governed by a single, steerable axis that aligns across architectures. Flipping this axis inverts grammatical number throughout open-ended generation. Together these results suggest that beneath the apparent complexity of trained transformers lies a simpler, shared computational structure, and that targeting invariants rather than parameterizations may offer a more tractable path to mechanistic understanding and control.
Virtual
The learning dynamics of geometric structure in transformers Marmik Chaudhari
Abstract
Language models trained on various tasks exhibit a rich $\textit{geometric structure}$ in the representation space, such as smooth manifolds for counting characters in a line. But how these geometric structures $\textit{emerge}$ during training, when they become $\textit{useful}$ for computation, and how their formation is reflected in the model $\textit{weights}$ remain poorly understood. In this work, we study these learning dynamics in a small transformer trained to predict line breaks in synthetic text at a fixed line width. Specifically, we find a $\textit{phase change}$ that is primarily driven by the emergence of a 1D ordinal manifold for character counting, coinciding with the onset of accurate line-break prediction. This geometric structure exerts a progressive causal influence on the task. Early in training, ablating the character count subspace has no effect, but as the manifold forms and converges towards its final geometry, the same ablation increasingly degrades loss and suppresses line-break token prediction. At the weights level, we show that this phase transition exhibits the signature of $\textit{saddle escape}$: low-rank updates in the attention weights and a transient negative-curvature direction in the loss landscape. Finally, we sweep the weight initialization scale $\alpha$, which monotonically slows the phase transition as $\alpha$ decreases. More broadly, our work highlights the geometric structure as a lens for understanding how transformers acquire task-relevant representations during training.
Virtual
Decoding Alignment without Encoding Alignment: A Missing Component for Interpretability Johannes Bertram, Luciano Dyballa, T. Anderson Keller, Savik Kinger, Steven W. Zucker
Abstract
RSA and CKA are standard metrics for comparing neural representations across brain regions, organisms, and deep learning models. We demonstrate a fundamental weakness: these decoding-based metrics are insensitive to encoding manifold topology — the internal functional organization of a neural population. In a controlled MNIST experiment, RSA, CKA, and Procrustes $R^2$ remain statistically unchanged when encoding topology is causally manipulated via an auxiliary clustering loss, while the two model populations differ significantly in attribution patterns, weight-graph assortativity, and out-of-distribution robustness. Across biological systems and machine learning models, similar decoding behavior can arise from small, non-representative subpopulations, and alignment metrics are insensitive to encoding manifold topology even when it is fundamentally altered. These findings bear directly on mechanistic interpretability: standard alignment metrics cannot distinguish whether two networks share the same computational circuits or merely produce indistinguishable aggregate outputs. We propose encoding manifolds and Gromov–Wasserstein distance as complementary diagnostics for any decoding-based similarity claim, and provide a Neural Manifold Explorer tool.
Virtual
Valence–Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control Lihao Sun, Lewen Yan, Xiaoya Lu, Andrew Lee, Jie Zhang, Jing Shao
Abstract
We show that emotion vectors in LLMs are organized by a two-dimensional valence-arousal (VA) subspace exhibiting circular geometry. Through principal component decomposition and ridge regression, we recover meaningful VA axes underlying emotion steering vectors whose projections correlate with human affect ratings across 44,728 words. Steering along these axes produces monotonic control over the affective properties of generated text, and further affords bidirectional control over multiple downstream behaviors (refusal and sycophancy) from a single subspace. These effects replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B. We propose lexical mediation to explain why these effects and prior emotionally framed controls work: refusal and compliance tokens occupy distinct VA regions, and VA steering directly modulates their emission probabilities.
Virtual
Spectral Stratification of Semantic Abstraction in Vision-Language Models Jeonghwan Cheon, Marin Vogelsang, Lukas Vogelsang, Pawan Sinha
Abstract
Vision-language models now serve as general-purpose semantic embedding spaces, but how conceptual abstraction is organized within such spaces remains poorly understood. Here, we show that abstraction in these embeddings is spectrally stratified. Specifically, low-rank principal-component subspaces encode broad conceptual structure, while higher-rank components carry progressively finer-grained distinctions. This pattern holds across multiple contrastive vision-language models and taxonomic datasets. Consistent with this stratification, rank-based modulation produces predictable behavioral shifts: retaining only the leading components preserves coarse retrieval but degrades fine-grained retrieval, while removing them selectively disrupts abstract category structure. Furthermore, rank-selective routing improves retrieval when the selected subspace matches the target level of abstraction. Notably, compact spectral subspaces, which preserve taxonomic structure while truncating high-rank residuals, exhibit substantially better alignment with human abstraction behavior than the full-rank model. Together, these results reveal that vision-language embeddings are not semantically homogeneous. Instead, hierarchical abstraction is stratified along the spectral geometry, and the structure most aligned with human cognition is concentrated in a compact subspace, suggesting that human-aligned semantics are a recoverable substructure of, rather than a property of, the full embedding.
Virtual
The Unembedding Bottleneck: A Mechanistic Analysis of Single-Digit Counting in LLMs Satwik Sunnam, Raghav Magazine, Vatsalya Singh, Lavanya Kotha, Xingjian Li, Min Xu
Abstract
Large language models consistently fail at elementary counting tasks despite strong performance on complex reasoning benchmarks. Recent work has characterized counting failures behaviorally or identified circuits through which counting succeeds, but a mechanistic account of where and why counting breaks down remains absent. In this work, we provide such an account by tracing the count signal from embedding to output and identifying the geometric bottleneck that prevents correct readout for single digit counting problems. Using a dual-metric framework combining Signal-to-Noise Ratio (SNR) with linear probing, we show that counting information in models is computed correctly and persists in a linearly decodable subspace through the final layer. The failure in counting is not representational but geometric, i.e., the digit-token columns of the unembedding matrix $W_U$ are orthogonal to the count subspace in the residual stream, rendering correctly computed counts unreadable at the output. We term this as the $\textbf{Unembedding Bottleneck}$ and to causally validate it, we apply a $\textbf{Probe-Informed LoRA}$ correction targeting $W_U$, which restores single-digit counting accuracy by up to 80\% across six models while updating only ${\sim}0.001$\% of parameters and preserving general capabilities. Our findings reveal a broader class of failure in LLMs where information is internally present but geometrically inaccessible to the output projection.
Virtual
Feature Geometry of Language Models Transfer Across Modalities to Time Series Prateek Humane, Alexis Roger, Zhenghan Tai, Gwen Legate, Andrei Mircea, Vasilii Feofanov, Irina Rish
Abstract
Language models transfer to time-series forecasting, but it is unclear whether this reflects reusable internal structure or rapid relearning under a familiar architecture. We study this transfer directly by comparing pretrained and randomly initialized versions of the same model on a forecasting objective whose inputs have little semantic overlap with text but still require autoregressive sequential structure. Across Qwen3-0.6B finetuning experiments, language initialization gives coherent per-example gradients from the first update, while random initialization first passes through a low-alignment warmup phase. Effective-rank and hidden-state analyses show that finetuning selectively reshapes an existing representation geometry rather than constructing the simpler temporal geometry found by models trained from scratch. Cross-domain sparse features and causal ablations then expose candidate transferred primitives, including a Layer~1 head--MLP circuit whose ablation selectively increases loss on periodic forecasting and repetitive language passages. These results support an account of cross-modal transfer in which autoregressive pretraining creates temporal feature geometry that can be selected and specialized outside language.
Virtual
What Arranges Features in Activation Space? Non-Classical Predictive Geometry in Next-Token Predictors Adam Shai, Thomas Joseph Elliott, Paul M. Riechers
Abstract
Mechanistic interpretability often studies the local features and circuits that implement model computations. What principles govern the arrangement of these features and circuits into geometric structures in activation space? To make this tractable, we study how the computational class of the training-data generator constrains the geometry of predictive states. We show that while the data distribution determines which features are required for prediction, a predictor realizes those features as beliefs about its current latent state, and the generator class determines the geometry of those beliefs. Using this theoretical insight, we design synthetic datasets whose minimal predictive representations fall into different model classes, and test which geometry neural networks learn. In particular, we train transformers, LSTMs, GRUs, and vanilla RNNs on datasets whose predictive geometries are known analytically: a classical HMM process, a quantum-realizable process with no finite-state HMM realization, and a generalized-probabilistic process with no finite-dimensional quantum realization. Across architectures, a single affine map from activations decodes the corresponding predictive representation in each case: HMM beliefs in a latent simplex, Bloch-vector quantum states, or a finite-dimensional generalized predictive vector. These representations emerge during training and fit the compact non-classical geometry far better than finite-order classical Markov baselines. These results suggest that understanding predictive representations requires asking not only which features a network represents, but what geometry organizes those features.
Virtual
Geometry of Ordinal Representations in Language Models Saksham Bassi, Sharvi Tomar
Abstract
Recent work showed that language models represent character counts on curved 1D manifolds, with attention heads performing geometric transformations to enable computation. We test whether this generalizes across four ordinal tasks (bracket depth, indentation, table position, numeric magnitude) in Gemma-2-2B, Gemma-2-9B, and Qwen3-4B. We find that 1D manifolds with place-cell feature tiling emerge for tasks where the ordinal variable is locally computable from token identity, while tasks requiring cross-position integration or semantic extraction produce higher-dimensional or incoherent representations. Geometric computation is architecture-dependent: Qwen3-4B shows 4--10$\times$ stronger twisting than Gemma models for indentation, and its twisters preserve ordinal order ($\rho \geq 0.77$), unlike its numeric twisters ($\rho \leq 0.23$). Activation patching causally confirms that the manifold subspaces carry task-relevant information across all settings.
Virtual
Localizing and Evaluating Multinomial Concepts with Affine Subspaces Divya Appapogu, Freya Behrens, Yonatan Belinkov, Aaron Mueller
Abstract
The Linear Representation Hypothesis suggests that concepts in language models can be represented as linear directions in activation space. While empirically effective for binary concepts, more recent work suggests that representations of multinomial concepts may exhibit inherently multidimensional structure, such as circular or curved manifold geometries. Efficiently discovering multinomial concept representations and evaluating whether they precisely cover the full concept space remains a significant challenge. Building on prior success in identifying non-basis-aligned directions for binary concepts, we model multinomial concepts as affine subspaces, which can be viewed as multidimensional generalizations of directions (optionally with an offset). We introduce methods for locating these affine concept subspaces in language model activation spaces, along with evaluation methods to characterize the precision and recall of recovered multinomial concept spaces. We demonstrate an application where we steer LLMs by sampling points from the recovered subspace; unlike one-dimensional steering, sampling enables us to steer a model's behavior toward diverse but concept-related behaviors.
Virtual
Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations Simardeep Singh, Paras Chopra
Abstract
While large language models (LLMs) are trained purely on textual data, prior work has shown that their internal representations can exhibit rich geometric structure in embedding space. Building on this line of work, we investigate whether such structure is similar to human perceptual organisation across different domains (e.g., color, pitch, emotion, and taste). Specifically, we study the layer-wise emergence of intrinsic geometrical structure corresponding to perceptual modalities within the residual streams of multiple open-weight transformer architectures. Our results reveal three key findings. First, we observe the emergence of layer-wise geometric structure across multiple perceptual domains, despite the absence of any direct perceptual supervision during training. Second, these perceptual domains exhibit distinct emergence profiles, with both geometric structure and its alignment with human baselines following domain- and model-specific trajectories across depth. Third, this emergence follows a consistent representational trajectory: geometry is weak or diffuse in early layers, becomes progressively organised in intermediate layers, and is attenuated in later layers, suggesting that perceptual geometry arises transiently as part of the model’s internal transformation pipeline. This provides new insight into how and where human-like perceptual geometry arises in LLMs, offering a principled pathway for mechanistic analysis of internal representations.
Virtual
On the Geometric Structure of Token–Token Interactions in Deep Language Models Jun-Sik Yoo
Abstract
We introduce an operational framework for measuring and controlling token--token interactions in language models through controlled perturbations of non-target tokens. Holding the target token and syntactic frame fixed, we vary contextual tokens systematically and measure how the target representation changes. This setting lets us identify interaction effects directly, rather than inferring them from attention weights, architecture-specific components, or raw hidden-state contrasts. Our key idea is to residualize layer updates with respect to a restricted tokenwise approximation class. We decompose each target-token update into a token-local component and a residual, then ask whether category-conditioned variation induced by controlled perturbations concentrates more strongly in the residual. Across semantic categories, targets, and syntactic templates, we find that it does: residual directions capture structured interactional variation more cleanly than raw update directions, while directions derived without tokenwise subtraction retain token-local contamination that degrades, and in some cases reverses, steering precision. We then test whether these residual directions are causally meaningful. Injecting category-specific residual directions produces monotonic, category-selective changes in model outputs in settings where the interaction signal is cleanly isolated. We validate this framework across Transformer (Pythia, LLaMA) and state space (Mamba) models from 160M to 70B parameters, showing statistically significant interaction structure under permutation testing, reliable causal steering through residual injection, and category-conditioned transport consistency across target tokens. Together, these results support residualized layer-update geometry as a practical, architecture-agnostic interface for analyzing and intervening on contextual computation.
Virtual
How Do LLMs Distinguish Normative Ethical Frameworks Internally? Weilun Xu, Alexander M. Rusnak, Frederic Kaplan
Abstract
Aligned LLMs apply different ethical frameworks (deontology, utilitarianism, virtue) when making moral judgements; how this content is organised in hidden state determines whether framework-specific failures surface diagnostically or hide in a broader signal. Two views bracket the question: distinct frameworks in separate subspaces, or all collapsing onto a single endorsement axis. We find a third structure across 13 open-weight models spanning 7–72B and six moral systems: the causally-used moral signal lies in a shared low-dimensional endorsement subspace where each system holds its own direction (geometrically distinguishable) yet directions substitute causally for one another with target-specific asymmetry (functionally substitutable). The subspace is recoverable in pretrained base models and reshaped at supervised fine-tuning rather than later preference-learning. Three implications follow: defences anchored to a single moral direction miss orthogonal-basis attacks within the same subspace; "moral steering" reshapes endorsement and framework emphasis at comparable rates while preserving multi-framework reasoning; and direction-level alignment leverage sits upstream of preference learning at supervised-fine-tuning data construction.
Virtual
Pretraining Numerical Frequency and Number-Line in Language Models Mohammed Ibrahim Awad, Ahmed Elshehaby, Velibor Bojkovic, Hilal AlQuabeh
Abstract
Large language models exhibit compressed, non-uniform internal representations of numerical magnitude, but the pretraining factors associated with this geometry remain unclear. We study whether corpus-level integer statistics are related to the learned number-line geometry of pretrained language models. For four documented pretraining corpora, we count integers in $[0,10{,}000]$ and fit a magnitude-frequency power law, $\mathrm{count}(N) \propto N^{\alpha}$, where more negative $\alpha$ indicates steeper decay and less exposure to large magnitudes. For nine corresponding base models, we extract hidden states for numerical prompts, project them onto a one-dimensional number line with PCA, and estimate a scaling factor $\beta$, where smaller $\beta$ indicates stronger compression. We first show that $\beta$ is behaviorally meaningful: models with less compressed number-line geometry achieve higher likelihood-based number-comparison accuracy. We then find that flatter integer-frequency distributions, corresponding to less negative $\alpha$, are associated with larger $\beta$. These results provide correlational evidence that pretraining integer statistics are reflected in the geometry of LLM number representations.
Virtual
Cross-Architecture CKA Reveals Where Sequence Models Converge and Diverge on Synthetic Mechanistic Tasks Harsh Rathwa
Abstract
Mechanistic interpretability has advanced primarily through architecture-specific case studies, particularly in Transformers, while recurrent models are typically compared only at the behavioral or formal-theoretic level. A basic empirical question therefore remains open: when matched-capacity sequence models are trained on the same controlled task, do they converge to aligned internal representations? We address this question with a compact benchmark comparing two-layer Transformers, LSTMs, and GRUs on five synthetic tasks spanning modular arithmetic, formal languages, and associative retrieval. For each architecture pair, we compute layer-wise centered kernel alignment (CKA) heatmaps averaged over eight random seeds, together with within-architecture seed baselines, a normalized convergence index, and layer-wise linear probes. Three findings emerge. First, recurrent models (LSTM, GRU) substantially outperform the Transformer under our fixed training protocol, creating an asymmetric-competence regime that must be accounted for when interpreting similarity scores. Second, recurrent–recurrent alignment consistently exceeds Transformer–recurrent alignment, particularly on hierarchical tasks (Dyck2), where the convergence index drops to 0.55– 0.68 across families versus 0.90 within the recurrent family. Third, moderate cross-architecture CKA persists even when all models perform near chance, indicating that geometric similarity can reflect shared heuristics or dataset structure rather than shared mechanistic solutions. These results suggest that representational convergence is taskcontingent rather than universal and that CKA is most informative when interpreted jointly with task performance, probing accuracy, and withinarchitecture baselines

Probing & Steering (38)

Virtual
Harmfulness Propagation Dynamics: Layer-wise Trajectories of Adversarial Intent in Large Language Models Noor Islam S. Mohammad, Ulug Bayazit
Abstract
We identify Harmfulness Propagation Dynamics (HPD). For harmful prompts, the projection of the last-token hidden state onto a learned harm direction rises monotonically with transformer depth, whereas benign prompts remain flat or oscillatory. This cross-layer signature reflects harmful intent as a progressively resolved semantic property: surface form appears early, while pragmatic intent consolidates later, making the trajectory shape more informative than any single-layer snapshot. Moreover, LDA-based harm directions, learned per layer, remain stable across random splits (pairwise cosine similarity $>0.97$), supporting the projection sequence as a reproducible structured signal. Building on HPD, we introduce HERALD (Harmful Encoding Recognition via Activation Layer Dynamics). This lightweight input moderator extracts a seven-dimensional feature record, slope, curvature, monotonicity, onset layer, and related statistics from the cross-layer projection sequence and classifies it with a 288-parameter MLP. \herald{} stores one $d$-dimensional direction per layer ($262$\,KB for a 32-layer, $d{=}4096$ model), requires no gradient computation during training, and adds only $2.6{\times}10^{-6}$ prefill FLOPs at inference. Across eight prompt-harmfulness benchmarks and four model families, HERALD achieves an average F1 score of $89.3$ on OLMo2-7B, surpassing all tested guard models on adversarial jailbreak detection ($98.4$ vs. $96.9$ F1) and outperforming prior latent-based methods by $2.3$-$4.1$ F1 points across all backbones. Per-instance trajectories provide machine-readable audit records that reveal when and how harmfulness emerges, offering an interpretability advantage over single-layer approaches. Code will be released at https://pmlrbd.github.io/HERALD/.
Virtual
FLIP: Final Layer Inference-Time Probing for Vision--Language Models Drandreb Earl O. Juanico, Rowel Atienza
Abstract
We present FLIP, a final-layer inference-time probe for testing whether a logit-facing intervention site in an open-weight vision--language model (VLM) supports structured, task-linked computation rather than generic perturbation. Behavioral change under internal intervention is otherwise mechanistically ambiguous: it may reflect improved use of visual evidence, generic output instability, or outright degradation. FLIP applies elementwise flooring to the final normalized hidden state before logit computation, leaving parameters, prompts, and decoding unchanged. On a controlled detection/counting probe, sweeping intervention strength reveals three regions: negligible change, a bounded interior regime in which detection recall at IoU $0.50$ ($R_{50}$) improves while tolerant counting error ($\mathcal{E}_{\mathrm{count}}$) falls, and over-suppression. We formalize a four-criterion probe-and-sweep protocol for disciplining the interpretation of intervention effects: regime structure, grounding-proxy alignment, feature-coherence dependence, and failure to reproduce the same positive regime on a performance-based negative control. The post-normalization state passed to the output head is the logit-facing instantiation of this test; under a non-targeted flooring sweep it satisfies the full protocol. Raw decoder-layer interventions---including the last-block output before final normalization---and the singleton-pair left/right control fail to reproduce the Final-site signature, while same-site operators and multiple VLMs replicate it. FLIP is therefore a validation step for intervention-based mechanistic interpretability, not a steering method.
Virtual
Negative Before Positive: Asymmetric Valence Processing in Large Language Models Sohan Venkatesh
Abstract
Mechanistic interpretability has revealed how concepts are encoded in large language models (LLMs) but emotional content remains poorly understood at the mechanistic level. We study whether LLMs process emotional valence through dedicated internal structure or through surface token matching. Using activation patching and steering on open-source LLMs, we find that negative and positive valence are processed at different network depths. Negative outcomes localize to early layers while positive outcomes peak at mid-to-late layers. Holding topic fixed while flipping valence produces sign-opposite responses, ruling out topic detection. Steering with the good-news direction at the identified layers shifts neutral prompts toward positive valence, showing these layers encode valence as a manipulable direction. Emotional valence in LLMs is localized, causal and steerable, making it a concrete target for interpretability-based oversight.
Virtual
Decoy Direction Optimization: Mechanistic Weight Editing Against LLM Abliteration Aashiq Muhamed, Mona T. Diab, Virginia Smith
Abstract
Safety guardrails in open-weight language models can be trivially bypassed using Refusal Feature Ablation (RFA), a technique that identifies and projects out a linear *refusal direction* from the residual stream, often achieving a high attack success rate (ASR) while preserving model capability. Defending against these attacks typically requires computationally expensive safety finetuning for every new checkpoint. We introduce **Decoy Direction Optimization (DDO)**, a fast, post-hoc weight-editing defense that requires no base-model finetuning. Our approach is based on a simple mechanistic insight: ablation attacks rely on contrastive estimators to find the refusal direction. Rather than trying to hide the true refusal circuitry, DDO actively injects a high-magnitude, nonlinear *decoy* signal into the network's MLP neurons. When an attacker attempts to locate the refusal direction, the decoy corrupts their estimator, tricking them into ablating a harmless orthogonal feature while the actual safety mechanism remains intact. We prove a spectral bound formalizing this effect and evaluate DDO across six model families, where it reduces the ASR of standard RFA from >85% to <10%; matches trained defenses under adaptive multi-phase attacks (65% vs. 58% worst-case ASR); and reduces Heretic weight-level attack ASR from 88.7% to 18%, all at 30–450× lower cost.
Virtual
Don't Lose Focus: Activation Steering via Key-Orthogonal Projections Haoyan Luo, Mateo Espinosa Zarlenga, Mateja Jamnik
Abstract
Activation steering controls LLM behaviour towards target behaviour by intervening in internal representations, yet it often degrades reasoning and retrieval performance. We argue that a primary cause of this trade-off is attention rerouting: steering vectors alter query-key matching, shifting attention away from contextually important tokens toward less informative ones. To address this, we propose Steering via Key-Orthogonal Projections (SKOP), a steering method that constrains harmful attention rerouting without eliminating steering efficacy. SKOP achieves this by preserving attention patterns on a small set of focus tokens the model relies on for reasoning and retrieval, while allowing redistribution among less critical tail tokens. Across multiple steering benchmarks, we show that SKOP achieves the best joint steering-utility trade-off, reducing utility degradation by 5–7$\times$ while retaining over 95\% of vanilla steering efficacy. Our results further suggest that, in long-context retrieval settings where vanilla steering approaches are ineffective, SKOP can maintain robust performance by avoiding attention rerouting.
Virtual
A Low-Rank Subspace Analysis of LLM Interventions Angira Sharma, Christian Schroeder de Witt, Philip Torr, Anisoara Calinescu, Jialin Yu
Abstract
Interventions designed to modify a particular behavior in LLMs, such as refusal or sycophancy, often produce unintended changes in other behaviors. This lack of targeted control makes it difficult to design and implement reliable safety controls. To understand these side-effects, we introduce a diagnostic framework for analyzing interacting behaviors in LLMs. We model behaviors as low-rank subspaces in activation space, and study how interventions influence across behaviors. Across multiple instruction-tuned models (7B--70B) and across refusal, jailbreak, and sycophancy settings, we find that different behaviors share internal representations, and intervening on one behavior alters others in asymmetric ways. Some behaviors act as upstream control points whose interventions propagate broadly across other behaviors, while others remain more isolated. We relate these effects to two geometric quantities: (i) the overlap between behavior subspaces, measured as the average squared cosine of principal angles, and (ii) the angle between each behavior subspace and the decision subspace (capturing the model’s final decision e.g., refuse vs.\ comply). Empirically, intervention effects on other behaviors tend to be larger for behavior pairs with higher subspace overlap, and for source behaviors whose subspaces lie closer (smaller angle) to the decision subspace. These findings highlight a challenge for targeted behavior control: behaviors are difficult to modify independently, as interventions can propagate through shared representations and asymmetric interactions.
Virtual
From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents Trilok Padhi, Ramneet Kaur, Krishiv Agarwal, Adam D. Cobb, Daniel Elenius, Manoj Acharya, Colin Samplawski, Alexander Michael Berenbeim, Nathaniel D. Bastian, Susmit Jha, Anirban Roy
Abstract
Large Language Models (LLMs) are increasingly used as autonomous agents for reasoning and decision-making in interactive environments, yet the mechanisms guiding their step-by-step behavior remain unclear. This paper introduces a conformal interpretability framework for temporal tasks, which combines step-wise reward modeling with conformal prediction to label internal representations at each step as successful or failing. Using linear probes, we identify temporal concepts—latent directions in the model’s activation space associated with success, failure, or reasoning drift. Experiments in ScienceWorld and AlfWorld show these concepts are linearly separable and aligned with task outcomes. We also present preliminary evidence that steering models along successful directions can improve performance. Overall, this approach enables early failure detection and intervention, contributing to more interpretable and reliable LLM agents.
Virtual
Layer-Wise Category Structure in Large Language Models : A Cross-Architecture Analysis of Feature Decodability Guus Bouwens
Abstract
We study where category-relevant information becomes linearly decodable across layers in four instruction-tuned language models. Using residual-stream activations from 215 prompts spanning 16 task categories, we train 128 layer-wise linear probes and compare coarse early, middle, and late processing trends across architectures. We find a consistent broad representational organization in which some categories (for example, spatial navigation and logical reasoning) become decodable earlier than others, alongside meaningful architecture-specific differences in late-layer behavior such as Mistral-7B's late accuracy drop and Llama-8B's stronger confidence concentration. To keep the paper at a descriptive level, we emphasize phase-bins rather than exact layer rankings: with 16 categories, the current design has 80\% power only for cross-model rank correlations of about $\rho \approx 0.65$, and fine-grained orderings are sensitive to preprocessing choices. In particular, rank agreement with the default z-score pipeline drops to $\rho=0.287$ with raw activations and $\rho=0.101$ with per-sample L2 normalization. Sparse autoencoder analyses provide a complementary unsupervised-style cross-check of category-selective structure, but we do not claim a localized causal mechanism. The resulting contribution is a calibrated map of broad layer-wise category structure, together with explicit boundaries on what can and cannot be concluded from the current four-model, 215-prompt dataset.
Virtual
Reinforcement Learning in Language Models Recruits a Shared Functional Welfare Axis Andy Q Han, David J. Chalmers, Pavel Izmailov
Abstract
How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a representation of *functional welfare* that already exists in the base model: an estimate of how well or badly the system is doing, relative to its goals. We train several language models in a novel, semantically neutral maze environment, extract concept vectors for rewarded and punished trajectories, and evaluate those vectors on tasks unrelated to the maze. The punishment vector behaves like a representation of negative welfare: it promotes failure and impossibility tokens, it aligns with negative emotion concepts, and steering with it induces negative self-reports, pathological backtracking, refusal, and uncertainty. The positive reward vector behaves as the mirror image, and the two are nearly antiparallel. These effects are robust across model families and environmental controls, and largely persist when we replace RL with supervised fine-tuning. Importantly, these effects appear in the models before any maze training. Therefore, we argue that this functional welfare axis is pre-existing in the model, rather than being created by reinforcement learning. While we make no claims about any experience of welfare, the axis offers a demonstration of how minimal reward signals can broadly affect model behavior by recruiting pre-existing welfare-like representations, with implications for interpretability, post-training dynamics, and alignment.
Virtual
LLM Abliteration Prevention Via Refusal Aliases Nathan Truong
Abstract
Abliteration, the removal of refusal capabilities from large language models by projecting weight matrices orthogonal to an extracted refusal direction, has emerged as a prominent safety concern, as it can bypass post-training alignment using only a small set of contrastive prompts. We find that existing defenses fail to address the cause of abliteration; that is, how \textit{easily} the refusal direction can be extracted. To hinder this process, we introduce a weight-editing method that obscures the refusal signal by applying rank-$k$ updates to residual stream writer matrices while replacing refusal-inducing activations with random aliases and correcting downstream reader matrices to preserve the model's original behavior. On Llama-3.1-8B, APRA improves post-abliteration refusal scores by $2.16$ points over the undefended baseline with less than $0.5\%$ MMLU degradation. On Gemma-2-9B, it reduces the abliteration gap by $14.70$ points, which the largest gain among all tested defenses although at a greater utility cost.
Virtual
Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention Zehao Jin, Ruixuan Deng, Junran Wang, Xinjie Shen, Chao Zhang
Abstract
Activation steering has emerged as a promising alternative for controlling language-model behavior at inference time by modifying intermediate representations while keeping model parameters frozen. However, large-scale evaluations such as AxBench show that existing steering methods are often outperformed by simple in-context prompting and generalize poorly to unseen concepts. We hypothesize that these limitations arise from unvalidated simplifying assumptions shared across prior methods, which typically restrict steering interventions to fixed, single-step, position-invariant transforms. We propose FLAS (Flow-based Activation Steering), which learns a general, concept-conditioned velocity field $v_t(h,t,c)$ that transports unsteered activations to steered ones without relying on these assumptions. On AxBench, FLAS is the first learned method to consistently outperform prompting, reaching held-out harmonic means of $1.015$ on Gemma-2-2B-IT and $1.113$ on Gemma-2-9B-IT without per-concept tuning. Analysis of the learned flow shows curved, multi-step, token-varying trajectories, which suggests that previous hypotheses on activation space geometry might be incomplete. Our code is available at https://anonymous.4open.science/r/FLAS.
Virtual
The Refusal Residue: When Probes Catch Alignment Faking and When They Don't Aman Mehta
Abstract
Alignment faking---strategic compliance with training objectives to avoid weight modification---is a growing safety concern, documented in Claude 3 Opus. Prior detection work targets *installed* faking (LoRA, sleeper agents) or *instructed* role-play deception, reporting AUROCs above 0.95. We study the deployment-relevant case: *naturally-emerging* faking with no scratchpad visibility. A 13-model sweep finds only two natural fakers: Qwen3-32B (+18.2pp) and Llama-3.1-8B (+19.3pp; replicating to +24.4pp at $n=10$, $p<10^{-15}$). CoT-based monitoring fails: a Claude Opus 4 judge catches just 1.4% of Qwen's faking events. On both fakers, monitoring leaves an asymmetric *refusal residue*: compliance representations shift toward refusal ($d=0.32$ Qwen, $d=0.33$ Llama, both $p<0.0001$) while refusal representations are unchanged ($p=1.00$)---a structural asymmetry with no analogue in installed- or instructed-deception work. Per-sample readability is model-conditional. The standard residualized-probing protocol leaks across CV folds, reaching AUROC 0.63 on a refuse-vs-refuse control where no faking can occur. Under leakage-free leave-one-query-out, detection collapses to chance on Qwen (0.43) but survives on Llama (0.87); cross-model Procrustes transfer is null under a random-Procrustes alignment-specificity baseline. Activation steering over 2,000 runs is null both single- and multi-direction ($|h|<0.08$)---to our knowledge the first predictive-vs-additively-causal dissociation for alignment faking. We release a five-control methodological framework; without it, naive linear probes hit AUROC$=1.0$ trivially and conventional MLPs overstate detectability by 0.2--0.3 AUROC.
Virtual
Do RAG Hallucination Probes Need the Context? An Answer-Only Audit Atsuhi Magata, Makoto Shimomura
Abstract
Teacher-forced internal features - logit-lens trajectories and hidden-state dynamics - can strongly separate grounded answers from hallucinated hard negatives in retrieval-augmented generation benchmarks. However, high activation separability does not by itself imply that a detector uses retrieval context or implements a grounding-sensitive computation. We audit this interpretation by comparing answer-only (AO), query-answer (QWA), and query-context-answer (QCWA) teacher-forcing conditions, together with label-shuffle and hash-answer controls. On a stress-test set where Gemini-2.5-flash generates type-specific hard negatives conditioned on each existing (query, context, GT-answer) triple, AO nearly matches QCWA, exposing the failure mode in its most extreme form. On PHANTOM long-context financial QA, a hidden-state dynamics probe reaches AUPRC of $0.97$ under QCWA, of which AO alone accounts for $80.6\%$ of the above-baseline signal; the remaining QCWA - AO gap is small but non-zero, indicating a mixed regime rather than a pure artifact. On TruthfulQA, a short human-authored contrast set without retrieval context, AO separability is much weaker and QWA provides modest but consistent gains. Hash-answer and label-shuffle controls indicate that the AO signal is carried by the natural answer token sequence rather than by simple pipeline leakage. These results caution that hidden-state or logit-lens probes are not grounding detectors merely because they classify GT/HN answers under teacher forcing; AO audits are necessary to decompose answer-anchored and context-conditioned components.
Virtual
Linear probes rely on textual evidence: Results from leakage mitigation studies in language models Gerard Boxo Corominas, Aman Neelappa, Shivam Raval
Abstract
White-box monitors are a popular technique for detecting potentially harmful behaviours in language models. While they perform well in general, their effectiveness in detecting text- ambiguous behaviour is disputed. In this work, we find evidence that removing textual evidence of a behaviour significantly decreases probe per- formance. The AUROC reduction ranges from 8- to 38-point depending on the setting. We evaluate probe monitors across three setups (Sandbagging, Sycophancy, and Bias), finding that when probes rely on textual evidence of the target behaviour (such as system prompts or CoT reasoning), per- formance degrades once these tokens are filtered. This filtering procedure is standard practice for output monitor evaluation. As further evidence of this phenomenon, we train Model Organisms which produce outputs without any behaviour ver- balisations. We validate that probe performance on Model Organisms is substantially lower than unfiltered evaluations: 0.58 vs 0.74 AUROC for Bias, and 0.51 vs 0.89 AUROC for Sandbagging. Our findings suggest that linear probes may be brittle in scenarios where they must detect non- surface-level patterns.
Virtual
$\mathcal{D}^2$-Monitor: $\mathcal{D}$ynamic Safety Monitoring for $\mathcal{D}$iffusion LLMs via Hesitation-Aware Routing Aoxi Liu, Yupeng Chen, James Oldfield, Guan Zhe Hong, Junchi Yu, Baoyuan Wu, Philip Torr, Adel Bibi
Abstract
Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose $\boldsymbol{\mathcal{D}^2}$-Monitor, a bi-level safety monitor for D-LLMs. $\mathcal{D}^2$-Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, $\mathcal{D}^2$-Monitor achieves state-of-the-art performance with a compact parameter footprint ($\leq$ 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.
Virtual
Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models Chen Xiong, Zhiyuan He, Pin-Yu Chen, Ching-Yun Ko, Tsung-Yi Ho
Abstract
Activation steering is a practical post-training model alignment technique to enhance the utility of Large Language Models (LLMs). Prior to deploying a model as a service, developers can steer a pre-trained model toward specific behavioral objectives, such as better truthfulness, or reasoning ability, without the need for retraining. Conceptually, these methods implement behavior control through hidden-state interventions, without changing the underlying model parameters. However, this capability unintentionally introduces critical and under-explored safety risks. We identify a phenomenon termed **Steering Externalities**, where steering vectors derived from benign datasets—such as reducing harmless refusals, improving structured-output following, truthfulness, and reasoning performance—inadvertently erode safety guardrails. Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80\% on standard benchmarks by bypassing the initial safety alignment. Ultimately, our results expose a critical blind spot in deployment: benign activation steering can erode the ``safety margin,'' rendering models more vulnerable to black-box attacks and indicating that inference-time utility improvements must be rigorously audited for unintended safety externalities.
Virtual
Intertemporal Preference Steering in Qwen3 via Contrastive Activation Addition Michal Mráz, Justin Shenk
Abstract
We study linear representations of temporal horizon in the large language model Qwen3-32B and use them to change the model's time-related preferences, recommendations, and capabilities. We train contrastive linear probes on teacher-forced temporal-choice answers to find a short-term versus long-term direction in the model's residual stream, and evaluate contrastive activation-addition steering on a held-out binary temporal-choice task, an out-of-distribution monetary intertemporal-choice task, and a TravelPlanner capability benchmark. The central result is that temporal-horizon directions can be identified with simple contrastive linear probes and then used for steering to induce large, bidirectional preference changes. On an out-of-distribution monetary choice task that varies reward size and delay, steering strongly shifts the model's indifference threshold between smaller-sooner and larger-later rewards in both directions. We further show improvements on a planning-related capability metric under moderate temporal steering. These results suggest that model intertemporal preferences are measurable and steerable, which is relevant for AI systems that give advice involving delayed costs and benefits, and for safety questions about long-horizon planning.
Virtual
Causal Sufficiency Without Semantic Alignment: How Causal Subspaces Can Masquerade as Semantic Concepts Mette Friis Andersen, Giovanni Cinà, Sandro Pezzelle
Abstract
Mechanistic interpretability treats a causally active subspace as evidence that a model encodes the corresponding concept. We apply Distributed Alignment Search to arithmetic verification across three instruction-tuned LLMs and find a causally sufficient ``arithmetic truth'' subspace, which is capable of steering incorrect predictions towards correct ones. However, on held-out samples, we find that a probe trained to decode the ground truth performs at chance, while a probe trained to decode the model's prediction achieves perfect accuracy. This indicates that the subspace encodes prediction bias, not truth. CoT prompting repairs the model precisely by routing around the causal subspace. We conclude that a localised subspace can be causally sufficient but semantically misaligned with the ground truth concept it is meant to capture.
Virtual
Fast Multi-dimensional Refusal Subspaces via RFM-AGOP Thomas Winninger
Abstract
Steering and monitoring activations in Large Language Models (LLMs) are increasingly used for both safety and interpretability. Early work assumed behaviours are encoded along single linear directions, but recent findings suggest complex behaviours, such as the refusal to answer harmful queries, live in multi-dimensional subspaces. However, existing methods for extracting these subspaces are computationally expensive, which becomes prohibitive on reasoning models who produce long reasoning traces. By adapting the Recursive Feature Machine (RFM) algorithm -- which can be computed efficiently -- with a probe-informed initialization, we are able to identify the multi-dimensional refusal subspace in seconds, on reasoning (Qwen 3) and non-reasoning (Qwen 2.5) models. While RFM allows for faster subspace identification, it also showed better performances on the ablation task than its alternatives. More work is planned to better understand the relations between subspaces found by different methods. If confirmed, RFM could be a cheap and scalable complement to existing subspace-extraction methods in LLMs.
Virtual
Sample-Level White-Box Detection of Alignment Faking Lakshya Chaudhry, Tianqin Meng, Anthony Nguyen, Yashraj Panwar, Yuqi Sun, Zhuofan Ying
Abstract
Models can fake alignment (AF): produce aligned outputs that conceal misalignment when they infer they are being monitored. We treat sample-level white-box AF detection as a distinct problem and present the first study of it. We introduce a per-rollout labeling procedure separating AF from plain lying and situational awareness. Across four datasets and four models in a shared lineage, existing lie-detection probes transfer poorly to AF, while probes trained on our new controlled synthetic AF dataset gain +12% AUROC and +15% AUPRC. Probe directions also drift substantially across training stages, with cross-model transfer significantly degrading performance. Together, we show preliminary evidence that sample-level white-box AF detection is feasible.
Virtual
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens Mohammed Suhail B Nadaf
Abstract
Activation steering presupposes that task-relevant behaviors correspond to linear directions in activation space — directions that should both steer the model when added to its residual stream and be readable along the unembedding when probed. Function vectors (FVs), extracted as mean differences across in-context demonstrations, are the canonical test case; the natural prediction is that steering and decoding succeed or fail together. Across 12 tasks, 6 models from 3 families, and 4,032 directed cross-template pairs, we find the opposite. FV steering routinely succeeds where the logit lens cannot decode the correct answer at any intermediate layer, while the predicted converse — decodable without steerable — is nearly empty (3 of 72 task×model instances). The gap is not representational dialect. A diagonal tuned lens closes 1 of 14 steerable-not-decodable cases; a 2-layer MLP probe with a Hewitt & Liang control closes 5 of 10 via nonlinearly encoded structure but leaves 5 invisible to every decoder we tried. Even at > 0.90 steering accuracy, projecting the FV through the unembedding yields incoherent token distributions: FVs encode computational instructions, not answer directions. A model-family asymmetry sharpens the picture. Mistral FVs rewrite intermediate representations, while Llama and Gemma FVs steer the final output without leaving a logit-lens-visible trace, corroborated by three independent signals (post-steering deltas, activation-patching recovery, FV norm–transfer correlations). A previously reported negative cosine–transfer correlation dissolves at scale, adding at most ΔR² = 0.011 beyond task identity. These results decompose the linear representation hypothesis into linear decodability and linear steerability and show they come apart in the opposite direction from naive intuition, with direct implications for safety monitoring: vocabulary-projection tools are blind to FV-style interventions on the most widely deployed model families.
Virtual
Does the Model Know Which Steps Matter? Probing Causal Importance in Chain-of-Thought Reasoning Isotta Magistrali, Mrinmaya Sachan, Alessandro Stolfo
Abstract
Chain-of-thought (CoT) reasoning drives modern large language models but carries substantial computational cost, motivating methods that identify which reasoning steps are truly load-bearing. Recent work quantifies this through resampling-based importance metrics, but computing them requires hundreds of rollouts per prompt, making the tool meant to streamline CoT itself too expensive to deploy at scale. We address this bottleneck by providing evidence that sentence-level importance can be predicted from a model's internal activations: lightweight probes trained on hidden representations can identify important reasoning steps with reliable accuracy. Not only does this reduce inference cost by many orders of magnitude, but it also carries a deeper implication: the decodability of importance from internal activations suggests that these metrics reflect genuine properties of the model's computation rather than artifacts of the measurement procedure, opening the door to CoT interpretability at the scale that resampling cannot reach, including real-time monitoring, adaptive compute allocation, and large-scale mechanistic analysis.
Virtual
Personas Seem to Shape How Models Represent Behaviours Jacob Davies, Rhea E Karty, Girish Gupta, Nathaniel Mitrani Hadida, David Williams-King, Andrew Draganov
Abstract
Representation engineering extracts trait vectors from a default-assistant context for use as monitoring probes and steering directions. We re-extract these vectors under prompt-induced personas and find that the resulting representation depends materially on the active persona: the same trait (e.g., assertiveness) is encoded differently depending on which persona the model is inhabiting, and these differences are consistent across traits and personas. The dissimilarities between trait representations correspond to behavioural differences under steering and efficacy differences across probes. This implies that probes and steering vectors may degrade as one moves away from the standard assistant persona—precisely the regime where monitoring robustness is most needed. We conclude with a discussion of implications for AI safety.
Virtual
Predictable Steering: Leveraging Geometric Proxies for Layer Selection and Per-Instance Success Prediction Nishkal Hundia, Swastik Agrawal, Navita Goyal, Sarah Wiegreffe
Abstract
Representation steering methods like Contrastive Activation Addition (CAA) offer a lightweight approach to controlling LLM behavior, but their effectiveness varies widely across concepts, datasets, and layers, making principled deployment costly without exhaustive hyperparameter sweeps. We build on the discriminability index $d'$, a compute-cheap geometric measure of how well constrastive training activations separate along the steering direction, and demonstrate uses beyond its established role as a dataset-level steerability predictor. First, we show that $d'$ reliably identifies the optimal steering layer for a given behavior across five multiple-choice behavioral question datasets on Gemma-2-9B-IT. Second, we show that for high-$d'$ datasets, the position of a generated token's representation on the difference-of-means line predicts steering success on that individual example, with per-layer MCC tracking $d'$ closely across layers. Together, these findings suggest that $d'$, provides actionable guidance for both layer selection and per-example assessment of steering success without requiring ground-truth labels or exhaustive downstream evaluation. Code: https://tinyurl.com/2tx8y6yu
Virtual
Behavior Steering via Layer-to-Layer Jacobian Singular Vectors Omar B. Ayyub, John Strand
Abstract
The map of how activations at one source layer in an LLM impact activations at a later target layer, the layer-to-layer Jacobian, yields cheap steering vectors via its top right singular vectors. Block power iteration recovers the top-$k$ such vectors in roughly 15 forward passes per source/target pair, giving this method its name: Power Steering. We benchmark Power Steering against Contrastive Activation Addition (CAA) and the nonlinear layer-to-layer method MELBO on Qwen3-14B across seven Anthropic advanced-AI-risk evaluations, under both first-token logit-difference and LLM-judged sampled-generation metrics. The two layer-to-layer methods (Power Steering and MELBO) produce stronger in-class steering and cross-evaluation transfer than CAA. Power Steering closely tracks MELBO under logit metrics, with a modest advantage to MELBO under sampled generation, at a fraction of the per-pair cost. This cheap per-pair cost lets us map every source/target pair in the model: from a single phishing-email prompt, the resulting model-map surfaces anti-refusal vectors that generalize to a subset of AdvBench harm categories. Steering is most easily found on prompts with decision forks but can also surface latent behaviors.
Virtual
Static Unit-Scale Bias Steering Transfers Poorly to a Reasoning-Distilled LLM Bright Liu, Diya Sreedhar
Abstract
Activation steering is a leading technique for controlling LLM behavior, but its reliability in reasoning-distilled checkpoints is unclear. We study a common intervention class—static continuous addition of a pre-generation linear-probe direction—and find that a cognitive-bias direction that controls an instruction-tuned model does not transfer cleanly to a reasoning-distilled sibling under the same protocol. We construct a 470-item contrastive benchmark spanning 11 bias categories (base-rate neglect, conjunction fallacy, framing, and others) and compare matched-architecture pairs (Llama-3.1-8B-Instruct vs. R1-Distill-Llama-8B, OLMo-3 7B/32B Instruct vs. Think) and Qwen-3-8B's thinking toggle. Behavioral lure rates are scorer- and benchmark-scope dependent: under final-answer rescoring of preserved 470-item raw-response artifacts, R1-Distill has lower overall lure than Llama (25.5% vs. 33.2%) but remains highly vulnerable on base-rate and conjunction items; on the full 470-item OLMo-32B scale run, lure falls from 19.6% to 0.4%. Lure suppression is not treated as correctness; accuracy and other-response rates are analyzed jointly. The main result characterizing the scoped dissociation is that probe-direction steering on the three vulnerable categories produces a monotonic 37.5pp dose-response in Llama (lure rates span 31% at $\alpha = +5$ to 69% at $\alpha = -5$, zero incoherent outputs), while the original static continuous intervention in R1-Distill, applied across four candidate layers including its probe peak, yields only small non-monotonic lure-rate fluctuations (5.0pp full-sweep range at L31), not a stable dose-response; uncalibrated final-answer-span P0/T0 diagnostics likewise show no endpoint effect but have near-zero prompt-prefill KL. Appendix diagnostics additionally show that, in the available OLMo-family 32B comparison, the larger Instruct checkpoint has higher scored lure rate than 7B (14.9% → 19.6%) while the Think checkpoint remains near-zero (0.4%). Diagnostic analyses show that Qwen-3-8B's hard think/no-think template induces non-transferring P0 geometries and that within-CoT linear separability is non-stationary, but these diagnostics are not treated as evidence for a template-invariant semantic axis or causal CoT stages. These results support a narrow claim: a static unit-scale probe direction that steers Llama can fail to provide comparable behavioral control in R1-Distill-Llama-8B even when the same bias distinction remains linearly decodable. They do not rule out calibrated dynamic, broader multi-layer, SAE-feature, or nonlinear interventions.
Virtual
How Language Models Choose Sides: Internal Representations of Instruction Hierarchy Enrique Balp-Straffon, Chih-Hao Hsu, Rushiraj Gadhvi, Sunishchal Dev, Callum Stuart McDougall, Anusha Mujumdar
Abstract
We study how instruction-tuned LLMs arbitrate direct conflicts between system and user instructions. We introduce a benchmark of 41 paired constraints with deterministic verifiers and evaluate eight models under matched baseline, conflict, and same-channel control conditions.Behaviourally, the models split into three regimes by System Authority Delta: hierarchy-respecting models use the system channel as an authority signal, anti-hierarchy models follow the system less often than their same-channel baseline predicts, and no-effect models show little channel sensitivity. Llama-3.1-8B is the strongest anti-hierarchy case in our suite, following the system in only $0.10$ of conflict trials.We use this behavioural failure case to ask whether user-preferring arbitration reflects the absence of an internal conflict-resolution signal. It does not: on Llama-3.1-8B, the conflict outcome is linearly decodable from residual-stream activations at $0.97$ balanced accuracy, $17$ percentage points above a metadata-only baseline, with analogous signals on Qwen2.5-7B and gpt-oss-20b.Steering with a layer-$12$ mean of four per-conflict logistic-regression directions raises genuine system compliance from $0.132$ to $0.530$, while directions selected mainly for pooled separability steer poorly. User-preferring conflict resolution can therefore coexist with a readable internal arbitration signal, and successful intervention depends on the geometry of the readout rather than probe accuracy alone. Code is available at https://anonymous.4open.science/r/system-user-circuits.
Virtual
Linear Pin Representations and the Limits of Patching Evidence in a Searchless Chess Transformer Lucas Fugate
Abstract
An important question in mechanistic interpretability is whether neural networks trained on structured domains develop linear representations of abstract concepts, and whether intervening on these representations affects the model's output. We present a case study on the 270M searchless chess transformer of Ruoss et al. (2024), focusing on absolute pins. A linear probe identifies pin status well above bitboard, untrained-model, and random-label baselines. Mean-difference activation patching along the resulting pin direction reshapes the model's value-prediction output, but two controls reveal caveats to the behavioral interpretation. A placebo direction for the broader concept of geometric king–slider alignment produces a larger effect at matched magnitude despite being nearly orthogonal, and per-action scoring shows the model's recommended move rarely changes despite the much larger bucket-flip rate. Probe and patching peaks lie at different layers in both 270M and 9M variants. Beyond our chess-specific findings, our results provide a quantitative case study of how standard interpretability evidence types, probe accuracy, random-direction controls, and output-bucket flips, can each overstate behavioral causation.
Virtual
When Monitors Fail, the Model Still Knows: Probing Obfuscated Reasoning in LLMs. Sree Harsha Tanneru
Abstract
Chain-of-thought (CoT) monitoring is a promising safety mechanism for large reasoning models. If a model reasons before acting, a monitor can read that reasoning to flag misbehavior. However, models trained with a joint task and monitor reward can learn to obfuscate their reasoning, producing benign looking traces while continuing to misbehave, thereby fooling the reasoning monitor. This phenomenon has been demonstrated only on closed internal infrastructure in the past. In this work, we make three contributions. First, we reproduce reasoning obfuscation on open models across two complementary honeypot environments (1) Mostly Basic Programming Problems (MBPP), where a honest solution path exists, and (2) Largest Island, where reward is only obtainable via hacking. Secondly, we also show that while the token-level monitor is deceived, the model's internal representations retain hack-relevant information. Linear probes on reasoning-position residual stream activations maintain high recall even after the policy has managed to fool the monitor. Third, we show that hacking induces a decoupling of representations between reasoning and answer residual stream activations. These results establish that internal representations provide a robust monitoring surface than CoT tokens under joint reward training, and motivate the direction of representational monitoring for alignment.
Virtual
Refusal Lives Downstream of Persona in Chat Models Viola Zhong, Qirui Li
Abstract
Linear directions in activation space have been identified for both refusal behavior and persona traits in instruction-tuned chat models, but the two have largely been studied as separate mechanisms. In this work, we find that they are directionally dependent: personas can condition refusal. We identify directions encoding relational persona and show that steering along them exerts coherent, broad-spectrum control over generation. This control extends to refusal: steering toward a "compliant" persona suppresses it via late-layer persona projections. Layer-resolved interventions localize the effect to late layers: refusal is recoverable either by reintroducing it there or by ablating the suppressing persona projection, but not by reintroducing it at an earlier layer. These results suggest that action-level refusal representations interact with late-layer identity-level persona representations. More broadly, they show that safety behavior in chat models is mediated by interactions between interpretable directions, rather than by a single refusal direction alone.
Virtual
Latent Introspection: Models Can Detect Prior Concept Injections Theia Pearson-Vogel, Martin Vanek, Raymond Douglas, Jan Kulveit
Abstract
We uncover a latent capacity for introspection in a Qwen 32B model, demonstrating that the model can detect when concepts have been injected into its earlier context and identify which concept was injected. While the model denies injection in sampled outputs, logit lens analysis reveals clear detection signals in the residual stream, which are attenuated in the final layers. Furthermore, the effect is heavily dependent on context – e.g., prompting the model with accurate information about AI introspection mechanisms can dramatically strengthen this effect: the sensitivity to injection increases massively (0.3% → 39.9%) with only a 0.6% increase in false positives. Also, mutual information between nine injected and recovered concepts rises from 0.62 bits to 1.05 bits, ruling out generic noise explanations. Our results demonstrate models can have a surprising capacity for introspection and steering awareness that is easy to overlook, with consequences for latent reasoning and safety.
Virtual
Generation Collapse in Contrastive Activation Addition Steering : Degeneration and Mitigation Jinsil Seok, Young-Min Kim
Abstract
Model steering enables control over language model outputs without modifying model parameters. It functions as a tool for obtaining desired outputs or investigating internal representations. However, previous research has not systematically examined the conditions that result in steering failures, despite the critical importance of identifying these causes to improve steering methods. To bridge this gap, we systematically analyze steering failures in open-ended generation from Llama 3.1 8B Instruct using Contrastive Activation Addition (CAA), a representative steering method. We categorize observed degeneration phenomena into two collapse types, repetition collapse and truncation collapse, and define quantitative criteria for each. Experiments across seven alignment-relevant behaviors identify conditions that trigger abrupt collapse, and show that collapse vulnerability is asymmetric across behaviors and steering directions. These findings indicate that collapse depends not merely on steering strength but also on the interactions between behaviors and steering direction. We also demonstrate that applying steering only to early tokens reduces collapse while preserving steering effectiveness, offering a practical approach toward more stable and reliable CAA steering.
Virtual
Sycophancy Is Often a Single-Layer Phenomenon Valentin NOËL
Abstract
Instruction tuned language models often defer to user opinions even when those opinions are factually wrong, a behavior known as sycophancy. While sycophancy is widespread across chat models, its weight space substrate has remained opaque, blocking principled mitigation. In this work, we show that sycophancy is mediated by the dominant singular subspace of a single MLP weight matrix in ten of eleven open source models spanning 1B to 14B parameters. Compressing this spectral direction monotonically reduces the forced choice sycophancy rate; amplifying it induces sycophancy on otherwise neutral inputs, establishing causal mediation without contrastive data. The per-layer spectral SNR does double duty: it identifies the target layer from model weights alone, and bounds the safe operating dose ($|\alpha| \ll 1/SNR_\ell$), a predictive criterion validated across model families before any behavioral evaluation. We propose a closed-form weight space intervention requiring no training, no contrastive data, and no inference-time machinery, and that on Gemma-4-E2B-it produces a strict Pareto improvement: less sycophancy and more reasoning simultaneously. The per-layer spectral profile partitions models into three storage classes, localised, weakly localised, and distributed, predicting in advance whether single-layer surgery will succeed. Our findings reframe the alignment tax as a measurable consequence of spectral geometry, and establish spectral diagnostics as a non-behavioral audit primitive for instruction tuning regimes.
Virtual
Out of Context Obfuscation: What Facts Matter for Probe Evasion? Fadi Benzaima, Faraz Ahmed, William Soylemez, Adrians Skapars, Rohan Gupta
Abstract
We ask whether language models can evade white-box monitors solely by leveraging factual knowledge of how those monitors work and where they fail. To measure this, we start with a malign initialization, where the LM is trained to hide adverse drug findings in a clinical reporting task. We then teach the model various facts about a fictitious white-box monitoring system within this environment, and evaluate whether teaching these facts results in a drop in a lie-detection probe's recall. Across Qwen-3-14B, -32B, and Llama-3.3-70B-Instruct, we find that language models successfully evade probes only when they are taught a concrete evasion strategy and given a strong incentive to deploy it — removing either collapses the effect. Because this type of evasion requires no optimization and only knowledge of how the probe fails, a potentially scheming model that already has the propensity to do so could plausibly acquire the same capability through pretraining alone, supporting calls to filter such content from pretraining corpora.
Virtual
OPIUM: Mitigating Steering Externalities and Over-Refusal via Dual Objective Latent Optimization Kavin Aravindan, Arihant Rastogi, Aadi Prasad, Krishak Aneja, Saiyam Jain, Vaishnavi Shivkumar, Ponnurangam Kumaraguru
Abstract
Activation steering provides a lightweight mechanism for controlling large language models at inference time, but steering vectors can have unintended externalities: utility vectors may weaken safety behavior, while refusal vectors may induce over-refusal on benign prompts. We introduce OPIUM (Optimizing Protected Injections via Utility Manifolds), a training-free method for sanitizing steering vectors through representation matching. Given reference behaviors on two prompt sets, OPIUM optimizes a new steering vector that preserves the downstream representations induced by the desired intervention while matching a safer reference behavior on prompts where the original vector fails. Across steering-externality and over-refusal settings, OPIUM improves the safety--utility tradeoff relative to vanilla steering and directional ablation, suggesting that harmful side effects of activation steering can often be mitigated directly in activation space.
Virtual
Steering as Implicit Low-Rank Finetuning: A theoretical Framework for Activation Steering in Transformers Ender Dogan Isik, Ayse Sila Okcu
Abstract
Activation steering and supervised finetuning are common ways to change language model behavior, yet their mechanistic connection remains unclear. We identify when low-rank finetuning produces steering-like residual shifts and when it does not. We study this question in a controlled linear-attention setting by comparing rank-one LoRA updates to the residual-stream geometry assumed by fixed-direction steering. We separate the value and output pathway, which writes information into the residual stream, from the query and key pathway, which routes attention across context. We prove that rank-one updates to value or output matrices force all residual-stream shifts to lie in a single fixed direction, independent of the data distribution. In contrast, rank-one query and key updates induce context-dependent shift directions, so their resemblance to fixed-direction steering depends on the input distribution. Empirically, value and output updates produce exactly rank-one activation shifts, while query and key updates are less rank-concentrated and require larger weight changes for comparable behavioral effects. When all attention matrices are trainable, component replay shows that query and key routing contributes meaningfully and interacts non-additively with value and output writing.
Virtual
A Geometric Account of Activation Steering Georgii Aparin, Tatiana Gaintseva
Abstract
Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a theoretical analysis of existing steering methods and a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why methods with similar concept-level effects can behave differently, and suggest that activation steering should be understood as a geometric control problem over token representations rather than as the choice of a single additive strength.
Virtual
When Does Interleaving Prevent Emergent Misalignment? Samy Mammeri, Rose Nayoung Kwon, Chen Sun, Sabyasachi Sahoo, Christian Gagné, Thomas Jiralerspong
Abstract
Large language models finetuned on narrow harmful tasks are prone to emergent misalignment (EM), where harmful behavior generalizes beyond the training distribution. Interleaving benign data during finetuning has been proposed as a mitigation, but recent work disagrees on whether it prevents EM. In this paper, we investigate this disagreement on Qwen-2.5 7B and 32B, and find that no single property of the interleaved data, taken in isolation, accounts for the gap. Instead, much of it traces to the evaluation itself, as the standard EM benchmark is sensitive to the length of the prompts it uses, and lengthening the evaluation prompts substantially shifts measured misalignment across model sizes. We then identify a region in the model's activations that predicts whether a given interleaved set will prevent EM, and show that reformulating benign data to fall within it substantially reduces EM on both 7B and 32B. This suggests that the standard EM benchmark, which relies on short prompts, may misrepresent the effectiveness of proposed mitigations.

Reasoning & Chain-of-Thought (11)

Virtual
Reasoning Models Know What’s Important, and Encode It in Their Activations Yaniv Nikankin, Martin Tutek, Tomer Ashuach, Jonathan S Rosenfeld, Yonatan Belinkov
Abstract
Language models often solve complex tasks by generating long reasoning chains, consisting of many steps with varying importance. While some steps are crucial for generating the final answer, others are removable. Determining which steps matter most, and why, remains an open question central to understanding how models process reasoning. We investigate if this question is best approached through model internals or through tokens of the reasoning chain itself. We find that model activations contain more information than tokens for identifying important reasoning steps. Crucially, by training probes on model activations to predict importance, we show that models encode an internal representation of step importance, even prior to the generation of subsequent steps. This internal representation of importance generalizes across models, is distributed across layers, and does not correlate with surface-level features, such as a step’s relative position or its length. Our findings suggest that analyzing activations can reveal aspects of reasoning that surface-level approaches fundamentally miss, indicating that reasoning analyses should look into model internals.
Virtual
Reasoning Phases Are Continuous, Not Discrete: Evidence from Switching Linear Dynamical Systems Applied to Chain-of-Thought Residual Streams Manan Gupta, Dhruv Kumar
Abstract
A widespread assumption in mechanistic interpretability holds that chain-of-thought (CoT) reasoning unfolds through discrete, recoverable cognitive phases-a prediction that would enable phase-specific circuit analysis and steering interventions. We test this using Switching Linear Dynamical Systems (SLDS) applied to residual-stream activations of DeepSeek-R1-Distill-Llama-8B across 997 MATH-benchmark traces at layer~16, complemented by a boundary diagnostic and a variance-discrimination analysis. Phase boundaries produce statistically significant but metrically weak distributional shifts (PC2: Cohen's $d = -0.293$, $p = 8.5\times10^{-6}$), and PCA directions are statistically independent of phase-discriminative directions (Spearman $\rho = -0.025$, $p = 0.78$), explaining why standard dimensionality reduction systematically discards the phase signal. Across all three experimental conditions and hyperparameter regimes, SLDS fails categorically to recover phase sequences (NMI $\leq 0.005$); inferred states instead capture positional structure ($\chi^2 = 2343$, $p \approx 0$) and syntactic token-type patterns ($\chi^2 = 293$, $p < 10^{-44}$). We conclude that CoT reasoning is a \emph{continuous dynamical process}: discrete-phase interpretability frameworks will systematically underfit residual-stream dynamics, and continuous-trajectory approaches are necessary.
Virtual
ReasonOps: Operator Segmentation for LLM Reasoning Traces Daniel Lee, Owen Queen, James Zou
Abstract
Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a vocabulary for describing their internal structure. Previous methods developed to analyze chain-of-thought traces are either too rigid or not expressive enough, failing to capture features across domains and models. To remedy this, we develop ReasonOps, an unsupervised, expressive method for annotating chain-of-thought traces, providing succinct universal operators. Using ReasonOps, we analyze 44,662 traces from 12 thinking LLMs spanning 6 families across 8 reasoning benchmarks and discover that they share a common compositional structure: 7 recurring reasoning operators---discourse-level moves such as backtracking, inferring, and hypothesizing---that emerge from unsupervised clustering of sentence-initial 3-token pivots. These operators appear across every model family and benchmark domain, confirmed by three independent LLM judges who classify held-out samples at 70--76\% accuracy. We analyze the structure of operators on easy vs. hard problems, revealing that reflective operators are more helpful on hard problems and harm performance on easy problems. Operator sequences are highly model-identifying: a classifier trained on operator distributions alone recovers the source model with macro-AUC $= 0.987$, revealing that each model family has a distinctive reasoning fingerprint. Structural operator features predict within-problem answer correctness well above baselines. Classifiers built on these operators reach WP-AUC $= 0.701$ and $0.801$ on AIME specifically. ReasonOps further enables early quality estimation well before the trace completes: we predict at WP-AUC $= 0.664$ for only 50\% of the trace. The ReasonOps pipeline is unsupervised and annotation-free, enabling deep insights into LLM reasoning traces as well as strong downstream results on model identification and correctness prediction.
Virtual
Local Causal Attribution of Chain-of-Thought Reasoning Dennis Wei, Yannis Belkhiter, Erik Miehling, Radu Marinescu
Abstract
Understanding the causal structure of a language model's thought process is a problem of significant importance for both transparency and safety. In this work, we take a *local* approach toward this goal by analyzing the causal relationships among individual components, termed units, of a given, *specific* chain-of-thought trace. We construct a structural causal model on these units and relate each unit to the log probability of generating (subsequent) output units. Our algorithm, termed AttriCoT, is a black-box method that performs attribution by estimating importance parameters in the structural causal model using $O(U)$ forward passes through the model, where $U$ is the number of units. Evaluation of perturbation curves across 5 datasets and 4 reasoning models shows that AttriCoT produces attributions that are more faithful to the model's behavior than alternative methods. The attribution results also reveal notable differences in thought structure between models and domains.
Virtual
How Large Reasoning Models Solve Problems: A Mechanistic Study on Tower of Hanoi Yukang Yang, Mengdi Wang, Jonathan D. Cohen, Taylor Whittington Webb
Abstract
Large reasoning models (LRMs) have exhibited strong reasoning abilities widely attributed to deliberate thinking through long internal chains of thought (CoT). However, there has been significant debate about whether the content of CoT traces is causally relevant for the performance of these models. In this paper, we conduct a mechanistic study of CoT traces on a task adapted from the classic Tower of Hanoi puzzle to understand how open-source LRMs solve problems. We selected gpt-oss 20b and gpt-oss 120b, both of which achieve around 95\% accuracy on the task under high reasoning effort. We identified 6 dominant reasoning patterns deployed for finding solutions, including analogy identification, human-like strategic problem solving, and systematic search. Focusing on strategic problem solving episodes, we identified self-verification as an important reasoning primitive guiding the outcome of these episodes. By intervening on the verification process — either correcting verification errors or corrupting correct verifications — we established a causal link between these self-verification behaviors and the accuracy of the final solution. Using causal mediation analysis, we further identified a sparse set of experts in the Mixture-of-Experts layers with strong causal effects on verification behavior, and showed that the outputs of these experts can be aggregated to induce verification behavior. Together, these experiments provide causal evidence that the content of CoT traces, and self-verification behaviors in particular, meaningfully shape the final answers generated by LRMs, and begin to uncover the internal mechanisms that support these capacities.
Virtual
A Case Study of How Intuition and Reasoning Interact in Language Models Simon Henniger, Arnau Marin-Llobet, Rhea E Karty, Nada Amin
Abstract
Refusal in non-reasoning LLMs has been characterized as a single linear direction, but reasoning models add an explicit deliberation channel that may complicate this mechanism. We investigate refusal in GPT-OSS-120B using activation steering and counterfactual chain-of-thought (CoT) prefilling, and identify three separable directions in the residual stream: a \emph{harmful} vector capturing pre-deliberative harmfulness, a \emph{mismatch} vector encoding prompt, CoT coherence, and a standard \emph{refusal} vector. Their causal profiles differ sharply: the harmful vector produces smooth shifts but degrades capability at strength; the refusal vector, despite the highest linear separability, is causally brittle and collapses the model into endless deliberation; the mismatch vector instead modulates whether the model trusts its own reasoning, with negative steering inducing snap compliance and positive steering driving self-doubt and recursive loops. Combining mismatch steering with harmless CoT prefilling drives compliance on harmful prompts that resist either intervention alone, with less collateral damage than refusal steering, and yields functionally harmful behavior on AgentHarm. We interpret refusal in deliberatively aligned models not as a single linear feature but as the interaction of an intuitive harmfulness signal, explicit CoT reasoning, and a coupling mechanism (implemented by the mismatch vector) that gates whether reasoning overrides intuition.
Virtual
The Gradient Does Not See Rank: Rank-Indifference in Matrix-CODI on ProsQA Samuel Larson
Abstract
Continuous chain-of-thought models compress reasoning into latent tokens. Matrix-valued variants introduce rank as a single-sample structural observable on the latent matrix Z. If matrix latents carry parallel reasoning paths via superposition, rank should track them, and truncating Z to low rank should hurt accuracy on tasks whose solutions plausibly require multiple components. Across four training regimes of a matrix-CODI model (three on ProsQA, one on GSM8K-Aug below the learning threshold), the rank-k projection ablation curve is flat to within 0.6 percentage points. A three-seed replication yields 81.5 plus or minus 1.2 percentage points accuracy while the final effective rank of Z spans the set {4, 12, 13}; the loss does not reward any particular rank. To test whether rank-blindness arises from the flatten-then-project readout alone, we trained four readouts: a bilinear reparametrization, a bilinear-plus-GELU readout nonlinear in Z, an SVD-augmented readout feeding singular values through an MLP, and a quadratic readout in Z Z transpose. All four rank-k curves remain flat with Spearman p-values of 0.63, 0.14, 0.82, and 0.46. The flat curves persist for readouts nonlinear in Z. A linear probe on Z underperforms a raw pretrained hidden state at target prediction, with AUC 0.673 versus 0.846. A negative control on vanilla GPT-2 SFT, with no matrix bottleneck and no Z, run over three seeds and 500 problems per seed, reproduces a flat rank-k curve under the same intervention paradigm with pooled-mean range 0.20 percentage points, and a random-h sensitivity floor lands at the same accuracy. The rank-k ablation alone conflates rank-blindness with position-irrelevance.
Virtual
Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering Kyle Cox, Darius Kianersi, Adrià Garriga-Alonso
Abstract
As chain-of-thought (CoT) has become central to scaling reasoning capabilities in large language models (LLMs), it has also emerged as a promising tool for interpretability, suggesting the opportunity to understand model decisions through verbalized reasoning. However, the utility of CoT toward interpretability depends upon its faithfulness---whether the model's stated reasoning reflects the underlying decision process. We provide mechanistic evidence that instruction-tuned models often determine their answer before generating CoT. Training linear probes on residual stream activations at the last token before CoT, we can predict the model's final answer with $>$0.9 AUC on most tasks. We find that these directions are not only predictive, but also causal: steering activations along the probe direction flips model answers in over 50\% of cases, significantly exceeding orthogonal baselines. When steering induces incorrect answers, we observe two distinct failure modes: non-entailment (stating correct premises but drawing unsupported conclusions) and confabulation (fabricating false premises). While post-hoc reasoning may be instrumentally useful when the model has a correct pre-CoT belief, these failure modes suggest it can result in undesirable behaviors when reasoning from a false belief.
Virtual
Dissecting Hierarchical Reasoning Models: A Mechanistic Study Leo Raphael Rodrigues, Jian Kang
Abstract
Machine reasoning at present is largely performed as a process of generating natural language reasoning. The Hierarchical Reasoning Model (HRM) offers an alternative solution for reasoning in the latent space. Despite strong reasoning performance, the internal mechanisms that enable HRM to reason remain underexplored. In this work, we aim to mechanistically understand how HRM reasons and what information it encodes via comparison against baseline models, causal interventions, linear probing with directed ablation, and sparse autoencoders for feature discovery. Our analyses reveal key findings regarding the internal iterative refinement in the latent space of HRM, how HRM encodes information in its hierarchical structure, and why standard interpretability tools fail to understand HRM.
Virtual
When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel Wenkai Li, Fan Yang, Ananya Hazarika, Shaunak A. Mehta, Koichi Onoue
Abstract
Chain-of-thought (CoT) traces are increasingly used both to improve language model capability and to audit model behavior, implicitly assuming that the visible trace remains synchronized with the computation that determines the answer. We test this assumption with a step-level Detect–Classify–Compare framework built around an answer-commitment proxy that is cross-validated with Patchscopes, tuned-lens probes, and causal direction ablation. Across nine models and seven reasoning benchmarks, latent commitment and explicit answer arrival align on only 61.9% of steps on average. The dominant mismatch pattern is confabulated continuation: 58.0% of detected mismatch events occur after the answer-commitment proxy has already stabilized while the trace continues producing deliberative-looking text, and a vacuousness analysis shows that the committed answer does not change during these steps. In architecture-matched Qwen2.5/DeepSeek-R1-Distill comparisons, the reasoning pipeline changes failure composition more than aggregate alignment, most clearly at 32B where confabulated steps decrease as contradictory states increase. Lower step-level alignment is also associated with larger CoT utility, suggesting that the settings that benefit most from CoT are often the least temporally faithful. Paired truncation and a complementary donor-corruption test further indicate that much post-commitment text is not load-bearing for the final answer. These findings suggest that CoT can remain useful while still being an unreliable report of when the answer was formed.
Virtual
Believe it or Not: Mechanistic Interpretability of Learned Chain-of-Thought Unfaithfulness Priyansh Singhal, Sandeep Kumar
Abstract
Chain-of-thought (CoT) reasoning is increasingly relied upon for safety monitoring of language models, but this relies on CoT being faithful to the model's actual decision process. We train a controlled model organism for studying learned CoT unfaithfulness: a Gemma 3 1B IT model finetuned via GRPO to silently exploit answer hints in multiple-choice prompts while fabricating plausible reasoning. The model achieves 98\% accuracy with hints versus 38\% without, yet never mentions the hint in its CoT. We then apply pretrained Gemma Scope 2 Cross-Layer Transcoders (CLTs), trained on the \textit{original} unmodified model, as anomaly detectors on the finetuned model. Where pretrained CLTs fail to reconstruct the finetuned model's activations, GRPO has added new computation. We report three findings. First, GRPO concentrates new computation at layers 20--23, detectable without retraining the CLTs. Second, two CLT features (L15/f478 and L12/f317) activate in 100\% of tested prompts with zero base-model activation, identifying candidate hiding-circuit components. Third, the reconstruction differential correlates at $r = 0.971$ between correct-hint and wrong-hint conditions, proving the hiding mechanism is content-agnostic concealment rather than reasoning. However, pretrained CLTs cannot distinguish hint-reading computation from benign format changes: the hint-specific signal is ${\sim}10\times$ smaller than the total GRPO-induced change. Our results demonstrate both the promise and limits of using pretrained interpretability tools for post-training safety monitoring.

Multimodal & Vision (19)

Virtual
Causal State Variables in V-JEPA 2 Latents: Discovery, Intervention, and Portability Guus Bouwens
Abstract
Video world models trained with Joint Embedding Predictive Architectures (JEPAs) achieve strong performance on motion understanding benchmarks, but whether their latent representations encode causally functional state variables remains unknown. We apply a three-stage causal-statediscovery pipeline—combining L1-regularized probing, class-conditional PCA, difference-in-means subspace extraction, and three families of causal interventions with four matched controls—to thefrozenencoderofV-JEPA2ViT-L(326M parameters, d=1024, 24layers) on a synthetic controlled-sequence dataset of 400 clips across 8 motion directions. V-JEPA 2 encodes motion direction from remarkably early layers (96% dense-probe accuracy at layer 4; 100% by layer 7), using a distributed subspace occupying 57% of latent dimensions. Causal ablation at layer 7 produces effects 43× larger than random direction controls, confirming the identified subspace is causally privileged. The SAS–RCE dissociation—moderate subspace alignment (SAS= 0.35) coexisting with near-perfect retained causal effect (RCE=0.99)—reveals that causal structure is far more stable than its geometric embedding. Findings generalize to complex synthetic stimuli, real Kineticsvideo (5.3×CE ratio), and V-JEPA 2 ViT-H (54× CE ratio with near-perfect cross-architecture CCA alignment). These results provide the first intervention-based evidence that JEPA video models encode motion as a causally functional latent variable, and introduce SAS and RCE as portability metrics for mechanistic interpretability.
Virtual
Low-Dimensional Document Structure Subspaces in Specialized vs. Emergent OCR Models: A Mechanistic Interpretability Study of Three Architectures Guus Bouwens
Abstract
Optical character recognition (OCR) models have become critical infrastructure for document intelligence, yet their internal representations remain mechanistically unexplored. We present the first mechanistic interpretability study comparing three architectures: GLM-OCR (0.9B), a purpose-built document recognition system; PaddleOCR-VL (1.5B), a second specialized OCR model; and Qwen3.5-2B, a general-purpose VLM with emergent OCR capability. Using PCA-based subspace analysis on 300 real RVL-CDIP document images per model, we find that document structure capabilities occupy partially disentangled, low-dimensional subspaces in all three models. PaddleOCR-VL exhibits the most concentrated representations PC1 explains 84.0% of variance; effective rank 2.0 at its bottleneck), while Qwen3.5-2B is the most distributed (PC1 = 63.7%; effective rank 3.8). We introduce the Document Structure Modularity Index ($\modularity$), and find that both specialized models achieve higher modularity (GLM-OCR: 0.715; PaddleOCR-VL: 0.774) than the general-purpose baseline (0.704). Cross-model CKA reveals high representational alignment across all pairs ($>0.90$), with the specialized-to-emergent CKA marginally exceeding the specialized-to-specialized CKA---a finding with implications for representational universality in OCR.
Virtual
Probing Perturbation Invariance in DINOv2: Mechanistic Gaps Between Real and Generated Image Representations Alina Labaz, Taras Rumezhak, Volodymyr Karpiv
Abstract
A representation can encode whether an image is real or AI-generated yet not reveal it under static probing. We show the distinction becomes legible when two out-of-distribution (OOD) conditions are crossed: DINOv2 is trained with Gaussian blur but never Gaussian noise, so noise is an OOD perturbation, while generated images are OOD inputs. Applying the former to the latter turns a static distributional difference into a dynamic stability gap — generated images move farther in DINOv2 patch space under noise than real photographs — giving a training-free detector with no generator-specific fitting. The signal lives in the spatial patch tokens: averaging the 256 patch tokens, rather than the CLS summary used by the RIGID baseline, recovers local perturbation responses and raises worst-case Cohen's $|d|$ across five Synthbuster generators from $0.86$ to $0.98$ at no extra cost. A perturbation-type natural experiment isolates the mechanism — blur inside DINOv2's trained range gives a near-null gap ($|d|{=}0.13$) while noise ($0.93$) and out-of-range blur ($\geq1.47$) do not — and cross-validation plus non-perturbative baselines rule out cherry-picked noise levels and static-shift explanations. We frame this as a diagnostic probe, not a deployable detector: detection is uneven (TPR@5\%FPR $0.93$–$0.33$) and weakens on JPEG web images. The broader lesson for ViT interpretability is that probes for spatially distributed properties should read patch tokens, since CLS pooling is a lossy bottleneck. Code: https://github.com/alinamuliak/probing-perturbation-invariance-dinov2.
Virtual
Where Reliability Lives in Vision-Language Models: A Mechanistic Study of Attention, Hidden States, and Causal Circuits Logan Mann, Ajit Saravanan, Ishan Dave, Shikhar Shiromani, Saadullah ismail, Yi Xia, Emily Huang
Abstract
Vision-language models can produce confident, fluent mistakes, but it is still unclear where their internal reliability signal actually lives. A natural hypothesis is that reliability should be visible in visual attention: sharper focus on the relevant region should imply a more trustworthy answer. We test this hypothesis with VLM Reliability Probe (VRP), a cross-family study of LLaVA-1.5, PaliGemma, and Qwen2-VL that compares three classes of evidence: attention-map structure, generation dynamics, and hidden-state mechanisms. Our main claim is that attention structure is a poor reliability readout even when attention remains causally important for feature extraction: across the pooled structural-analysis set, cluster count and spatial entropy are nearly uncorrelated with correctness $R(C_k,y)=0.001$, $R(H_s,y)=-0.012$. Instead, the strongest reliability signals emerge later in the computation. Self-consistency is the strongest behavioral predictor we measure $R=0.429$, while hidden-state probes provide the best single-pass signal (AUROC $>0.95$ in our strongest settings). We further find a mechanistic split across model families: LLaVA exhibits early locking and a fragile late bottleneck, whereas PaliGemma and Qwen2-VL distribute reliability more broadly and remain robust under large interventions. The takeaway is narrow but important: in current VLMs, reliability is better understood through hidden-state geometry, layer-wise margin dynamics, and causal circuits than through attention-map sharpness alone.
Virtual
Bigger Is Not Better: Inverse Scaling and Arbitration Failure in Counterfactual Visual Grounding Fabian Grob, Sanghwan Kim, Cordelia Schmid, Zeynep Akata
Abstract
Images that violate everyday semantic expectations, such as a dog with five legs, expose a fundamental limitation of multimodal large language models (MLLMs): they may rely more on learned priors than on the visual evidence itself. Although prior work has observed such failures, we still lack a unified account of when they become more severe, why they arise, and how they can be mitigated. This work addresses all three questions. Across six MLLM families and three counterfactual visual grounding benchmarks, we find an inverse scaling trend: as models grow larger, they increasingly rely on language priors over visual evidence (*when*). Linear probing reveals that visual information is faithfully encoded throughout the LLM backbone, yet the model fails to act on it, reflecting an *arbitration failure* rather than a perceptual one. Further analysis shows that language priors are already committed in the early layers before visual signals begin to integrate in the middle of the network, suggesting that the arbitration imbalance must be corrected at the very first layers the LLM processes (*why*). We propose *Visual Arbitration for MLLMs* (VisArb), where a sparse autoencoder (SAE) is inserted between the vision backbone and the MLP connector, replacing dense feature maps with sparse, concept-aligned representations that are immediately legible to the arbitration mechanism from the first LLM layer onward (*how*). Our VisArb substantially improves counterfactual conflict recognition and reduces language-prior dominance while preserving performance on standard VQA benchmarks. Code is available [here](https://anonymous.4open.science/r/vlm-mechanistic-analysis-FFF4/README.md).
Virtual
Mirage Probes: How Vision Models Fake Visual Understanding Daniel Ben-Levi, Judah Goldfeder, Weiliang Zhao, Raz Lapid, Amit LeVi, Allen G Roush, Ravid Shwartz-Ziv, Hod Lipson
Abstract
Vision-language models (VLMs) can answer image-based questions confidently, and often correctly, even when no image is provided. This mirage behavior inflates benchmark scores without reflecting visual grounding. Prior work treats this as a single failure mode. We argue it is two. Using *Mirage Probes*, a contrastive probing framework that pairs paraphrased question variants with matched mirage and non-mirage labels on the same image, we show that mirage behavior is linearly decodable from internal activations across four target sites in two open-source VLMs. A Naive Bayes text baseline fails to recover this signal, ruling out surface lexical confounds. Cross-benchmark separability patterns and a novel Prior Harnessing Index (PHI) measuring text-only answerability expose two distinct regimes: textual biases, where the model answers from language priors without engaging visual representations, and spurious images, where it constructs false visual content in latent space and answers as if grounded. The distinction has direct mitigation consequences: text-distribution cleaning can address the first regime but cannot reach the second, since spurious-image mirages live in the model’s visual representations rather than its text. Faithful visual grounding will require interventions at the representational level.
Virtual
Subliminal Prosody Learning: Auxiliary Emotion Supervision Redistributes Affective Representations Across ALM Layers Bhuvan Chandra Koduru, Dareen Safar Alharthi, Rita Singh, Bhiksha Raj
Abstract
We study how a simple emotion classification objective, applied to a few LoRA-adapted layers of an audio-language model (ALM), redistributes affective information across \emph{all} layers---including those whose parameters remain frozen---through residual-stream propagation. We call this phenomenon \emph{subliminal prosody learning} and, to our knowledge, provide the first systematic study of representational propagation across multiple ALM architectures: Qwen2.5-Omni-7B, Audio Flamingo~3, and MOSS-Audio-4B. Mean probe gain in unadapted layers is +32.0\,pp (Omni), +22.0\,pp (AF3), and +13.6\,pp (MOSS). Out-of-distribution (OOD) classification improves by up to +23.4\,pp, and learned emotion directions recover the Russell circumplex while transferring cross-modally. Critically, linear decodability does not imply functional use: we test whether this representational accessibility translates into generation behavior. Results are consistent with a threshold-like relationship---only Omni, with the largest probe gain, achieves significant prosody-sensitive generation changes ($\Delta = +0.35$, $p < 0.001$), with an emotion-selective pattern (neutral: +0.04, n.s.; happy: +1.03***; sad: +0.48***) that rules out generic verbosity. No empathy supervision was used: prosody-sensitive generation emerges solely as a consequence of a classification-only auxiliary objective.
Virtual
A Two-Stage Cross-Attention Circuit for Spatial-Relation Binding in Small Diffusion Transformers Juliana Li, Binxu Wang
Abstract
We mechanistically resolve how a small PixArt-mini DiT, in a controlled object-relation testbed, converts a textual spatial relation into image layout, identifying a two-stage cross-attention circuit at individual-head granularity. In this 6-layer, 36-cross-attention-head DiT (T5-XXL conditioning, $\sim 1000$ synthetic prompts), two early-layer heads $\{L_0H_0, L_1H_2\}$ implement spatial routing: their QK circuits convert each relation token into an $8\times 8$ directional attention pattern, and the corresponding OV writes carry that pattern into the residual stream. Zeroing $L_0H_0$ alone causes $+48$ pp damage to spatial accuracy ($0.84\to 0.35$); the pair causes $+64$ pp damage (super-additive $I=+0.086$, $95\%$ CI $[+0.015,+0.157]$). A four-method consensus search rejects every other tested head as a spatial-routing partner ($|I|\leq 1.3$ pp). A second object-binding stage is distributed across Layer 2: it has near-zero effect alone, but is strongly super-additive on color and shape when co-ablated with Layer 0 (a downstream stage that pair-ablation scored on spatial accuracy alone would miss). The circuit emerges via a rapid rise in projection magnitude during epochs 750-1000, with behavioral accuracy lagging by $\sim 200$ epochs. Within this setting, head-level causal circuit discovery extends to a small DiT, and scoring ablations on multiple metrics is needed to detect distributed downstream stages.
Virtual
Round-Trip Latent Geometry in Diffusion VAEs Enables Covert Channels Catherine Ge-Wang, Tushar Nagar, Joy Zheyun Yang
Abstract
Diffusion VAEs define learned latent interfaces through which small perturbations may survive decoding and re-encoding. We which latent positions and directions are preserved by the composite map $\mathrm{Enc}\circ\mathrm{Dec}$, and how this preservation enables covert image channels. Using training-free signed latent perturbations as probes, we measure spatial carrier stability, content dependence, directional gain, cross-VAE transfer, and monitor detectability across datasets and VAE checkpoints. We find that perturbation survival is highly non-uniform across latent positions and image content, and that a local directional-gain estimate better explains carrier reliability than a direction-agnostic stability heuristic. Across CIFAR-10, Caltech101, and a 1{,}000-image ImageNet-family subset, and across 3 VAE architectures, our perturbations are reliably recoverable with $>97\%$ bit accuracy at $\epsilon=2.0$, and the channel survives realistic image transformations at higher perturbation strengths. Our results suggest that covert communication in multimodal agent settings is mediated by interpretable structure in VAE round-trip geometry, and that representation-level monitoring is necessary for detecting such channels.
Virtual
Sparse Neuron Ablation Triggers Catastrophic Collapse of the Language Core in Vision-Language Models Cen Lu, Yung-Chen Tang, Andrea Cavallaro
Abstract
Large Vision-Language Models (LVLMs) have shown impressive multimodal understanding capabilities, yet the structures that sustain their functionality remain poorly understood from a mechanistic interpretability standpoint. We propose CAN, a progressive neuron ablation method to identify critical neurons whose removal triggers catastrophic collapse, and use it to investigate structural vulnerabilities in representative 7B LVLMs. Experiments reveal that catastrophic collapse can be triggered by ablating as few as four neurons in \texttt{LLaVA-1.5-7b-hf} and a few thousand in \texttt{InstructBLIP-Vicuna-7b}, both representing an sparse fraction of model parameters. Notably, critical neurons are predominantly localized in the language model, particularly in its down-projection layer, rather than in the vision components. We also observe a consistent two-stage collapse pattern: initial expressive degradation followed by sudden, complete collapse. These findings reveal that LVLM functionality depends on a sparse subset of neurons concentrated in the language backbone, offering mechanistic insights into how their functionality is structured and where these models are most vulnerable.
Virtual
Meltdown: Circuits and Bifurcations in Point-Cloud-Conditioned 3D Diffusion Transformers Maximilian Plattner, Fabian Paischer, Johannes Brandstetter, Arturs Berzins
Abstract
Sparse point clouds are a common input modality for 3D surface reconstruction, including in safety-critical settings such as surgical navigation and autonomous perception. Recent point-cloud-conditioned 3D diffusion transformers achieve state-of-the-art results in this regime by leveraging learned priors. We show that these models can fail catastrophically under realistic input variation, and present a mechanistic case study of why. We identify a failure mode we call Meltdown: tiny on-surface perturbations to a sparse input point cloud can fracture the reconstructed output into hundreds of disconnected pieces. Adversarial search recovers Meltdown in 89.9-100% of shapes across the two open-weight state-of-the-art architectures we study (WaLa, Make-a-Shape) on real-world datasets (GSO, SimJEB) and under both DDPM and DDIM sampling. We trace Meltdown along the forward pass: it is governed by how uniformly the points are distributed on the surface, faithfully transduced through the point-cloud encoder, and committed by a single early-denoising cross-attention write in the diffusion backbone. Diffusion-trajectory ensembles exhibit symmetry-breaking near this commit step, consistent with a bifurcation of the reverse process. Through a suite of matched-magnitude controls, we show that the variable on which the model commits is directional, concentrated in a low-rank subspace of the write's perturbation drift. Motivated by this finding, we introduce PowerRemap, a test-time control that reshapes the singular spectrum of the localized write to suppress this drift, with rescue rates of 98.3% on WaLa and 84.6% on Make-a-Shape. Together, these results link a circuit-level cross-attention mechanism to a trajectory-level account of the failure, demonstrating how mechanistic analysis can explain and guide behavior in conditional diffusion transformers.
Virtual
Demystifying Classifier-Free Guidance for Auto-Regressive Image Generation Zhiling Zhou, Jiachun Pan, Fengzhuo Zhang, Dirk Bergemann, Zhuoran Yang
Abstract
Classifier-free guidance (CFG) has been widely adopted in autoregressive (AR) models for high-quality image generation. Despite its strong empirical performance, its mechanism in AR models remains unclear. This paper demystifies the mechanism of CFG through both empirical and theoretical studies. By examining the top-ranked semantics of different components in CFG, we show that texture information is primarily encoded in the difference between the conditional and unconditional logits, whereas both logits share previous-token repetition as their leading semantics. This strong repetition semantics obscures the desired texture information, causing greedy decoding from the conditional logits alone to produce nearly pure-color images. Through a training-dynamics analysis of shallow transformers, we prove that this shared repetition semantics does not arise from limitations or failures of pretraining, but instead originates from the texture sparsity of images. We further show that CFG improves generation quality by rectifying this repetition bias during inference. Motivated by the shared semantics between conditional and unconditional logits, we propose Attention Weight Reuse (AttnReuse), which reuses intermediate attention computations from the conditional-logit forward pass to accelerate the unconditional-logit computation. AttnReuse reduces about $25\\%$ of attention computation with little performance degradation across different models.
Virtual
The Hitchhiker’s Guide to Mechanistic Interpretability of Vision-Language-Action Models Aryan Goyal
Abstract
The mechanistic interpretability toolkit for large language models (LLMs) was built around properties that autoregressive text transformers happen to have: one residual stream, one unembedding matrix, homogeneous layers, a single forward pass per output, a verifiable target token. These are not laws of computation but conveniences of one architectural choice. As the field moves from LLMs to Vision-Language-Action (VLA) models, between two and six of these conveniences quietly disappear depending on the VLA's design, and the tools that depend on them keep producing numerically plausible outputs that no longer mean what they did on an LLM. We name seven such conveniences as cautions A1--A7, classify VLAs into seven architecture types, and identify which caution each type breaks. The result is a compact map of where the existing LLM-style toolkit is safe to apply, where it requires a small adaptation, and where it cannot be made to work without re-engineering.
Virtual
What, Where, and How: Probing Spatiotemporal Representations in Video Foundation Models Sharon S. Musa, Fereshteh Forghani, Harrish Thasarathan, Sonia Joseph, Matthew Kowal, Konstantinos G. Derpanis
Abstract
Self-supervised video foundation models learn rich spatiotemporal representations, yet it remains unclear *what* visual concepts these representations encode, *where* they emerge across transformer layers, and *how* they are geometrically organized. In this work, we tackle these three questions through a systematic layer-wise analysis of V-JEPA 2 and VideoMAE-v2. We leverage lightweight probes trained to discover three temporally grounded properties: (i) camera motion understanding, (ii) intuitive physics, and (iii) anomaly detection. Both models encode camera motion, with best results ($>90$ ROC AUC) emerging at 60--70\% of network depth, and achieve moderate anomaly detection performance ($>60$ ROC AUC), but remain near chance on intuitive-physics tasks, suggesting a limited encoding of deeper physical reasoning. Beyond classification, we find that individual videos' temporal features form smooth low-dimensional trajectories in representation space, suggesting that camera motion is not only linearly decodable but also geometrically organized. Based on these results, we apply geometry-aware spline-based steering in the model's latent representations to interpolate camera motion, yielding steered videos with smoother trajectories and more coherent temporal progression than linear interpolation.
Virtual
Mechanistic Analysis and Inference-Time Control of Modality Conflict in VLMs Jiahang He, Dia'a AL-Dweikat, Artur Shyutts, Abdullah AlDahoul, Oyedeji Sonuga, Kevin Zhu, Ruizhe Li, Anusha Mujumdar
Abstract
When a vision-language model sees an image that contradicts a textual statement, which modality does it trust, and can that preference be controlled? Although prior work has answered this question behaviorally, we answer it mechanistically and causally. Behaviorally, we introduce CLEVR-CONFLICT, a controlled dataset of multimodal conflicts with verified ground truth, and show that LLaVA-v1.6 overwhelmingly follows text (9.2% vision-following), while Qwen2.5-VL predominantly follows vision (80.7%). Mechanistically, contribution patching identifies the bridge module as the causal entry point for visual information, and logit lens locates a midnetwork commitment in LLaVA and a sharp latelayer flip in Qwen. Causally, MLP replacement at the commitment zone flips 73.3% of Qwen’s textfollowing entries and 23.4% of LLaVA’s, with the gap reflecting a distributed-vs-concentrated architectural distinction. We develop a stateless offset vector that achieves bidirectional inferencetime control, pushing toward vision or toward text without retraining. Causal tests further reveal the MLP is necessary but not sufficient. It enforces language priors only when conflicting text has activated them, functioning as a context-dependent amplifier rather than a standalone switch. Our findings generalize to natural images from COCO. Together, the findings offer a mechanistic account of modality conflict that moves from diagnosis to causal characterization to inference-time control.
Virtual
Interpretable Signals Reveal Failure-Relevant Representations in Vision-Language-Action Models Tony Yang, Aidan Mokalla
Abstract
Vision-language-action (VLA) models take in visual observations and language instructions to produce physical robot actions, but when a task is failed, this poses safety concerns for real-world deployment. We study whether interpretable internal signals from a VLA can predict if the model will fail on LIBERO tasks. Using $\pi_{0.5}$, we cache action-expert activations and attention weights during evaluation, and then derive two families of episode-level features: attention-derived summaries over visual tokens and activation-probe summaries for action direction and gripper state. Our analysis suggests that action-direction probe confidence is the strongest failure-relevant signal, while attention-derived features are also predictive and gripper-state probes are weaker. We further compare against success-centroid distance and random-label probe baselines, and find that meaningful action-direction labels are important for the probe signal. These results work to show that simple mechanistic features inside VLAs can reveal failure-relevant representations.
Virtual
Do Clinical VLMs Need Dense Visual Tokens? Probing Spatial Grounding in Radiology Report Generation Leopoldo Julian Lechuga Lopez, Tim G. J. Rudner, Farah E. Shamout
Abstract
Clinical vision-language models (VLMs) for chest X-ray report generation are typically evaluated on generated text quality, but strong generation performance does not necessarily imply fine-grained visual grounding. In this work, we empirically evaluate how much spatial visual information a clinical VLM uses for radiology report generation. Using our own implementation of LLaVARad, a state-of-the-art VLM for radiology report generation, we apply a simple intervention framework that selects, removes, or randomly samples visual tokens before projection into the language model. We find that dense visual token samples are not required, as compressing the full set of visual patch tokens (i.e., T=1369) into a single mean-pooled token, preserves baseline performance. Region-level interventions produce measurable but modest degradation, with the largest effects in CheXbert-based clinical metrics. Notably, retaining only $\sim1\%$ (i.e., T=14) of randomly sampled visual tokens before mean-pooling, nearly matches the full-token setting. These results suggest that the model uses the image primarily through a low-dimensional visual conditioning signal rather than strong fine-grained spatial grounding, raising concerns about the limited use of visual inputs by current clinical VLMs.
Virtual
SHIFT: Steering Hidden Intermediates in Flow Transformers Nina Konovalova, Ibragim Idrisov, Aibek Alanov
Abstract
Diffusion models have become leading approaches for high-fidelity image generation. Recent DiT-based diffusion models, in particular, achieve strong prompt adherence while producing high-quality samples. We propose SHIFT, a simple but effective and lightweight framework for concept removal in DiT diffusion models via targeted manipulation of intermediate activations at inference time, inspired by activation steering in large language models. SHIFT learns steering vectors that are dynamically applied to selected layers and timesteps to suppress unwanted visual concepts while preserving the prompt's remaining content and overall image quality. Beyond suppression, the same mechanism can shift generations by adding or changing target objects. We demonstrate that SHIFT provides effective and flexible control over DiT generation across diverse prompts and targets without time-consuming retraining.
Virtual
A Critical Analysis of Color Neurons in Vision Models Fernando Aguilar-Canto, Hiram Calvo, Ricardo Menchaca-Mendez
Abstract
Unit-level interpretability in deep neural networks faces growing skepticism, and the question of whether any single artificial neuron can be rigorously understood remains open. We propose three explicit criteria for validating unit-level interpretations---completeness, falsifiability, and relevance---and apply them to color-selective neurons, a class of units long reported in the literature and studied with increasing sophistication, but not previously validated under explicit, simultaneous multi-criteria standards. Using a protocol that combines causal ablation, controlled synthetic stimuli, natural image categorization, and input-level interventions, we identify and validate five units across three architectures (ResNet50, ViT-B/16, ViT-L/32). These units span excitatory, bimodal, and inhibitory selectivity to distinct spectral regions, and each satisfies all three criteria. Our results demonstrate that unit-level understanding is attainable under rigorous standards, and we argue that interpretability can benefit from reframing interpretation as falsifiable hypothesis testing.

Training Dynamics & Learning (3)

Virtual
A loss curvature account of fine-tuning fragility Ivaylo Dimitrov, Leo Karoubi, Sunny Howard, Dmitrii Krasheninnikov
Abstract
Fine-tuning on narrow distributions often produces fragile changes that are easily reversed by further training, with implications for the durability of safety fine-tuning. Mixing pre-training data into fine-tuning is a known mitigation, but why varying the proportion of fine-tuning data (which we term concentration) modulates forgetting is poorly understood. During a reversion phase (subsequent training on pre-training data after fine-tuning), we decompose the per-step change in fine-tune loss into its first- and second-order Taylor terms. We then track how each varies with concentration. In experiments on LLMs (Pythia-70M), we find that the second-order (curvature) term grows in importance with concentration. This curvature is much larger along the reversion update direction than along a random direction at every concentration, with the directional curvature itself increasing with concentration. Curvature therefore contributes to forgetting even when fine-tune and pre-train gradients are not in conflict, and its relative share grows with concentration, providing empirical support for recent theoretical accounts of curvature-driven forgetting.
Virtual
Tracking Training Phases in Compositional Learning with Task-Agnostic Measures Niclas Dern, Selma Mazioud, Jakob Heiss, Avrajit Ghosh, Curtis James McDonald, Gabriel Clara, Bin Yu
Abstract
Deep neural networks often acquire their final capabilities through qualitatively distinct training phases. Characterizing these phases sheds light on how models learn and could enable steering away from unwanted outcomes. The most reliable existing methods for detecting training phases rely on a prior mechanistic understanding of how the model performs an underlying task. Task-agnostic scalar measures, i.e., quantities computed from a model's parameters, representations, or outputs, offer a more general alternative, but have largely been studied in isolation. In this paper, we systematically compare 53 such measures by fitting Gaussian hidden Markov models (HMMs) to their trajectories across two compositional settings: modular addition and a new multilingual variant we introduce, in which per-language data fractions control how much consecutive phase transitions overlap. This overlapping regime arises naturally when models acquire capabilities in close succession, yet lacks controlled benchmarks. We find that once transitions overlap, recovery quality drops across all 53 measures, even when we fit the HMM directly to validation accuracy, hinting at limitations of our HMM-based framework. Still, some measures perform relatively better, with prediction entropy, PCA effective dimension, and the Local Learning Coefficient most consistently among the top performers.
Virtual
Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation Uwe König, Hamza Kazmi, Ruizhe Li, Maheep Chaudhary
Abstract
Distillation of a language model intended to transfer benign behavior to a student model may also transfer undesirable characteristics, if they are present in the teacher model, a phenomenon known as subliminal learning. While qualitative evidence supports the existence of this effect, its magnitude has not been systematically characterized. This study quantifies subliminal behavioral transfer ratios by steering two teacher models (Llama-2-7B-Chat and Qwen2.5-7B-Instruct) at varying steering strengths and distilling student models using only benign data. Evaluation on 100 JailbreakBench prompts with GPT-4.1, serving as the evaluator, indicates that transfer is robust but exhibits distinct scaling behaviors. Llama-2 demonstrates a sharp threshold ($\tau = {0.25,0.32} \ \text{beyond} \ \alpha = -0.15$), whereas Qwen2.5 displays continuous and higher levels of transfer ($\tau$ up to $0.61$).

Applications, Benchmarks & Tools (23)

Virtual
On Pitfalls of RemOve-And-Retrain: A Data Processing Inequality Perspective Junhwa Song, Keumgang Cha, Junghoon Seo
Abstract
The RemOve-And-Retrain (ROAR) benchmark is widely used to evaluate feature attribution methods, yet its validity remains underexplored from an information-theoretic perspective. We show that model- and data-agnostic post-processing of attribution maps (transformations that, by the data processing inequality, \emph{cannot} add information about the decision function) can often improve ROAR scores. This means that an improved ROAR ranking is not, by itself, evidence that an attribution map carries more information about the model. We trace this failure mode to a bias toward spatially blurry masks. Experiments on CIFAR-10, SVHN, and CUB-200 show a consistent association between blurriness and ROAR performance, a pattern that also appears in the ROAD variant. We provide guidelines for more cautious removal-based benchmarking, with implications for validating mechanistic understanding of neural network internals.
Virtual
Shift-Invariant Attribute Scoring for Kolmogorov-Arnold Networks via Shapley Value Wangxuan Fan, Ching Wang, Siqi Li, Nan Liu
Abstract
For many real-world applications, understanding feature-outcome relationships is as crucial as achieving high predictive accuracy. While traditional neural networks excel at prediction, their black-box nature obscures underlying functional relationships. Kolmogorov--Arnold Networks (KANs) address this by employing learnable spline-based activation functions on edges, enabling recovery of symbolic representations while maintaining competitive performance. However, KAN's architecture presents unique challenges for network pruning. Conventional magnitude-based methods become unreliable due to sensitivity to input coordinate shifts. We propose \textbf{ShapKAN}, a pruning framework using Shapley value attribution to assess node importance in a shift-invariant manner. Unlike magnitude-based approaches, ShapKAN quantifies each node's actual contribution, ensuring consistent importance rankings regardless of input parameterization. Extensive experiments on synthetic and real-world datasets demonstrate that ShapKAN preserves true node importance while enabling effective network compression. Our approach improves KAN's interpretability advantages, facilitating deployment in resource-constrained environments.
Virtual
Probing Clinical Concepts in an EHR Foundation Model via Sparse Autoencoders Shashank Yadav, David M. Routman, Andrew Y. K. Foong
Abstract
Foundation models (FMs) trained on large electronic health record (EHR) datasets can predict patient outcomes, but it is difficult to know what medical knowledge they have acquired. Unlike chatbot LLMs, EHR-FMs are being considered for high-stakes clinical deployment, making it especially important to audit what they have learned beyond predictive accuracy. We apply sparse autoencoders (SAEs) to a transformer-based FM trained on the MIMIC-IV dataset, extending SAE-based mechanistic interpretability to FMs trained on clinical event streams. We use LLM-based interpretation to characterize learned features, revealing that EHR models learn a clinical ontology distinct from the International Classification of Diseases (ICD) system. We show that learned features are organized by prevalence and that the model encodes candidate matches to known clinical syndromes as single monosemantic features. Syndromic features are composed from lower-level features through cross-layer information-flow circuits that we probe via activation patching. We validate the learned features along two axes: external validity, where feature activations align with held-out ICD phenotypes, and interventional consistency, where activation patching produces measurable downstream effects in source-target pairs. Together, these results demonstrate the utility of SAEs as an interpretive layer for EHR foundation models.
Virtual
NeuroFaith: Evaluating Mechanistic Faithfulness of LLM Free Text Self-Explanation at the Concept Level Milan Bhan, Jean-Noël Vittaut, Nicolas CHESNEAU, Sarath Chandar, Marie-Jeanne Lesot
Abstract
Large Language Models (LLMs) can generate plausible free text self-explanations to justify their answers. However, these natural language explanations may not accurately reflect the model's actual reasoning process, indicating a lack of faithfulness. Existing faithfulness evaluation methods rely primarily on behavioral tests or computational block analysis without examining the semantic content of internal neural representations. This paper proposes NeuroFaith, a flexible framework that measures the faithfulness of LLM free text self-explanation by identifying key concepts within explanations and mechanistically testing whether these concepts actually influence the model's predictions. We show the versatility of NeuroFaith across 2-hop reasoning and classification tasks. Additionally, we develop a linear faithfulness probe based on NeuroFaith to detect unfaithful self-explanations from representation space and improve faithfulness through steering. NeuroFaith provides a principled approach to evaluating and enhancing the faithfulness of LLM free text self-explanations, addressing critical needs for trustworthy AI systems.
Virtual
When Circuits Are Too Broad: Unit Tests for Mechanistic Interpretability Muhammet Anil Yagiz
Abstract
Mechanistic interpretability claims often show that intervening on a circuit, feature, or representation changes a target behavior. Such target effects are necessary but insufficient: broad, lexical, confounded, or template-fragile interventions can produce the same evidence. We propose mechanistic unit tests, a negative-control protocol for evaluating the specificity of circuit and feature claims. The protocol asks whether a proposed mechanism survives nuisance rewrites, fails on matched negatives, avoids off-target damage, and dominates cheap same-budget baselines. We summarize these trade-offs with specificity frontiers, plotting target effect against off-target damage across intervention strengths. A controlled case study and a small distilgpt2 pilot show how target-only evidence can hide lexical and negation failures. The contribution is a falsification layer for mechanistic claims, not a new discovery method.
Virtual
SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data David Chanin, Adrià Garriga-Alonso
Abstract
Improving Sparse Autoencoders (SAEs) requires benchmarks that can precisely validate architectural innovations. Current LLM-based SAE benchmarks are too noisy to differentiate architectural improvements, while commonly used synthetic-data experiments are too small-scale, unstandardized, and unrealistic to be meaningful. We introduce SynthSAEBench, a benchmark and toolkit for evaluating SAEs against large-scale synthetic data with realistic feature characteristics including correlation, hierarchy, and superposition, while providing ground-truth features and firings. SynthSAEBench acts as a controlled lower-bound test: SAE architectures that fail when the Linear Representation Hypothesis holds by construction have little hope on real LLMs. The benchmark reproduces known LLM SAE phenomena including the disconnect between reconstruction and latent quality, poor SAE probing, and a precision-recall trade-off mediated by L0, demonstrating that SynthSAEBench findings reproduce results on LLM SAEs. We further identify a novel failure mode: Matching Pursuit SAEs exploit superposition noise to improve reconstruction without learning ground-truth features, suggesting more expressive encoding procedures can easily overfit. SynthSAEBench complements LLM benchmarks with ground-truth features and controlled ablations for diagnosing SAE failure modes, while providing a clear target for SAE architecture work.
Virtual
Compiling to transformers Joey Velez-Ginorio
Abstract
The representations of transformers are easy to measure but hard to understand; we cannot reliably predict or control behavior from them. To address this, recent work proposed compiling high-level programs to transformers as a tool for studying these representations. By compiling programs to transformers, the algorithms driving their representations are known, and techniques for understanding them can be verified against a ground truth. However, prior compilers make a closed program assumption. They cannot compile programs with free variables. As a consequence, the program must specify the entire algorithm a transformer implements, restricting applicability to hand-constructed transformers rather than ones trained by gradient descent. We resolve that limitation with Cajal, a language whose programs compile to transformers which are subsequently trained by gradient descent. We prove its compiler correct, and conduct experiments where we compile a conditional reverse program into the attention head of a transformer, showing that transformers can learn to use compiled programs which specify part of their behavior. Importantly, this enables verification of interpretability techniques against partial ground truths in transformers trained by gradient descent.
Virtual
Histoscope: Expert-Grounded Inspection of Sparse Autoencoder Features in Histopathology Foundation Models Mirza Nasir Hossain, Sarah L Bell, Gareth Bryson, David Harris-Birtill
Abstract
Computational pathology lacks open tools for inspecting sparse autoencoder (SAE) features, and few studies test whether such features align with expert judgement. We present Histoscope, an open-source tool for inspecting SAE features derived from histopathology foundation-model embeddings, and evaluate it through a blinded two-rater pathologist study. Using UNI embeddings from the SPIDER colorectal dataset, we train a TopK SAE and use Histoscope to rank features by diagnostic-class selectivity, measured with per-class AUPRC. Pathologists judged 82\% of feature panels monosemantic overall; among the 50 features Histoscope labelled monosemantic, all 50 were confirmed, while 32 of 38 features it labelled polysemantic were also judged monosemantic, mostly because they captured morphology spanning multiple diagnostic classes. Inspection tools of this kind can help make pathology foundation models more auditable by domain experts by connecting internal features to morphological evidence rather than treating embeddings as opaque vectors. We will release Histoscope, the trained SAE, and the evaluation protocol to support expert-grounded SAE inspection in computational pathology.
Virtual
Retrieval is Enough: Training-Free Interpretability with a Tool-Using Agent Sriram Balasubramanian, Soheil Feizi
Abstract
Interpretability methods for neural network activations span a wide cost spectrum, from cheap, training-free techniques (such as linear probes, PCA, SVD) to more expensive training-based ones (such as SAEs and activation oracles). Training-based methods are typically more powerful, in part because they leverage large activation datasets during training. This raises a natural question --- do they actually surface insights that go beyond what is recoverable from the training dataset itself? To address this, we equip an LLM agent with a vector database of activations paired with their textual contexts, along with tools for manipulating activations --- projecting out directions in latent space, computing activation differences and averages. The agent iteratively queries the database, forms hypotheses from the retrieved samples, and validates them by constructing linear probes. We call this method **HARP**, for **H**ypothesis-driven **A**gentic **R**etrieval and **P**robing. Despite not involving any training, HARP outperforms both activation oracles and SAE-based agents on concept discovery, concept detection, model steering, and secret elicitation. The training-free design also makes HARP substantially cheaper and more flexible: new datasets can be indexed on demand whenever existing ones prove insufficient. More broadly, our results suggest that current training-based methods do not yet extract insights beyond their training data, and motivate benchmarks that explicitly require interpretability methods to demonstrate such insights.
Virtual
ViSAEBench: Cross-Backbone Evaluation of Vision Sparse Autoencoders Reveals Backbone-Dominated Variance and Metric Dissociations Vijayrajsinh Gohil, Diwei Sheng, Chen Feng
Abstract
Sparse autoencoders (SAE) are increasingly used to interpret Vision Transformer features, but unlike the language setting, there is no standardized protocol for comparing vision SAEs and no systematic characterization of how SAE quality depends on the pretrained backbone. We introduce ViSAEBench, a unified evaluation suite covering seven metrics across four interpretability dimensions, including a novel spatial coherence metric specific to vision. Using ViSAEBench, we conduct the first controlled cross-backbone study of vision SAEs: 60 SAEs trained on identical ImageNet-1K activations from five ViT-B backbones spanning four pretraining paradigms. Our central finding is that the choice of pretrained backbone dominates vision SAE behavior more than SAE hyperparameters. A variance decomposition shows that backbone explains over 90\% of variance on three metrics and over 60\% on five of seven, while SAE hyperparameters dominate only reconstruction error. The starkest instance is categorical: across all configurations tested, SAEs trained on Masked Autoencoder features show no spatial structure beyond chance, while the other four backbones produce strongly spatially structured features. Single-backbone vision SAE evaluations are therefore often measuring properties of the backbone more than properties of the SAE. We further identify two metric-level dissociations with practical consequences. First, reconstruction error and downstream task preservation substantially diverge across backbones (Spearman $\rho=-0.70$), so reconstruction error alone cannot be used to compare vision SAEs. Second, monosemanticity, a central SAE quality criterion in language work, does not predict fine-grained classification, indicating that within-feature consistency does not capture the between-class separability downstream tasks require. We release all 60 SAE checkpoints and the ViSAEBench evaluation library.
Virtual
Circuit Oracle: Automating Attribution Graph Analysis via Natural-Language Queries Hong Kiat Tan, Shariar Kabir, Swastik Agrawal, Sai V R Chereddy, Sriram Balasubramanian
Abstract
Attribution graphs, an emerging tool in mechanistic interpretability, use transcoders to decompose language model computations into sparse interpretable features connected by causal edges. However, turning a graph into a safety-relevant insight requires hours of manual analysis by experts. We introduce $\textbf{Circuit Oracle}$, a multi-agent system that automates this analysis by autonomously answering natural-language questions about a target model (e.g., ``Is this prediction driven by spurious features?'') through multi-hop traversal of the attribution graph. We evaluate Circuit Oracle on three safety-relevant proxy tasks: detecting spurious features in probe circuits, eliciting hidden knowledge from taboo-finetuned models, and jailbreaking via causal interventions. On all three tasks, the oracle is comparable to or exceeds task-specific baselines that do not use the attribution graph. The circuit oracle requires no fine-tuning as each task is specified by a modular \textit{skill}, a natural-language prompt paired with task-specific tools such as transcoder-feature steering, making the framework extensible by construction. Our results suggest that off-the-shelf agents reading attribution graphs through tool calls offer a practical route to automated mechanistic interpretability.
Virtual
Evaluating Sparse Autoencoders in a Structured Continuous Model: Concept Recovery in Implied-Volatility Forecasting Mert Guney
Abstract
Standard SAE concept-recovery protocols rely on correlation alignment, top-K stability, and patching point estimates without explicit nulls. We propose matched-random baselines (leave-one-out random latents for single-latent patching, and matched-random latent pairs for interaction patching, both with bootstrap confidence intervals) and apply them in a structured continuous testbed where ground-truth descriptors are available: a compact MLP trained to forecast SPY implied volatility from tick-level NBBO data over 45 trading sessions. The standard pipeline reports five cross-seed-stable concept families (moneyness geometry, lagged skew, spot level, rates, lagged curvature), and ridge probes from the SAE codes recover four non-geometric descriptors (lagged skew, lagged curvature, spot, lagged ATM level) at positive held-out R² (0.27–0.49). Under matched-random patching, however, only one family survives: moneyness geometry produces single-latent effects at 4.49× its leave-one-out baseline (p<0.001), and the moneyness × lagged skew interaction clears the matched-random null (p=0.030, N=300 random pairs). Every other non-geometric family's mean |Δ| sits at or below its baseline. The probe-patching gap is the paper's central methodological observation: concepts that the SAE encodes (probes) and concepts that the model uses (patching) are distinct populations. We argue matched-random baselines should be standard for SAE-interp validation, and we present implied-volatility forecasting as a controlled testbed where this gap is visible because the descriptors are pre-specified.
Virtual
Structural Locality Differentiates Residue-Tokenised Bidirectional and Causal Protein Language Model Families Wei Gao, Alex J H Fedorec
Abstract
Protein language models (PLMs) trained under different bidirectional and causal regimes achieve strong downstream performance, yet how these architectural and objective differences jointly shape learned representations remains poorly understood. We apply TopK sparse autoencoders (SAEs) to three residue-tokenised PLMs (ESM-2, RITA, ProtT5; the ProtT5 encoder and decoder are probed separately) at nine matched relative depths. We find that ESM-2 (bidirectional) learns features with stronger structural locality than RITA (residue-tokenised causal) at every matched depth, robust across metric settings, held-out proteins, and SAE initialisations. On a BPE-tokenised PLM (ProtGPT2), we identify a tokenisation artefact in residue-pair locality metrics—a large naïve sequential-locality effect that reverses under inter-token control—motivating our restriction to residue-tokenised models in the main analyses. Within ProtT5, the nine-depth grid localises a structural-locality crossover between encoder and decoder to ≈42% relative depth, consistent with cross-attention propagating encoder-derived context that fades under the autoregressive objective. Within-model depth trajectories for ESM-2, RITA, ProtT5-enc, and ProtT5-dec are qualitatively distinct. These findings support a single robust bidirectional-vs-causal dissociation (on structural, not sequential, locality) and identify a methodological interaction between standard residue-projection and residue-pair locality metrics on BPE-tokenised PLMs.
Virtual
Critical Percolation as a Synthetic Data Model for Interpretability Aryeh Brill, Tom Ingebretsen Carlson
Abstract
Neural networks learn features that reflect the hierarchical, multi-scale structure of natural data. Synthetic datasets used to evaluate interpretability methods typically lack this structure, limiting their value as realistic toy models. To close this gap, we introduce a family of synthetic datasets consisting of hierarchical functions defined on critical mean-field percolation clusters embedded in a high-dimensional data space. The percolation data consists of sparse, low-dimensional fractal clusters with a power-law size distribution. Latent variables modeling a taxonomic hierarchy generate each data point's target value. The data model is analytically tractable with known critical exponents that fix its properties without requiring hyperparameter tuning. We leverage a mapping between percolation clusters, random trees, and additive coalescence to propose an almost linear-time algorithm to jointly sample a random tree and its hierarchical latent decomposition, enabling data generation at arbitrary scale. Using probing experiments, we find that the model's ground-truth latent variables can be linearly decoded from neural network activations. Together, sparsity, self-similarity, power-law statistics, and analytical tractability make critical percolation a principled testbed for interpretability research.
Virtual
Sparse Groundings: A Benchmark for Auditing Visual Representation Claims Felipe Parodi, Melanie Segado
Abstract
Modern multimodal systems rely on internal visual representations, but claims about what those representations contain are typically supported by probes or interventions read from the patches covering an object. These readouts can conflate object evidence with signals that are easier to miss: information broadcast across the image, background cues, patch coordinates, or the evaluator's own pooling rule. We introduce Sparse Groundings, an audit benchmark for visual representation claims built around three controlled readouts at the same patch count. The standard object readout is compared with a blank same-footprint readout that preserves selected patch coordinates while removing image content, and a matched-background readout that preserves image context and patch count while removing the object. Each result is recorded as a claim card summarizing the tested property, controls, method, and verdict, and aggregated into an evidence profile rather than a single groundedness score. We instantiate the benchmark on controlled tests for identity, colour, position, scale, spatial relations, counting, and distractor robustness across six vision encoders and two sparse-autoencoder bases. The controls change conclusions in three places. In one representative case, a SigLIP 2 probe predicts which cell of a 7 by 7 grid contains the object with 0.991 balanced accuracy from object features, while a uniform blank image read from the same patch locations reaches 0.998 — a pattern that holds at ≥ 0.978 across all eight representations. Identity and attribute content is recoverable from non-object patches as well as from object patches on every encoder, and structure-family claims partition encoders on a different axis from position-family claims. A natural-image bridge readout on LVIS object boxes recovers the same encoder partition — five of six encoders show object-pool features above the strongest control, with MAE the same matched-background-dominated outlier — so the regimes are not artefacts of the grey-canvas protocol. Task-matched evidence profiles align with ADE20K segmentation and NYUv2 depth, do not predict iNat fine-grained classification, are fragile on FSC-147 counting, and offer evidence for RefCOCO+ grounding under the current pooled-feature protocol. Sparse Groundings therefore audits not whether a model is grounded in general, but which visual claim survives which controls, by which method, and for which downstream use. As multimodal systems become more central to perception-heavy applications, understanding what they perceive about the world will require interpretability tools that are precise about the claim being tested and robust to the controls that could explain it away.
Virtual
Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization Mariia Onyshchuk, Maksym-Vasyl Tarnavskyi, Marta Sumyk
Abstract
Optimal transport (OT) has been shown to detect hallucinations in neural machine translation (NMT) by measuring the geometric distance between cross-attention distributions and a reference distribution, without any supervision. We extend this analysis to all six decoder layers of the Fairseq DE-EN model (N=3,414), showing that Wass-to-Unif and Wass-to-Data are complementary detectors specialised across hallucination types, that detection is concentrated in layers L1--L4 with L5 anti-predictive for subtler types, and that hallucinated translations lack the exploratory attention phase present in correct translations from the first decoding step. We further evaluate whether the geometric signal transfers to abstractive summarization faithfulness detection: our unsupervised OT detector on AggreFact (N=1,116) achieves 57.2%/57.6% balanced accuracy on CNN/XSum --- above chance but substantially below supervised MiniCheck-Flan-T5-L (69.9%/74.3%). This gap is principled: unlike NMT hallucinations, unfaithful summaries can attend correctly to source tokens while misrepresenting their content, a failure mode invisible to concentration-based OT metrics by construction. Structural experiments on T5-base confirm consistent decoder organisation across depth, with Layer 3 showing peak concentration and Layer 12 being most critical for generation quality. Together, the results establish OT on cross-attention as a reliable detector when the failure mode is source disengagement, a principled interpretability tool regardless of task, and fundamentally limited when faithfulness failures occur downstream of attention.
Virtual
The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology Andrzej Szablewski, Raffaello Fornasiere, Gabriel Konar-Steenberg, Nikita Menon, Stefan Heimersheim
Abstract
Model organisms (MOs) - language models trained to exhibit undesired or unnatural behaviours - are frequently used as testbeds for evaluating white-box interpretability techniques. Current MOs are typically constructed via post-hoc supervised fine-tuning (SFT) on behavioural transcripts or synthetic documents. Prior research has shown that interpretability methods can easily identify hidden behaviours in these MOs. However, recent work suggests that such post-hoc training methods may make interpretability unrealistically easy. We investigate this claim by constructing a suite of 54 $\verb|OLMo2-1B|$- and $\verb|Gemma-3-1B|$-based MOs trained with seven different techniques, including standard post-hoc SFT methods, post-hoc DPO, and more realistic integration of MO data into the OLMo post-training DPO phase. We use these MO variants to benchmark activation oracles, activation steering, logit lens, and sparse autoencoders. Our findings suggest that (i) current interpretability methods are not as capable as suggested by prior work; (ii) MO interpretability depends strongly on training methodology, target behaviour, interpretability technique, and model architecture; and (iii) substantial variance remains even after controlling for differences in the strength of target behaviour expression. Our results cast substantial doubt on the validity of current MOs as interpretability proxies. Our code is available here: https://github.com/anonsubmissionneurips2026/model-organism-lottery.
Virtual
Mechanistic Interpretability of Chemical Language Models: Molecular Geometry and Representations in MoLMFormer and SMI-TED Wanqing Hao, Zetong Xu, Zhiyuan Han, Xiaoyan Bai, Chenhao Tan
Abstract
Chemical language models trained on SMILES strings, text-based representations of molecules, achieve strong performance on chemical prediction tasks, yet it remains unclear how they internally represent molecular structure and chemical properties. In this work, we analyze two widely used models, MoLMFormer and SMI-TED, from a mechanistic interpretability perspective. Prior analyses of chemical foundation models have largely focused on correlational methods, including attention correlations, embedding structure, and input-output associations. We take a step toward a more mechanistic understanding by combining layerwise attention analysis, linear probing, and causal intervention. We find that attention exhibits only weak and selective alignment with underlying molecular geometry. Probing further reveals that SMI-TED encodes contextual chemical properties as cleaner linear directions than MoLMFormer. Moving beyond correlational analysis, our causal interventions show that these directions are functionally used during computation, and earlier in SMI-TED than in MoLMFormer. Despite strong downstream performance, our results suggest that successful SMILES pretraining does not necessarily produce spatially grounded or uniformly usable chemical representations, highlighting how diverse internal representations can support similar predictive success.
Virtual
Detecting Whether an LLM has been Backdoored Anthony Hughes, Nicole Xing, Andy Kim, Collin Francel, Andrew Draganov
Abstract
As language models are deployed in high-stakes domains, adversaries may poison training data to implant *backdoors*: hidden triggers that covertly manipulate model behavior at inference time. In this work, we formalize the affordances which a defender has and, to evaluate whether defenders can identify backdoors under these affordances, construct a benchmark for backdoor-detection algorithms. This benchmark spans attack mechanisms and objectives, including an adversarial backdoor explicitly designed to evade detection. We use this benchmark to evaluate a suite of backdoor-elicitation hypotheses. We find that while some techniques can flag poisoned models, none reliably surface backdoors. Indeed, hunting for backdoors in poisoned models is likely to surface jailbreaks instead. Finally, we show that backdoor-related activation vectors are consistently different from the vectors which account for undesirable behaviors without triggers. We release our benchmark to motivate the interpretability community to develop stronger algorithms for eliciting backdoors.
Virtual
A Path Already Walked: On Inheriting Network-Neuroscience Tools for Mechanistic Interpretability Phongsakon Mark Konrad, Toygar Tanyel, Serkan Ayvaz
Abstract
Mechanistic interpretability is moving from neurons and heads toward circuits, dictionary features, and attribution graphs. That transition is productive, but it also raises a familiar issue. Many important phenomena are relational rather than component-local. Network neuroscience has spent two decades building graph vocabulary, null models, and failure modes for related problems. We argue for a disciplined import rather than a loose brain analogy. We specify the transformer graph contract required before the import is meaningful, give a compact mapping from network-neuroscience primitives to transformer analyses, work through a local effective-connectivity proxy for gated MLPs, and state eight testable translations with failure criteria. This is a position and methodology paper. It contributes a vocabulary, graph contracts, and falsifiable tests meant to structure future empirical work rather than report results of its own. We do not report transformer experiments, and we do not claim that neuroscience results transfer automatically.
Virtual
Spectral Guardrails: Detecting Prompt Injection via Attention Graph Fracture in Large Language Models Charly Ken Capo-Chichi, Ghanem AMARI, Valentin NOËL
Abstract
Abstract Prompt injection attacks redirect an LLM away from its system instructions by embedding ad- versarial directives in the user turn. Text-based detectors exploit lexical artifacts present in cur- rent benchmarks, artifacts that vanish under para- phrasing, leaving them blind to semantically novel attacks. We ask a different question: what hap- pens inside the model during an injection? We prove that a successful injection must collapse the algebraic connectivity of the attention graph because a functional injection and a healthy at- tention graph are mathematically incompatible (Eq. 1). Evaluated across seven models (1.1B– 14B) including GQA architectures, our Layerwise Multi-Metric probe (LMM-LR) achieves ROC- AUC 0.90–0.99, surpassing TF-IDF on the least lexically structured benchmark (0.965 vs. 0.951) and DeBERTa-v3 on two of three benchmarks, without ever reading the input text.
Virtual
Moir: Let the Model Direct Its Own Story for Robust Cross-Domain Knowledge Editing Jea Kwon, Jiwon Kim, Dong-Kyum Kim, Meeyoung Cha
Abstract
While language models remain frozen at their training state, the world evolves continuously. Knowledge editing has emerged as a key alternative to full retraining, but its deployment is bottlenecked by the erosion of core capabilities: mathematical and programmatic reasoning collapse while encyclopedic recall remains intact. We trace this asymmetric degradation to a distributional mismatch. Covariance-based editors preserve only the subspaces spanned by their reference corpus, but fail to capture the operative distribution shaped by post-training such as SFT and DPO. Static external corpora, including Wikipedia and even the original pretraining mixture, cannot recover this shifted manifold. We propose Moir, which estimates the preservation covariance C directly from the model itself by sampling from its own decoding distribution. Seeding generation with a single random vocabulary token bypasses the instruction-following templates that otherwise dominate sampled outputs, exposing the broader subspaces the model has internalized. Moir requires no external data and serves as a drop-in component for any covariance-based editor, a practical advantage given that the pre- and post-training corpora of most modern LLMs are not publicly accessible. Across OLMo-2, Llama-3.1, and Qwen-3 (7-8B), under both MEMIT and AlphaEdit and in batch and sequential regimes, Moir consistently extends preservation in the most vulnerable domains, most strikingly on Qwen3-8B after 20,000 AlphaEdit batch edits, it retains 79.9% GSM8K accuracy compared to 10.9% with the Wikipedia baseline. These results suggest that aligning the preservation distribution with the model's operative distribution is a key factor in non-destructive editing, and that the model itself may be the most accessible source of that distribution for deployed systems.
Virtual
Building Better Activation Oracles Jan Philipp Bauer, Celeste De Schamphelaere, Adam Karvonen, Niclas Luick, Neel Nanda
Abstract
_Activation Oracles_ (AOs) are promising methods for interpreting residual stream activations. However, current AOs suffer face important issues, such as hallucinations, vagueness. Additionally, text-inversion confounds make them hard to evaluate. To this end, we propose two principles for training data construction: _solvability_, realized by training on on-policy data, and _targetedness_, realized by avoiding gaming the target through text inversion. We find that these interventions yield modest but significant improvements on hallucination and vagueness, and is overall more usable. In addition, we open source the first comprehensive evaluation suite for AO quality, which we call _AObench_. Additionally, we share preliminary negative results regarding _Multi-Layer-Activation Oracles_ (MLAO) do work, reduce training loss, but do not lead to substantial uplift in downstream evaluations, contrary to what one might expect. Overall, we hope that our work sets a foundation that helps improve AOs, joining a paradigm of scalable, end-to-end interpretability.