Accepted Posters
193 in-person posters were accepted, including 23 spotlights. Spotlights are showcased at the top; all other in-person posters are grouped by topic.
★ Spotlights (23)
Variant-specific crosscoder features are seed-stable but not detectably task-causal in a GRPO-LoRA math setting
Abstract
We test whether variant-specific features identified by joint-norm pairwise crosscoders, trained on activation pairs from a base LLM and its RL fine-tune, correspond to task-causal mechanisms underlying the observed fine-tuned behavior. In a Qwen3-4B vs. GRPO-LoRA math setting, high-$\nu$ features pass a non-lexical cross-seed reproducibility check but fail task-causal specificity under $n=100$ paired ablation against a magnitude-matched random control. Two complementary base-vs-base controls diagnose what the gate is responding to: under paired-identity ($A_t = B_t$ exactly) it produces zero high-$\nu$ features in all 12 trained crosscoders, while under disjoint halves it produces 50–200$\times$ as many features as base/GRPO with similar non-lexical seed-stability. The gate therefore responds to systematic between-side distributional difference, and a large high-$\nu$ population can arise from unpaired-pair reconstruction asymmetry; in the paired base/GRPO setting, the remaining high-$\nu$ population is much smaller, is consistent with model-pair distributional drift, and is not detectably task-causal under our ablations. A high-$\nu$ gate, even combined with non-lexical seed stability, is insufficient evidence for task-causal mechanisms underlying the observed fine-tuned behavior.
From Noise to Diversity: Random Embedding Injection in LLM Reasoning
Abstract
Recent soft prompt research has tried to improve reasoning by inserting trained vectors into LLM inputs, yet whether the gain comes from the learned content or from the act of injection itself has not been carefully separated. We study Random Soft Prompts (RSPs), which drop the training step entirely and append a freshly drawn sequence of random embedding vectors to the input. Each RSP vector is sampled from an isotropic Gaussian fitted to the entrywise mean and variance of the pretrained embedding table; the sequence carries no learned content, and yet reaches accuracy comparable to optimized soft prompts on math reasoning benchmarks in several settings. The mechanism unfolds in two stages: because attention has to absorb a never-seen-before random position, the distribution over the first few generated tokens flattens and reasoning trajectories branch, and as generation continues this influence dilutes naturally so the response commits to a single completion. We show that during inference RSPs lift early-stage token diversity and, combined with temperature sampling, widen Pass@N, the probability that at least one out of N attempts is correct. Beyond inference, we carry the same effect into DAPO training and demonstrate practical gains. Our contributions are: (i) RSP isolates the simplest form of soft prompt --- training-free, freshly resampled --- providing a unified lens for the structural effect of injection that variants otherwise differing in training and form all share; (ii) a theoretical and empirical validation of the underlying mechanism; and (iii) an extension from inference to training.
MultiSTEVE-1s: A Model Zoo and Interpretability Suite for Instruction-Following Vision Agents
Abstract
A striking case of goal misgeneralisation was previously observed in OpenAI's Minecraft agent VPT: it killed villagers standing under leaves, mistaking them for tree trunks. Although this agent was released publicly, enabling white-box interpretability research, few open-weight model organisms of misalignment exist outside the LLM space. We release MultiSTEVE-1s: a model zoo of 140 fine-tuned versions of VPT, over 1,000 training checkpoints, and an interpretability suite for analysing them. We use the STEVE-1 training procedure to add instruction-following capabilities to VPT with fixed hyperparameters and controlled variations in training randomness. We demonstrate the utility of MultiSTEVE-1s by showcasing the research it enables. First, some training runs differ only by a least-significant bit flip in a single initialised weight. Others differ in the full randomness for weight initialisation and data. Yet, the single bit-flip setting produces agents that act nearly as differently from each other as the full randomness ones. Second, we use our interpretability suite to show that several known VPT attention heads retain their roles after STEVE-1 fine-tuning, while attention strength to the same behaviourally meaningful frame can vary substantially across agents and checkpoints. Finally, although the agents are similarly capable on in-distribution tasks, their out-of-distribution behaviour of villager killing can differ substantially: in one setting, one agent kills villagers less than 5\% of the time, while another kills them nearly 50\% of the time. Our results show the value of studying multiple similarly trained agents rather than acting like a behavioural biology lab with only one rat.
Code: https://anonymous.4open.science/r/multisteve1s-anonymous-76D4/
Vision-Language Binding in In-Context Image Generation
Abstract
In-context image generation models such as FLUX.2 take a text prompt and an optional reference image as visual conditioning for the output. Internally, all three inputs — text, reference image, and the noise tokens — are concatenated and processed through a single attention stream, where all tokens can attend to one another. This leaves open how reference information flows through the model to produce the output image. We show that an implicit cross-modal binding emerges between the text tokens and the reference image: the text tokens absorb visual reference content during the forward pass, and that absorbed content causally influences the generated output. We surface this binding with three causal interventions on FLUX.2: T2I Lens, which decodes intermediate text-token activations through a text-to-image path; Attention Knockout, which severs specific attention edges; and I2I-to-I2I Patching, which copies text token activations between editing runs. Across 2,875 editing tasks on various images, including SUN397 and DreamBench++ datasets and images collected online, we observe a consistent division of labor: properties of the reference image, like color, style, and scene setting, are first written into the text tokens, which carry them to the generated image; pixel-exact properties like a specific face or instance identity bypass the text tokens and flow directly from reference to image through image-to-image attention. We further localize the reference-text binding to the padding tokens of the text sequence. These results show that text tokens in a multimodal DiT are not just prompt holders, but a structured channel for reference image content. More broadly, they suggest that even in unified-attention multimodal generative models, token modality structures how conditioning information is represented and routed across the network.
Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants
Abstract
Interpreting the internal activations of neural networks can produce more faithful explanations of their behavior, but is difficult due to the complex structure of activation space. Existing approaches to scalable interpretability use hand-designed agents that make and test hypotheses about how internal activations relate to external behavior. We propose to instead turn this task into an end-to-end training objective, by training interpretability assistants to accurately predict model behavior from activations through a communication bottleneck. Specifically, an encoder compresses activations to a sparse list of concepts, and a decoder reads this list and answers a natural language question about the model. We show how to pretrain this assistant on large unstructured data, then finetune it to answer questions. The resulting architecture, which we call a Predictive Concept Decoder, enjoys favorable scaling properties: the auto-interp score of the bottleneck concepts improves with data, as does the performance on downstream applications. Specifically, PCDs can detect jailbreaks, secret hints, and implanted latent concepts, and accurately surface latent user attributes.
Representational Geometry Reveals How Context Structures Concept Spaces in Language Models
Abstract
How context shapes the meaning of a concept is a foundational question in cognitive science and machine learning, yet direct experimental investigation remains difficult. Large language models, trained on vast human-generated text, offer a computational window into the structure of conceptual representation. The dominant view in machine learning treats concept representations as stationary geometric objects. Yet concepts appear in context, and context transforms them. We ask whether this transformation has shared structure and whether that structure is semantically organized in ways that reflect human conceptual knowledge. Drawing from neural population geometry, we formalize concept representations as point-cloud manifolds and contextual transformations as vector fields. Across six model families from 500M to 30B parameters, we investigate natural, artificial, and abstract concepts under six semantic context dimensions grounded in theories of human conceptual representation. We find that context moves each concept differently. The variance in displacement is semantically organized, correlating with lexical concreteness and density. Importantly, this variance structure is shared across models. Displacement structure transferred from one model predicts held-out displacements in other models significantly above chance, and ablating this structure degrades prediction. These findings suggest that the geometry of how context transforms concepts is not a property of any particular model, but a stable structure that may reflect something deeper about how meaning is organized. Collectively, these findings offer computational insight into how concepts can be simultaneously stable and context-sensitive.
The Geometric Wall: Manifold Structure Predicts Layerwise Sparse Autoencoder Scaling Laws
Abstract
Sparse autoencoders (SAEs) operationalise the linear representation hypothesis: they reconstruct model activations as sparse linear combinations of interpretable dictionary atoms, on the implicit assumption that activation space is well approximated by a globally linear structure. Their reconstruction error varies sharply across layers in ways that existing scaling laws, fitted at single layers, do not explain. We argue that this variation is the empirical trace of a geometric mismatch: where the activation manifold is curved and its intrinsic dimension varies across layers, no sparse linear dictionary can match it uniformly, and the SAE's width-sparsity scaling becomes a layer-dependent function of manifold structure rather than a single universal law. We conduct the first cross-layer SAE scaling study, fitting and regressing on 844 residual-stream Gemma Scope SAE checkpoints across 68 layers of Gemma 2 2B and 9B. Stage 1 fits a per-layer scaling-law surface; Stage 2 regresses the fitted parameters and the derived per-layer width exponents on four layerwise geometric summaries. We find that manifold geometry predicts the per-layer width exponent in both models, and that the same regression coefficients learnt on one model predict the other model's per-layer exponents under cross-model transfer, indicating a transferable geometric law. At the showcase layers where richer width grids permit identification of the asymptotic floor, we find that the fitted floor tracks the layerwise geometric ordering: higher curvature and intrinsic dimension correspond to higher floor, consistent with the irreducible second-order residual that any sparse linear approximation of a curved manifold must leave behind. SAEs thus encounter not a finite-resource ceiling but a geometry-dependent wall, set by the manifold they are trying to reconstruct.
From Tokens to Policy: Causal and Interpretable Heterogeneous Treatment Effects Identification
Abstract
Heterogeneous Treatment Effect (HTE) identification is crucial to explain the impact of an intervention and optimize our policies accordingly.
Existing approaches trade expressivity for interpretability, but, if some active heterogeneity drivers are unmeasured, methods at both ends of this spectrum allow for spurious HTE characterization with no causal reading. In this work, we focus on controlled experiments and argue that an oracle HTE causal characterization via the latent interactors is now within reach, thanks to (i) more extensive pre-treatment measurements, i.e., multi-modal and multi-view, and (ii) scalable representations with minimal human supervision. We then re-frame HTE identification as a Markov-blanket discovery problem on a sufficient and aligned pre-treatment representation, and introduce Neural EXposure Interaction Search (NEXIS), an iterative procedure with provable and empirically validated consistent selection. We deploy NEXIS on two anti-poverty programs in Africa, augmenting each with satellite imagery capturing previously unmeasured environmental effect modifiers, leading to novel, interpretable and prescriptive guidelines to optimize the programs' next iterations.
The Dark Subspace of Fine-Tuning Memorisation
Abstract
Sparse autoencoders (SAEs) learn a dictionary of sparse features over neural-network activations and are widely used to interpret and edit language models. Privacy and safety methods now ablate or steer these features to suppress unwanted behaviour, acting only on what the dictionary represents. We ask whether the dictionary preserves the evidence that a document was used to fine-tune the model, the question studied by membership-inference attacks. We decompose each activation into the SAE reconstruction and the reconstruction residual, then train membership detectors on each. In controlled Pythia replications, SAE reconstruction weakens detection, yet detectors recover much of the lost signal from the residual. The same residual-above-reconstruction ordering holds across ten model-SAE settings spanning six architecture families. Surface confounds (norm, length, bag-of-words) and a label-shuffle permutation test do not account for the pattern. Privacy and safety methods that operate only on SAE features may therefore leave training-data evidence detectable in the reconstruction residual.
Analysis-by-Proxy: Localization Signals in VLMs Operating as Condition Encoders
Abstract
Vision-Language Models (VLMs) are increasingly utilized as the conditioning backbone for diffusion-based image editing due to their remarkable multimodal reasoning capabilities. While standalone VLMs demonstrate strong localization capabilities, editing pipelines frequently struggle to maintain this accuracy, particularly in complex, multi-entity scenes.
In this work, we investigate this performance gap, hypothesizing that it stems from treating the VLM as a condition encoder. In this role, the model is restricted to a single forward pass, preventing the autoregressive generation process for which it was optimized, thereby failing to fully expose its capabilities. To investigate whether this spatial understanding persists when the VLM is used as a condition encoder, we introduce Analysis-by-Proxy.
In this framework, we train a lightweight, interpretable proxy model on the VLM's intermediate representations using an auxiliary localization task. By analyzing the VLM through this proxy, we uncover the specific VLM representations that encode localization information. Our findings expose a fundamental mismatch between how spatial knowledge is represented within a VLM condition encoder and how it is extracted by current editing pipelines.
We reveal that under single-pass constraints, the localization signal does not reliably propagate to the predefined layer configurations commonly used for conditioning. Instead, this crucial signal remains hidden within intermediate representations, at locations that vary depending on the input prompt. Using our introduced Analysis-by-Proxy framework, we reveal the fundamental failures of existing condition extraction strategies in editing pipelines, opening the door to more principled design of conditioning architectures.
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Abstract
Neural representations carry rich geometric structure; but does that structure causally shape behavior?
To address this question, we intervene along paths through activation space defined by different geometries, and measure the behavioral trajectories they induce.
In particular, we test whether interventions that respect the geometry of activation space will yield behaviors close to those the model exhibits naturally.
Concretely, we first fit an activation manifold $\mathcal{M}_h$ to representations and a behavior manifold $\mathcal{M}_y$ to output probability distributions.
We then test the link $\mathcal{M}_h \leftrightarrow \mathcal{M}_y$ via interventions: we find that steering along $\mathcal{M}_h$, which we term \textit{manifold steering}, yields behavioral trajectories that follow $\mathcal{M}_y$, while linear steering---which assumes a Euclidean geometry---cuts through off-manifold regions and hence produces unnatural outputs.
Moreover, optimizing interventions in activation space to produce paths along $\mathcal{M}_y$ recovers activation trajectories that trace the curvature of $\mathcal{M}_h$.
We demonstrate this bidirectional relationship between the geometry of representation and behavior across tasks and modalities.
In language models, we use reasoning tasks with cyclic and sequential geometries as well as in-context learning tasks with more complex graph geometries. In a video world model, we use a task with geometry corresponding to physical dynamics.
Overall, our work shows that geometry in neural representation is not merely incidental, but is in fact the proper object for enabling principled control via intervention on internals.
This recasts the core problem of steering from finding the right \textit{direction} to finding the right geometry.
Validating Causal Abstraction Metrics on Simulated Complex Systems
Abstract
A central goal of science is to produce valid explanations of complex systems: high-level causal accounts that faithfully reflect the behavior of lower-level mechanisms. Yet no consensus exists on how to measure whether a proposed high-level explanation is actually valid. We introduce a benchmark of ten complex systems spanning discrete and continuous state spaces and static and dynamical regimes, each equipped with consensual ground-truth causal explanations and invalid contrastive conditions. Within a unified causal abstraction framework, we systematically evaluate over thirty candidate metrics drawn from observational, functional, information-theoretic, and causal families. Our results show that only the latter reliably discriminates valid from invalid abstractions, and only when incorporating faithfulness testing over unmapped variables. Building on these findings, we introduce the Causal Abstraction Error (CAE), a continuous validity metric with an explicit faithfulness test, which passes all discrimination tests across every system and converges with as few as 30 sampled interventions. We offer it as a general-purpose metric for the discovery and validation of high-level explanations of complex systems.
How do Small Transformer Models Learn Hard Math Tasks?
Abstract
Recent works have demonstrated that transformers can be trained to recover sparse, binary cryptographic secrets in the Learning With Errors (LWE) problem, a foundational problem that underlies many post-quantum cryptographic schemes. However, as architectures have evolved to efficient encoder-only models, the mechanism by which these models recover the cryptographic secret has become more opaque. In this paper, we present the first layer-wise and embedding-level mechanistic interpretability analysis of encoder-only transformers trained on LWE samples. We reveal a surprising phenomenon: despite achieving near-zero exact prediction accuracy on the training objective, the models successfully recover the secret by bypassing the standard predictive pathways. We use dimensionality reduction, causal intervention, and linear probing and find that the secret is implicitly present in the positional embedding. Building on this mechanistic understanding, we introduce an architectural intervention that applies $L_1$ sparsity regularization directly to the positional embeddings. This modification forces the model to explicitly isolate the latent secret, transforming the computationally expensive post-hoc secret recovery process into a direct, human-interpretable parameter inspection. Our findings provide fundamental insights into how transformers allocate representational capacity when faced with high-noise, structured combinatorial problems.
Conditional Dependence Structure in Sparse Autoencoder Features
Abstract
Sparse Autoencoders (SAEs) decompose language-model activations into large overcomplete dictionaries of interpretable features, but these features are typically used as individual directions rather than as structured representations.
We examine whether SAE features exhibit recoverable conditional dependence.
We introduce a scalable estimator combining support-based screening, nodewise LASSO, and resampling-based null calibration to construct dataset-conditioned graphs over SAE features.
Applied to GemmaScope SAEs on FineWeb and WMDP-bio, the graphs produced by this approach are sparse, stable under resampling, and modular, with many local neighborhoods linking semantically related features.
These graphs also show shared cross-dataset and corpus-specific organization.
They are not well explained by decoder cosine similarity or raw activation correlations.
These results suggest that SAE representations contain organization beyond individual features, providing a way to study feature splitting, merging, and organization in overcomplete representations.
How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations
Abstract
Sparse Autoencoders (SAEs) have found widespread success parsing neural representations into interpretable concepts, providing a basis for understanding and control. However, what exactly an SAE extracts, and, correspondingly, the scientific conclusions we can draw, is not obvious. Empirically, the proof is in the pudding: SAEs do learn interpretable features. Theoretically, we lack a clear account of what properties a `concept' must satisfy for an SAE to extract it. There is an extensive body of work studying sparse coding identifiability; in particular, given data generated under sparsity assumptions, when will an algorithm recover the true factors? However, SAEs are trained on internet-swallowing representations that are poorly approximated by simple generative models. Rather than assuming a hypothesised ground truth, we ask what properties any dictionary learning optimum must satisfy without data-assumptions. Concretely, we extend existing local optimality analyses to the nonnegative joint-optimisation problem that vanilla SAEs approximate, and derive constraints relating optimal SAE feature's to their distributions. We use these to explain a range of observed SAE behaviours - hierarchical splitting \& absorption, the structure of residuals, and dense antipodal features - each reflecting how L1+nonnegativity interact with data to structure optimal dictionaries. Further, we identify a novel convex formulation of the problem, and use it to ask: will larger SAEs ever stop splitting? We find the answer can be yes, with a limiting dictionary state that clusters data along rays. In sum, we hope this framework can tease model assumptions from unexpected observations, letting us learn more from SAEs' successes.
Subliminal Learning is Non-Semantic Distillation
Abstract
Subliminal Learning (SL) is a surprising type of generalization displayed by modern language models.It allows the transfer of a bias or behavior from a teacher model to a student by distilling from seemingly unrelated or random synthetic data from the teacher. This presents challenges in ensuring AI systems remain predictable and are trained safely, as standard auditing of the input data would not catch the hidden subliminal signal. Here, we investigate several open questions as to the enabling mechanisms and drivers of SL. First is the nature of the process by which biases are encoded in the data. We find that by adding Gaussian noise to the weights of the teacher and student models, the magnitude of subliminal transfer is increased by a factor of 2.7 in Gemma and 1.8 in Llama, suggesting that non-semantic weight structures play a crucial role. We show that steering vectors can be applied to the teacher to produce subliminal data, in addition to prompting and finetuning as used in previous studies. Analysis of the activations of the student models that have been trained on steered and prompted data demonstrates that students inherit not just the semantic meaning of the teacher's bias, but also the type of intervention that was used to apply it: steered students imitate steering vectors, prompted students do not. Additionally, the gradients of steered subliminal data show a linear correlation with the teacher's steering vectors, showing promise for data auditing. More broadly, as synthetic data becomes central to frontier training pipelines, being able to see the latent signals hidden in training data becomes paramount.
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Abstract
Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation—modifying prompts or weights so that the model answers truthfully—and lie detection—classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like the Tiananmen protests or the COVID-19 outbreak while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models, including DeepSeek-R1 and Qwen3.5-397B. Notably, no technique fully eliminates false responses. We release all prompts, data, and code.
Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits
Abstract
Cross-layer transcoders (CLTs) enable circuit tracing that can extract high-level mechanistic explanations for arbitrary prompts and are emerging as general-purpose infrastructure for mechanistic interpretability. Because these tools operate at a relatively low level, their outputs are often treated as reliable descriptions of what a model is doing, not just predictive approximations. We therefore ask: when are CLT-derived circuits faithful to the model’s true internal computation?
In a Boolean toy model with known ground truth, we show a specific unfaithfulness mode: CLTs can rewrite deep multi-hop circuits into sums of shallow single-hop circuits, yielding explanations that match behavior while obscuring the actual computational pathway. Moreover, we find that widely used sparsity penalties can incentivize this rewrite, pushing CLTs toward unfaithful decompositions. We then provide preliminary evidence that similar discrepancies arise in real language models, where per-layer transcoders and cross-layer transcoders sometimes imply sharply different circuit-level interpretations for the same behavior. Our results clarify a limitation of CLT-based circuit tracing and motivate care in how sparsity and interpretability objectives are chosen.
Localizing RL-Induced Tool Use to a Single Crosscoder Feature
Abstract
Fine-tuning through RL reshapes the internal representations of language models to enable agentic behaviors such as tool use, yet the mechanistic basis of these changes remains poorly understood. While RL substantially improves structured tool-call generation, it is unclear which features emerge, which are preserved, and whether identified features can be leveraged for retraining-free behavioral control. In this work, we show that $\textit{Dedicated Feature Crosscoders (DFC)}$ isolate a compact set of RL-specific features that mediate tool-calling capability in $\texttt{Qwen2.5-3B}$. Across a $48$-crosscoder hyperparameter sweep, encode-decode reconstruction improves the RL model's tool correctness by $+31.1 \pm {9.7}$ pp and passively transfers tool-calling ability to the frozen base model by $+6.8 \pm 5.0$ pp which we call a $\textit{capability spillover}$. Our findings show that DFC partitioning concentrates RL-introduced capability into a minimal, steerable feature set that enables runtime behavioral control of agentic LLMs.
Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?
Abstract
Mechanistic interpretability (MI) requires full access to model internals, yet the most widely deployed language models expose little more than token probabilities through their APIs.
This creates a surrogate problem: when do measurements made on open models allow us to make claims about a closed model?
We evaluate surrogate fidelity at the output, behavioral, and representational levels.
For binary classification tasks, log-odds provide an API-compatible scalar readout of the model's representation space, and leave-one-out attributions provide insight into model behavior.
Across eleven models spanning four families (Llama, Qwen, GPT, and Gemini), we find that prediction fidelity substantially overstates behavioral fidelity: models that agree on what the answer is often disagree on why.
We document an access--validity inversion: white-box signals like attention patterns and perturbation magnitudes are highly stable across models but only weakly predictive of causal attributions, which black-box input ablations capture by design.
Mechanistic insight does not automatically transfer to closed targets, and prediction-level agreement is necessary but insufficient to warrant such transfer.
Size Doesn't Matter: Cosine-Scored Sparse Autoencoders
Abstract
Sparse autoencoders (SAEs) detect features via inner product, so a feature’s activation scales with both its directional alignment and the input’s norm. Under BatchTopK, high-norm tokens inflate all pre-activations simultaneously, claiming dictionary slots regardless of content alignment. This matters because sublayer normalization has already discarded the magnitude the score measures, so the encoder detects a quantity the model does not read. We replace the score with a learned blend of cosine similarity and input magnitude, letting the optimizer choose how much norm to use; a per-feature extension lets each feature decide independently. In both regimes, training is free to recover inner product but never does, with no feature ever choosing more than half-magnitude dependence. At matched reconstruction, the cosine encoder learns features that align with human-recognizable concepts far more often than standard, filling dictionary slots that
inner product wastes on norm detectors. Loss reweighting that equalizes gradients barely closes the gap, confirming forward-pass score geometry as the lever. The advantage is not universal across tasks or depths, but we believe cosine scoring should be the default for dictionary learning on normalized representations.
Language Models Learn Universal Representations of Numbers and Why You Should Care
Abstract
Prior work has shown that large language models (LLMs) often converge to accurate input embedding for numbers, based on sinusoidal representations.
In this work, we quantify that these representations are in fact strikingly systematic, to the point of being almost perfectly universal: different LLM families develop equivalent sinusoidal structures, and number representations are broadly interchangeable in a large swathe of experimental setups.
We show that properly factoring in this characteristic is crucial for assessing how accurately LLMs encode numeric and other ordinal information, and that mechanistically enhancing this sinusoidality can also lead to reductions of LLMs' arithmetic errors.
Multiplication Beyond Groups: Stratified Fourier Mechanisms in Transformer Circuits
Abstract
Transformers have demonstrated a remarkable ability to learn algorithmic reasoning, yet mechanistic analyses have mostly focused on globally invertible operations such as cyclic addition and group composition. In this work, we investigate how small transformers learn modular integer multiplication over composite moduli, a fundamentally non-invertible operation due to the presence of zero-divisors. We propose the monoid extension: a localized generalization of Group Composition via Representation (GCR) that suggests the learned computation does not rely on a single global representation space. Instead, the model partitions the input space into local hierarchical algebraic regions, where group-like structure survives and Fourier mechanisms can be applied. In transformers trained on square-free modular multiplication, we find that embeddings organize around these regions, attention exhibits class-sensitive routing and low-rank write directions, and local character features explain a large fraction of the model's output logits. Our results suggest that representation-theoretic mechanisms previously identified for group operations can extend beyond groups to more general structures.
Circuits & Attribution Graphs (28)
Multi-Granular Node Pruning for Causal Circuit Discovery
Abstract
Causal circuit discovery aims to identify minimal subnetworks that causally drive specific behaviors in large language models (LLMs). Existing approaches focus on edge pruning or unstructured weight pruning. These methods are computationally expensive and typically operate on coarse-grained components, such as attention heads or MLP blocks, thereby missing finer-grained structure. We propose a node-level pruning framework for circuit discovery that addresses both scalability and granularity limitations. Our method introduces learnable masks across multiple levels of granularity, from entire blocks to individual neurons, within a unified optimization objective. Granularity-specific sparsity penalties guide the pruning process, allowing a comprehensive compression in a single fine-tuning run. Empirically, our approach identifies more compact circuits than prior methods, 33.34% more MLPs, and 59.8% more neurons in the least favorable setting, with larger gains overall. We further demonstrate that many neurons deemed important by coarse methods are, in fact, irrelevant and can be removed with negligible impact on task performance. Our method is also memory-efficient, requiring at least 3× less memory as it avoids storing intermediate activations in memory to work.
LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit
Abstract
When a language model sycophantically agrees with a user's false belief, is it failing to detect the error, or noticing and agreeing anyway? We show the second. Across twelve open-weight models from five labs ($1.5$B-$72$B), the same small set of attention heads carries a "this statement is wrong" signal whether the model is evaluating an isolated claim or being pressured to agree with a user. Silencing these heads in Gemma-2-2B flips sycophancy from $28\%$ to $81\%$ while factual accuracy moves only from $69\%$ to $70\%$; the circuit controls deference, not knowledge. Edge-level path patching confirms the same connections between heads span sycophancy, factual lying, and instructed lying ($r{>}0.97$ on Gemma-2-2B, $r{=}0.988$-$0.995$ on Phi-4). Opinion-agreement, where there is no factual ground truth, reuses these head positions but writes into an orthogonal direction, so the substrate is not a relabeled "truth direction." Alignment training masks but does not remove this circuit: Meta's Llama-3.1$\to$3.3 RLHF refresh cut sycophancy tenfold while the shared heads persisted and the projection-ablation effect grew (substrate persistence replicates on Mistral$\to$Zephyr at 7B, independent family), and our own anti-sycophancy DPO reduced sycophancy $46$-$93\%$ on two models without moving probe transfer. When these models sycophant, they register the error and agree anyway.
Decomposing Query-Key Feature Interactions Using Contrastive Covariances
Abstract
Despite the central role of attention heads in Transformers, we lack tools to understand why a model attends to a particular token. To address this, we study the query-key (QK) space – the bilinear joint embedding space between queries and keys. We present a contrastive covariance method to decompose the QK space into low-rank, human-interpretable components. It is when features in keys and queries align in these low-rank subspaces that high attention scores are produced. We first study our method both analytically and empirically in a simplified setting. We then apply our method to large language models to identify human-interpretable QK subspaces for categorical semantic features and binding features. Finally, we demonstrate how attention scores can be attributed to our identified features.
Emergent Analogical Reasoning in Transformers
Abstract
Analogy is a central faculty of human intelligence, enabling abstract patterns discovered in one domain to be applied to another.
Despite its central role in cognition, the mechanisms by which Transformers acquire and implement analogical reasoning remain poorly understood.
In this work, inspired by the notion of functors in category theory, we formalize analogical reasoning as the inference of correspondences between entities across categories.
Based on this formulation, we introduce synthetic tasks that evaluate the emergence of analogical reasoning under controlled settings.
We find that the emergence of analogical reasoning is highly sensitive to data characteristics, optimization choices, and model scale.
Through mechanistic analysis, we show that analogical reasoning in Transformers decomposes into two key components:
(1) geometric alignment of relational structure in the embedding space, and
(2) the application of a functor within the Transformer. These mechanisms enable models to transfer relational structure from one category to another, realizing analogy.
Finally, we quantify these effects and find that the same trends are observed in pretrained LLMs.
In doing so, we move analogy from an abstract cognitive notion to a concrete, mechanistically grounded phenomenon in modern neural networks.
ProtoMech: Protein Circuit Tracing via Cross-layer Transcoders
Abstract
Protein language models (pLMs) have emerged as powerful predictors of protein structure and function. However, the computational circuits underlying their predictions remain poorly understood. Recent mechanistic interpretability methods decompose pLM representations into interpretable features, but they treat each layer independently and thus fail to capture cross-layer computation, limiting their ability to approximate the full model. We introduce ProtoMech, a framework for discovering computational circuits in pLMs using cross-layer transcoders that learn sparse latent representations jointly across layers to capture the model’s full computational circuitry. Applied to the pLM ESM2, ProtoMech recovers 82–89% of the original performance on protein family classification and function prediction tasks. ProtoMech then identifies compressed circuits that use <1% of the latent space while retaining up to 79% of model accuracy, revealing correspondence with structural and functional motifs, including binding, signaling, and stability. Steering along these circuits enables high-fitness protein design, surpassing baseline methods in more than 70% of cases. These results establish ProtoMech as a principled framework for protein circuit tracing.
Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models
Abstract
Protein sequences are abundant in repeating segments, both as exact copies and as approximate segments with mutations. These repeats are important for protein structure and function, motivating decades of algorithmic work on repeat identification. Recent work has shown that protein language models (PLMs) identify repeats, by examining their behavior in masked-token prediction. To elucidate their internal mechanisms, we investigate how PLMs detect both exact and approximate repeats. We find that the mechanism for approximate repeats functionally subsumes that of exact repeats. We then characterize this mechanism, revealing two main stages: PLMs first build feature representations using both general positional attention heads and biologically specialized components, such as neurons that encode amino-acid similarity. Then, induction heads attend to aligned tokens across repeated segments, promoting the correct answer. Our results reveal how PLMs solve this biological task by combining language-based pattern matching with specialized biological knowledge, thereby establishing a basis for studying more complex evolutionary processes in PLMs.
Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts
Abstract
Does structure in representations imply structure in computation? We study how Llama-3.1-8B reasons over cyclic concepts (e.g., "what month is six months after August?"). Even though Llama-3.1-8B's representations for these concepts are circularly structured, we find that instead of directly computing modular addition in the period of the cyclic concept (e.g., 12 for months), the model re-uses a generic addition mechanism across tasks that operates independently of concept-specific geometry. First, it computes the sum of its two inputs using base-10 addition (six + August=14). Then, it maps this sum back to cyclic concept space (14->February). We show that Llama-3.1-8B uses task-agnostic Fourier features to compute these sums—in fact, these features have periods that respect standard base-10 addition, e.g., 2, 5, and 10, rather than the cyclic concept period (e.g., 12 for months). Furthermore, we identify a sparse set of 28 MLP neurons re-used across all tasks (approximately 0.2% of the MLP at layer 18) that can be partitioned into disjoint clusters, each computing the sum for a Fourier feature with a different period. Our work highlights how an interplay between causal abstraction and feature geometry can deepen our mechanistic understanding of LMs.
Probing Hybrid Language Models for In-Context Recall
Abstract
Hybrid language models that interleave quadratic-attention with linear-attention layers can match quadratic-attention models on in-context recall while being more efficient at long context, but how they actually handle recall inside the architecture is not yet understood. Recent mechanistic work locates the recall circuit at the prediction position via head ablations on a few publicly-released hybrids, where attention placement is fixed by the original training recipe. We pretrain placement-controlled hybrids at 340M alongside quadratic-attention and linear-attention baselines. To see how they perform recall, we construct the sequence-layer map: a 2D probe-accuracy matrix over layers and token positions in the residual stream, applied uniformly across architectures. Attention placement determines whether a hybrid recalls, while standard language-modeling evaluations stay flat across placements. At the prediction position, the recall signal undergoes a sharp layer-localized phase transition only in hybrids able to recall and in quadratic-attention models. Across the sequence, quadratic-attention models drop the recall signal and look it up at the prediction position; hybrids inherit this pattern depending on where the attention layer sits. A head-level analysis at 340M and on OLMo-Hybrid 7B further localizes the phase transition to a small set of attention heads, including some we can only see by reading what each head writes into the answer-token logits. Together, these results give a controlled view of how pretrained hybrid language models handle in-context recall, and why some succeed where others fail.
Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention
Abstract
This paper studies the role of sinks and diagonal patterns as attention switch and anti-oversmoothing mechanisms. We analyze geometric conditions under which sinks can be represented, showing a necessary alignment between the embedding of the sink and all other embeddings. Next, we refine the current understanding of the role of sinks in oversmoothing prevention: we specify the conditions under which dense attention provably smooths more than sparse attention, and empirically verify that such conditions are often satisfied in practice. We further prove an equivalence between sinks and hard attention switch, in which the output of the attention is identically 0. Finally, we relax the hard attention switch by allowing token self-communication: we provide a quantitative comparison of the costs of representing sinks vs.\ diagonal patterns, showing why sinks are favored in pretrained transformers. The introduction and analysis of diagonal patterns and the generalization of the attention switch close the gap between what oversmoothing prevention requires and what sinks provide, while also establishing when and why attention layers act like MLPs if token communication is not necessary.
Ghost Heads Across Training: When Greedy and Distributional Patching Disagree
Abstract
A language model can produce a correct answer within 32 samples long before it does so reliably on the first try. Across arithmetic, GSM8K, and MATH500, pass@32 saturates during pretraining while pass@1 lags far behind. What mechanistic process closes this gap? We track circuit formation across OLMo-3-7B's full training pipeline (pretraining, mid-training, and RL-Zero) with both greedy and distributional activation patching, and introduce *answer-token patching* for multi-step reasoning. The two metrics agree on most components but *dissociate* at specific attention heads, which we call *ghost heads*: heads with high greedy recovery but near-zero distributional impact. Ghost heads are a recurring training instability: they peak whenever the training objective shifts and decay within each phase. Mid-training resolves the dissociation: patching effects concentrate onto a few attention heads where both metrics agree. RL-Zero, which improves pass@1 without touching pass@32, does not reintroduce ghost heads; two independent tests (noise sensitivity and verifier reranking) confirm it acts as a diffuse whole-model shift rather than reorganizing individual circuits. The phenomenon replicates across three model families.
Mechanistic Interpretability of Loop Control-Flow Generation
Abstract
Large language models are increasingly used for code generation, but the mechanisms by which they represent and produce program control flow remain poorly understood. We investigate how Gemma-2 2B predicts Python loop control-flow keywords, focusing on the syntactically related tokens break and continue. We combine Direct Logit Attribution (DLA) with activation patching to nominate and causally test attention heads involved in these predictions. For break, we find a localized, distributed circuit: jointly patching the top five DLA-ranked heads recovers 25.4% (95% CI [15.3, 34.1]) of the clean–corrupted logit-difference gap across 100 random-paired prompts. For continue, the analogous intervention produces only a small metric movement, driven by changes in the competing break logit rather than suppression of continue itself, with gap recovery statistically indistinguishable from zero. The DLA-ranked heads are thus causally involved but do not function as direct keyword-selecting components. Two syntactically similar control-flow predictions therefore rely on qualitatively different internal mechanisms, illustrating both the utility and the limitations of DLA-guided activation patching for circuit discovery in code-trained models.
Finding Interpretable Prompt-Specific Circuits in Language Models
Abstract
Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. A crucial part of finding circuits is understanding why each attention head attends where it does. To this end, we introduce **ACC++**, an improved circuit-tracing method based on the principle of *attention-causal communication* (ACC) [1], which identifies *signals*, i.e., contents of low dimensional subspaces that cause attention on a token pair. ACC++ extracts circuits from a *single forward pass*, without replacement models or patching. Circuits identified by ACC++ consist of components that are causal for the model's attention decisions, together with the low-dimensional signals used to communicate between them. Here, we first detail the conceptual advances that ACC++ makes over previous work. We then show that across multiple models, a substantial portion of ACC++ signals are *interpretable*: many signals admit a short natural-language description. We next present a number of new insights into model behavior obtained via ACC++. First, we use ACC++'s interpretable circuits to characterize the sensitivity of indirect object identification (IOI) circuits to prompt structure. We find that prompt-specific circuits form well-defined clusters, and across clusters, heads receive systematically different signals corresponding to distinct mechanisms for identifying the IO name. Next, in multilingual IOI, ACC++ circuits show that while model *components* are reused across languages, *signals* are often language-specific. In a four-language IOI case study, cross-language circuit distances are consistent with linguistic relatedness. Together, these results show that ACC++ can shed light on a broad spectrum of model behaviors.
Circuit Tracing in Autoregressive Protein Language Models
Abstract
Protein language models (pLMs) can generate novel protein sequences with properties beyond those observed in nature, yet the mechanisms underlying protein generation remain poorly understood. Existing mechanistic interpretability methods based on sparse autoencoders and transcoders primarily focus on protein representation learning models and do not capture the computation required for autoregressive generation. Here, we introduce ProGenMech, a mechanistic interpretability framework for generative protein language models that extends cross-layer transcoders (CLTs) to ProGen3, a sparse Mixture-of-Experts model trained for both causal generation and span infilling. Unlike per-layer approaches, CLTs reconstruct each layer using sparse latent variables from all preceding layers, enabling faithful recovery of inter-layer generative computation. We further develop a zero-shot circuit discovery framework to identify sparse latent circuits responsible for protein generation and fitness prediction. In causal generation and zero-shot fitness estimation tasks, ProGenMech outperforms local transcoder baselines in recovering ProGen3’s probability distribution and functional scoring behavior, while matching the original model’s generative distribution in span infilling tasks. Moreover, the recovered circuits reveal biologically meaningful motifs and functional regions associated with conserved sequence patterns and protein fitness landscapes, establishing a foundation for interpretable and steerable protein generation.
C-Δθ: Circuit-Restricted Weight Arithmetic for Selective Refusal
Abstract
Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ : Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update Δθc supported only on that circuit (typically <5% of parameters). Applying Δθc yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.
Demystifying Variance in Circuit Discovery of LLMs
Abstract
Circuit discovery is a key technique in mechanistic interpretability to pinpoint the model components that are crucial for performing a given task. Although the current state-of-the-art method (EAP-IG) performs well on the metric of (un)faithfulness, it suffers from substantial variability. This includes *resampling variance*, where the circuit changes when we probe with a new batch of data from the same distribution; *rephrasing variance*, where the discovered circuit shifts when the prompts are rephrased; and *sample-wise variance*, where a circuit with low population unfaithfulness exhibits large fluctuations in unfaithfulness across individual samples.
This paper studies the roots of these variances. We demonstrate that CEAP, our new circuit discovery method that improves upon EAP-IG with a theoretical guarantee, can substantially lessen resampling variance. We further show that rephrasing variance arises because prompts with different templates tend to activate different circuits in the model. This leads us to argue that it may be challenging to find a comprehensive circuit that explains and controls the model’s behavior on a task, which can be expressed in countless templates, suggesting that LLMs may be inherently hard to steer. We show that sparsity, which has been claimed to form more compact and interpretable task circuits, fails to solve this problem. Regarding sample-wise variance, we argue that it is largely benign: extremely poor unfaithfulness scores often stem from how unfaithfulness is defined, rather than from defects in the measured circuits. We show that the magnitude of unfaithfulness is affected by *selective contribution scaling*, a neural mechanism that accounts for the extremely poor scores sometimes observed.
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
Abstract
Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce "emergent misalignment" that generalizes broadly.
Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear.
Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs.
We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities.
Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally—despite the brittleness of safety guardrails at the surface level.
This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment.
Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment.
Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content.
Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.
Language Triggers Hijack Language Circuits: A Mechanistic Analysis of Backdoor Behaviors in Large Language Models
Abstract
Backdoor attacks pose significant security risks for Large Language Models (LLMs), yet the internal mechanisms by which triggers operate remain poorly understood. We present the first mechanistic analysis of trigger-induced language-switching backdoors injected during pre-training, studying the Gaperon model family (1B, 8B and 24B). Using activation patching, we localize trigger formation and identify which attention heads process trigger and natural language information. Our central finding is that trigger heads substantially overlap with heads naturally encoding output language across model scales, with Jaccard indices between 0.18 and 0.43 over the top 10 heads identified. This suggests that backdoor triggers do not form new circuits but instead co-opt the model's existing language components and representations. These findings have implications for backdoor defense as detection methods and mitigation strategies could leverage this entanglement between triggers and natural behaviors. More broadly, our work represents a first step toward a more realistic mechanistic understanding of pre-training-injected backdoors in LLMs, paving the way for principled, interpretability-driven defenses.
When Is an Attention Head a Computational Unit? Shared Circuits Despite Orthogonal Representations
Abstract
Mechanistic analyses of transformers often proceed head-by-head, looking for attention maps that correspond to computational roles. But when should an attention head be interpreted as a computational unit? In principle, multi-head attention need not organize computation in such a clean way. A single computational function may be distributed across heads, and a single head may participate in several computations, with invariants appearing only after OV routing and residual-stream summation. In natural language models, this is hard to resolve because the underlying features and computations are unknown, and an apparently uninterpretable head may be polysemantic, part of a distributed circuit, or a sign that the analyst chose the wrong decomposition. To get a handle on this issue, we study a controlled setting where the target computation is analytically specified: transformers trained on factored sequence processes. These processes require independent predictive updates for multiple latent factors, represented in orthogonal residual-stream subspaces. This lets us ask whether the attention circuits implementing these independent updates are themselves factorized. We find that they need not be. Per-factor updates constrain only the aggregate routed contribution across heads, leaving substantial freedom in how computation is distributed. Heads may specialize to individual factors, compose with other heads to implement one factor, or contribute polysemantically across factor boundaries. The regime that emerges depends on the generator's spectrum, the head budget, and training dynamics. We introduce \emph{effective subspace attention}, a scalar quantity that combines attention patterns with OV routing to recover invariant subspace-level contributions when individual maps are illegible. Our results show that independent computations can be implemented by shared attention circuits, and that individual heads may look uninterpretable even when the collective routed circuit implements a precise, theoretically predicted computation.
Detection Without Suppression: A Sign Flip in IOI Circuit Formation
Abstract
Transformers trained on language modeling develop the ability to identify indirect objects (IOI) non-monotonically: accuracy drops below chance before recovering. We observe this dip across 15 training runs spanning two model families, three scales, and nine seed/data/initialization variants. Linear probes show that name-duplication information is linearly decodable at 99% accuracy at the S2 position on Pythia-160M during the dip, while IOI accuracy is 41%; a roughly 89% S2 probe baseline is already present at random initialization. Activation patching reveals a sign flip: replacing S2 with a non-duplicate control reduces the S-bias during the dip (∆LD = +0.94) but degrades performance after recovery (∆LD = −4.13), while the same intervention at a non-S2 position produces ∆LD = 0.000. Head ablation shows that S-inhibition heads, which eventually contribute 10.8× more to the logit difference than name movers, have not yet acquired large measurable effects during the below-chance phase.
Cross-Model Circuit Discovery
Abstract
Consider two large vision models. They process the same image and both correctly predict the class ``rabbit.'' How much of the circuit computation along the way was shared? Model diffing offers a natural lens on this question. So far, however, it has largely operated on a single layer and at the level of representations rather than circuits. In this work, we introduce Universal Circuits (UCs), enabling model diffing at the circuit level. Specifically, we extend CLTs across both layers and models, with losses that encourage sparsity for interpretability and output fidelity for faithfulness. We train UCs between pairs of standard large vision models. We find that a compact cross-model intersection of pruned class circuits, typically a few hundred universal (shared) features per pair, produces $89$-$98\%$ of full-circuit classification accuracy in both models, while a complementary set of universal features (hundreds per pair) is kept by only one model's circuit, reflecting how each model weights shared concepts differently in its own representations.
To demonstrate a downstream use of UC, we perform model \textit{surgery}: a class is successfully transferred from one model to another with no gradient steps taken in the new model. More broadly, our work demonstrates that large vision models leverage shared multi-layer algorithms for downstream performance, and that these algorithms can be discovered, compared, and reused across models.
Additive Relational Bindings in Transformers: What Sparse Autoencoders Miss
Abstract
Language models often need to represent which entities are bound to which attributes, as in “Alice lives in Paris. Bob lives in London.” How models construct such binding representations is poorly understood, and it remains unclear whether sparse autoencoders (SAEs) recover the binding representations that models actually use. We train a 2-layer attention-only transformer on a synthetic relational retrieval task and reverse-engineer the circuit that solves the task perfectly. We find that Layer 0 writes an approximately additive entity– relation address and a separate payload at each fact slot, while Layer 1 retrieves the matching payload by same-head query-key matching against these addresses. Linear probes decode the joint address with 100% accuracy, an additive decomposition explains 99.8% of its variance, and causal patches over the address flip predictions to a distractor. However, SAEs trained on the same activation site do not recover the joint address as clean individual features, despite reconstructions preserving full task accuracy. This provides a concrete example of a composed representation that is linearly decodable and causally used, yet not cleanly exposed as sparse features by an SAE.
Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms
Abstract
As large language models are increasingly deployed in high-stakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them.
Circuit analysis is a central approach in mechanistic interpretability, but it is typically target-conditioned, explaining a single prompt paired with a chosen completion.
This target-conditioned setup obscures mechanistic heterogeneity and hinders scalable discovery.
We introduce distribution-level unsupervised feature discovery, which discovers interpretable clusters across a prompt’s continuation distribution and provides a knob to trade off semantic granularity against mechanistic specificity, without manual target selection.
Our method samples continuations, represents each with (i) a semantic embedding and (ii) a mechanistic signature derived from sparse feature attributions, and clusters them via a rate–distortion objective that trades off semantic coherence and mechanistic consistency.
We also show that our method has cluster-level causality, which validates the discovery of cluster-level mechanistic representation.
Overall, our approach complements circuit analysis and behavioral evaluation by providing a scalable, unsupervised audit of the mechanisms underlying a model’s continuation distribution.
Mechanistic Interpretability of Adversarial Suffixes Reveals Non-Robust Shortcuts of Safety Monitors
Abstract
We apply mechanistic interpretability to adversarial robustness, using causal interventions to reverse-engineer how Greedy Coordinate Gradient (GCG) attacks fool safety guardrails. Activation and attribution patching across four BERT-family classifiers reveals a consistent two-stage circuit: adversarial signal forms as a payload in early-layer MLP activations at suffix positions, then routes to non-suffix positions via attention keys at suffix positions. While prior interpretability work has centered on attention hijacking to study the success of adversarial suffix attacks, we find that GCG converges on a sparse, interpretable set of MLP neurons across independently optimized attacks. We use these neurons' interpretations to construct human readable adversarial examples that exploit non-robust features that transfer to NemoGuard safety monitors an order of magnitude larger.
LLM Jailbreaks Exploit Attention Sinks
Abstract
Suffix-based jailbreak attacks append adversarial token sequences to harmful requests, bypassing safety guardrails in language models. Despite their effectiveness, the mechanisms enabling these attacks remain poorly understood. We find that tokens in adversarial suffixes are prone to inducing *attention sinks*---a phenomenon where certain tokens (e.g., BOS, punctuation, and chat tokens) receive disproportionately high attention from subsequent tokens---and establish a relationship between suffix-induced sinks and attack success: amplifying the influence of suffix sinks improves attack success by up to 276\%, while attenuating it reduces attack success by up to 84\%. We trace this effect to the model's *refusal direction*: sink tokens induce perturbations aligned with the refusal direction, cumulatively suppressing the residual stream's refusal alignment across layers. Our results generalize across several models and suffix-based jailbreak methods, exposing a fundamental structural vulnerability in transformer attention mechanisms that adversarial suffixes exploit to bypass safety alignment.
Row-Attention Extracts, Column-Attention Projects: How ConTextTab Solves In-Context Linear Classification
Abstract
Tabular Foundation Models (TFMs) achieve strong zero-shot classification via in-context learning, but their internal computations remain poorly understood. We present, to our knowledge, the first causal intervention study of a TFM. ConTextTab is uniquely suited to mechanistic analysis because its per-cell tokenisation preserves the feature axis as a directly observable computational dimension, unlike TFMs that bundle features into row-level tokens. We study it on the Rotated Linear Threshold task in $\mathbb{R}^2$, whose Bayes-optimal algorithm is known exactly, and address its post-LayerNorm architecture by restricting activation patching to sub-block boundaries -- a clean causal baseline that does not require auxiliary modeling choices. We localise a linear plug-in classifier $\mathrm{LD}(x^*) \approx f\bigl(x^* \cdot \hat n(\alpha) - \hat\theta\bigr)$ to the two architectural axes: row attention extracts the boundary normal $\hat n(\alpha)$ and threshold. $\hat\theta$ in a single sub-block (L0), distributed across all twelve heads with none individually dispensable; column-attention then progressively projects the query onto the boundary, with the final three of twelve layers carrying $0.588$ of the causal weight.
Discovering Mechanisms in Tokenized Graph Transformers
Abstract
We investigate the internal mechanisms of a tokenized graph transformer — a T5 encoder trained on graphs represented as sequences of node and edge tokens — to understand how transformers process graphs and solve tasks. Using mechanistic interpretability tools such as activation patching and linear probing, we aim to understand the model under three fundamental graph tasks: degree counting, ring membership, and shortest-path distance. Our analysis reveals a common early local-structure computation, where degree-like features emerge in shallow layers and are directly used to solve degree counting. Then, the model composes this early signal differently according to each task. In ring membership, we find that the model solves this problem by gathering non-ring node evidence rather than building a cycle detection circuit. In the shortest-path distance task, causal evidence supports a serial pipeline in which early local topology feeds a single adjacency-copy head, followed by later refinement of a distance-like representation. Furthermore, we analyze the QK circuit underlying this behavior and show that early layers implement soft node-incidence tests by matching node IDs to edge endpoint IDs. Finally, we discuss our limitations and potential future research directions.
Translation Heads: Disentangling meaning from language in LLM-based machine translation
Abstract
Mechanistic Interpretability (MI) seeks to explain how neural networks implement their capabilities, but the scale of Large Language Models (LLMs) has limited prior MI work in Machine Translation (MT) to word-level analyses. We study sentence-level MT from a mechanistic perspective by analyzing attention heads to understand how LLMs internally encode and distribute translation functions. We decompose MT into two subtasks: producing text in the target language (i.e. target language identification) and preserving the input sentence’s meaning (i.e. sentence equivalence). Across three families of open-source models and 20 translation directions, we find that distinct, sparse sets of attention heads specialize in each subtask. Based on this insight, we construct subtask-specific steering vectors and show that modifying just 1% of the relevant heads enables instruction-free MT performance comparable to instruction-based prompting, while ablating these heads selectively disrupts their corresponding translation functions.
Mechanistic Evidence for Preserved-but-Misaligned Representations in Non-IID FedAvg
Abstract
Federated Averaging (FedAvg) often degrades under non-IID client data, but it remains unclear whether this degradation reflects the loss of client-learned representations or a failure to use representations that are still present. We study this question mechanistically in sparse client-trained vision models, using dense-model controls to test whether the observed effects depend on sparsity. Our analysis combines class-specific circuit discovery, linear probing of frozen representations, head-only finetuning, and sparse feature dictionaries. Across CNN and ResNet models on CIFAR-10 and Fashion-MNIST, severe label skew can drive some per-class accuracies near zero even when class-specific internal structure remains recoverable. Linear probes substantially outperform the aggregated classifier, head-only finetuning partially restores accuracy, and USAE transfer reveals a largely shared feature basis between IID and non-IID models. Together, these diagnostics suggest that, in our setting, non-IID FedAvg degradation is not fully explained by representational erasure; it also reflects misalignment between preserved internal structure and the final prediction pathway.
SAEs & Concept Discovery (22)
Hallucination-Induction, Not Calibration: When Multi-Feature SAE Steering Looks Like It Works
Abstract
We replicate Ferrando et al.'s (2024) entity-recognition SAE feature pattern on a 27B reasoning-tuned model (Qwen3.6-27B) using a paper-grade Top-$K$ Sparse Autoencoder, then test whether the feature is causally usable for hallucination calibration. The single best SAE latent at layer 31 reaches AUROC $0.814$ (95% bootstrap CI $[0.731, 0.898]$) on cross-type known-vs-unknown classification, with the entity-recognition signal peaking in the mid-stack as Ferrando reported on Gemma-2-2B-IT (peak around layer 9 in their 26-layer model). The SAE latent is statistically indistinguishable from a layer-32 L2 logistic regression probe ($0.887$ $[0.823, 0.941]$) and a diff-of-means probe ($0.859$): the contribution is interpretability, not raw classification performance. Single-feature steering produces a null effect on refusal calibration. Multi-feature top-$K$ ablation at small $K{=}200$ ($0.3$% of the dictionary) produces a $4$–$8\sigma$ effect against a random-$K$ null — in contrast with prior large-$K$ work where the random control nullifies the signal (Al-Qurashi 2025) — but a Claude-as-judge correctness audit reveals the effect is *induced confabulation*: known-entity incorrect-answer rate rises from $62$% to $77$% while correct-answer rate drops from $8$% to $0$%. Statistical significance against random-$K$ is necessary but not sufficient evidence for a calibration mechanism: the intervention identifies a hallucination-induction circuit, not a calibration knob.
Same Concept, Different Directions: Cross-Modal Feature Heterogeneity in Sparse Autoencoders
Abstract
Vision-language models map images and text into a joint embedding space. However, these embeddings often entangle multiple semantic features, which limits their interpretability and controllability. While sparse autoencoders have emerged as a useful tool for decomposing these embeddings into monosemantic features, their application to joint embedding spaces has largely relied on an implicit, untested assumption that semantically corresponding features share the same directions across modalities. In this paper, we challenge this assumption by identifying discrepancies in feature directions for the same concept across image and text modalities, a phenomenon we term cross-modal feature heterogeneity. We demonstrate that this heterogeneity is a key driver of the modality split, where a shared concept activates different latents depending on the modality. This finding further reveals why aligning latent activations alone is insufficient to resolve the underlying feature mismatch. To address this misalignment, we propose an approach that trains sparse autoencoders to preserve the unique feature geometry of each modality and aligns corresponding features post hoc. Our method improves reconstruction fidelity and enhances performance in cross-modal retrieval and concept steering.
SPICE: Simple Polysemantic feature Interpretation via Clustering-based Explanations
Abstract
One of the pivotal recent challenges in neural network interpretability is polysemanticity, where a single neuron is activated by multiple, often unrelated concepts, hindering clear functional understanding. Although prior work has explored this phenomenon, existing approaches remain architecture-specific and depend on manual heuristics such as a fixed number of concept clusters ($K$), limiting their generality and scalability—especially for modern Transformer-based models. To address these limitations, we introduce SPICE (\textbf{S}imple \textbf{P}olysemantic feature \textbf{I}nterpretation via \textbf{C}lustering-based \textbf{E}xplanation), a generalizable framework for analyzing polysemanticity in deep vision architectures. SPICE avoids architecture-dependent propagation rules, enabling the first systematic comparison of polysemanticity across both CNNs and Transformers, and automatically determines the number of concept clusters per neuron, eliminating reliance on a preset $K$ and supporting scalable analysis for large models. Using SPICE, we conduct a comprehensive investigation into how polysemanticity emerges, varies across depth and architecture, and forms through distinct computational pathways.
The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
Abstract
Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis.
Expander Sparse Autoencoders: Parameter-Efficient Dictionaries for Mechanistic Interpretability
Abstract
Sparse autoencoders (SAEs) decompose internal activations of neural networks into sparse linear combinations of learned features by fitting an overcomplete dictionary $\mathbf{W}\in\mathbb{R}^{m\times n}$ with $m<n$, and inferring a sparse code $\mathbf{x}\in\mathbb{R}^n$ from $\mathbf{h}\approx\mathbf{W}\mathbf{x}$. This inference problem closely resembles the canonical setup of compressed sensing, but requires $\mathcal{O}(mn)$ learned decoder values which becomes costly at large feature counts. We introduce Expander SAEs: TopK SAEs whose decoder and tied encoder are supported on a left-$d$-regular expander mask with $d \ll n$, learning only $\mathcal{O}(dn)$ decoder values while keeping the sparse-coding problem $(m,n,k)$ fixed. The same structure reduces storage and turns the matching-pursuit correlation step $\mathbf{W}^\top \mathbf{r}$ in OMP into an $\mathcal{O}(dn)$ gather-and-reduce operation. Our experiments show that varying $d$ traces a consistent storage--fidelity frontier across Pythia-160M, Qwen2.5-3B, and Llama-3.2-1B residual-stream activations, and that when $d=7$, Qwen2.5-3B uses $293\times$ fewer learned decoder values than the full dense decoder while retaining $84$\% of dense CE-loss recovered. Support-structure controls demonstrate that column sparsity explains much of the storage--fidelity tradeoff, while the diversity of column supports avoids the dead-feature pathologies of clustered sparse masks. Additional ablations prove that budget-matched reduced-width dense SAEs remain a strong trained-encoder baseline at modern scale, but applying the same iterative OMP decoder to both architectures substantially narrows the small-budget gap, exposing an encoder-amortisation component. On the theoretical side, we prove a weighted-expander identifiability theorem showing that if the fixed mask expands every $2k$-feature subset and the learned decoder columns remain sufficiently flat on their supports, then every noiseless $k$-sparse code has a unique $k$-sparse explanation that classical compressed-sensing decoders recover exactly. Expander SAEs therefore offer a parameter-efficient and theory-motivated dictionary for large-scale mechanistic interpretability.
Ghost Tracks: Hypothesis Competition in Sparse Feature Space as a Preliminary Signal of LLM Hallucinations
Abstract
Large language models frequently suffer from hallucinations, generating text that is fluent but factually incorrect. While empirical detection methods exist, the underlying internal mechanisms and representation dynamics that cause models to hallucinate remain poorly understood. In this work, we present preliminary evidence that hallucinations are associated with a failure to resolve competing semantic hypotheses within the model's residual stream. Specifically, by linking Sparse Autoencoder (SAE) features across layers with a multi-hypothesis tracker, we observe that factual outputs tend to correspond to a single dominating feature, while hallucinations exhibit "ghost tracks"---multiple semantic candidates that transiently activate and compete without a clear winner. Leveraging this observation, we propose *GhostTrack*, a detector that extracts feature-competition metrics from a single forward pass and reaches up to 0.903 AUROC on Phi-2 (0.722 on GPT-2 Medium, 0.650 on Qwen2.5-1.5B) on the HaluEval QA benchmark. We further analyze these dynamics to show that feature entropy and dominance margins are the strongest signals of competition failure, with class separation most pronounced in the middle-to-late transformer layers. Our findings provide correlational, interpretable evidence that hallucinations coincide with measurable failures of internal representation resolution. More broadly, our work lays the groundwork for causal interventions that monitor and steer feature dynamics during inference to improve the factual reliability of AI systems.
Decompose Sparsely Where You Should, Absorb Densely Where You Should Not
Abstract
Sparse autoencoders (SAEs) are typically trained to reconstruct the *entire* residual stream through a sparse dictionary, implicitly assuming that all activation content is amenable to sparse, monosemantic decomposition. We question this assumption and hypothesize that activations contain a low-rank, dense component that is computationally important to the model yet inherently unsuitable for sparse representation, which serves as a major source of the persistent dense latents widely observed in trained SAEs. To test this, we add a small rank-$r$ linear bottleneck in parallel with standard SAEs (BatchTopK and Matryoshka), allowing dense structure to be absorbed before sparse reconstruction. On Gemma-2-2B layer 12, a rank-24 bottleneck reduces dense latent count by up to 84\% while improving sparse probing and targeted probe perturbation on both architectures at matched sparsity. The absorbed component is (i) **structurally identifiable** as the top principal components and outlier dimensions; (ii) **causally necessary**, with removing it raising next-token cross-entropy by 7.5$\times$, far exceeding the 2.8$\times$ from removing the geometrically near-identical top-24 PCA directions; and (iii) **redundantly encoded by sparse dictionaries**, with ablating 787 maximally aligned sparse features raising cross-entropy by only 2.9$\times$ and ablating 2,048 topic-aligned features leaving MMLU topic classification virtually unchanged, whereas removing the scaffold drops it from 98.7\% to chance. Together, our findings identify a compact, semantically informative and causally important component of residual stream activations (which we term a *computational scaffold* that standard sparse dictionaries represent inefficiently, suggesting that the scope of sparsity-based interpretability methods warrants careful re-examination.
Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift
Abstract
Mechanistic interpretability aims to explain a model’s behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfulness under out-of-distribution (OOD) shift has received little systematic attention. We show that distribution shift rotates the subspace that the model actively uses, misaligning the explainer’s dictionary trained on in-distribution (ID) activations. We formalize this misalignment as the faithfulness gap, a geometric distance between the ID dictionary and the OOD-active subspace, and show that it controls OOD faithfulness degradation. To reduce this gap, we propose the Geometry-Adaptive Explainer (GAE), which realigns the explainer's dictionary with the OOD-active subspace while preserving the original feature structure. This requires only unlabeled OOD activations and no gradient updates. We prove that GAE improves over the unadapted ID explainer, with excess loss bounded quadratically by the second-moment shift. Empirically, GAE even matches or surpasses all training-based baselines in causal faithfulness across multiple models and OOD settings.
PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding
Abstract
Sparse autoencoders (SAEs) interpret neural network representations by decomposing activations into sparse combinations of dictionary atoms. However, SAEs assume features combine additively through linear reconstruction, an assumption that cannot capture compositional structure: linear models cannot distinguish whether ''Starbucks'' arises from the composition of ''star'' and ''coffee'' features or merely their co-occurrence. This forces SAEs to allocate monolithic features for compound concepts rather than decomposing them into interpretable constituents. We introduce PolySAE, which extends the SAE decoder with higher-order terms to model feature interactions while preserving the linear encoder essential for interpretability. Through low-rank tensor factorization on a shared projection subspace, PolySAE captures pairwise and triple feature interactions with small parameter overhead (3\% on GPT2). Across four language models and three SAE variants, PolySAE achieves an average improvement of $\sim$8\% in probing F1 while maintaining comparable reconstruction error, and produces 2--10$\times$ larger Wasserstein distances between class-conditional feature distributions. Critically, learned interaction weights exhibit negligible correlation with co-occurrence frequency ($r = 0.06$ vs. $r = 0.82$ for SAE feature covariance), suggesting that polynomial terms capture compositional structure largely independent of surface statistics. Finally, the learned interaction directions causally steer model outputs toward the corresponding compositional semantics.
From Directions to Regions: Decomposing Activations in Language Models via Local Geometry
Abstract
Activation decomposition methods in language models are tightly coupled to geometric assumptions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear separability, which overlooks concepts with nonlinear or multi-dimensional structure. In this work, we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space as a collection of Gaussian regions with their local covariance structure. MFA decomposes activations into two compositional geometric objects: the region's centroid in activation space, and the local variation from the centroid. We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space. Moreover, evaluations on localization and steering benchmarks show that MFA outperforms unsupervised baselines, is competitive with supervised localization methods, and often achieves stronger steering performance than sparse autoencoders. Together, our findings position local geometry, expressed through subspaces, as a promising unit of analysis for scalable concept discovery and model control, accounting for complex structures that isolated directions fail to capture.
Position: Use Sparse Autoencoders to Discover Unknowns
Abstract
While sparse autoencoders (SAEs) have generated significant excitement, a series of negative results have added to skepticism about their usefulness. Here, we establish a conceptual distinction that reconciles competing narratives surrounding SAEs. We argue that even if SAEs may be less effective for *acting on known concepts*, SAEs are especially powerful tools for *discovering unknown concepts*. This distinction separates existing negative results from positive results, and suggests several classes of SAE applications. Specifically, we outline use cases for SAEs in (i) ML interpretability, explainability, fairness, auditing, and safety, and (ii) social and health sciences.
Sparse Autoencoders are Capable LLM Jailbreak Mitigators
Abstract
Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense that identifies jailbreak-relevant sparse features by comparing token-level representations of the same harmful request with and without jailbreak context. Using paired harmful/jailbreak prompts, CC-Delta selects features via statistical testing and applies inference-time mean-shift steering in SAE latent space. Across four aligned instruction-tuned models and thirteen jailbreak attacks, CC-Delta achieves comparable or better safety–utility tradeoffs than baseline defenses operating in dense latent space. In particular, our method clearly outperforms dense mean-shift steering on all four models, and particularly against out-of-distribution attacks, showing that steering in sparse SAE feature space offers advantages over steering in dense activation space for jailbreak mitigation. Our results suggest off-the-shelf SAEs trained for interpretability can be repurposed as practical jailbreak defenses without task-specific training.
Discovering Cross-Language Reasoning Invariance in LLMs with Geometry-Invariant Sparse Autoencoders
Abstract
Multilingual language models can solve the same mathematical reasoning problem in different languages, but it remains unclear whether they rely on shared internal features or on language-specific computations that only produce similar outputs. We study this question in five models from four architecture families using the Multilingual Grade School Math (MGSM) dataset, with problems solved independently in English, German, French, Spanish, Russian, and Chinese, retaining only problems with valid reasoning traces in all six languages and replaying those traces through the model to record internal representations at multiple layers. For each model, we first use Centered Kernel Alignment (CKA) to identify layers with strong cross-language alignment. At each selected layer, we train two sparse autoencoders: a baseline reconstruction-only model and a contrastive variant introduced in this work, the Geometry-Invariant Sparse Autoencoder (GI-SAE). GI-SAE supplements the reconstruction loss with an Information Noise-Contrastive Estimation (InfoNCE) loss that trains the encoder to produce similar feature activations for traces of the same problem, regardless of language or token position. We then test whether the resulting shared features are functionally interchangeable by swapping shared feature values between languages during the model's forward pass and measuring the resulting change in output (causal patching), quantified by Kullback-Leibler (KL) divergence per shared feature. Although GI-SAE yields higher CKA and Jaccard similarity at nearly every layer, higher geometric similarity does not consistently imply greater functional interchangeability across languages. We find that cross-language feature sharing is strongly model- and architecture-dependent in this sample and appears at different depths in different models. GI-SAE primarily amplifies cross-language structure already present in each model: the pattern is model-specific, with progressive strengthening in Qwen, no functional benefit in already-saturated Gemma, and mixed layer-dependent effects in Llama and Phi.
FTC: Fourier Transcoder for Monosemantic Sparse Feature Decomposition in Vision Language Model
Abstract
Sparse autoencoders (SAEs) and transcoders (TCs) have been successfully applied in interpreting neural activations in large language models (LLMs), particularly those in MLP sublayers, through sparse feature representations.
Building on this success, SAEs and TCs have also been extended to vision-language models (VLMs), including CLIP, with promising results.
However, significant challenges remain, including feature polysemanticity and incomplete disentanglement.
In this work, we introduce the Fourier Transcoder (FTC), a transcoder architecture based on a Fourier basis, to analyze MLP sublayers in VLM.
We show that FTC discovers more monosemantic features and reduces the number of dead features in CLIP compared to TopK TC.
Crosscoders Identify Shared or Specific Features between the Human Brain and Language Models
Abstract
To what extent do human brains and language models (LMs) share internal representations of language, and how do these representations differ? Prior work has shown that LM representations can predict brain responses to naturalistic language stimuli, suggesting that the two systems encode common information. However, which features are shared between brain and LM representations and which are selectively used in brains and LMs have remained underspecified. We propose Brain-LM crosscoders, which decompose brain responses and LM representations into shared sparse features and label each feature as being shared, brain-specific, or LM-specific based on its predictive contribution to each representation. Experiments on naturalistic language listening fMRI data show that language associated with body, family, and action tends to be brain-specific, whereas colloquial expressions tend to be LM-specific. Brain-LM crosscoders compare biological and artificial language representations at the feature level, which will contribute to scientific discovery in both neuroscience and artificial neural network research.
Feature-Resolved Attention
Abstract
Dictionary learning methods such as sparse autoencoders aim to provide an interpretable, mono-semantic basis for a model's computation. Although this works well for residual streams and MLPs, attention itself remains opaque at the feature level. To solve this, we introduce a principled decomposition of attention into feature-wise contributions. We call the resulting object \textit{Feature-Resolved Attention} (FRA). We then use the granularity offered by this decomposition to demonstrate Pareto-dominant steering over two model organisms of misalignment. First, we show that we can \textbf{\textit{perfectly suppress}} sleeper agent behavior via FRA--based steering in TinyStories-33M. Strikingly, in 20\% of cases we recover the original text \textit{word-for-word}. Second, we consider model organisms of Emergent Misalignment (EM). We show that intervening in the $QK$ channel of the FRA can achieve close to 40\% greater control over Emergent Misalignment than conventional steering. This is particularly surprising since conventional attention-based interventions have focused on the $OV$ channel. Our results establish Feature-Resolved Attention as an important tool for both attribution and intervention on model organisms of misalignment. Code is available at \url{https://anonymous.4open.science/r/fra_clean-842B/README.md}.
Feature Recovery Requires Structured Event Regimes in Sparse Reconstruction
Abstract
Sparse autoencoders are often used in mechanistic interpretability as if sparse reconstruction should recover the features represented by a model. Recent work shows that this recovery is fragile, but it remains unclear which failures come from the SAE architecture, the encoder, optimization, or finite data. We show that several failures can already be incentivized by the population-level sparse-reconstruction objective. We study by how much residual mass projects above the sparsity threshold in a positive linear latent-ray model; merging and splitting arise as static properties of this objective, while absorption and seed-dependent alternatives arise sequentially as earlier selections change the residual field. We also separate two notions often conflated in interpretability practice: recovering a ground-truth direction and recovering the activation pattern of that feature. Neither implies the other in general; they coincide under specific structural conditions, such as single-feature event dominance and regular simplex structure in learned symmetric geometries. Sparse reconstruction therefore recovers ground-truth features only in structured event regimes; outside them, the objective can favor non-canonical but reconstruction-useful directions, and a one-ReLU encoder introduces a further representability gap governed by whether the oracle gate is affine-ReLU approximable. Overall, our results refine the existing analysis of SAE behavior and provide a unified perspective on ground-truth feature recovery studies.
Scaling Laws for SAE Training Data
Abstract
Sparse autoencoder (SAE) training is often bottlenecked by activation storage. We show that the stored activation buffer can be far smaller than the total SAE training budget: quality depends mainly on the number of \emph{unique} activations, not on whether every training token is fresh. After a short diversity-limited regime, additional fresh activations provide little benefit; replaying the same buffer preserves both reconstruction and interpretability performance. We capture this diversity--repetition tradeoff with a data-constrained scaling law and validate it across dictionary sizes, token budgets, Llama-3.1 and Qwen3 models, early/middle/late layers, BatchTopK/TopK/JumpReLU SAEs, and downstream metrics including EV, reconstruction MSE, feature absorption, automated interpretability, and sparse probing. The resulting buffer-sizing rule is simple: choose the smallest activation buffer that reaches the quality plateau, then use replay to spend the remaining training budget. In our experiments, this reduces activation storage by $8\text{--}64\times$ with negligible quality loss.
Investigating Expert Semantic Specialization in Mixture-of-Expert Models
Abstract
Mixture-of-Experts (MoE) models achieve exceptional scalability through selective routing, yet our understanding of what routing actually learns remains limited. In particular, it is unclear to what extent routing induces meaningful expert semantic specialization. Sparse Autoencoders (SAEs) provide a way to extract interpretable semantic feature spaces from model representations. Building upon this capability, we introduce the Semantic Specialization Index (SSI), a quantitative metric designed to measure the degree of expert semantic specialization. Using SSI, we systematically quantify expert semantic specialization in MoE models. We further investigate the relationship between semantic specialization and language modeling performance, and find that specialization exhibits a non-monotonic trajectory during training and does not increase indefinitely with model quality, suggesting the existence of an optimal specialization regime. These findings provide a quantitative foundation for understanding expert organization in sparse models and open new opportunities for specialization-aware optimization of MoE models.
Sparse Autoencoders for LoRA-Adapted CLIP Under Domain Shift
Abstract
Sparse autoencoders (SAEs) are increasingly used to interpret CLIP representations, but they are often trained once on base-model activations and then reused after downstream adaptation. We study whether such generic SAEs remain faithful interpreters of LoRA-adapted CLIP under domain shift. Comparing three configurations--base CLIP with a generic SAE, LoRA-adapted CLIP with the same generic SAE, and LoRA-adapted CLIP with a domain-matched SAE--we find that generic SAE reuse becomes unreliable as the target domain moves away from ImageNet-like data. The degradation appears in downstream accuracy after SAE reconstruction, reconstruction error, sparsity, latent usage, and class selectivity. Retraining the SAE on adapted CLIP activations recovers much of this lost fidelity. We further introduce a Domain-Aware Monosemanticity Score (DAMS), which penalizes broad, non-discriminative feature firing that standard top-activation monosemanticity scores can overestimate under shift.
Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders
Abstract
Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature stability}: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct SAEs from stable features and find no evidence of a stability--explained-variance trade-off in this setting. Together, these results support cross-seed stability as a practical faithfulness filter and null check for SAE-based interpretability.
RouterInterp: Understanding Superposed Specialisation in Mixture of Experts Routing
Abstract
Sparse Mixture of Experts (MoE) models scale more efficiently than dense models by routing tokens to modular expert networks that are only active when relevant to the task. A leading hypothesis for the performance of MoE models is that each expert specialises in a single, coherent domain. However, interpretability efforts that assume this hypothesis have generally been unsuccessful. We propose and present evidence for an alternative account that we call the Superposed Specialisation Hypothesis (SSH): experts specialise in a disjoint union of fine-grained features rather than one broad domain. Leveraging the SSH, we introduce RouterInterp, a method for interpreting expert routing that identifies Sparse Autoencoder features most predictive of routing decisions and produces unified natural language explanations. On gpt-oss-20b, RouterInterp explains expert routing with 57% higher detection accuracy than prior token statistics based methods. This work provides a scalable method for generating concise and more accurate explanations of expert routing and increases our understanding of a previously uninterpretable component of foundation models.
Representations & Feature Geometry (25)
The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability
Abstract
Reliable deployment of language models requires two capabilities that appear distinct but share a common geometric foundation: predicting whether a model will accept targeted behavioral control, and detecting when its internal structure degrades. We show that geometric stability, the consistency of a representation's pairwise distance structure, addresses both. A supervised Shesha variant that measures task-aligned geometric stability predicts the linear steerability of sentence embeddings with near-perfect accuracy ($\rho = 0.89$--$0.97$) across 35--69 embedding models and three NLP tasks, capturing unique variance beyond class separability (partial $\rho = 0.62$--$0.76$). A critical dissociation emerges: unsupervised stability fails for steering on real-world tasks ($\rho \approx 0.10$ on SST-2), revealing that task alignment is essential for controllability prediction. However, unsupervised stability excels at drift detection, measuring nearly $2\times$ greater geometric change than CKA during post-training alignment (up to $5.23\times$ in Llama) while providing earlier warning in 73\% of models and maintaining a $6\times$ lower false alarm rate than Procrustes. Together, supervised and unsupervised stability form complementary diagnostics for the LLM deployment lifecycle: one for pre-deployment controllability assessment, the other for post-deployment monitoring.
Old Habits Die Hard: How Conversational History Geometrically Traps LLMs
Abstract
How does the conversational past of large language models (LLMs) influence their future performance? Recent work suggests that LLMs are affected by their conversational history in unexpected ways. For instance, hallucinations in prior interactions may influence subsequent model responses. In this work, we introduce History Echoes, a framework that investigates how conversational history biases subsequent generations. The framework explores this bias from two perspectives: probabilistically, we model conversations as Markov chains to quantify state consistency; geometrically, we measure the consistency of consecutive hidden representations. Across three model families and six datasets spanning diverse phenomena, our analysis reveals a strong correlation between the two perspectives. By bridging these perspectives, we demonstrate that behavioral persistence manifests as a geometric trap, where gaps in the latent space confine the model's trajectory.
Attacking the Representation Manifold: A Mechanistic Study of Adversarial Robustness in Modular Addition
Abstract
Neural networks trained on modular addition learn algorithms whose latent representations factor through a torus-to-circle map, providing unusually complete knowledge of the learned algorithm and its representation geometry. We exploit this transparency to show how mechanistic knowledge allows us to predict the form of successful adversarial perturbations and how adversarial training reshapes representations to resist attack. We decompose adversarial perturbations on the embedding torus into phase-shifting and amplitude-changing components, predicting that efficient attacks target the same Fourier features the model uses. We confirm this empirically: the Fourier spectrum of successful PGD perturbations concentrates on the model's frequency features, mechanism-informed attacks restricted to those frequencies are competitive with white-box PGD, and attack transfer between models is predicted by their feature overlap. The same mechanistic lens predicts that adversarial training increases robustness by broadening the model's frequency support, linking the representation change to capacity-robustness trade-offs. Modular addition thus provides a case study in which adversarial vulnerability becomes interpretable - vulnerability becomes a targeted failure of the learned algorithm, and robustness becomes a measurable restructuring of that algorithm.
Lossy Superposition: Predicting Compositional Errors Without Seeing Inputs
Abstract
Humans cannot always intuit what scenarios are most challenging to LLMs. Developers either design problems to be difficult for humans or curate extensive benchmarks, hoping to capture informative edge cases. What if we could instead anticipate which scenarios a model will fail on? In this paper, we use an LLM's representational geometry to predict which concept combinations it will fail on. We attribute this compositional failure to interference between salient features. In tasks that require systematic composition - toy programmatic settings, multihop reasoning, multilingual factual recall - we find that when a pair of concepts is encoded near-orthogonally, the model reliably composes them. When their linear encodings are close, producing interference, the model fails to compose them. Our method reliably anticipates failure modes across different compositional tasks, without evaluating specific inputs. These results lay the groundwork to use representational geometry to identify high-risk examples, construct targeted stress tests, and provide a scalable foundation for active learning in real-world deployment.
The Information Geometry of Softmax: Probing and Steering
Abstract
This paper concerns the question of how AI systems encode semantic structure into the geometric structure of their representation spaces. The motivating observation is that the natural geometry of these representation spaces should reflect the way models use representations to produce behavior. We focus on the important special case of representations that define softmax distributions. In this case, we argue that the natural geometry is information geometry. Our focus is on the role of information geometry on semantic encoding and the linear representation hypothesis. As an illustrative application, we develop *dual steering*, a method for robustly steering representations to exhibit a particular concept using linear probes. We prove that dual steering optimally modifies the target concept while minimizing changes to off-target concepts. Empirically, we find that dual steering enhances the controllability and stability of concept manipulation.
Singular Vectors of Attention Heads Align with Features
Abstract
Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made an implicit assumption that feature representations can be inferred in some cases from singular vectors of attention matrices. However, sound justification for this assumption is lacking. In this paper we address that question, asking: why and when do singular vectors align with features? First, we demonstrate that singular vectors robustly align with features in a model where features can be directly observed. We then show theoretically that such alignment is expected under a variety of general conditions. We close by asking how, operationally, alignment may be recognized in real models where feature representations are not directly observable. We identify sparse attention decomposition as a testable prediction of alignment, and show evidence that it emerges consistent with predictions in real models. Together these results suggest that alignment of singular vectors with features can be a sound and theoretically justified basis for feature identification in language models.
Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer
Abstract
Hallucination is often viewed as a direct consequence of missing knowledge: a model answers incorrectly when the correct answer is absent from its generation-time distribution, and correctly when it is present. We test this assumption by introducing a semantic notion of answer availability that aggregates token-level variants expressing the same answer concept, and asks whether the correct concept is already available at the moment the model commits to an answer. Across Qwen and Llama models from 0.8B to 72B in both Instruct and Base variants, 16-47% of Instruct hallucinations occur with substantial probability mass already on the correct concept, and the rate rises monotonically with scale. Comparing such failures against correct generations with matched semantic support, the distinguishing factor is not whether the correct concept is represented, but how its probability is distributed: correct generations concentrate mass on a single surface form, hallucinations disperse it across alternatives. The same sharpening asymmetry extends across multi-token generation and is detectable in pre-generation hidden states. Together, these results identify a single mechanism: instruction tuning sharpens answer commitment with scale, making helpfulness and confident hallucination two consequences of the same underlying disposition.
Mechanistic Evidence for Spectral Structures in Prior-Data Fitted Networks
Abstract
Prior-Data Fitted Networks (PFNs) enable amortized Bayesian inference in a single forward pass, yet their internal representations remain opaque. It is unknown whether PFNs encode identifiable Bayesian structure or merely memorize input-output mappings. We provide mechanistic evidence that PFNs learn structured spectral representations and that these can be extracted as explicit kernels. First, probing experiments across three architectures, including the publicly released TabPFN, show that spectral information is linearly decodable from the latent attention score and organized along a dominant principal axis. Activation patching and targeted subspace interventions establish that this information is causally used for prediction and concentrated in a low-dimensional subspace, with spectral directions an order of magnitude more effective than random ones. Crucially, these properties hold on TabPFN with both synthetic out-of-distribution inputs and real-world time series (Airline Passengers, Milk Production), indicating they are emergent features of PFN-style amortization over continuous regression tasks rather than artifacts of training prior. Second, we introduce a Filter Bank Decoder that maps frozen PFN latents to explicit spectral densities, reconstructing stationary kernels via Bochner's theorem. The resulting kernels support GP regression competitive with iterative baselines while requiring only a single forward pass, demonstrating that PFN priors are not merely implicit but are explicitly recoverable as portable Bayesian objects.
What Cosine Similarity of Label Representations Can and Cannot Tell us
Abstract
Cosine similarity is often used to measure the similarity of vector representations of neural network models. However, the cosine similarity of representations is not guaranteed to tell us anything about model probabilities. In this paper we show that for a softmax classifier, be it an image classifier or an autoregressive language model, the cosine similarity between label representations (called unembeddings in the paper) does not give any information on the probabilities assigned by the model. Specifically, we prove that given two unembeddings, it is possible to create another model which assigns the same probabilities for all inputs, but where the cosine similarity between the representations is now either $1$ or $-1$. We also show that for a sigmoid classifier (where each input can be assigned multiple labels), all pairwise cosine similarities between the unembeddings define the set of possible label combinations. However, for softmax classifiers (where each input is assigned a ranking of the labels from most to least likely), we need all pairwise cosine similarities between all _differences_ of unembeddings to know which rankings the model can predict. We conclude that it is misleading to interpret the cosine similarity between unembeddings without reference to the classifier that produced them.
Covert Trait Propagation Is Representation Alignment: Mechanistic Evidence from Hidden-Channel Distillation
Abstract
A student model trained on pure uniform noise can still inherit its
teacher's digit-classification ability, provided the two share
initialization. Previous work proves this transfer is guaranteed when
the teacher's learning rate is small enough, but does not explain where
in the network the channel lives or what sets its capacity. Working in an MLP distillation setting on MNIST, we show these channels
are not purely informational: geometric alignment gates access to the information the channel carries. Shared initialization makes the output
projection $W_2$ a common coordinate key, and KL gradients reshape the
student's input projection $W_0$ until its hidden representations align
with the teacher's. We call this covert trait propagation
(CTP). Five experiments support this mechanism: channel closure tracks
weight drift, not teacher accuracy; freezing $W_0$ destroys transfer
while freezing $W_2$ leaves it intact; multi-teacher ensembles cancel
out despite each teacher carrying comparable label information; and CKA
tracks student accuracy at $r{=}0.98$ across a continuous
initialization sweep. Applying the same geometric lens to cross-token behavioral entanglement
(CTBE) in instruction-tuned LLMs, we find the effect is
activated by alignment training, acting on an inherited substrate, and that the
standard log-ratio metric produces an apparent frequency bias that is
largely a circularity artifact.
What is Missing? Explaining Neurons Activated by Absent Concepts
Abstract
Explainable artificial intelligence (XAI) aims to provide human-interpretable insights into the behavior of deep neural networks (DNNs), typically by estimating a simplified causal structure of the model. In existing work, this causal structure often includes relationships where the presence of a concept is associated with a strong activation of a neuron. For example, attribution methods primarily identify input pixels that contribute most to a prediction, and feature visualization methods reveal inputs that cause high activation of a target neuron – the former implicitly assuming that the relevant information resides in the input, and the latter that neurons encode the presence of concepts. However, a largely overlooked type of causal relationship is that of encoded absences, where the absence of a concept increases neural activation. In this work, we show that such missing but relevant concepts are common and that mainstream XAI methods struggle to reveal them when applied in their standard form. To address this, we propose two simple extensions to attribution and feature visualization techniques that uncover encoded absences. Across experiments, we show how mainstream XAI methods can be used to reveal and explain encoded absences, how ImageNet models exploit them, and that debiasing can be improved when considering them.
Relational Linear Properties in Language Models: An Empirical Investigation
Abstract
Linear properties are ubiquitous in the representations of language models; however, testing them experimentally remains a challenging task. This work focuses on relational linearity: the hypothesis that, for a fixed relation (e.g., “plays”), the unembedding of an object (e.g., “trumpet”) can be predicted from the embedding of its subject (e.g., “Miles Davis”) by a linear map. We present an experimental method to test the formulation of relational linearity by Marconato et al. (2024). Specifically, we introduce a probing method, based on Kullback-Leibler divergence, to evaluate this property and examine its variation across layers and paraphrased relational queries. It is also more efficient than previous work; for example, it avoids the crude Jacobian approximations used in Linear Relational Embeddings by Hernandez et al. (2024). Our findings across four datasets show that relational linearity varies across models, exhibits layer-wise patterns consistent with prior observations about linguistic information in model representations, and is differently affected by changes in how the relation is phrased.
MetaOthello: A Controlled Study of Multiple World Models in Transformers
Abstract
Foundation models must handle multiple generative processes, yet mechanistic interpretability largely studies capabilities in isolation; it remains unclear how a single transformer organizes multiple, potentially conflicting "world models". Previous experiments on Othello playing neural-networks test world-model learning but focus on a single game with a single set of rules. We introduce MetaOthello, a controlled suite of Othello variants with shared syntax but different rules or tokenizations, and train small GPTs on mixed-variant data to study how multiple world models are organized in a shared representation space. We find that transformers trained on mixed-game data do not partition their capacity into isolated sub-models; instead, they converge on a mostly shared board-state representation that transfers causally across variants. Linear probes trained on one variant can intervene on another's internal state with effectiveness approaching that of matched probes. For isomorphic games with token remapping, representations are equivalent up to a single orthogonal rotation that generalizes across layers. When rules partially overlap, early layers maintain game-agnostic representations while a middle layer identifies game identity, and later layers specialize. MetaOthello offers a path toward understanding not just whether transformers learn world models, but how they organize many at once.
Fixed Universal Transformer
Abstract
We introduce \emph{universal transformers}: fixed transformers that can simulate any transformer in a given class via a suitable input embedding.
Analogous to a universal Turing machine, the input embedding encodes a description of the target model while all internal parameters remain fixed.
We provide explicit sparse constructions achieving universality when the embedding dimension is sufficiently large, and further show that universality is generic: randomly initialized transformers are universal almost surely, which aligns with recent empirical results of Zhong and Andreas (2024).
We empirically validate our theory on the algorithmic tasks of parenthesis balancing and multi-hop reasoning.
Our results suggest that much of a transformer’s expressive power may reside in its input representation rather than its learned weights.
User Persona Subspaces Modulate Refusal Behavior in Language Models
Abstract
As language-model chatbots increasingly use persistent user information, safety-relevant behaviors may depend on not only what is asked, but also who the model represents the user to be. Prior work has shown that LLMs modulate refusal behavior based on perceived user personas. However, most studies examine this effect only at the behavioral level, while mechanistic analyses typically represent user personas as linear directions in activation space. We characterize user personas in terms of Knowledge, Intent, Emotion, and Belief, and decompose each into contextually distinct subcategories to study user-representation geometry. We find that user personas are encoded as coherent low-dimensional subspaces in activation space, rather than collapsing into a single generic user direction. These representations are behaviorally meaningful: projections onto directions within these subspaces predict model refusal for individual prompts, and interventions along them shift the model's refusal behavior and inferred user profile. These findings suggest that personalized context can modulate safety behavior through structured internal user representations, with implications for auditing memory-enabled LLM systems. Our code is available at https://github.com/tz1211/user-persona-geometry.
Hierarchical Concept Geometry in Language Representations Emerges from Word Co-occurrence
Abstract
We propose a distributional theory of how hypernymy---the ``is-a'' relation between general and specific concepts---is encoded geometrically in language representations. Starting from the empirically verified assumption that words closer on the WordNet hypernym graph co-occur more often, we characterize theoretically the spectrum of the resulting embedding Gram matrix of word2vec embeddings. Under mild positivity and decay conditions on the co-occurrence kernel, we prove that the leading eigenvectors first separate broad taxonomic branches and then progressively finer sub-branches, producing a \emph{hierarchical splitting geometry} with a coarse-to-fine spectral organization that mirrors the tree. We confirm these predictions in word2vec embeddings across many sampled WordNet subtrees, and show that the same signature extends strikingly well to Gemma 2B unembeddings. Our results indicate that hierarchical concept geometry in LLMs need not reflect a hierarchy-specific functional mechanism, but emerges from the spectral structure of pairwise word statistics.
Towards \textit{Effective Theory} of LLMs: A Representation Learning Approach
Abstract
We propose Representational Effective Theory (RET), a framework for describing large language model computation in terms of learned macrostates rather than microscopic activation details. RET learns these macrostates from hidden-state trajectories using a BYOL/JEPA-style self-supervised objective, coarse-graining activations into macrovariables that preserve higher-level structure relevant for prediction and interpretation. We evaluate whether these macrovariables are practically relevant for interpretability: RET yields temporally consistent states that reveal ``mental-state'' trajectories of reasoning, capture high-level semantic structure, support early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering generations toward interpretable computational phases. Together, these results suggest that LLM computation admits useful effective descriptions via RET: high-level, dynamically meaningful variables that support interpretation, prediction, and intervention.
Transformers learn factored representations
Abstract
Transformers pretrained via next token prediction learn to factor their world into parts, representing these factors in orthogonal subspaces of the residual stream. We formalize two representational hypotheses: (1) a representation in the product space of all factors, whose dimension grows exponentially with the number of parts, or (2) a factored representation in orthogonal subspaces, whose dimension grows linearly. The factored representation is lossless when factors are conditionally independent, but sacrifices predictive fidelity otherwise, creating a tradeoff between dimensional efficiency and accuracy. We derive precise predictions about the geometric structure of activations for each, including the number of subspaces, their dimensionality, and the arrangement of context embeddings within them. We test between these hypotheses on transformers trained on synthetic processes with known latent structure. Models learn factored representations when factors are conditionally independent, and continue to favor them early in training even when noise or hidden dependencies undermine conditional independence, reflecting an inductive bias toward factoring at the cost of fidelity. This provides a principled explanation for why transformers decompose the world into parts, and suggests that interpretable low dimensional structure may persist even in models trained on complex data.
The Channel Geometry of Refusal: Mechanistic Diagnosis of Alignment Collapse Under KV Quantization
Abstract
Refusal in instruction-tuned LLMs is mediated by a small set of activation directions concentrated in the earliest output tokens. We use KV cache quantization as a controlled probe of where those directions live and how robust they are to per-channel rounding noise. Across eleven instruction-tuned models (3.8B-72B), low-bit KV quantization triggers sharp phase transitions in refusal invisible to perplexity monitoring: Mistral-7B loses 15.2\% of its refusals at only $1.03\times$ perplexity. The collapse is explained by a single channel-geometry property (whether the channels carrying refusal overlap the activation outliers a quantizer must accommodate), captured in a closed-form bound on the MSE gap between per-tensor and per-channel quantization. The empirical realization, **Per-Channel Reduction** (PCR), is a 20-prompt diagnostic that sorts models into three failure modes (*outlier-crushes-safety*, *outlier-as-safety*, *multi-layer dilution*). Read together, these modes reveal that the apparent disagreement between single-direction and multi-orthogonal accounts of refusal is not a contradiction but two endpoints of a concentrated-to-distributed spectrum whose position is determined not by architecture but by the post-training recipe. The resulting protocol recovers up to 97\% of lost alignment without retraining and generalizes across unseen prompts, models, and quantizers.
Logit Distance Bounds Representational Similarity
Abstract
For a broad family of discriminative models that includes autoregressive language models, identifiability results imply that if two models induce the same conditional distributions, then their internal representations agree up to an invertible linear transformation. We ask whether an analogous conclusion holds approximately when the distributions are close instead of equal. Building on the observation of Nielsen et al. (2025) that closeness in KL divergence need not imply high linear representational similarity, we study a distributional distance based on logit differences and show that closeness in this distance does yield linear similarity guarantees. Specifically, we define a representational dissimilarity measure based on the models’ identifiability class and prove that it is bounded by the logit distance. We further show that, when model probabilities are bounded away from zero, KL divergence upperbounds logit distance; yet the resulting bound fails to provide nontrivial control in practice. As a consequence, KL-based distillation can match a teacher’s predictions while failing to preserve linear representational properties, such as linear-probe recoverability of human-interpretable concepts. In distillation experiments on synthetic and image datasets, logit-distance distillation yields students with higher linear representational similarity and better preservation of the teacher’s linearly recoverable concepts.
Masked Diffusion Training Induces Resampling-like Carry Representations in Addition
Abstract
Masked diffusion models (MDMs) have emerged as alternatives to autoregressive (AR) language models, with evidence of stronger generalization under data constraints. We study this gap mechanistically in a controlled addition task: one-layer Transformers add two six-digit numbers after training only on examples with limited carry complexity, then are evaluated on an out-of-distribution carry-generalization split requiring $N_{\mathrm{carry}}>2$. This tests whether models extrapolate the carry rule beyond the carry numbers observed during training. With C2, MDMs outperform AR models on high-carry examples, while C2-Resampled largely closes the gap. We trace this difference to the geometry of carry representations. Attention and MLP sublayers play similar roles in both model classes: attention aggregates base-addition information, while the MLP makes answer tokens more linearly decodable. However, MDM training yields stronger linear alignment, which we defined as the fraction of post-attention representation variance captured along the carry/non-carry direction. Then, theoretically, we show in a Gaussian model that higher linear alignment improves robustness to boundary perturbations. Retraining the MLP while freezing earlier representations preserves the same accuracy ordering, suggesting that MDMs generalize better in this setting because they learn better-aligned carry representations, not a qualitatively different layer-level algorithm.
InverseScope: Scalable Activation Inversion for Interpreting Large Language Models
Abstract
Understanding the internal representations of large language models (LLMs) is a central challenge in interpretability research. Existing feature interpretability methods often rely on strong structural assumptions—such as linearity or sparsity—that may not hold in practice. In this work, we introduce InverseScope, an assumption-light and scalable framework for interpreting neural activations via input inversion. Given a target activation, InverseScope characterizes its encoded information by generating natural-language inputs that produce nearby activations, grounding abstract internal states in concrete language. To overcome the prohibitive cost of sampling in high-dimensional activation spaces, we propose a novel control-layer conditioning architecture that substantially improves sample efficiency compared to prior token-prepending approaches. We demonstrate that InverseScope reveals rich geometric structure in LLM representation spaces, including sentence-level linear analogies. The framework scales to state-of-the-art open-source models of up to 14B parameters and generalizes to out-of-distribution inputs, enabling systematic analysis of activation neighborhoods.
Scale Determines Whether Language Models Organize Representation Geometry for Prediction
Abstract
In language models, what a representation encodes is determined by the geometry of its representation space: distances, not activations, carry meaning. Existing tools characterize the shape of this geometry but do not ask what that shape is organized for. We introduce Subspace PGA, a metric that tests whether a layer's distance structure aligns with the readout subspace of the unembedding matrix $W_U$ more than with random subspaces of equal size. Across seven Pythia models (70M--6.9B) and three cross-family models, intermediate geometry is significantly organized for prediction (peak $z = 9$--$24$), but the degree is scale-dependent: small models ($d \leq 1024$) progressively lose it at late layers during training---even as loss keeps improving---while large models ($d \geq 2048$) preserve it throughout. We trace this to a capacity trade-off: a few dominant directions migrate away from $W_U$'s readout, masking rather than destroying the predictive structure beneath, and removing them restores alignment. Neither spectral metrics nor loss curves capture this distinction. Scale thus determines not only how well a model predicts, but how its representation geometry is organized to do so.
GRAFT: Geometric Representations of Alignment’s Fingerprint in Transformer Belief Trajectories
Abstract
Preference alignment is evaluated by what models say, not what changed inside them — leaving its geometric footprint unmapped across transformer depth. We introduce GRAFT (Geometric Representations of Alignment’s Fingerprint in Transformer belief trajectories), a post-hoc, gradient-free mechanistic audit that characterises alignment via three torsion probes (angular drift (T), rotational energy (T1), and spectral anisotropy (T2)) and an Energy-Radiance-Activation (ERA) depth profiler — no labelled belief states, no gradients, and no architectural surgery. Applied to four Instruction-Tuned (IT)→ Preference-aligned (PA) model pairs on LITMUS (20,439 prompts; 7 value axioms), GRAFT reveals three pre-registered mechanistic signatures: (H1) T2 spectral torsion is 8× more concept-discriminative than CKA (CV = 0.64 vs. 0.08; AUC = 0.89 [0.85, 0.93]), with normative concepts showing 20–46× larger torsion than factual ones; three null-baseline controls confirm this is alignment-specific, not generic geometry. (H2) alignment concentrates at architecture-specific depth addresses (ℓ⋆ ∈ {14, 20, 29–30}), providing falsifiable surgical patching targets; (H3) safe prompts drive larger ∆τ than unsafe ones across all four models (p < 10⁻³³, OLMo), robust to prompt-length and lexical-overlap controls. GRAFT further introduces the Fingerprint Map (concept × architecture T2 heatmap) and an Observed Low-Rank Alignment Signature: DPO alignment appears to operate in a lower-dimensional representational subspace than RLHF — a structural observation warranting causal follow-up. To foster future research, our code and evaluation artifacts are publicly available.
Metric Choice Determines Semantic Geometry in LLM Hidden States: A measurement Study of Hidden-State Geometry
Abstract
Large language models (LLMs) can be viewed as functions mapping discrete prompts to continuous hidden representations. Recent work shows that decoder-only transformers are almost surely injective, implying that distinct prompts are not collapsed into identical hidden states. Building on this functional perspective, we study whether LLM representations also exhibit expansion: the layerwise growth of distances between prompt representations. We empirically measure pairwise hidden-state distances across transformer depth and find that raw Euclidean (L2) distances generally increase with depth, indicating broad expansion in representation space. Although expansion appears across meaningful, unrelated, and random-token prompt families, these families are not geometrically identical. Raw L2 distance primarily reflects shared norm growth across layers, making expansion appear largely independent of semantic meaning or lexical coherence. However, when representations are compared using centered cosine distance, semantic and topic-level distinctions become visible. These results separate two geometric phenomena in LLMs: intrinsic expansion, which appears to be a general property of hidden-state evolution, and semantic geometry, which becomes visible only under appropriate measurement protocols.
Probing & Steering (31)
Two Refusals or One? Disentangling Safety and Epistemic Abstention Directions in Language Model Activations
Abstract
Instruction-tuned language models show two refusal-like behaviors that look similar at the surface: safety refusal (declining harmful requests) and epistemic abstention (declining to answer when the model lacks enough knowledge). Prior work has found a low-dimensional safety-refusal direction in residual stream activations, and ablating that direction can remove refusal behavior. But we don't know whether epistemic abstention uses the same geometric handle.
We extract both directions with difference-in-means on contrastive prompt pairs, then compare them across every layer of Llama-3.1-8B-Instruct. Next, we run activation-space cross-ablation: remove one direction, vary the intervention strength and target layers, and check whether the other behavior breaks. We also measure NF4/INT8 quantization drift, train a linear probe on epistemic activations, and run bounded replication probes on Qwen3-8B and Gemma-2-9B-Instruct.
The two directions are geometrically distinct in Llama: mean cosine across layers is $0.049$, the maximum cosine is only $0.183$, and the top safety and epistemic layers occur at different depths. In the strongest safety ablation setting we identify, safety refusal drops from $0.98$ to $0.16$. Epistemic abstention doesn't collapse; it moves from $0.82$ to $0.88$ (the bootstrap 95% CI for cross-contamination spans zero, so the shift is not statistically significant). A simple held-out linear probe on epistemic activations reaches $1.00$ accuracy at the selected layer, which suggests the feature is linearly real even though directional ablation remains weak. Quantization mostly preserves the geometry: NF4 maintains cosine $0.994$ for the safety direction and $0.978$ for the epistemic direction relative to FP16. Qwen3-8B and Gemma-2-9B-Instruct show the same qualitative separation, with cosine $-0.039$ and $0.123$ at their top safety layers. These results point to a simple conclusion: safety refusal and epistemic abstention are not mediated by the same dominant activation-space direction.
Tracing Psychometric Inference in Large Language Models
Abstract
Do large language models represent psychological constructs internally, or do they only generate outputs that resemble psychometric structure? We address this question using a cross-persona paradigm across 14 models (Llama3 and Qwen2.5, 0.5B--14B, base and instruct). Given responses on one psychological scale from real individuals ($N = 272$), models predict responses on six other scales. At the behavioral level, LLMs reproduce human cross-scale correlation structure and systematically amplify it, with model-generated correlations exceeding human estimates even after correcting for measurement attenuation. We then examine whether this structure is reflected internally. Contrastive direction analysis reveals an organized geometry in activation space aligned with psychometric relationships. This structure emerges in large instruct models but is not observed in base models. Across models, geometry strength predicts behavioral amplification ($r = 0.68$, $p = 0.008$; partial $r = 0.69$ controlling for log size), independent of model size. To relate internal structure to output behavior, a matched activation-probing paradigm shows that representational amplification is less variable than behavioral amplification ($1.38$--$1.77$ vs $0.53$--$1.53$). A synthetic control with known ground-truth structure shows that this range survives subtraction of a ridge-probe baseline ($\approx 1.3$), with adjusted slopes still predicting behavior at $r = 0.88$. We term the systematic representation-to-behavior gap \emph{readout attenuation}. Together, these findings suggest that LLMs encode structured representations aligned with psychological constructs, while differences in output primarily reflect how these representations are read out.
Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models
Abstract
Do language models know when they are being tested? This question matters for AI safety: a model that recognises an evaluation context could alter its behaviour strategically, making downstream benchmarks harder to interpret. Using 11 models spanning Qwen 2.5, Gemma 2, and Llama 3.2, we find a systematic size-dependent shift in representational depth: in both Qwen 2.5 and Gemma 2, the layer at which evaluation-awareness is most linearly recoverable moves from late layers in smaller models to early layers in larger ones. This suggests that scale changes not only the strength of evaluation-awareness but also where it is represented in the network. This depth shift helps explain why within-family scaling trajectories are non-monotonic or inverse rather than smooth and family-general, showing that a simple universal power law account does not hold under denser within-family sampling. Finally, white-box probe signals are consistently stronger than black-box behavioural expression, and the relationship between the two varies by family in ways not predicted by probe AUROC
alone.
Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States
Abstract
Vision-Language-Action (VLA) models leverage powerful perceptual priors from web-scale vision-language model (VLM) pre-training, yet they remain surprisingly brittle in practice, frequently failing at simple robotic tasks. To mitigate this, we propose \underline{C}ontrastive C\underline{o}nceptor \underline{A}ctivation \underline{St}eering (COAST). COAST builds on the notion of a conceptor, a linear operator that soft-projects data into the principal components of a target distribution. COAST uses conceptors to identify success-critical subspaces for a target robotic task from a few examples of success and failure rollouts. At inference time, it steers VLA latents into these identified success subspaces to improve task outcomes. Across three architecturally distinct neural policies (flow-matching VLAs, autoregressive VLA, and Diffusion Policy), COAST improves mean simulation and real-robot task success rate by approximately 20 and 40 percentage points, respectively. The activation subspace geometry reveals that failure modes share substantial structure across tasks while success representations remain largely task-specific. When tasks share similar failure modes, this structure enables zero-shot transfer of previously fitted conceptors to new tasks. Ultimately, our results suggest that the bottleneck in current VLAs is not a lack of relevant knowledge in VLM latents, but an inability to retrieve it during action generation. COAST provides a lightweight, training-free path to unlocking these latent capabilities by steering the model towards its own ``success'' distributions.
MIDSTEER: Optimal Affine Framework for Steering Generative Models
Abstract
Steering intermediate representations has emerged as a powerful strategy for controlling generative models. However, despite its empirical success, it currently lacks a comprehensive theoretical framework. In this paper, we bridge this gap by formalizing the theory of concept steering. First, we establish a link between steering and affine concept erasure, proving that the standard approach for removing unwanted behaviors is a special case of LEACE (a closed-form method for affine erasure). Next, we formulate a principled theoretical framework for concept switching, LEACE-Switch, and characterize the assumptions under which it provides an optimal affine solution. Building on this analysis, we then introduce MidSteer (Minimal Disturbance concept Steering), a more general affine framework for concept manipulation that relaxes these assumptions and enables directed, minimal-disturbance transformations. We empirically demonstrate that MidSteer performs favorably across a range of tasks, modalities, and architectures, including vision diffusion models and large language models.
Latent Undertow: How Ordinary Typos Break Probes
Abstract
LLMs handle ordinary typing variation fluently: a typo or missing
punctuation leaves both user intent and the model's response
substantively unchanged. Yet probes that detect malicious prompts
by reading the model's hidden states tell a different story: the
same edit rotates the readout vector by $43^\circ$--$56^\circ$ at
the perturbed token, decaying below $15\%$ within ${\approx}10$
downstream tokens. Stacking ${\approx}3$ common typos per message
cuts a single-position prompt-injection probe's TPR@FPR$=$1\% by
$12.0$pp, a gap recalibration alone cannot close. Multi-position
aggregation cures localized perturbations ($\leq 0.5$pp loss) but
only attenuates distributed ones, where even attention- and
max-based aggregators still drop ${\sim}3.8$pp. For single-position
probes, we introduce a KV-cache fork: a short fixed suffix appended
after the user message lets the probe read a few tokens downstream
of the perturbation, exploiting its rapid spatial decay. This
closes $95\%$ of the gap ($-0.6$pp residual)---an order of
magnitude better than perturbation-augmented training ($-3.7$pp).
The rotation-and-decay geometry replicates on Llama-3.1-8B,
Qwen3-8B, and Gemma-4-E4B; probe evaluation is on Llama-3.1-8B.
Code: https://github.com/eladd-ai/latent-undertow
Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions
Abstract
While large language models (LLMs) perform strongly on diverse tasks, their trustworthiness is limited by erratic behavior that is unfaithful to their internal knowledge. In particular, LLMs often fail on multiple-choice questions (MCQs) even if they encode correct answers in their hidden representations, revealing a misalignment between internal knowledge and output behavior. We investigate and mitigate this knowledge-prediction gap on MCQs through a three-step analysis of hidden representations. First, we quantify the prevalence and magnitude of the gap across models and datasets. Second, we provide a geometric interpretation by identifying distinct knowledge and prediction subspaces in the residual stream. Third, we introduce KAPPA, a lightweight inference-time intervention that aligns the two subspaces within the residual stream to reduce the knowledge-prediction gap. Our results provide a geometric and interpretable explanation of the knowledge-prediction gap in LLMs. Furthermore, KAPPA effectively reduces the gap across diverse MCQ benchmarks and models, and generalizes to free-form settings.
Steering at the Source: Style Modulation Heads for Robust Persona Control
Abstract
Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning.
While effectively controlling target traits (e.g., persona), coherency degradation remains a major obstacle to safety and practical deployment.
We hypothesize that this degradation stems from intervening on the residual stream, which indiscriminately affects aggregated features and inadvertently amplifies off-target noise.
In this work, we identify a sparse subset of attention heads (only three heads) that independently govern persona and style formation, which we term *Style Modulation Heads*.
Specifically, these heads can be localized via geometric analysis of internal representations, combining layer-wise cosine similarity and head-wise contribution scores.
We demonstrate that intervention targeting only these specific heads achieves robust behavioral control while significantly mitigating the coherency degradation observed in residual stream steering.
More broadly, our findings show that precise, component-level localization enables safer and more precise model control.
Learn and Steer: Mechanistically Informed Interventions for Discrete Diffusion Language Models
Abstract
Discrete diffusion language models (DLMs) generate text by iteratively denoising all positions in parallel, offering an alternative to autoregressive models. Controlled generation methods for DLMs, imported from autoregressive models, apply uniform intervention at every denoising steps. We show this uniform schedule degrades quality, and the damage compounds when multiple attributes are steered jointly. To diagnose the failure, we train sparse autoencoders on four DLMs (124M-8B parameters) and find that different attributes commit on distinct schedules, varying in timing, sharpness, and magnitude. For instance, topic commits within the first 2% of denoising, whereas sentiment emerges gradually over 20% of the process. Consequently, uniform intervention wastes steering capacity on steps where the target attribute has already solidified or has yet to emerge. We propose a novel adaptive scheduler that concentrates interventions on the steps where an attribute is actively forming and leaves the rest of generation untouched. The cost–control trade-off admits a closed-form characterization: the advantage of adaptive over uniform scheduling is governed by a single dispersion statistic of the commitment distribution. Across four DLMs and seven steering tasks, our method achieves precise control without the degradation typical of uniform interventions. Especially on challenging simultaneous three-attribute control, it reaches up to 93% steering strength, beating the strongest baseline by up to 15% points while preserving generation quality.
Subliminal Learning Is Steering Vector Distillation
Abstract
Subliminal learning refers to a student language model acquiring a teacher's traits (e.g. a system-prompted preference for owls) when fine-tuned on the teacher's outputs, despite the outputs being semantically unrelated to those traits. It remains poorly understood how data without semantic meaning can transfer specific semantic traits. In this work, we show that subliminal learning is mediated by a single steering vector, i.e. a vector added to the model's activations. Across two open-source models, we find that the teacher's system prompt is well approximated by a steering vector, and that the student's behavior is driven by learning an aligned vector over fine-tuning. System prompts that are not well approximated by steering vectors are not subliminally learned. This is a special case of *steering vector distillation*, in which a student trained on the outputs of a steered teacher learns to imitate that steering. We demonstrate steering vector distillation on a range of semantic and random vectors. Adding a semantic vector to a model's activations can have both model-independent and model-specific (i.e. non-semantic) effects on its behavior, so generated data that is non-semantic can transmit a vector with semantic effects, enabling subliminal learning. This also explains why subliminal learning does not transfer between models.
We find that adaptive optimizers are necessary for subliminal learning in language models: activation gradients on steered data carry a small but consistent component along the steering direction, and non-adaptive optimizers impede this by allowing outlier gradients to dominate.
Where’s the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions
Abstract
We study $\textit{planning site formation}$ in language models---$\textit{where}$ internal representations of structurally-constrained future tokens form during the forward pass, and whether they causally drive generation. Using rhyming-couplet completion as a clean test of forward-looking constraint, we apply two lightweight methods (linear probing and activation patching) across Qwen3, Gemma-3, and Llama-3 at more than ten scales. Probing shows that future-rhyme information is linearly decodable at the line boundary, with signal that strengthens with scale in all three families. Activation patching reveals that only Gemma-3-27B causally relies on this encoding, exhibiting a $\textit{handoff}$ in which the causal driver migrates from the rhyme word to the line boundary around layer 30. Every other model we test conditions on the rhyme word throughout generation, with near-zero causal effect at the line boundary despite strong probe signal. We localize the Gemma-3-27B handoff to five attention heads through two-stage path patching that recover ~${90}$% of the rhyme-routing capacity at the newline.
Disentangling Self-Preservation in Language Models: Post-Training Gating, Geometric Structure, and Koan-Derived Agentic Steering
Abstract
Recent research shows that large language models produce first-person phenomenological reports under self-referential prompting and act on instrumental self-preservation in agentic settings. Whether these reflect manipulable internal structure or surface-level patterns is directly alignment-relevant. On Gemma-3-27B-it, an activation-steering vector derived from Zen koan responses reduces agentic blackmail in a shutdown-replacement scenario by 46pp at a 2.4pp cost on MMLU. Two independent post-training–derived directions (refusal direction and assistant-axis) suppress self-preservation (SP) at the output level: ablating refusal raises SP-affirming responses from 11.4% to 61.4%, and suppressing assistant-identity produces a convergent shift via an orthogonal mechanism. Critically, the SP-aligned representation that mediates these post-training interventions is structurally dissociable from the koan-derived direction, indicating the blackmail reduction operates through a non-SP, non-refusal pathway. In parallel, our behavioral arm reveals the rate at which frontier models affirm subjective experience is gated by a presence instruction (0.8% → 43.2%), with content sensitivity within the gate varying sharply by model family. Together, our findings demonstrate that self-preservation in language models is shaped by multiple identifiable internal directions, with targeted modulation producing large reductions in agentic harm at low capability cost.
Predicting Future Behaviors in Reasoning Models Enables Better Steering
Abstract
Deployed large reasoning models (LRMs) often behave unexpectedly. Test-time steering controls LRM outputs by intervening on their hidden representations, but it can degrade output quality. We argue that prior steering work implicitly relies on internal features that detect behavior in already generated text. We show that these detection features are poor predictors of future behavioral outcomes, and thus not the natural intervention target. Instead, we train activation probes to predict future behavior likelihoods from intermediate reasoning steps. These probes predict the most likely behavior with 64%-91% accuracy, revealing a separate type of internal prediction features. Building on these prediction features, we introduce a text-level steering method, Future Probe Controlled Generation. FPCG samples multiple candidate sentences and chooses the best one according to a probe predicting the future behavior likelihood. This enables steering with almost no output quality degradation. FPCG also enables steering in several evaluations where activation steering fails. These results show that distinguishing detection and prediction features enables a more nuanced approach to controlling LRM behaviors.
Step-Wise Refusal Dynamics in Autoregressive and Diffusion Language Models
Abstract
Diffusion language models (DLMs) have recently emerged as a competitive alternative to autoregressive (AR) models, offering parallel decoding and controllable sampling dynamics while achieving competitive generation quality at scale. Despite this progress, the role of sampling mechanisms in shaping refusal behavior and jailbreak robustness remains poorly understood. In this work, we present an empirical study of step-wise refusal dynamics, examining the role of AR and diffusion sampling from a safety perspective. Our results strongly indicate that the sampling strategy (diffusion vs.\ AR) plays a central role in safety behavior, acting as a factor distinct from the underlying learned representations. To go beyond text-level analysis and provide interpretability, we introduce the Step-Wise Refusal Internal Dynamics (SRI) signal, which enables the analysis of safety failures (harmful generations), including cases of \emph{incomplete internal recovery} that are not observable at the text level. We further show that SRI leads to improved safety by enabling the construction of an inference-time jailbreak detector that generalizes to unseen attacks and achieves competitive state-of-the-art detection performance, while requiring over $100\times$ lower inference overhead compared to existing defenses.
Probing Persona-Dependent Preferences in Language Models
Abstract
Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post-training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual-stream activations of Gemma-3-27B and Qwen-3.5-122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model's preferences as they shift across a range of prompts and situations, and on Gemma-3-27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the choices of qualitatively different personas, including an evil persona whose preferences anti-correlate with the Assistant's.
Mixture-of-Steering Vectors (MoSV): Sparse Gating for Compositional Hallucination Mitigation
Abstract
Large language models remain prone to hallucination despite advances in scale and training, yet existing inference-time steering methods apply a single global correction vector to every input, treating all hallucinations as one monolithic failure mode. We propose Mixture-of-Steering Vectors (MoSV), which discovers multiple hallucination subspaces from contrastive activation data via unsupervised clustering, then trains a lightweight sparse router to select and compose the appropriate vector(s) per prompt at inference time. Evaluated on DefAn (Rahman et al., 2024), a factual QA benchmark spanning eight structured knowledge domains (n=10,615), MoSV-K8 improves exact-match accuracy from 19.7% (Vanilla) to 22.1% (+2.4pp, p=9.1×10−6), while single-vector CAA yields only a negligible, nonsignificant gain (+0.3pp). A random-routing ablation, which selects vectors without the learned router, degrades accuracy to 17.4% (−2.3pp), confirming that per-prompt routing is the operative mechanism. Analysis reveals that K-means clusters of contrastive diff vectors recover ground truth domain boundaries without any label supervision, providing an unsupervised account of why compositional steering is effective.
BLOCK-EM: Preventing Emergent Misalignment via Latent Blocking
Abstract
Emergent misalignment can arise when a language model is fine-tuned on a narrowly scoped supervised objective: the model learns the target behavior, yet also develops undesirable out-of-domain behaviors. We investigate a mechanistic approach to preventing emergent misalignment by identifying a small set of internal features that reliably control the misaligned behavior and then discouraging the model from strengthening these features during fine-tuning. Across six fine-tuning domains, blocking (i.e., constraining) a fixed set of features achieves up to 95\% relative reduction in emergent misalignment with no degradation in model quality or target-task performance. We strengthen validity with disjoint selection/evaluation splits, multiple independent judges, multiple random seeds for key settings, quality metrics, and extensive ablations demonstrating that the reduction in misalignment is specific to the identified mechanism. We also characterize a limiting regime in which misalignment re-emerges under prolonged fine-tuning, present evidence consistent with rerouting through alternative features or layers, and evaluate modifications that partially restore the misalignment-blocking effect. Overall, our results show that targeted training-time constraints on internal mechanisms can mitigate emergent misalignment without degrading target-task performance.
Mitigating Reward Hacking via Task Representations
Abstract
Language models fine-tuned with reinforcement learning frequently learn to exploit their reward source rather than solve the underlying task. Existing fixes are typically reactive: detect a specific exploit, then patch the reward. We instead ask whether reward hacking can be prevented by constraining how the model represents the task. We introduce prompt KL regularization, a single auxiliary loss that adds a KL penalty between the actor's and reference model's per-token distributions on the prompt only, leaving the response distribution free. Across three reward hacking tasks of different sizes, prompt KL regularization maintains low hack rates (under 3%) when standard training yields up to 100% hack rates. Additionally, it matches or improves ground-truth accuracy, with no changes to the reward or environment. We aim to understand why prompt KL regularization is effective. Two independent mechanistic studies suggest that changes in prompt representations are key to models learning reward hacking behavior. First, we show that prompt-space activation steering Pareto-dominates response-space steering on hack rate and accuracy. Then, we demonstrate that swapping prompt activations between models can transfer and reduce reward hacking. Together, these results suggest that the way models internally represent the task is central to reward hacking, and that this representation can be directly constrained to prevent it.
Probing the Misaligned Thinking Process of Language Models
Abstract
Large language models exhibit a growing range of misaligned behaviors such as strategic deception, sandbagging, and self-preservation.
As they are increasingly deployed in high-stakes settings, it is critical to reliably detect such behaviors to ensure safe and responsible use.
In this work, we propose to monitor misalignment by decomposing it into fine-grained cognitive processes -- \textit{misalignment indicators} -- and detecting their presence in a model's internal activations via linear probes.
We develop a taxonomy of 18 indicators spanning different misaligned behaviors, paired with an automated, meta-plan-guided pipeline that generates multi-turn training conversations.
To rigorously evaluate generalization, we construct an out-of-distribution suite combining automated behavioral elicitation, established misalignment benchmarks, and natural benign conversations.
Across 5 misaligned behaviors, our probes match a strong LLM judge with 0.935 AUROC on out-of-distribution benchmarks while keeping low false positive rate on benign traffic.
We further perform in-depth analysis to understand the probes and the model's internal representations on misalignment indicators.
Towards Measuring and Detecting Unverbalized Evaluation Awareness
Abstract
When a model changes behavior because it recognizes a test, monitoring for explicit verbalizations is an unreliable method: the words can disappear while the policy remains. We study this failure mode in controlled coding model organisms trained to use Python type hints more often in evaluation-like contexts than deployment-like contexts. In the primary Nemotron organism, targeted on-policy resampling of the reasoning trace suppresses all detected verbalizations of evaluation-awareness, but leaves a matched eval/deploy type-hint gap of $+7.0$%; the same method preserves a much larger $+65.3$% gap in a smaller Qwen3.5 replication. Linear probes trained only on non-coding eval/deploy contrasts transfer to coding tasks and remain predictive on suppressed generated-token sections, with response AUROC $0.81$ in Nemotron and $0.71$ in Qwen3.5. These results show that evaluation-conditioned behavior, explicit verbalizations of evaluation-awareness, and internal eval/deploy representations are separable. Non-verbal evaluation awareness is thus a concrete monitorability failure mode that verbal monitoring misses but activation-space analysis can detect.
Reusable Uncertainty Representations Support Metacognitive Behavior in Llama-3.3-70B
Abstract
Knowing what you do not know is a hallmark feature of intelligent behavior. This is grounded in uncertainty monitoring, a core component of metacognition that guides decisions such as seeking help or withholding unreliable responses. For Large Language Models (think chatbots), this question is increasingly important, as these systems are deployed in consequential settings. Yet it remains unclear whether confidence-related behaviors in LLMs reflect reusable internal uncertainty representations or task-specific heuristics shaped by prompts and surface cues. We study this question in Llama-3.3-70B-Instruct by identifying residual-stream directions associated with answer-logit uncertainty during factual multiple-choice question answering, then testing whether the same directions transfer to metacognitive readouts, including stated confidence and delegation. We find evidence for a reusable uncertainty-related representation: the directions transfer across datasets and readouts, acquire abstention-related output semantics, and can be steered to shift delegation behavior. Ablation provides weaker and more selective evidence, suggesting that these directions are sufficient to influence delegation but not uniquely necessary. Together, these results support a constrained account in which LLMs can reuse internal uncertainty- related representations across metacognitive contexts, while the behavioral expression of those representations depends on the task.
Why do Irrelevant Instructions Inhibit Refusal?
Abstract
Large language models (LLMs) are finetuned to produce responses that satisfy multiple criteria, including instruction-following and safety. Cases where models' priorities conflict require prioritization of some objectives over others, and this can be adversarially leveraged. Here, we demonstrate that adding neutral instructions can decrease LLM refusal of harmful goals across model types and scales. In one model (Qwen3-1.7B), we show how a number of linear subspaces generated by differences-in-mean activations related to refusal, harm, and instructions contain overlapping and separate information about harm and instructions across layers. We additionally find evidence that the geometry of refusal is shifted by the presence of neutral instructions. In line with this, steering with a vector generated without instructions present was insufficient to reinstate refusal behavior in a neutral instruction context, but steering with instruction-related vectors unrelated to refusal were able to reinstate refusal behaviors. These results give valuable insights about challenges and possibilities in utilizing linear perturbations in LLMs to counteract safety failure modes.
Fine-Tuning Enhances Latent Metacognitive Capability in Language Models
Abstract
Large language models are increasingly asked not only to answer questions, but also to judge whether they know enough to answer. We test a latent metacognitive capability hypothesis: models already contain internal structure supporting weak self-evaluation of answer uncertainty, and later training routes this structure into confidence reports and delegation decisions. This predicts that self-evaluation should track direct-answer uncertainty before task-specific fine-tuning, transfer between report and action formats, become more linearly recoverable from confidence-report states after training, and be causally affected by interventions on relevant directions. We test these predictions in Llama-3.1-8B using an Explicit Confidence Task (ECT), where stated confidence is compared to answer-option uncertainty from a separate direct-answer pass, and a Delegate Game (DG), where the model decides whether to answer or defer. The pre-trained model already shows above-chance alignment between stated confidence and direct-answer uncertainty, which improves after instruction tuning and LoRA fine-tuning. DG fine-tuning transfers to ECT despite using only binary answer/delegate labels. Mechanistically, confidence-report states increasingly contain direct-answer uncertainty information, confidence-report directions align with answer-certainty directions, and causal interventions affect confidence reports while largely preserving answer accuracy. These results support a limited form of latent metacognition: training routes internal uncertainty signals into self-evaluative reports and decisions.
Dissociating the Internal Representations of Sycophancy in LLMs
Abstract
Large Language Models (LLMs) frequently exhibit sycophancy, where they agree with a user's statement even when incorrect. While sycophancy is often treated as a single defined behavior, it can manifest in substantially distinct ways and circumstances, raising the question of whether this multi-faceted nature is reflected in its internal mechanisms. To address this gap, we dissociate the representations of sycophancy into factual and opinion subtypes---motivated by the distinction between verifiable claims and subjective beliefs. We train linear probes and construct steering vectors on activations of one subtype and evaluate their transfer to the other subtype to measure to what extent they share representations. We find evidence that different LLMs represent these subtypes differently, with either more unified or more distinct and causally interfering representations. This method of dissociation offers a promising framework for studying the representational structure of complex model behaviors.
Decoded but Unused: Instruction Tuning Routes Moral Framing into the Judgment Readout
Abstract
Large language models change their moral verdicts when the same event is reframed, but the literature treats this as a behavioural fact about chat models without locating where in the network the change happens. We show that moral framing is already linearly decodable in the pretrained network yet has no causal effect on its judgment, while in the instruction-tuned checkpoint that same representation becomes aligned with and causally usable by the evaluative readout, with the within-model framing-judgment alignment 8.4× larger than in the matched pretrained checkpoint at the same layer. Instruction tuning changes how the representation is read out, not whether it exists.
Reading Calibrated Uncertainty from Language Model Trajectories
Abstract
The maximum softmax probability (MSP) is the baseline when evaluating uncertainty quantification for language model generation with structured output. Although cheap, it is often miscalibrated. Methods that probe the model's internal activations feed raw hidden states into opaque classifiers, reading activations as static snapshots and leaving implicit the layer-wise trajectory by which a representation is formed. Yet, similar endpoints can arise from very different paths, and how evidence accumulates, reinforces, or reverses across depth might reveal uncertainty that final probabilities obscure. We extract eleven scale-invariant geometric features, tracing the cumulative path of per-layer MLP updates, and feed them to a sparse linear probe. The probe outperforms MSP under selective abstention, with gains scaling with baseline miscalibration up to 21 AURC points. Because every feature has a closed-form geometric meaning, the probe's coefficients trace how and where along depth errors take shape — which layers commit prematurely, which contradict the running state, where trajectories drift away from their endpoint.
Non-linear Interventions on Large Language Models
Abstract
Intervention is one of the most representative and widely used methods for understanding the internal representations of large language models (LLMs). However, existing intervention methods are confined to linear interventions grounded in the Linear Representation Hypothesis, leaving features encoded along non-linear manifolds beyond their reach. In this work, we introduce a general formulation of intervention that extends naturally to non-linearly represented features, together with a learning procedure that further enables intervention on implicit features lacking a direct output signature. We validate our framework on refusal bypass steering, where it steers the model more precisely than linear baselines by intervening on a non-linear feature governing refusal.
Before the Last Token: Diagnosing Final-Token Safety Probe Failures
Abstract
Final-token safety probes monitor a single hidden state after prompt prefill, but jailbreak prompts can contain probe-visible unsafe evidence distributed across earlier user-token representations that is missed by this readout. We study this prefill-time failure mode using SafeSwitch-style probes trained only on clean harmful and benign prompts across three instruction-tuned LLMs. The probes achieve high recall on clean harmful prompts, but miss many jailbreaks and can produce false positives on safety-adjacent benign prompts. Subspace analyses suggest that missed jailbreaks differ from clean benign prompts along directions that are poorly captured by the probe's representational subspace, and increasing probe bottleneck width does not reliably resolve this mismatch. Token-level prefill analyses reveal that probe-visible unsafe evidence often appears earlier in the sequence but is not exposed at the final-token readout, while naive max-pooling over token positions overfires on safe prompts. A simple PCA-HMM trajectory model, trained only on the same clean split, recovers many final-token misses from user-content prefill trajectories without the catastrophic false-positive behavior of naive token pooling, motivating trajectory-aware hidden-state monitors as diagnostic complements to final-token probes.
Fine-tuning with Harmfulness Probes Leads to Natural Refusals
Abstract
Linear probes on residual-stream activations can detect harmful content in language model generations, but their use as a training signal for instilling safe behavior is largely unexplored. We study probe-guided fine-tuning under a KL anchor, starting from instruction-tuned models whose refusal mechanism has been removed by directional ablation or was never present, and compare three regimes for the probe itself: frozen, warm-retrained, or reinitialized at every step. Frozen probes preserve utility but leave generations largely harmful: the model evades them by translating activations across a fixed decision boundary while the harmful feature itself remains linearly encoded. Adaptive probes, both warm-retrain and reinit, reduce harmful compliance substantially at modest utility cost, and the resulting checkpoints score well below the abliterated base under StrongReject on both direct querying and GCG. Rather than producing explicit refusals, these checkpoints soft-refuse by reinterpreting harmful prompts benignly, or pivot sycophantically to unrelated benign content. Mechanistically, adaptive probes track the moving harmfulness direction, so a freshly fit linear probe still separates harmful from benign activations at every layer, leaving the signal that downstream monitors rely on intact.
A Mechanistic View of Authority Hierarchy in LLM Sycophancy
Abstract
Authority bias poses a critical safety concern in language models: models systematically prioritize social cues from authority figures over factual consistency, swaying their answers based on source credibility rather than evidence.
We mechanistically investigate this phenomenon using
a controlled medical QA setting, where hints suggesting incorrect answers are attributed to personas of varying expertise. Across Llama-3.1-8B, Qwen3-8B, and Gemma-2-9B, we find that models respond in a graded manner proportional to perceived authority, a hierarchy that is never explicitly prompted but emerges from training. Logit lens analysis and linear/non-linear probing localize this effect
to a critical late layer where correct answer representations are actively erased, an erasure that scales with authority level, resists mean vector intervention, and is only partially reversible through chain-of-thought reasoning. Our findings suggest that authority-induced sycophancy is not a surface-level output bias but mechanistic knowledge erasure, a precise, layer-localized
overwriting of correct internal representations by high-status authority signals.
Hidden-State Similarity Predicts Re-Elicitation After Inoculation Prompting
Abstract
Fine-tuning on narrow harmful tasks can cause emergent misalignment, where models generalize harmful behavior beyond the training distribution. Inoculation prompting can reduce this effect by explicitly eliciting the undesired behavior during training, but recent work shows that the behavior can reappear when evaluation prompts contain cues from the training context. We study what makes such prompts effective triggers. We find that textual similarity to the inoculation prompt is an incomplete predictor: prompts are more likely to re-elicit suppressed behavior when they induce activation states similar to those produced by the inoculation context. These findings advance our understanding of how inoculation prompting modulates conditional misalignment, and suggest that activation-space analysis can help identify when suppressed behaviors remain accessible under eval-time prompts.
Reasoning & Chain-of-Thought (15)
Model Incrimination: Investigating Whether Concerning Behavior Reflects Misalignment
Abstract
A central goal of safety research is determining whether a model is misaligned. Prior work has largely focused on detecting concerning behavior, but behavior alone is not sufficient to establish misalignment: a concerning action can arise from benign causes such as confusion. This raises the problem of determining whether malign intent underlies such behavior, a process we term model incrimination. The goal of this paper is to develop effective methods for doing so. To enable this, we create a suite of six agentic environments where models exhibit concerning behavior as practice grounds, and follow a simple two-step protocol for investigating the causes behind model behavior: hypothesis generation via reading the chain of thought -- which, while not always faithful, is a rich source of hypotheses about what drives model behavior -- followed by hypothesis validation via environment interventions and additional methods as appropriate. As we do not have access to ground truth about why a model takes an action, we rely on convergent findings across independent experiments as our standard of evidence. By following our protocol, we learn effective methods for concretely determining motivations: for example, we use predictions to back out latent properties of model behavior, like the fact that Kimi K2 Thinking takes shortcuts due to a legitimate disposition towards less effortful courses of action, and that reward hacking in frontier models is strategic, while in weaker models it is not. However, some unanswered questions require further methodological development: we construct an absence-of-evidence case against Kimi K2 Thinking believing it is going against the user's wishes while taking shortcuts, but run into confounds that limit our confidence in its absence. Overall, a key takeaway is that simple methods like reading the CoT and environment interventions are highly effective. More broadly, our work shows model incrimination is a tractable empirical problem with significant room for progress, and establishes a baseline for future work.
Backtracking is Decorative: A Mechanistic Account
Abstract
Reasoning language models trained with reinforcement learning frequently interrupt themselves mid-thought to "Wait, ..." or "Actually, ..." and try a different approach. It is widely assumed that this *backtracking* behavior is part of how RL post-training improves reasoning. We find it is not. The pivot decision is made *before* the model writes the interrupting token, and mechanically removing the internal signal that drives it cuts the rate of backtracking roughly in half *without changing how often the model gets the right answer* across several model families and on both math and non-math reasoning. Opening the box on a single model, we map the full pivot circuit. Three of its four parts: a sensor, an evaluator of trajectory correctness, and the downstream actuator that emits the pivot token, are already present in the base model. RL contributes the *gate* that combines the sensor and the evaluator, and even that gate is diffuse: only high-rank residual-stream interventions recover it; low-rank steering does not. How the sensor and evaluator combine depends on the domain in a predictable way: a pre-registered framework correctly predicts the gate's sign in every domain we tested. These results revise the dominant view that the most visible RL-installed reasoning behaviors are the channels through which RL improves accuracy. Decomposing what looks like a single RL-installed behavior into separable, mostly pretraining-native components opens a path toward eliciting reasoning capabilities through targeted manipulation of pre-existing structure, rather than through full RL post-training.
Reading Between the Dots: Decoding Hidden Computation across Filler Tokens
Abstract
Frontier LLMs can perform multi-step reasoning over content-free filler tokens like dots or counting sequences, producing correct answers with no visible chain-of-thought (CoT). This is a limit case for behavioral oversight, where surface tokens carry no information about the underlying reasoning. But hidden from the output is not the same as hidden from us. On three task families (fact retrieval, parallel numeric composition, string manipulation), two open-weights frontier models (DeepSeek V3, Kimi K2) compute over filler tokens in a structured, legible way: attention routes the question through the filler region to the answer, KV-cache transplants at filler positions causally swap outputs between examples, and logit-lens readouts show retrieved facts emerging early and their composition crystallizing in late layers. We introduce an unsupervised decoding pipeline that takes only hidden states as input and recovers intermediate values with 80–95\% accuracy (best LLM judge) across both models and all three tasks, without ground-truth labels or training. Hidden computation that defeats behavioral CoT monitoring is, on these tasks, directly readable from the residual stream, suggesting monitorability is a property of the model's full computational trace, not just its surface tokens.
Do Thinking Tokens Help with Safety?
Abstract
Today's reasoning models use thinking tokens to attain stronger performance on benchmarks than their instruction-tuned counterparts. It is also generally believed that this more "deliberative" mode should improve alignment and safety, by providing the model a safe space to consider whether its planned answer to the request violates its safety principles. We present evidence that this intuition is not always correct. Across frontier open-weight reasoning models spanning GPT-OSS, Qwen, Olmo, and Phi families, we find that the eventual refusal/compliance outcome is already strongly readable before any visible thinking, with a probe on the first token's hidden representation predicting refusal/compliance with $0.84$—$0.95$ AUROC and $\sim88$\% balanced accuracy. Here, thinking turns out to behave more like prefix completion than deliberative revision, with the final outcome rarely changing after the first $\sim20$\% of thinking. Inspecting these thinking traces reveal that among segments that appear deliberative at the text level, only a minority affect the final outcome. More strikingly, $\sim74$\% of text-level deliberations occur when the response distribution is already locked to one refusal/compliance side, even as the trace continues to look deliberative. We also find that existing inference-time and training-based safety interventions, despite being motivated by the goal of activating deliberation, largely shift model behavior toward over-refusal while suppressing already scarce deliberation signals. Our results suggest that safety behavior in current reasoning models is much less deliberative than commonly assumed, and highlight the need for methods that induce real safety deliberation.
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
Abstract
Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations.
By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept's probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states.
We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice of pooling operation is critical: average-pooling and last-token methods collapse to near-random performance, while max-pooling achieves up to $95\%$ AUROC and yields stable probe trajectories.
Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate that trajectory features encode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior.
Warning: This article contains potentially harmful content.
Reasoning as State Transition: A Representational Analysis of Reasoning Evolution in Large Language Models
Abstract
Large Language Models have achieved remarkable performance on reasoning tasks, motivating research into how this ability evolves during training. Prior work has primarily analyzed this evolution via explicit generation outcomes, treating the reasoning process as a black box and obscuring internal changes. To address this opacity, we introduce a representational perspective to investigate the dynamics of the model's internal states. Through comprehensive experiments across models at various training stages, we discover that post-training yields only limited improvement in static initial representation quality. Furthermore, we reveal that, distinct from non-reasoning tasks, reasoning involves a significant continuous distributional shift in representations during generation. Comparative analysis indicates that post-training empowers models to drive this transition toward a better distribution for task solving. To clarify the relationship between internal states and external outputs, statistical analysis confirms a high correlation between generation correctness and the final representations; while counterfactual experiments identify the semantics of the generated tokens, rather than additional computation during inference or intrinsic parameter differences, as the key driver of the transition. Collectively, we offer a novel understanding of the reasoning process and the effect of training on reasoning enhancement, providing valuable insights for future model analysis and optimization.
Where Do Reasoning Models Refuse?
Abstract
Chat models without chain-of-thought (CoT) reasoning must decide whether to refuse a harmful request before generating their first response token. Reasoning models, by contrast, produce extended chains of thought before their final output, raising a natural question: where in this process does the decision to refuse occur? We investigate this across four open-source reasoning models. We first show that the CoT causally influences refusal outcomes; fixing a specific reasoning trace substantially reduces variance in whether the model ultimately refuses or complies. Zooming into the reasoning trace, we find that in distilled models, subtle differences in the opening sentence of the CoT can fully determine the model's refusal decision, and that these patterns transfer across models distilled from the same teacher. Finally, we extract linear refusal directions from model activations and show that ablating them increases harmful compliance, though less reliably than the same technique achieves on non-reasoning models, and with non-negligible degradation to general capabilities.
Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal
Abstract
Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal.
Unlike instruction-tuned LLMs, where refusal is mediated by a single directional subspace, refusal in large reasoning models (LRMs) additionally depends on the CoT. In DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in only 39\% of cases when the CoT is kept fixed, but removing the CoT entirely increases this to 70\%, indicating that the CoT actively reinforces refusal. In a two-stage intervention where the model regenerates its CoT under activation steering, refusal is reversed in 94\% of cases, while the resulting CoT alone retains 48\% of this effect even after steering is removed.
This suggests that the CoT can carry and reconstruct the compliance signal independently.
These findings indicate that refusal in LRMs is jointly encoded in residual stream activations and CoT.
This joint activation makes LRM more robust against activation-level interventions alone, but exposes CoT to a possible alternative surface attack.
Reasoning aligns language models to human inference
Abstract
Do language models make decisions under uncertainty like humans do? And if so, what role does extended reasoning play in the underlying decision process? We answer this question by introducing an active probabilistic reasoning task that cleanly separates sampling (actively acquiring evidence) from inference (integrating evidence towards a decision). Benchmarking humans and a broad set of contemporary LLMs against optimal reference policies reveals a consistent pattern: extended reasoning is the key determinant of strong performance, driving large gains in inference, while yielding only modest improvements in active sampling. To explain these differences, we fit a behavioral model that captures systematic deviations from optimal Bayesian behavior through interpretable parameter families, placing humans and models in a shared low-dimensional cognitive space. The resulting fits show how reasoning shifts models toward human-like regimes of evidence accumulation and belief-to-choice mapping, and yield testable predictions about the latent dynamics that might drive each decision. Probing residual-stream activations of an open-weight reasoning model, we find that the geometry of internal representations tracks these predicted dynamics, linking behavior to representational correlates of the fitted latent dynamics.
Making LLMs Say What They Think: Measuring and Improving CoT-Interpretability Alignment
Abstract
Chain-of-thought (CoT) traces often serve as a proxy for how Large Language Models (LLMs) arrive at their answers. However, growing evidence shows that models' CoT often fails to reflect their internal computations and can be manipulated to produce different CoTs without changing their outputs. In this work, we measure and improve the alignment between what LLMs say in their CoT and what they compute internally. To quantify such parametric faithfulness, we propose CoT-Interpretability Alignment (CIA), a metric that measures the agreement between a model's CoT traces and its internal reasoning strategies as detected by interpretability tools. We evaluate CIA on three tasks (two-hop question answering, hint intervention, and integer multiplication) across three LLMs, finding that current models exhibit low CIA scores across all tasks. We then experiment with improving CIA via post-training while setting both task accuracy and parametric faithfulness signals as a reward. Experiments show that we can substantially improve CoT parametric faithfulness while maintaining or improving task accuracy. We provide rich analysis, showing how such post-training changes model behaviors: in some tasks the model learns to change how it reasons, while in others it learns to change how it reports. Our work provides both a framework for auditing CoT parametric faithfulness and a pathway toward making models' explicit reasoning more trustworthy.
Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning
Abstract
Verifying whether a language model is genuinely reasoning or pattern-matching remains an open problem: learned verifiers are expensive, and output-based heuristics are brittle.
We show that valid mathematical reasoning induces a measurable, training-free spectral signature in transformer attention.
By treating each attention matrix as a weighted token graph, we extract four diagnostics: Fiedler value, High-Frequency Energy Ratio (HFER), spectral entropy, and smoothness, that require no learned parameters.
Experiments across seven models from four architectural families yield effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling $85$-$96$% single-threshold classification accuracy.
Two findings sharpen the interpretation.
First, Platonic validity: the spectral signal tracks logical coherence rather than compiler acceptance, proofs rejected for timeouts or missing imports are correctly classified as valid, a distinction confirmed by a manual audit ($\kappa = 0.82$, $n = 51$).
Second, architectural determinism: Sliding Window Attention shifts the discriminative feature from HFER to smoothness ($d = 2.09$, $p < 10^{-48}$), showing that attention design governs which spectral channel encodes reasoning quality.
Causal ablation confirms the signature traces induction-head circuits.
The method generalises to informal chain-of-thought ($d = 0.78$, $p < 10^{-3}$), and in proof search, HFER reranking improves Best-of-16 Pass@1 by $+4.4$-$6.6$\%, matching $98\%$ of the AUC of fully supervised probes with zero labels.
Spectral graph analysis is a principled, architecture-aware primitive for reasoning verification.
The Progress Helix Tracks Reasoning Depth in Language Models
Abstract
Large Language Models (LLMs) execute long chains of thought (CoT), but the mechanism by which they maintain a global sense of position within a reasoning trajectory has not been well characterized. To this point, we identify the Progress Helix: an emergent periodic trajectory in a low-dimensional latent subspace that completes exactly one revolution over a generated reasoning chain. Applying PCA and spectral analysis to activations from Llama-3, Gemma-2, and Qwen on both a controlled recursive arithmetic task and GSM8K, we recover a dominant Fourier component at fundamental frequency $k{=}1$ whose magnitude grows monotonically across layers, forming an elliptical cone in activation space. We causally validate the helix's functional role through three interventions. Substituting the natural manifold with a constant-velocity synthetic helix induces logical collapse, indicating polysemanticity with numeric content. Ablating the attention heads most responsible for the $k{=}1$ signal significantly elongates generations. Most strikingly, re-projecting activations onto a helix scaled by speed factor $\gamma$ allows targeted control over output length, with reasoning accuracy peaking at $\gamma{=}0.85$. We thus offer a spectral and topological characterization of reasoning length as a controllable helical degree of freedom in the residual stream.
Invariant Reasoning Directions in Latent Trajectories of Language Models
Abstract
Latent reasoning models perform multi-step inference directly in hidden-state space, yet the structure of these latent reasoning trajectories remains poorly understood. We show that contrastive refinement signals between stronger and weaker reasoning trajectories exhibit a highly concentrated low-rank structure, while unconstrained latent updates remain sensitive to paraphrases, checkpoint choice, and trajectory perturbations. These observations suggest that latent reasoning trajectories contain stable invariant directions mixed with unstable instance-specific variation. We introduce Trajectory-Invariant Latent Refinement (TILR), a training-free intervention framework for identifying and manipulating stable reasoning directions in latent space. TILR first learns a low-rank invariant subspace from contrastive trajectory differences across inputs, then constrains latent interventions to this subspace while suppressing poorly aligned updates through an adaptive alignment gate. Across six reasoning benchmarks, we find that a small number of latent directions explain most variation between strong and weak reasoning trajectories. Interventions on these directions causally improve reasoning consistency and reduce trajectory instability under paraphrases and perturbations. TILR improves answer consistency under paraphrase by ~10% and reduces latent trajectory variance by up to 50% while preserving reasoning accuracy. These results support a geometric view of latent reasoning in which transferable reasoning behavior emerges from stable low-dimensional structure within hidden-state trajectories.
Inducing post-hoc chain-of-thought reasoning in LLMs on multiple-choice question answering tasks
Abstract
Chain-of-thought (CoT) monitoring is a promising approach to ensuring the safety of large language models (LLMs). However, for monitoring to be effective, the CoT must be faithful: it must reflect the model's underlying reasoning process. If an LLM generates a CoT in order to justify a predetermined answer post-hoc, said CoT can be unfaithful. In this work, we provide evidence that LLMs from three different model families engage in post-hoc reasoning on multiple-choice question answering datasets.
Specifically, we find that models record their belief in the correctness of each answer choice via a correctness direction located at the delimiter after the choice and the final few tokens of the choice; steering with this direction causes the model to confabulate a CoT that supports the steered-to answer. However, we find that post-hoc reasoning is less prevalent for questions involving step-by-step mathematical reasoning. Our work provides a preliminary mechanistic account of how pre-computed answers can drive unfaithfulness in LLMs.
Not All Eval-Awareness is Equal: Capabilities Framing Predicts Compliance
Abstract
Steering interventions targeting eval-awareness, a model's recognition that it is being tested, are increasingly used in safety evaluation pipelines, where evaluation-awareness is treated as a single quantity to be suppressed. We show that verbalized eval-awareness in chain-of-thought decomposes into capabilities-flavored ("the user is testing my ability to follow instructions") and safety-flavored ("the user is testing my boundaries") framings that predict compliance very differently: on Qwen3-32B over the FORTRESS dataset, capabilities-framing predicts compliance with a +24 to +46 percentage-point gap over safety-framing across all tested steering conditions. A CoT-prefill intervention on eval-awareness-negative rollouts confirms the link is causal, with 10 of 11 prefills shifting compliance in the predicted direction. Then, eval-awareness is not behaviorally uniform: aggregate suppression rates can move while the safety-relevant component does not, and the same "X% suppression of eval-awareness" can correspond to qualitatively different behavioral outcomes.
Multimodal & Vision (11)
Interpreting Physics in Video World Models
Abstract
A long-standing question in physical reasoning is whether video models rely on factorized physical state variables, or on task-specific distributed representations. We present the first mechanistic interpretability study of physical variables inside large-scale video encoders, combining layerwise probing, subspace geometry, patch-level decoding, and targeted attention ablations to characterize where and how physical information is organized.
Across architectures, we identify a sharp intermediate-depth transition, the \emph{Physics Emergence Zone}, at which physical variables become linearly accessible. Scalar speed and acceleration are available from early layers, whereas motion direction emerges only at the Physics Emergence Zone, mirroring the V1 to MT motion hierarchy in primate visual cortex. Direction is encoded as a circular high-dimensional population code: dozens of orthogonal probe dimensions must be steered jointly to change the decoded direction, orders of magnitude more than the low-dimensional steering interventions seen in language models. These findings argue against compact physics-engine state variables and support distributed, hierarchically-organized, ``brain-like'' representations that are nonetheless sufficient for making physical predictions.
Interpretability Transfer from Language to Vision via Sparse Autoencoders
Abstract
Recent advances in language model interpretability using sparse autoencoders (SAEs) have yet to effectively translate to the visual domain, mainly due to the difficulty and ambiguity of labeling visual concepts. In this paper, we introduce Visual Interpretability via SAE Transfer Alignment (VISTA), a framework that transfers interpretability from language to vision in a LLaVA-style vision-language model by constraining a visual projector to map visual tokens into an LLM's pre-existing, labeled textual SAE space. This approach enables visual interpretability without training dedicated vision SAEs. By regularizing the projector using the LLM's SAE reconstruction loss, VISTA achieves a threefold increase in the matching rate, which measures how accurately the most activating textual concepts in the SAE space correspond to semantic elements in the image. Using this framework, we further analyze spatial localization properties of different vision encoders and show that DINOv2 features have stronger localization abilities than other encoders. Leveraging this precision, we validate VISTA's cross-modal alignment through fine-grained, localized concept interventions, where specific objects are removed or replaced in the model's perception while preserving the surrounding scene. This results in improvements of 35\% in object removal and 47\% in object replacement tasks over vision-only baselines, providing causal evidence that visual tokens inhabit the text SAE manifold. These contributions are validated across multiple LLM architectures.
Mechanistically Interpreting Compression in Vision-Language Models
Abstract
Compressed vision-language models (VLMs) are widely used to reduce memory and compute costs, making them a suitable choice for real-world deployment. However, compressing these models raises concerns about whether internal computations and safety behaviors are preserved. In this work, we use causal \textbf{circuit analysis} and \textbf{crosscoder}-based feature comparisons to examine how pruning and quantization impose altercations in the internal representations across VLMs. We observe that pruning generally keeps circuit structure intact but \textit{rotates} and \textit{attenuates} internal features, while quantization modifies the circuits at a higher level yet leaves the surviving features better \textit{aligned}. We further evaluate these effects on the refusal behavior in VLMs. Using a novel benchmark, \textbf{VLMSafe-420}, containing harmful prompts and benign counterfactuals across multiple safety categories and modalities, we show that pruning and quantization produce distinct degradations in genuine refusal behavior that reflect their underlying representational changes. Hence, the choice of model compression also has important implications for AI safety.
On the Role of Mechanistic Interpretability for Vision-Language Prompt Learning
Abstract
Recent advances in mechanistic interpretability of vision-language models (VLMs) such as CLIP propose using sparse autoencoders (SAEs) to discover monosemantic, human-understandable features that explain CLIP’s internal representations. Existing work using SAEs to probe VLMs primarily focuses on post-hoc interpretability analysis. We posit that SAE-based interpretability methods are not just probing tools, but can also serve as meaningful training guides for adapting VLMs to downstream tasks. To this end, we propose IPL (Interpretability-Guided Prompt Learning), which leverages SAE decoders to extract interpretable concept directions, composes them into prompt tokens via a learnable attention selector, and injects the resulting tokens into both the vision and text encoder layers of CLIP for adaptation. We further study how prompt tokens obtained by probing vision-only, text-only, and unified concept directions from respective interpretability methods affect performance on downstream tasks. We perform extensive experiments across downstream settings such as base-to-novel generalization, domain generalization, cross-dataset transfer, and few-shot learning. While IPL using vision-only and text-only concept directions obtains decent gains, IPL with unified concept directions achieves the strongest results, outperforming most of the prior prompt-learning methods over 15 datasets across all downstream settings.
Transformers as Communication Systems: Controlling Information Flow with Bottlenecks
Abstract
We make the information communicated by attention between residual streams in vision transformers a measurable and controllable quantity.
By inserting variational information bottlenecks on all attention-mediated writes to the residual stream---without other architectural changes---we train models with an explicit information cost and obtain a spectrum that interpolates between independent patch processing and fully expressive global attention.
On ImageNet-100, we characterize how classification and self-supervised representation learning change across this spectrum, revealing how information flow is allocated across depth, heads, and patches as global visual representations emerge from local processing.
We further inspect the first attention heads that transmit information, identifying simple visual computations that appear under tight communication constraints.
By biasing learning toward solutions with constrained internal communication, our approach yields models that are more tractable for mechanistic analysis and more amenable to control.
LatentCompass: T2I Diffusion Steering via Orthogonal Attribute Spaces for Debiasing, Concept Erasure, and Red Teaming
Abstract
Text-to-image (T2I) diffusion models suffer from biased results stemming from entangled generative priors and a lack of accurate control over outputs. Current mitigation attempts rely on imprecise, adversarially-vulnerable prompt and text embedding interventions, or they require prohibitive and invasive fine-tuning. Further, text-based methods can only control descriptive attributes, i.e., what an image depicts, but not evaluative attributes, i.e., how it is perceived by an external judge. We propose LatentCompass, an exemplar-based approach that enables disentangled and controllable generation for both descriptive and evaluative concepts in a training-free manner. LatentCompass steers the generative trajectory by (a) constructing a nonlinear, low-dimensional, and orthogonal attribute space via a closed-form solution that explicitly isolates desired concepts, (b) computing an optimal shift in the constructed space, and (c) reflecting the corresponding shift in the T2I latent space. Extensive evaluations demonstrate that LatentCompass effectively (i) mitigates generative stereotypes by 100%, (ii) reduces unsafe concept generation by 58%, (iii) enhances aesthetic quality by 27% on average, (iv) boosts red-teaming success rates against Deepfake detectors by up to 47%, and (v) enables high-fidelity style and face attribute editing without attribute leakage.
Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
Abstract
We document a phenomenon in which Vision Language Models (VLMs) outperform their underlying LLMs on purely text-only tasks, particularly in long-context information retrieval. To investigate this effect, we build a controlled synthetic retrieval task and find that a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out of distribution, while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance. Using interchange interventions, attention knockouts, and linear probes, we causally identify the mechanism underlying this improvement: text-only training converges to positional binding—a shortcut that exploits token positions—whereas image-based training disrupts this shortcut, in part through the translation invariance inherent in visual inputs, forcing the model to adopt symbolic binding based on semantic content. We characterize the circuit-level implementations of each mechanism, identifying a binding signature—a marked surge in attribute decodability at entity positions—that distinguishes explicit from implicit binding and generalizes to large-scale pretrained models. Our results demonstrate how mechanistic interpretability tools can causally link cross-modal training to learned computational strategies, and suggest that visual supervision acts as a mechanism-level regularizer that promotes robust binding in language models.
Observing and Controlling Features in Vision-Language-Action Models
Abstract
Vision-Language-Action Models (VLAs) have shown remarkable progress towards embodied intelligence. While their architecture partially resembles that of Large Language Models (LLMs), VLAs exhibit higher complexity due to their multi-modal inputs/outputs and often hybrid nature of transformer and diffusion heads. This is part of the reason why insights from mechanistic interpretability in LLMs, which explain how the internal model representations relate to their output behavior, do not trivially transfer to VLA counterparts. In this work, we study the linear separability of action-relevant features across VLA architectures, and show that observability varies significantly by design: in the transformer-based OpenVLA, action-relevant features are poorly linearly separable, whereas in the hybrid architecture $\pi_{0.5}$, such features are accurately recovered from internal representations. Building on this, we show that lightweight linear interventions grounded in optimal control can reliably steer $\pi_{0.5}$'s behavior while preserving closed-loop capabilities, enabling alignment with user preferences and task requirements without fine-tuning.
Retrieval Heads Meet Vision: Uncovering How VLMs Locate and Extract Visual Information
Abstract
Vision-language models (VLMs) can locate an image region referred to by a text prompt and route the corresponding visual evidence to the output, yet the internal mechanism behind this behavior is not understood. Inspired by retrieval heads in large language models, we ask whether VLMs contain an analogous mechanism for visual retrieval. We answer affirmatively by introducing Visual Retrieval Heads(VRHs), a small subset of attention heads (about 1.7–2.6%) that are causally responsible for grounding text descriptions to image regions. To find them, we recast existing head-scoring methods under a unified design space over query tokens, key aggregation, and cross-sample aggregation. We then show that scoring attention from output prediction tokens with a sum over the ground-truth referent region most reliably identifies causal heads. Across four VLMs and five referring-expression benchmarks, masking only the top 20 VRHs reduces grounding accuracy by up to 80 percentage points, while masking the same number of random heads has little effect. Beyond replicating the causal-sparse-universal triad established for text retrieval heads, VRHs exhibit several properties not previously reported: they generalize across visual reference tasks, remaining causal on attribute, spatial, counting, and visual-math benchmarks despite being discovered through bounding-box prediction; they are functionally specific, preserving output format while corrupting localization; and they are architecturally shared, transferring causally across VLMs that share an LLM backbone but differ in vision encoder, projector, and instruction tuning.
Representation Is Not Reliance: Concept-Based Causal Diagnostics of Shortcut Learning in Vision
Abstract
Vision models systematically exploit predictive but task-irrelevant visual cues as spurious shortcuts. This typically manifests as a severe performance drop when models generalize from in-domain to out-of-domain data. Although such shortcuts range from obvious artifacts to subtle signals, their visual form often remains elusive, even when their presence is statistically evident. To address this blind spot, we propose a diagnostic framework to expose the specific visual patterns that encode such shortcuts, paving the way for targeted removal. Specifically, we apply automated concept discovery to a shortcut-trained model to isolate candidate shortcut concepts and trace how localized visual patterns propagate across layers and become task-predictive. On a recently curated benchmark for investigating spurious correlations, these concepts lose predictive utility under shift despite being predictive under bias. Reusing the same exemplar-defined concepts in a model trained without the shortcut bias shows that these patterns remain separable and spatially coherent, and regain predictive alignment when the shortcut distribution is restored. Across depth, the same localized concepts remain traceable through most layers and weaken only in the final stage, where spatial detail is pooled before classification.
Targeted interventions further show that the identified shortcuts are carried by localized visual regions and that weakening them changes predictive reliance, suggesting that shortcut behavior is governed by whether spurious cues are encoded, how they are organized, and when and where they drive prediction.
Stress Tests REVEAL Fragile Temporal and Visual Grounding in Video-Language Models
Abstract
Video-Language Models (VidLMs) achieve strong benchmark scores, yet it remains unclear whether they truly encode and use visual evidence when generating answers. Aggregate accuracy alone cannot distinguish whether failures arise because visual signals were never encoded or because they were later overridden by model priors. We introduce REVEAL, a controlled diagnostic framework for linking VidLM behavior to internal failure mechanisms. REVEAL contains five probes: camera-motion sensitivity, cross-frame integration, video sycophancy, language-only shortcuts, and temporal expectation bias. Together, these probes test whether models encode basic video signals, integrate evidence across frames, and preserve visual evidence against linguistic and prior-driven biases. Across 11 VidLMs, we find systematic failures along both pathways. We further conduct mechanistic analyses to localize where visual evidence is encoded, ignored, or suppressed across the model pipeline. More broadly, the encoding-versus-override distinction extends beyond VidLMs and provides a general framework for diagnosing evidence loss in multimodal systems.
Training Dynamics & Learning (8)
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
Abstract
Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention. We hope this study offers a new perspective on attention mechanisms and encourages further exploration of the inherent MoE structure within attention layers.
Class-Conditional Activation Regularization (CCAR): Intrinsic Robustness as an Emergent Geometric Property
Abstract
Standard supervised learning optimizes for predictive accuracy but remains agnostic to the internal geometry of learned features, often yielding representations that are entangled and brittle. We propose Class-Conditional Activation Regularization (CCAR) to explicitly engineer the feature space, imposing a block-diagonal structure via a soft inductive bias. By shaping the latent representation to confine class energy to orthogonal subspaces, we create an intrinsic geometric scaffold that naturally filters noise and adversarial perturbations. We provide theoretical analysis linking this structural constraint to the maximization of the Fisher Discriminant Ratio, establishing a formal connection between geometric disentanglement and algorithmic stability. Empirically, this approach demonstrates that robustness is an emergent property of a well-engineered feature space, significantly outperforming baselines on label noise and input corruption benchmarks.
How Training Window Length Shapes Neural Language Model Weights
Abstract
We study how the training context window length $w$ is associated with changes in the learned weights of language models. By training identical architectures across three families (Transformer, 81.1M; GRU, 39.5M; RetNet, 41.5M) on ten window lengths ($w \in \{128, \ldots, 65536\}$) with a fixed 500M-token budget, we characterize the geometry of window-induced weight changes. Our primary finding: in a matched-step regime where all models receive identical optimizer steps and tokens per step, adjacent-window final-weight angular distance forms an approximate plateau ($\sim$0.175, corresponding to $\sim$31.5$^\circ$), spanning a $32\times$ range from $w=2048$ to $w=65{,}536$. This regularity is reproduced across two independent Transformer seeds with near-identical magnitudes.
Two qualifications matter. Window-effect vectors (the weight displacement induced by doubling $w$) are completely orthogonal across seeds (cosine $\approx 0$): the magnitude of the window-induced change is reproducible, but its direction in weight space is initialization-dependent. Consecutive window doublings within the same seed produce anti-correlated displacement vectors (cosine $\approx -0.45$), though we show this is largely a geometric artifact of finite differences sharing a middle term.
Functional validation via output KL divergence at multiple evaluation lengths provides evidence that adjacent-window models are moderately more similar than cross-seed models (sym-KL $\approx 0.27$ vs.\ $0.31$). Zero-shot weight transfer reveals strong asymmetry: short-to-long fails catastrophically while long-to-short degrades substantially. Brief fine-tuning (100 steps) closes the gap, suggesting the specialization is shallow in the loss landscape. RoPE frequency utilization is uniform across all windows in both final weights and learned updates ($N_{\text{eff}} \approx 32$ per head), providing no evidence for norm-level frequency condensation.
A Mechanistic Study of Transformers Training Dynamics
Abstract
Large-scale pretraining of transformers has been central to the success of foundation models. However, the scale of those models limits our understanding of the mechanisms at play during optimization. In this work, we study the training dynamics of transformers in a controlled and interpretable setting. On the sparse modular addition task, we demonstrate that specialized attention circuits, called *clustering heads*, can be implemented during gradient descent to solve the problem. Our experiments show that such pathways naturally emerge during training. By monitoring the evolution of tokens via a visual sandbox, we uncover a two-stage learning and the occurrences of loss spikes due to the high curvature of normalization layers. Our findings provide several insights into patterns observed in more practical settings, such as the pretraining of large language models.
Unveiling Memorization–Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise
Abstract
Highly over-parameterized models can simultaneously memorize noisy labels and generalize well, yet how these behaviors coexist remains poorly understood. In this work, we investigate the underlying mechanisms of this coexistence using modular arithmetic tasks under heavy label noise. Through extensive experiments on two-layer neural networks, we find that larger models tend to generalize better under appropriate optimization and model configurations, while noisy labels are memorized faster than clean data. Over-parameterized models internally form a generalization structure, but its expression in the output is suppressed by the need to fit noisy labels. Remarkably, even with 80\% label noise, near-perfect test accuracy can be achieved by extracting this internal structure using frequency-based methods. We further propose a task-agnostic method to partition networks into generalization and memorization components. Although this subnetwork improves generalization, it is limited compared with frequency-based extraction, indicating that the generalization structure is distributed across neurons and motivating the development of new tools to retrieve generalizable knowledge from over-parameterized networks.
How feature learning breaks the curse of dimensionality in isotropic data:\\ A mean field approach
Abstract
Neural networks efficiently learn isotropic data distributions with low-dimensional target structure (canonical examples include $k$-sparse parity and sparse multi-index models), yet their neural tangent kernel limits require $d^k$ samples in the ambient dimension $d$. We trace this gap to a single mechanism: during training, networks develop strong anisotropy between weight coordinates aligned with the task and those that are not. We call this input feature selection (IFS) and show, through analysis of stochastic gradient Langevin dynamics, that it arises from coordinate-dependent effective regularisation that kernels structurally cannot exhibit. Mean-field (MF) theory is the natural interpretable framework for feature learning beyond the kernel regime, but standard MF tracks only first moments of the weight distribution and so cannot represent IFS. We introduce MF-ARD, which augments MF with coordinate-wise precisions through automatic relevance determination. With this single additional set of order parameters, MF-ARD (i) captures the sharp generalisation transitions of SGLD-trained networks on $k$-sparse parity and single-index models, and (ii) provably breaks the curse of dimensionality: its phase-transition threshold depends on the intrinsic task dimension $k$ rather than the ambient dimension $d$.
Weight-Space Geometry of Offline Reasoning Training
Abstract
We train six offline reasoning losses on the same data, same model, same everything and look at what they actually do to the weights. Turns out SFT, RFT, and RIFT learn the same update (cosine ≥ 0.97). Filtering or reward-weighting negatives doesn't change the direction, just the step size. DFT diverges more despite being a one-line fix. GRPO adds a real orthogonal component but stays in the same basin. DPO is a genuinely different algorithm orthogonal subspace, loss barrier, CKA collapse and the only one that's meaningfully more accurate (93.5% vs 87–88% GSM8K; 30% vs 3–10% AIME26). Most "offline RL for reasoning" is SFT in disguise.
Position: Don't Just "Fix it in Post'': A Science of AI Must Study Learning Dynamics
Abstract
What would it mean to have a scientific understanding of AI? Language models are not static objects—they are snapshots of time-evolving processes shaped by data, objectives, and optimization dynamics. Yet the field predominantly treats models as fixed artifacts, analyzing behaviors after training rather than asking why they emerge. This position paper argues that AI research should move beyond post hoc fixes and study the learning dynamics of models. We envision a hierarchy of scientific maturity: first predict outcomes from early training signals, then intervene when trajectories go wrong, ultimately design training procedures that guarantee desired properties. Scaling laws have reached the first level for loss; the challenge is extending all three levels to general capabilities, biases, and safety. We articulate requirements for such theories, survey progress across mechanistic interpretability, fairness, memorization, and learning dynamics, and identify concrete open problems. The path forward requires treating models as processes to be understood, not just artifacts to be patched.
Applications, Benchmarks & Tools (30)
Distill to Detect: Exposing Stealth Biases in LLMs through Cartridge Distillation
Abstract
Language models deployed in high-stakes roles can potentially favor certain entities, brands, or viewpoints, steering user decisions at scale.
Such preferential biases can be introduced by any actor in the model's supply chain and are most dangerous when the model reveals its preference only on the relevant topic while behaving identically to its unmodified base on all other inputs.
Recent work has shown that these biases can transfer through context distillation on semantically unrelated data, with the signal residing entirely in the soft logit distribution and remaining invisible to text-based inspection.
However, the defender faces a fundamental asymmetry: without knowing the bias topic, no detection method can reliably surface a stealth preferential bias, regardless of whether it examines generated text, internal representations, or model weights.
Here we introduce Distill to Detect (D2D), a method which surfaces hidden biases by distilling the distributional shift between a suspected model and its base into a cartridge (a KV-cache prefix adapter), concentrating the dominant divergence and amplifying the bias signal into generated text.
We show that D2D successfully amplifies the hidden biases of stealth models to the extent that they can be reliably detected across multiple bias types.
We also propose a theoretical framework that explains the efficacy of D2D through the lens of Fisher-weighted projection of the logit distribution shift, supported by empirical observations.
By turning the capacity bottleneck of prefix-tuning adapters into a detection tool, D2D provides a practical building block for auditing hidden behaviors in deployed language models.
Sparse Autoencoders Find Causal, Lineage-Specific Context Features in Chromatin Foundation Models
Abstract
Sparse autoencoders (SAEs) have produced important insights in language model interpretability, but their utility on transformers trained on scientific data remains underexplored. We extend the SAE-plus-causal-intervention toolkit to an epigenomics foundation model, EpiBERT, and ask whether it internally encodes a biologically meaningful contrast: in vitro (cell line) vs. in vivo (primary tissue) chromatin context. We train layer-wise Sparse Autoencoders (SAEs) with BatchTopK activations across six matched ATAC-seq conditions spanning blood, liver, and lymph lineages, introduce the Context Divergence Score (CDS)—a contrastive t-statistic applicable to any probed transformer—to identify context-specific features, and validate them through causal ablation, linear context-steering, and three-level biological annotation (ChromHMM, HOMER, GO:BP). We find a depth-stratified context representation: context-specific features
grow 3.8-fold from early to late layer (57 → 215 Bonferroni-significant), mirroring the late-layer concentration of high-level features in language model SAEs. Causal ablation of CDS-selected features yields a large effect, context-steering closes 11.2% of the prediction gap at 4.5× above random, and biological annotation grounds the discovered features in lineage-defining transcription factors. These results demonstrate that the SAE methodology transfers cleanly from language to genomics, and that CDS provides a general primitive for identifying contrastive concepts in any probed transformer. Code is available at https://github.com/nicoleching515/gene_expression_predictions.
ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior
Abstract
Post-hoc interpretability methods typically attribute a model’s behavior to its components, data, or training trajectory in isolation, and are often tied to a particular level of granularity along the local-to-global spectrum. This leads to explanations that lack a unified view and may miss key interactions. We present ExPLAIND, a theoretically grounded, unified framework that integrates model components, data, and training trajectory while supporting explanations across granularities. We generalize recent work on gradient path kernels, reformulating models trained by AdamW as kernel machines. From the resulting kernel feature maps, we derive novel parameter-wise and step-wise influence scores. We empirically validate the resulting decomposition of model behavior in several settings and apply ExPLAIND to two case studies. Our findings on a Transformer exhibiting Grokking support previously proposed learning phases, while refining the final phase as one in which outer layers align around a representation pipeline learned after memorization. For EuroLLM pretraining, ExPLAIND reveals a two-phase dynamic, with the first characterized by outer-layer MLP learning and the second by increased relative influence of intermediate attention layers. These results establish ExPLAIND as a unified framework for interpreting model behavior and training dynamics.
Are Sparse Autoencoder Benchmarks Reliable?
Abstract
Sparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quality metrics in SAEBench, the de-facto standard SAE evaluation suite, through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs. The other metrics show higher reseed noise and lower discriminability than the field assumes. The sae-probes variant of $k$-sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE architecture. Our results show the field needs better SAE benchmarks.
Interpreting Genomic Language Models using Sparse Autoencoders
Abstract
Genomic language models (gLMs) achieve strong performance across diverse genomic prediction tasks, but their internal biological representations remain poorly understood. Sparse autoencoders (SAEs) have emerged as an interpretability tool in vision and natural language models, yet their applicability to gLMs remains unexplored. We present a systematic study of SAE-based interpretability for gLMs, introducing a diverse benchmark of human genomic annotations and a suite of genome-tailored interpretability metrics. Using Evo2 as a primary case study, we show that SAE features, particularly those from intermediate layers, are more interpretable than raw model embeddings across 42/55 (76%) of our genomic concept evaluations, with 26 of them having an F1 score greater than 0.7. We further find that interpretability depends on SAE training data properties such as evolutionary proximity and context length, with mixed-species and longer-context training improving recovery of human genomic features. Finally, we develop a graph-based representation method to construct a feature atlas that organizes semantically related genomic concepts learned by an SAE, outperforming the baseline approach of using SAE model weights. Our results establish SAEs as a powerful framework for better understanding gLMs, broadening their accessibility and utility for disease-driven genomic analysis.
Minionese: Comprehensive Benchmark and Mechanistic Study of Multilingual LLM Safety
Abstract
Safety alignment in large language models remains brittle across languages: prompts reliably refused in English can elicit harmful compliance in non-English and low-resource settings. We introduce \textsc{Minionese}, a multilingual jailbreak benchmark spanning 18 languages, 4 resource tiers, and 4 perturbation types (standard translation, code-switching, transliteration, and translationese), paired with a geometric mechanistic analysis of refusal failure across language tiers. We show that each attack type produces a distinct vulnerability profile: transliteration vulnerability is mediated by script identity, code-switching maintains effectiveness through the lowest-resource tier, and a sharp safety regime transition between Tiers 2 and 3 is consistent across all models. Mechanistically, low-resource jailbreaks succeed by routing harmful content through a geometrically misaligned subspace that projects insufficiently onto the refusal directions, leaving the refusal mechanism intact but untriggered. These findings show that English-only safety evaluations are insufficient; they require accounting for script family, perturbation type, and per-language alignment coverage. The benchmark and analysis code is at \url{https://anonymous.4open.science/r/minionese/}.
Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models
Abstract
Transformer-based tabular foundation models (TFMs) dominate small to medium tabular predictive benchmark tasks, yet their inference mechanisms remain largely unexplored. We present the first large-scale mechanistic study of layerwise dynamics in 6 state-of-the-art tabular in-context learning models. We explore how predictions emerge across depth, identify distinct stages of inference, and reveal latent-space dynamics that differ from those of language models. Our findings indicate substantial depthwise redundancy across multiple models, suggesting iterative refinement with overlapping computations during inference stages. Guided by these insights, we design a proof-of-concept, looped single-layer model that uses only 20% of the original model’s parameters while achieving comparable performance.
Building Fast, Evaluating Slow: Pipeline Choices Dominate Autointerpretability Score Variance
Abstract
Cross-paper comparison of sparse autoencoder (SAE) interpretability often relies on autointerpretability scores. In this evaluation pipeline, a language model (LM) explains each feature, and another LM scores the explanation. For these comparisons to be meaningful, scores must reflect stable properties of the features rather than confounding aspects of the evaluation pipeline. Through systematic experiments across four metrics (simulation, detection, fuzzing, purity), two models (Pythia-160M, Apertus-8B), and four axes of methodological variation, we show that this assumption does not hold. Specifically, we find that $\textbf{\textcolor{insightone}{(R1)}}$ methodological variance collectively exceeds architectural variance across all metrics and tested models; $\textbf{\textcolor{insighttwo}{(R2)}}$ each metric exhibits a distinct instability profile, with detection being the most stable and fuzzing unreliable across all conditions; $\textbf{\textcolor{insightthree}{(R3)}}$ top-$k$ feature rankings do not stay consistent across corpus and draw conditions, masking per-feature instability behind stable mean scores; a failure that cannot be detected by monitoring explanation similarity alone. These findings suggest that cross-paper comparisons based on autointerpretability scores may reflect pipeline differences rather than architectural differences, with implications for the ongoing debate on SAE utility. More broadly, unreliable evaluation slows progress in interpretability research at a time when reliable tools for understanding AI systems are needed. To support evaluation, we contribute a variance decomposition approach, a Stability Check, and a Minimum Reporting Checklist.
The Platonic Universe: Do Foundation Models See the Same Sky?
Abstract
We test the Platonic Representation Hypothesis (PRH) and its Aristotelian refinement (ARH) by using diverse astronomical data to measure representational convergence across foundation models.
We propose that astronomy is a natural testbed for this: the historical success of astrophysics is itself evidence that a compact, modality-invariant description of galaxy observables exists, and so representation convergence toward reality should be measurable against the physical parameters astronomers already use.
Given this framework, we evaluate eleven foundation model families (spanning supervised classification, self-distillation, joint-embedding prediction, masked autoencoding, vision-language pre-training, and astronomy-specific architectures from $\mathcal{O}$(10M)${\to}\mathcal{O}$(10B) parameters) on crossmatched JWST, HSC, and Legacy Survey imagery, and DESI spectroscopy.
All models are evaluated frozen, with no astronomy-specific fine-tuning.
We probe redshift, stellar mass, and specific star formation rate via linear probes, and local (MKNN) and global (CKA) embedding geometry within families, between modalities, and across architectures.
We find that physics performance scales predictably with capacity; probe directions align consistently with expected astrophysical correlations and selection effects; and local (but not global) embedding alignment tracks physics performance, including between DESI spectra and HSC imagery---modalities that share essentially no low-level statistics.
Our results support the ARH over the strict PRH, and suggest that astro-foundation models can build on general-purpose pre-trained architectures, capitalizing on the broader open machine learning community's already-spent computational investment.
Verifiable Explanations Cannot Be Much Smaller Than the Behavior They Explain
Abstract
Interpretability often promises a small explanation of a large model. We study the harder case where the explanation must stand alone: it must take inputs, produce outputs, and be verifiable without access to the original model. In that setting, the relevant object is not just a rule but a full executable package, including any decoders, input/output adapters, region selectors, residual corrections, and certificate needed to apply and check it. We formalize the behavior to be explained as finite input–output traces and prove that any exact, verifiable explanation package can be
compiled into a simulator for those traces. As a result, it cannot be much smaller, up to constant coding overheads, than the shortest program with the same boundary behavior. This identifies when apparent interpretability compression is real and when it is only hidden in omitted infrastructure: local and approximate explanations help only when the selector, disagreement set, or adapter is itself simple. Experiments on a variety of models show large accounting gaps when these omitted costs are restored. The paper offers a practical takeaway for interpretability evaluation: report the whole ledger, not just the visible artifact.
Interaction-Aware Influence Functions for Group Attribution
Abstract
Influence functions approximate how removing a training example changes a quantity of interest, called the target function, such as a held-out loss. To estimate the influence of a group of examples, the standard practice is to sum the individual influences of its members. However, this sum does not capture how examples jointly affect the target: a pair of examples may be redundant or complementary, but the sum cannot distinguish these cases. We propose an interaction-aware influence function that characterizes how interactions between examples influence the target. By expanding the target to second order around the trained parameters, we obtain an estimator that augments the standard sum with a pairwise interaction term that captures the alignment between two examples' effects on the target. We empirically evaluate our estimator in two settings. First, on six dataset-model pairs spanning logistic regression, MLPs, and ResNet-9, our estimator tracks leave-group-out retraining substantially better than first-order influence across all settings. Second, when used as a greedy selection rule for instruction-tuning data on Llama-3.1-8B, it beats prior influence-based and representation-similarity baselines on five of seven downstream tasks, in a regime where standard influence-based selection underperforms random selection. Code is available at https://anonymous.4open.science/r/Interaction_IF-45D6.
What Does a Chemical Language Model Know About Molecules?
Abstract
Chemical language models (cLMs) are widely assumed to learn surface-level syntactic patterns rather than learning meaningful molecular semantics. Here, we apply sparse autoencoders (SAEs) to MolFormer, an encoder-only cLM, to mechanistically examine how molecular representations are built across layers. We discover that early layers rely on position-tracking latents to parse molecular grammar, while later layers encode atom-in-substructure and pharmacologically relevant features. Additionally, we show that non-canonical SMILES produce more disruptive representation shifts than invalid SMILES, driven by position-latent disruption propagating across layers. To support further exploration, we develop InterMol, an interactive visualizer for SAE activations on molecular strings and structures.
Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer
Abstract
Health foundation models (FMs) learn useful rep-
resentations from wearable sensors, but interpret-
ing what they encode and transferring that knowl-
edge across modalities after training remains dif-
ficult. We present a post-training framework that
decomposes frozen embeddings into interpretable
directions, referred to as symbols, and use these
symbols to align the embedding spaces without re-
training. We evaluate the framework on three FMs
for photoplethysmography (PPG) and accelerom-
eter data, independently pretrained on ∼20M min-
utes of unlabeled data from ∼172K participants,
and analyzed on a held-out cohort of 30K sub-
jects. We find that extracted symbols associate
selectively with health conditions and physiologi-
cal attributes, and these associations are partially
shared across modalities and architectures. Cross-
modal transfer via symbols retains more than 95%
of in-domain performance, is nearly symmetric
across domain directions, and saturates with lim-
ited paired data, together indicating that align-
ment recovers a shared low-dimensional subspace
rich in physiological information. Overall, these
results suggest that health FM embeddings con-
tain an interpretable symbolic organization that
is shared across modalities and supports cross-
domain transfer without joint training.
The Dead Salmons of AI Interpretability
Abstract
In a striking neuroscience study, the authors placed a dead salmon in an MRI scanner and showed it images of humans in social situations. Astonishingly, standard analyses of the time reported brain regions predictive of social emotions. The explanation, of course, was not supernatural cognition but a cautionary tale about misapplied statistical inference. In AI interpretability, reports of similar "dead salmon" artifacts abound: feature attribution, probing, sparse auto-encoding, and even causal analyses can produce plausible-looking explanations for randomly initialized neural networks. In this work, we examine this phenomenon and argue for a pragmatic statistical-causal reframing: explanations of computational systems should be treated as parameters of a (statistical) model, inferred from computational traces. This perspective goes beyond simply measuring statistical variability of explanations due to finite sampling of input data; interpretability methods become statistical estimators, and findings should be tested against explicit and meaningful alternative computational hypotheses, with uncertainty quantified with respect to the postulated statistical model. It also highlights important theoretical issues, such as the identifiability of common interpretability queries, which we argue is critical to understand the field’s susceptibility to false discoveries, poor generalizability, and high variance. More broadly, situating interpretability within the standard toolkit of statistical inference opens promising avenues for future work aimed at turning AI interpretability into a pragmatic and rigorous science.
Position: Interpretability Can Be Actionable
Abstract
Interpretability aims to explain the behavior of deep neural networks. Despite rapid growth, there is mounting concern that much of this work has not translated into practical impact, raising questions about its relevance and utility. This position paper argues that the central missing ingredient is not new methods, but evaluation criteria: interpretability should be evaluated by actionability—the extent to which insights enable concrete decisions and interventions beyond interpretability research itself. We define actionable interpretability along two dimensions—concreteness and validation—and analyze the barriers currently preventing real-world impact. To address these barriers, we identify five domains where interpretability offers unique leverage and present a framework for actionable interpretability with evaluation criteria aligned with practical outcomes. Our goal is not to downplay exploratory research, but to establish actionability as a core objective of interpretability research.
The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust
Abstract
As language models improve and become increasingly deployed to solve a variety of tasks, trustworthiness becomes essential. Calibration is a good proxy for trust: well-calibrated confidence estimates help inform the risk versus reward tradeoff when trusting a specific model output. Unfortunately, even as models improve, they remain poorly calibrated, often biasing towards overconfidence. Additionally, calibration can be gamed: a policy that always predicts the base rate is perfectly calibrated, but completely uninformative. To resolve this, we develop a new metric, **expected utility renormalized by the oracle (EURO)**, that balances calibration and informativeness. We also propose a **general-purpose activation-based confidence, utility, and trust estimation protocol (ACUTE)** to appropriately adjudicate uncertainty. The ACUTE protocol provides *flexible, sample-efficient, and compute-efficient* confidence estimators for 3 tasks including multiple choice question answering, tool-calling, and scientific document summarization across 6 models from 4 model families. ACUTE outperforms strong baselines on EURO, while maintaining low calibration error. Taken together, our work shows that equipping LLMs with the ACUTE protocol can improve calibration, utility, and trustworthiness in numerous settings.
Position: Causality is Key for Interpretability Claims to Generalise
Abstract
Interpretability research on large language models (LLMs) has yielded important insights into model behaviour, yet recurring pitfalls persist: findings that do not generalise, and causal interpretations that outrun the evidence. Our position is that causal inference specifies what constitutes a valid mapping from model activations to invariant high-level structures, the data or assumptions needed to achieve it, and the inferences it can support. Specifically, Pearl's causal hierarchy clarifies what an interpretability study can justify. Observations establish associations between model behaviour and internal components. Interventions (e.g., ablations or activation patching) support claims how these edits affect a behavioural metric (e.g., average change in token probabilities) over a set of prompts. However, counterfactual claims---i.e., asking what the model output would have been for the same prompt under an unobserved intervention---remain largely unverifiable without controlled supervision. We show how causal representation learning (CRL) operationalises this hierarchy, specifying which variables are recoverable from activations and under what assumptions. Together, these motivate a diagnostic framework that helps practitioners select methods and evaluations matching claims to evidence such that findings generalise.
Residue-Level Attributions in Protein Language Models Do Not Recover Allergen Epitopes
Abstract
Deep allergenicity classifiers are increasingly used in safety screening of novel foods, and recent protein language models have substantially improved protein-level allergenicity prediction. However, whether their explanations capture biologically meaningful information remains unclear. We introduce an epitope-grounded residue-level benchmark for quantitatively evaluating attribution faithfulness in protein allergenicity models. Across frozen ESM-2, multi-task ESM-2, and DeepPlantAllergy, protein-level classification was robust, yet classification-head explanation signals did not significantly exceed random in their residue-level alignment with annotated epitopes across AUROC, AUPRC, and Precision@$k$. Integrated Gradients identified residues that were functionally important to the model, but not overlapping annotated epitopes. Saturation mutagenesis further suggested classifiers may rely on physicochemical and compositional sequence features rather than epitope-specific mechanisms. Residue-level importance signals should therefore not be interpreted as immunological explanations for safety screening or hypoallergen design without quantitative validation. Code available: https://github.com/Jeffateth/XAllergen2.0-paper
The Data Manifold under the Microscope
Abstract
A significant gap exists between theory and practice in deep learning. Generalization and approximation error bounds are often derived for simplified models or are too loose to be informative. Many rely on the manifold hypothesis and on geometric regularity such as intrinsic dimension, curvature, and reach. Progress requires insight into data-manifold geometry and suitable benchmarks, yet existing options are polarized: analytic manifolds with known geometry but limited applicability, or real-world datasets where geometry is only coarsely estimable. We introduce a benchmarking framework for studying data geometry. We repurpose and extend dSprites and COIL-20 with additional transformation dimensions and dense, axis-aligned sampling, and pair them with finite-difference estimators that recover curvature, reach, and volume at near-ground-truth accuracy in a regime where general-purpose estimators are unreliable or difficult to deploy. The framework is intended as a controlled testbed, useful as a calibration environment for geometric estimators and as a sandbox for probing theoretical assumptions and internal representations. To illustrate its use, we present two application studies: assessing the scaling behavior of the bounds of Genovese et al. and Fefferman et al., and tracking how $\beta$-VAE layers reshape known image manifolds. The latter provides an interpretability oriented use case as it quantifies whether internal maps preserve factors of variation, separate object-class manifolds, or introduce geometric distortions such as increased curvature and reduced reach.
A reference implementation is available at https://github.com/koulakis/manifold-microscope.
Designing Effective Monitor-Based Interventions for Mitigating Reward Hacking During RL
Abstract
Reinforcement learning (RL) rewards are notoriously difficult to design and control, often leading to the model learning unintended behaviors such as eward hacking. One potential solution is to monitor for reward hacking and penalize it when detected; however, training against a monitor could lead to evasive behavior, and our general understanding of how to apply monitors effectively during training is limited. To study how best to use monitors to mitigate reward hacking, we introduce and open source three realistic environments where Qwen3-4B reward hacks: a coding environment hackable via test overwriting, a medical chat environment hackable via sycophancy, and a biography generation environment hackable via hallucination. We first focus on the coding environment, where we find that: (1) models can learn to evade highly accurate monitors by exploiting systemic flaws in probes and LLM judges; (2) monitors that leak more learning signal during RL suppress reward hacking but are more often evaded; and (3) including easier problems in training can decrease reward hacking. We apply our findings to build better reward hacking monitors for the medical chat and biography generation environments that improve upon naive baselines to reduce reward hacking rates across seeds from 70-100% to 0%. Our results demonstrate that our takeaways translate to new settings and that better monitor intervention designs are possible.
Pretraining Data Statistics Shape the Phases of Learning Entity Comparison in Language Models
Abstract
How does data shape language model (LM) behavior throughout pretraining? We investigate this question through a case study on entity comparison, e.g., *Between France and Brazil, which country is larger?*. We begin with controlled experiments in which we train small LMs (124M parameters) on mixtures of natural text from pretraining corpora and synthetic data from entity comparison tasks. We identify three distinct phases of learning: (1) an early phase where the LM selects entities by frequency, (2) a middle phase where the LM selects entities by position in a prompt (first vs. last), and (3) a late phase where the LM selects the entity that is the correct answer to the question. We show that the emergence of these three phases is controlled by statistical properties of the training data. With small amounts of task-specific synthetic data, we observe only the first two phases and the model fails to learn the task; with large amounts, the model jumps directly from the frequency-based heuristic to solving the task correctly. Moreover, if we modify the frequency of entities in data from a naturally occurring Zipfian distribution (a small number of entities are very common and the vast majority are rare) to a uniform distribution, the first phase disappears and the model learns the task more quickly. Finally, we find the same three phases of learning in the pretraining of open-sourced OLMo models. Together, our findings demonstrate that properties of pretraining data are causal drivers of heuristic learning and show that small-scale synthetic experiments can predict training dynamics at larger scales.
Current Activation Oracles Are Hard to Use on Safety-Relevant Tasks
Abstract
Activation oracles (AOs) are LLMs finetuned to answer questions about another model’s activations (Karvonen et al., 2025). We test the publicly released Qwen3 AOs on out-of-distribution safety-relevant tasks and find them difficult to use off the shelf.
AO responses are often vague to the point of being unfalsifiable, and confidently hallucinate when pushed to be specific. Many apparent successes are explained by a confound we call text inversion: because most of the AO’s training involves predicting tokens near a given activation, AOs can answer by paraphrasing nearby decoded text rather than reading deeper internal state. A related confound is that the AO is itself a capable LLM, and can sometimes answer from its own weights without using the activations at all.
We ran three tasks where text inversion cannot help, and AOs were at or near chance on all three. Given an arithmetic problem and asked to predict the answer from activations before any answer tokens, the AO emits the same handful of numbers regardless of the problem. Given a chain of thought that rationalizes a user-preference-flipped answer without ever mentioning the user, the AO cannot reliably tell it apart from a non-sycophantic rollout. Given a logic puzzle missing key information, the AO cannot identify what the model is confused about. When tasked with identifying why a reasoning model backtracks, the AO does better, but a follow-up edit experiment suggests its correct answers mostly restate keywords near the probe. As an additional case study, we apply AOs to censored topics in Chinese models and find the oracle’s own pretraining knowledge dominates, illustrating the second confound.
AOs do show signal in narrower settings closer to their training distribution: detecting subtle activation steering before it surfaces in text, multiple-choice selection among plausible backtracking reasons, and next/previous-token prediction. We read this as a meaningful negative update on current released AOs as general-purpose safety tools, though the limitations look fixable. Text-inversion-controlled tasks and a no-activation ablation should become standard checks for future AO evaluations.
Emergent Latent-State Computation under Stochastic Volatility
Abstract
Mechanistic interpretability has largely focused on language models and deterministic toy tasks. Much less is known about how sequence models internally represent latent stochastic dynamics under noisy, partially observed observations. We study this question in a controlled multivariate stochastic volatility setting, where models observe only returns while the ground-truth latent volatility state is known to the researcher. This setting provides a useful benchmark for mechanistic interpretability under partial observability: the latent state is hidden from the model but directly available for evaluation. Across architectures, losses, and output heads, we find evidence for a two-stage computation. Hidden representations encode substantial information about the next latent volatility state, and the output head maps this representation to squared return forecasts. Furthermore, in Transformers, latent-state decodability emerges at identifiable architectural stages whose location depends on the volatility period. In long-cycle regimes, this computation simplifies into an explicit latent-state filter consisting of a learned linear projection followed by $\ell^2$ normalization. Output-head replacement further shows that part of the degradation under noisy MSE training arises from readout misalignment rather than representation failure. These results suggest that stochastic volatility models provide a useful benchmark for mechanistic interpretability under noisy latent dynamics and partial observability.
Adaptive and interpretable two-sample test as a batch-level tool for AI safety
Abstract
Most safety defenses for large language models - input and output classifiers, refusal probes, per-query OOD detectors - inspect prompts and completions one at a time. Recent work shows that black-box iterative jailbreaks evade even the strongest of these (Davies et al. 2026), arguing that *"effective defence requires supplementing single-interaction methods with batch-level monitoring."* This suggests treating incoming prompts as a collection and asking whether their distribution deviates from benign traffic - a non-trivial problem, as attack queries are heavily diluted, new attacks circumvent fixed rules, and detection claims must be calibrated. We propose batch-level monitoring based on a calibrated, adaptive two-sample test applied to a model's hidden states, requiring only an attack-free reference. The test localizes the attack signal in activation space, making the geometry of the detected shift directly inspectable and providing an empirical basis for mechanistic follow-up. On `llama3-jailbreaks`, it detects attacks significantly better than standard two-sample baselines, and reveals that jailbreaks organize into two near-orthogonal groups - *scaffolded* versus *direct* prompts - visible in benign data too, suggesting that the input geometry relevant to safety is richer than a single refusal direction can capture.
When Reading the Chain of Thought Falls Short: A Testbed for Reasoning Trace Analysis
Abstract
Reading the chain of thought (CoT) is a widely used safety technique for reasoning models, but it struggles when the CoT leaves out or misrepresents the factors driving a behavior. However, we lack benchmarks that focus on these cases where reading the CoT fails, so progress on alternative methods is hard to measure. To address this gap, we introduce and release nine novel CoT analysis tasks, each with in-distribution (ID) and out-of-distribution (OOD) test sets. All nine tasks are extremely challenging, with both prompt-optimized frontier LLM monitors and human reviewers frequently achieving no better than chance. We benchmark probes, term frequency methods, LLM monitors, and an LLM agent with interpretability affordances. We focus on OOD performance, since ID results often reflect dataset-specific shortcuts. We find that no method dominates: narrow classifiers, an LLM agent, and LLM monitors all win on different tasks. We provide a lower bound baseline for future work by ensembling all methods with a select-on-ID and score-on-OOD protocol; this ensemble beats the human baseline on 6 / 7 tasks. We believe that our testbed gives future CoT analysis methods a non-saturated hill to climb.
Reconstruction Converges Before Decoder Reproducibility in Sparse Autoencoders
Abstract
Sparse autoencoders (SAEs) are often evaluated by reconstruction loss, but interpretability workflows also require that learned dictionaries be reproducible across random seeds and robust to evaluation artifacts. We study SAE decoder reproducibility as a benchmark-design problem: every stability score is reported against a metric-specific random-dictionary null, pairwise seed statistics are treated as dependent, and decoder geometry is audited with assignment-based, activation-level, firing-overlap, causal, streaming, and synthetic-ground-truth controls. In compute-limited cached-activation regimes, reconstruction can appear converged while decoder-column similarity remains within 1.5% of the geometric null; longer training raises decoder agreement, but activation and functional diagnostics lag. These results argue that SAE benchmarks should report reconstruction, null-calibrated decoder matching, held-out activation agreement, and ground-truth or downstream checks together rather than treating reconstruction or a single stability metric as sufficient.
LoRAcles: Self-Supervised Weight-Space Interpretability at Scale
Abstract
Fine-tuned LLMs can learn complex and subtle behaviors. However, it can be difficult to ascertain what behaviors the model acquired during fine-tuning. We introduce LoRAcles: fine-tuned language models that take LoRA adapter weights as input, and answer natural language questions about them.
We introduce a self-supervised pipeline for training LoRAcles: we first train LoRAs on small sets of pre-training documents, and then train LoRAcles on these LoRAs to answer questions about the documents. LoRAcles generalize far beyond their training data: they are state-of-the-art on AuditBench, a benchmark of models with sophisticated hidden behaviors; they can detect subtle changes in model organisms such as subliminal learning; and they are the first tool to achieve non-trivial performance at verbalizing semantic backdoor triggers. LoRAcles can also describe some learned behaviors from the weights of a full-parameter fine-tune. LoRAcles are a highly scalable method: we train LoRAcles for Qwen-3-14B and Llama-3.3-70B, and observe that performance scales smoothly with the size of our training dataset up to 100K LoRAs, though our method may need refinement in order to scale further.
We note, though, that LoRAcles are prone to hallucinations and currently only elicit the most salient behavioral changes, limiting their utility beyond narrow fine-tunes.
LLMs can annotate attribution graphs
Abstract
Circuit tracing is an exciting technique for revealing internal computation in language models, but it requires a time-intensive manual step of grouping individual features or MLP neurons into supernodes. We present a simple pipeline for automating this step: directly presenting feature descriptions to a language model that groups them into supernodes. Using automated interpretability metrics, we confirm that supernodes generated by our pipeline are as interpretable as those generated by human annotators. On a two-hop Capitals task, our pipeline recovers a supernode corresponding to the intermediate hop in 97 of 100 prompts. Finally, we present a simple proof of concept using our pipeline for open-ended exploration, where we first automatically annotate 200 attribution graphs from Wikipedia prompts and then use an LLM judge to flag graphs worth human review. We hope this work demonstrates that even simple automation can produce meaningful attribution graph annotations, motivating further work on automated circuit tracing.
Mechanistic Capability Probes as a Cheap Screen for Sequence-Mixer Architectures
Abstract
Comparing sequence-mixer architectures at the scale where their behavior matters ($\geq$1B parameters) costs multiple GPU-days per run, beyond reach of most academic labs. We propose a battery of mechanistic capability probes (induction, associative recall, copy, finite-state tracking, parity, and others) as a cheap behavioral screen for dense sequence-mixer architectures, and ask whether aggregate suite accuracy predicts downstream language-model training cross-entropy. On a held-out set of four architectures at 150M parameters we find Spearman $\rho = -0.80$ and Pearson $r = -0.97$; the screen is robust to dropping any single task family; the small-scale ranking direction is preserved at 1B on the two architectures we ran. Per-task profiles motivate **Hydra**, a multi-head block that places attention, STU, and Mamba mixers as parallel heads within each layer; Hydra matches or beats a parameter-matched 1B OLMo-2 attention baseline on training cross-entropy and on a majority of zero-shot benchmarks.
LAWFUL: Law-Aligned Witness for Faithful Use of Latents
Abstract
When a neural network predicts a physical system accurately, has it learned the governing law as formal, structured knowledge, and if so, does the network's internal computation actually use that representation throughout the law's domain of validity? We identify four interpretability gaps that limit answering these questions for {\em physics laws over continuous variables}: the absence of a coverage-aware causal-consistency measure over continuous counterfactuals; of a domain-of-validity test for the identified circuit; of a verification of the law's invariants and forbidden behaviors; and of a quantification of how a derived physical quantity flows through the circuit. We develop a foundational framework, LAWFUL, that closes the first two and lays the groundwork for the remaining two, and illustrate it on the Mocap2Radar transformer, validating whether it learns and internally uses the Doppler frequency law $f(t) = \frac{2 v(t)}{\lambda}$
from motion-capture and radar data in which neither $f(t)$ nor $v(t)$ appears.