{"total":24,"items":[{"citing_arxiv_id":"2605.22719","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification","primary_cat":"cs.LG","submitted_at":"2026-05-21T16:55:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"An empirical audit identifies a strong SAE feature correlate for GPT-2 small failures on 'keys' prompts in the IOI task, performs ablation and baseline controls showing it is not causal, and presents the audit pipeline as the primary contribution.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22488","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer","primary_cat":"cs.LG","submitted_at":"2026-05-21T13:43:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Transformer represents but does not causally transmit staged algorithmic intermediates for base-digit extraction, diverging from probe predictions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21303","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach","primary_cat":"cs.LG","submitted_at":"2026-05-20T15:33:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Causal Functional Signatures grounded in causal evidence and ILP-learned architectural signatures to enable explicit, comparable, and portable mechanistic claims across model scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19908","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Where Does Authorship Signal Emerge in Encoder-Based Language Models?","primary_cat":"cs.CL","submitted_at":"2026-05-19T14:37:51+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18646","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Language-Switching Triggers Take a Latent Detour Through Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-18T16:53:54+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13156","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Dual-Pathway Circuits of Object Hallucination in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-13T08:20:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Our work is complementary: we characterize the internal circuits that produce hallucination, providing a mechanistic foundation for future mitigation. Activation PatchingActivation patching [ 18] has become a standard tool for circuit discovery in language models, with applications to indirect object identification [ 32], automated subgraph search [5], and edge-level information flow [7, 10]. Recent work extends these techniques to VLMs: Neo et al. [20] applied logit lens and activation patching to visual information processing; Li et al. [13] proposed cross-modal causal tracing with an inference-time intervention; and Rudman et al. [25] identified attention heads responsible for prompt-induced hallucination in three VLMs. These studies"},{"citing_arxiv_id":"2605.12809","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces","primary_cat":"cs.LG","submitted_at":"2026-05-12T23:01:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12770","ref_index":61,"ref_count":6,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WriteSAE: Sparse Autoencoders for Recurrent State","primary_cat":"cs.LG","submitted_at":"2026-05-12T21:32:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.","context_count":2,"top_context_role":"method","top_context_polarity":"use_method","context_text":"(n=57); Mann-Whitney p=0.239. Pearson r=0.19 between cosine and per-atom win rate (p=0.08); Spearman ρ=0.06 (p=0.60). Cosine to the native write tracks dictionary geometry; it does not predict substitution success at firing-level resolution. Table 13:Atom-beats-ablate at 89.80% across 87 alive atoms (95% CI [88.1,91.3] ).Win rates flat near ∼90% across cosine bins except[0.05,0.20)where10atoms straddle the threshold. Cosine binn atoms nfirings atom<ablate % cos<0.00 26 481 91.5 0.00≤cos<0.05 68 1,069 88.0 0.05≤cos<0.20 10 122 81.1 0.20≤cos<0.30 33 480 92.9 cos≥0.30 18 274 92.7 All atoms (firing-level)155 2,426 89.85 ≥5firings (per-atom mean)87 2,42689.80 20 82% 85% 88% 91% 94% atom-vs-ablate win rate (%) Z = +0.59 Figure 13:L9 H4 lies within the bulk of the per-head distribution.Win rate across all 15 L9 heads with firings (mean 89.29%±2.63%). Red star marks L9 H4 at90.84%. Table 14:Per-head rank-1 vs rank-2 at Qwen3.5-0.8B L9.Rank-2 lowers mean validation MSE by 3.1% and wins on 11/15 heads with both ranks trained, but the all-head substitution gives downstream perplexity 20.360 at rank-2 vs 20.347 at rank-1. HeadMSE r1 MSEr2 MSEr2 /MSEr1 nrecords atom<ablate % H05.05×10 −6 4.92×10 −6 0.974 1,000 91.90 H13.04×10 −6 3.09×10 −6 1.017 515 89.13 H26.76×10 −6 6.53×10 −6 0.966 177 85.31 H33.00×10 −6 2.96×10 −6 0.985 392 88.27 H42.17×10 −5 2.17×10 −5 1.000 1,277 90.84 H51.01×10 −6 9.81×10 −7 0.972 1,000 93.20 H65.14×10 −6 5.10×10 −6 0.994 427 88.52 H71.38×10 −5 1.32×10 −5 0.950 200 92.50 H81.60×10 −5 1.57×10 −5 0.984 601 87.35 H96.84×10 −6 6.51×10 −6 0.952 314 89.81 H101.39×10 −5 1.21×10 −5 0.876 23 82.61 H115.21×10 −7 5.28×10 −7 1.014 253 90.12 H122.55×10 −6 2.99×10 −6 -0- H13 -5.36×10 −6 -1,213 89.37 H141.99×10 −6 1.80×10 −6 0.905 697 90.10 H156.71×10 −7 6.77×10 −7 1.009 800 90.38 Mean (15heads)6.80×10 −6 6.59×10 −6 0.969-89.29±2.63 F.3 Per-head rank-1 vs rank-2 reconstruction at L9 The rank-2 reduction does not propagate to substitution: rank-2 perplexity exc"},{"citing_arxiv_id":"2605.11746","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel","primary_cat":"cs.AI","submitted_at":"2026-05-12T08:24:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"theoretical foundation for mechanistic interpretability.Journal of Machine Learning Research, 26(83):1-64, 2025. [15] Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva. Patchscopes: A unifying framework for inspecting hidden representations of language models. InProceedings of the 41st International Conference on Machine Learning, 2024. [16] Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. Localizing model behavior with path patching.arXiv preprint arXiv:2304.05969, 2023. [17] Google DeepMind. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. [18] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al."},{"citing_arxiv_id":"2605.11206","ref_index":112,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Instructions Shape Production of Language, not Processing","primary_cat":"cs.CL","submitted_at":"2026-05-11T20:21:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09239","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs","primary_cat":"cs.CL","submitted_at":"2026-05-10T00:45:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08295","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification","primary_cat":"cs.LG","submitted_at":"2026-05-08T10:20:39+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"In-context learning binds model outputs to the demonstrated label tokens as an exhaustive vocabulary, overriding semantic plausibility and causing fixation even with homogeneous or nonsense labels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06480","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Patch-Effect Graph Kernels for LLM Interpretability","primary_cat":"cs.AI","submitted_at":"2026-05-07T16:03:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Patch-effect graphs built from causal mediation, partial correlation, and co-influence, when analyzed with graph kernels, preserve task-discriminative signals from activation patching that outperform global shape descriptors and raw baselines on GPT-2 Small.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06076","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training","primary_cat":"cs.CL","submitted_at":"2026-05-07T11:59:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05715","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes","primary_cat":"cs.AI","submitted_at":"2026-05-07T05:58:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05076","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"High-Dimensional Statistics: Reflections on Progress and Open Problems","primary_cat":"math.ST","submitted_at":"2026-05-06T16:11:09+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03052","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How Language Models Process Negation","primary_cat":"cs.CL","submitted_at":"2026-05-04T18:17:05+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17761","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks","primary_cat":"cs.AI","submitted_at":"2026-04-20T03:24:11+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"attribution target and backpropagate it to layerl. Given the large cardinality of{R(l) i }, this results in an excessive number of backward passes. To address this challenge, we leverage a batching trick that reuses the batch dimension to pack multiple attribution targets into a single backward pass, following recent work on attribution graph construction with sparse features in transcoders [27]. This approach exploits GPU vectorization to efficiently recover relevance propagation between layers. Details are provided in Appendix A, with empirical efficiency gains reported in Appendix B. To further improve graph interpretability, we prune attribution targets with relevance below a fixed threshold and remove edges with negligible relevance during graph construction."},{"citing_arxiv_id":"2604.13694","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Weight Patching: Toward Source-Level Mechanistic Localization in LLMs","primary_cat":"cs.AI","submitted_at":"2026-04-15T10:21:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"attention heads, MLPs, neurons, residual-stream updates, and information paths as meaningful computational units, supported by causal abstraction and causal scrubbing as mechanism-level evaluation frameworks [2], [45]-[52]. A central interventionist tool in this literature is activation patching [3], later extended to path patching, attribution patching, and circuit-discovery methods [4]-[6], [23]; re- lated causal tracing work has further shown that internal interventions can identify behavior-relevant modules and even support subsequent parameter editing [10]. However, patching results are known to depend on corruption design, readout choice, and interpretation protocol [8], [9], and activation-space importance does not by itself establish that"},{"citing_arxiv_id":"2604.11962","ref_index":4,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts","primary_cat":"cs.LG","submitted_at":"2026-04-13T18:54:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03764","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Automated Attention Pattern Discovery at Scale in Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-04T15:32:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.14004","ref_index":88,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-01-20T14:23:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.15255","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How to use and interpret activation patching","primary_cat":"cs.LG","submitted_at":"2024-04-23T17:42:29+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.16042","ref_index":84,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Best Practices of Activation Patching in Language Models: Metrics and Methods","primary_cat":"cs.LG","submitted_at":"2023-09-27T21:53:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}