{"total":13,"items":[{"citing_arxiv_id":"2606.06267","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery","primary_cat":"cs.CL","submitted_at":"2026-06-04T15:10:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Structurally distinct circuits for literal sequence copying across token frequency bands implement the same computation, shown by broad transfer of band-specific edges, a shared core recovering 99% performance, and interchangeable representations via causal interventions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00831","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Subliminal Learning is a LoRA Artifact","primary_cat":"cs.AI","submitted_at":"2026-05-30T18:05:35+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Subliminal learning is a LoRA artifact that disappears with full finetuning, depends on context tokens like system prompts, and localizes to overlapping finetuning-evaluation tokens.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29358","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet","primary_cat":"cs.AI","submitted_at":"2026-05-28T04:57:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.25225","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability","primary_cat":"cs.LG","submitted_at":"2026-05-24T19:26:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Transformer Field Theory frames the residual stream as a field, models patching as source insertion, and uses first-order sensitivities plus Green functions to predict and describe responses, with empirical tests on GPT-2 autoregressive models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24577","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Polymorphism Is Rotation: Operational Mechanistic Interpretability from a Two-Layer Transformer to Pythia-70m","primary_cat":"cs.LG","submitted_at":"2026-05-23T13:37:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Transformers trained from different random seeds exhibit residual-stream polymorphism that is exactly a uniform random rotation, which a Procrustes alignment removes to transfer SAEs and steering vectors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12770","ref_index":12,"ref_count":4,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WriteSAE: Sparse Autoencoders for Recurrent State","primary_cat":"cs.LG","submitted_at":"2026-05-12T21:32:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.","context_count":2,"top_context_role":"method","top_context_polarity":"use_method","context_text":"(n=57); Mann-Whitney p=0.239. Pearson r=0.19 between cosine and per-atom win rate (p=0.08); Spearman ρ=0.06 (p=0.60). Cosine to the native write tracks dictionary geometry; it does not predict substitution success at firing-level resolution. Table 13:Atom-beats-ablate at 89.80% across 87 alive atoms (95% CI [88.1,91.3] ).Win rates flat near ∼90% across cosine bins except[0.05,0.20)where10atoms straddle the threshold. Cosine binn atoms nfirings atom<ablate % cos<0.00 26 481 91.5 0.00≤cos<0.05 68 1,069 88.0 0.05≤cos<0.20 10 122 81.1 0.20≤cos<0.30 33 480 92.9 cos≥0.30 18 274 92.7 All atoms (firing-level)155 2,426 89.85 ≥5firings (per-atom mean)87 2,42689.80 20 82% 85% 88% 91% 94% atom-vs-ablate win rate (%) Z = +0.59 Figure 13:L9 H4 lies within the bulk of the per-head distribution.Win rate across all 15 L9 heads with firings (mean 89.29%±2.63%). Red star marks L9 H4 at90.84%. Table 14:Per-head rank-1 vs rank-2 at Qwen3.5-0.8B L9.Rank-2 lowers mean validation MSE by 3.1% and wins on 11/15 heads with both ranks trained, but the all-head substitution gives downstream perplexity 20.360 at rank-2 vs 20.347 at rank-1. HeadMSE r1 MSEr2 MSEr2 /MSEr1 nrecords atom<ablate % H05.05×10 −6 4.92×10 −6 0.974 1,000 91.90 H13.04×10 −6 3.09×10 −6 1.017 515 89.13 H26.76×10 −6 6.53×10 −6 0.966 177 85.31 H33.00×10 −6 2.96×10 −6 0.985 392 88.27 H42.17×10 −5 2.17×10 −5 1.000 1,277 90.84 H51.01×10 −6 9.81×10 −7 0.972 1,000 93.20 H65.14×10 −6 5.10×10 −6 0.994 427 88.52 H71.38×10 −5 1.32×10 −5 0.950 200 92.50 H81.60×10 −5 1.57×10 −5 0.984 601 87.35 H96.84×10 −6 6.51×10 −6 0.952 314 89.81 H101.39×10 −5 1.21×10 −5 0.876 23 82.61 H115.21×10 −7 5.28×10 −7 1.014 253 90.12 H122.55×10 −6 2.99×10 −6 -0- H13 -5.36×10 −6 -1,213 89.37 H141.99×10 −6 1.80×10 −6 0.905 697 90.10 H156.71×10 −7 6.77×10 −7 1.009 800 90.38 Mean (15heads)6.80×10 −6 6.59×10 −6 0.969-89.29±2.63 F.3 Per-head rank-1 vs rank-2 reconstruction at L9 The rank-2 reduction does not propagate to substitution: rank-2 perplexity exc"},{"citing_arxiv_id":"2605.12207","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Not How Many, But Which: Parameter Placement in Low-Rank Adaptation","primary_cat":"cs.LG","submitted_at":"2026-05-12T14:46:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"parameterization, we use standard BA with a standard optimizer and simply mask which elements of B receive gradients. Second, our selection is informed: we use gradient statistics from the training distribution to choose elements, rather than relying on a fixed or random basis. Importance scoring and parameter selection.Using gradient or curvature information to identify important parameters has a long history, from Optimal Brain Damage [ 47] and Optimal Brain Surgeon [31] to recent methods like Wanda [ 74] and SparseGPT [ 21]. Movement pruning [ 68] selects parameters during fine-tuning based on gradient-weight products. Other methods (SNIP, GRASP, SynFlow) [48, 81, 75] addresses issues like layer collapse. The lottery ticket hypothesis [20] demonstrates that sparse trainable subnetworks exist within dense networks."},{"citing_arxiv_id":"2605.09314","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How LLMs Are Persuaded: A Few Attention Heads, Rerouted","primary_cat":"cs.AI","submitted_at":"2026-05-10T04:15:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08740","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure","primary_cat":"cs.LG","submitted_at":"2026-05-09T07:05:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Causal dimensionality kappa of transformer layers grows sub-linearly with SAE width, remains invariant to model scale, and stays constant across depth while attribution thresholds drop sharply.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19052","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cell-Based Representation of Relational Binding in Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-21T03:58:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13694","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Weight Patching: Toward Source-Level Mechanistic Localization in LLMs","primary_cat":"cs.AI","submitted_at":"2026-04-15T10:21:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"mation not as a replacement for exact intervention, but as a scalable surrogate for full-model screening and neuron-scale localization. This design is conceptually analogous to at- tribution patching in activation space, which approximates exact patching by locally linearizing the restoration utility around a reference point and using a gradient-difference inner product as a fast screening score [23], [24]. Here, we apply the same approximation logic in parameter space, while retaining exact Weight Patching as the primary exact interventional test. For a componentc, exact Weight Patching replaces the base-model sliceΘ (c) base byΘ (c) sft = Θ (c) base + ∆θ(c), where ∆θ=θ sft−θbase. Under a first-order Taylor expansion of the anchor utility aroundM base, the resulting change in anchor"},{"citing_arxiv_id":"2604.08016","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs","primary_cat":"cs.AI","submitted_at":"2026-04-09T09:16:00+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.19647","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models","primary_cat":"cs.LG","submitted_at":"2024-03-28T17:56:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}