pith. machine review for the scientific record. sign in

arxiv: 2605.03058 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords rule extractionmechanistic interpretabilityagonist neuronsLLM circuitsablationexplainable AIhierarchical localization
0
0 comments X

The pith

MechaRule localizes sparse agonist neurons to ground symbolic rules in LLM internal mechanisms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MechaRule as a pipeline to extract rules from large language models by anchoring them to specific neurons. It identifies agonist neurons whose deactivation disrupts particular behaviors like arithmetic reasoning or following jailbreak prompts. The method exploits the observation that these neuron effects tend to be monotone and saturating, allowing efficient search through hierarchical ablation rather than exhaustive testing. It also shows that using data splits that align with the model's actual rule-following behavior improves the accuracy of identifying these neurons compared to random splits. Experiments demonstrate that the found neurons account for most of the high-impact ones found by brute force and that turning them off substantially impairs the targeted behaviors.

Core claim

MechaRule grounds rule extraction in LLM circuits by efficiently localizing sparse neurons called agonists, whose activation neutralization disrupts rule-related behaviors. It rests on monotone and saturating agonist effects within fixed baseline/flip regimes for adaptive group testing and uses aligned data splits for reliable verification over unfaithful ones.

What carries the argument

Agonist neurons, sparse sets whose neutralization disrupts rule-related behaviors, localized via contrastive hierarchical ablation as adaptive group testing with regime-conditional strength predicate and confidence-guided pruning, yielding Theta(k log(N/k) + k) interventions.

Load-bearing premise

Sparse agonist effects are approximately monotone and saturating within a fixed baseline/flip regime, and agonists emerge more reliably when ablations are verified through data splits aligned with close-to-faithful rule behavior.

What would settle it

A brute-force search that identifies many high-effect agonists not recalled by MechaRule, or an experiment where suppressing the localized agonists fails to reduce arithmetic accuracy or jailbreak success, would falsify the central claims.

Figures

Figures reproduced from arXiv: 2605.03058 by Francesco Sovrano, Gabriele Dominici, Marc Langheinrich.

Figure 1
Figure 1. Figure 1: Problem overview. We seek a singleton neuron view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline overview: RuleSHAP extracts behavioral splitter rules from view at source ↗
Figure 3
Figure 3. Figure 3: Arithmetic/Qwen2: layerwise flip coverage by high view at source ↗
Figure 4
Figure 4. Figure 4: Conceptual picture: dominance (overtopping) and overlap explain why group effects are neither purely additive nor view at source ↗
Figure 5
Figure 5. Figure 5: Runtime or two-pass intervention via high-MCC neuron-anchored rules. A neuron-anchored rule view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity of localized-neuron counts to the CHA effect threshold view at source ↗
Figure 7
Figure 7. Figure 7: Representative ECDF overlays for arithmetic under rule split + spectral coverage. view at source ↗
Figure 8
Figure 8. Figure 8: Agonist signatures for arithmetic on Qwen2-7B-Instruct under rule split + spectral coverage. view at source ↗
Figure 9
Figure 9. Figure 9: Agonist signatures for arithmetic on GPT-J-6B under rule split + spectral coverage. view at source ↗
Figure 10
Figure 10. Figure 10: Agonist signatures for BoN jailbreaking on Qwen2-7B-Instruct under rule split + spectral coverage. view at source ↗
read the original abstract

A key goal of explainable AI (XAI) is to express the decision logic of large language models (LLMs) in symbolic form and link it to internal mechanisms. Global rule-extraction methods typically learn symbolic surrogates without grounding rules in model circuitry, while mechanistic interpretability can connect behaviors to neuron sets but often depends on hand-crafted hypotheses and expensive neuron-level interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by efficiently localizing sparse neurons called agonists, whose activation neutralization disrupts rule-related behaviors. MechaRule rests on two empirical observations. First, within a fixed baseline/flip regime, sparse agonist effects can be approximately monotone and saturating: a few dominant neuron activations can overtop weaker ones at coarse scales, while overlapping neurons flip many of the same examples. This motivates viewing localization as adaptive group testing driven by a regime-conditional strength predicate with confidence-guided conservative pruning, yielding Theta(k log(N/k) + k) interventions over N candidates when k << N neurons are agonists under the monotone-overtopping abstraction. Second, agonists emerge more reliably when ablations are verified through data splits aligned with close-to-faithful rule behavior; spectral splits remain a useful rule-free fallback, while unfaithful splits degrade localization. Empirically, overtopping appears mainly in learned, task-aligned regimes: on arithmetic and jailbreak tasks across Qwen2 and GPT-J, MechaRule recalls 96.8% of high-effect brute-force agonists in completed comparisons, and suppressing localized agonists reduces arithmetic accuracy and jailbreak success by up to 71.1% and 8.8%, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MechaRule, a pipeline for neuron-anchored rule extraction in LLMs using contrastive hierarchical ablation to localize sparse agonist neurons. It rests on two empirical observations: within a fixed baseline/flip regime, sparse agonist effects are approximately monotone and saturating (motivating adaptive group testing with Theta(k log(N/k) + k) interventions), and agonists emerge more reliably with data splits aligned to close-to-faithful rule behavior. Experiments on arithmetic and jailbreak tasks across Qwen2 and GPT-J report 96.8% recall of high-effect brute-force agonists and performance reductions of up to 71.1% (arithmetic accuracy) and 8.8% (jailbreak success) after suppression.

Significance. If the results and underlying observations hold, the work could meaningfully advance XAI by efficiently grounding symbolic rules in LLM circuitry, reducing reliance on hand-crafted hypotheses while achieving practical localization efficiency. The reported recall and causal impact metrics indicate potential utility for task-aligned behaviors, though the empirical foundation limits immediate theoretical impact.

major comments (2)
  1. [Abstract] Abstract (first observation): the monotone-overtopping abstraction (dominant neurons overtop weaker ones with overlapping flips) is load-bearing for both the Theta(k log(N/k) + k) complexity bound and the claim of reliable localization, yet the manuscript provides no marginal-effect curves, interaction matrices, or direct comparison to additive/synergistic alternatives to confirm saturation and dominance within the baseline/flip regime.
  2. [Abstract] Abstract (empirical claims): the 96.8% recall of brute-force agonists and the 71.1%/8.8% performance reductions are presented without baselines, statistical tests, error bars, or full experimental controls, making it impossible to assess whether the results support the central claim that MechaRule reliably grounds rules in circuitry.
minor comments (2)
  1. [Abstract] The second observation on data splits (spectral vs. faithful) is stated clearly but would benefit from explicit pseudocode or a small illustrative example showing how splits are constructed and verified.
  2. Notation for 'agonists' and the 'regime-conditional strength predicate' is introduced without a dedicated definitions subsection, which could confuse readers unfamiliar with the contrastive ablation setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and commit to revisions that directly strengthen the substantiation of the monotone-overtopping abstraction and the presentation of empirical results.

read point-by-point responses
  1. Referee: [Abstract] Abstract (first observation): the monotone-overtopping abstraction (dominant neurons overtop weaker ones with overlapping flips) is load-bearing for both the Theta(k log(N/k) + k) complexity bound and the claim of reliable localization, yet the manuscript provides no marginal-effect curves, interaction matrices, or direct comparison to additive/synergistic alternatives to confirm saturation and dominance within the baseline/flip regime.

    Authors: We agree that the monotone-overtopping abstraction is central to both the complexity analysis and the localization claims. The manuscript grounds this in empirical observations from arithmetic and jailbreak tasks, but we acknowledge the absence of explicit marginal-effect curves, interaction matrices, and model comparisons. In the revised manuscript we will add these visualizations and analyses, including direct contrasts against additive and synergistic alternatives, to confirm saturation and dominance in the baseline/flip regime. revision: yes

  2. Referee: [Abstract] Abstract (empirical claims): the 96.8% recall of brute-force agonists and the 71.1%/8.8% performance reductions are presented without baselines, statistical tests, error bars, or full experimental controls, making it impossible to assess whether the results support the central claim that MechaRule reliably grounds rules in circuitry.

    Authors: The 96.8% recall and performance reductions are derived from completed brute-force comparisons reported in the experiments section. We agree that the abstract and results would be strengthened by explicit baselines, statistical tests, error bars, and additional controls. We will revise the abstract and results to incorporate these elements, including error bars on all reported metrics and expanded control experiments, to allow rigorous evaluation of the grounding claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on independent empirical observations and interventions

full rationale

The paper presents two empirical observations about agonist effects (monotone/saturating overtopping within fixed regimes, and better localization via aligned data splits) as starting points, then derives an adaptive group-testing procedure with Theta(k log(N/k) + k) complexity under that abstraction. It validates the approach via separate brute-force comparisons (96.8% recall) and causal suppression experiments (performance drops) on held-out tasks and models. No equations or claims reduce by construction to fitted parameters, self-definitions, or self-citations; the observations are stated as external empirical findings rather than outputs of the method itself. The efficiency bound follows directly from standard group-testing analysis once the abstraction is granted, and the recall metric compares against an independent brute-force baseline rather than a fitted surrogate.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Central claims rest on two stated empirical observations about neuron effects and data split reliability rather than derived principles; no free parameters or invented entities with independent evidence are detailed.

axioms (2)
  • domain assumption Sparse agonist effects can be approximately monotone and saturating within a fixed baseline/flip regime
    First empirical observation motivating adaptive group testing and pruning.
  • domain assumption Agonists emerge more reliably when ablations are verified through data splits aligned with close-to-faithful rule behavior
    Second empirical observation; spectral splits noted as fallback.
invented entities (1)
  • agonists no independent evidence
    purpose: Sparse neurons whose activation neutralization disrupts rule-related behaviors
    Core concept introduced to ground rule extraction in circuitry

pith-pipeline@v0.9.0 · 5605 in / 1567 out tokens · 94422 ms · 2026-05-08T18:50:46.110002+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/AlphaCoordinateFixation.lean (J-cost / cosh structure absent here) J_uniquely_calibrated_via_higher_derivative unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we model slice-wise flip rates as potentially monotone and saturating set functions; a concrete abstraction is a regime-conditional union-of-flips model ... compatible with coverage-type submodularity

  • (domain orthogonal to RS forcing chain) reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We evaluate MechaRule on open-weight causal LLMs ... Qwen2-7B-Instruct, Qwen2-1.5B-Instruct, and gpt-j-6B ... main tasks are arithmetic, ... and Best-of-N jailbreaking

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. 1993. Mining Associa- tion Rules between Sets of Items in Large Databases. InProceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 26-28, 1993, Peter Buneman and Sushil Jajodia (Eds.). ACM Press, 207–216. doi:10.1145/170035.170072

  2. [2]

    Matthew Aldridge, Oliver Johnson, and Jonathan Scarlett. 2019. Group testing: an information theory perspective.Foundations and Trends in Communications and Information Theory15, 3–4 (5 Dec. 2019), 196–392. doi:10.1561/0100000099

  3. [3]

    Mamdouh Alenezi and Mohammed Akour. 2025. Ai-driven innovations in soft- ware engineering: a review of current practices and future directions.Applied Sciences15, 3 (2025), 1344

  4. [4]

    Seltzer, and Cyn- thia Rudin

    Elaine Angelino, Nicholas Larus-Stone, Daniel Alabi, Margo I. Seltzer, and Cyn- thia Rudin. 2017. Learning certifiably optimal rule lists. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 35–44. doi:10.1145/3097983.3098047

  5. [5]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normaliza- tion. arXiv:1607.06450 [stat.ML] https://arxiv.org/abs/1607.06450

  6. [6]

    Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. 2023. Language models can explain neurons in language models. https://openaipublic.blob.core. windows.net/neuron-explainer/paper/index.html

  7. [7]

    Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794

  8. [8]

    Davide Chicco and Giuseppe Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.BMC genomics21, 1 (2020), 6

  9. [9]

    J., & Pearson, E

    Charles J. Clopper and Egon S. Pearson. 1934. The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial.Biometrika26, 4 (1934), 404–413. doi:10.1093/biomet/26.4.404

  10. [10]

    William W. Cohen. 1995. Fast Effective Rule Induction. InProceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann, 115–123

  11. [11]

    Mavor-Parker, Aengus Lynch, Stefan Heimer- sheim, and Adrià Garriga-Alonso

    Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimer- sheim, and Adrià Garriga-Alonso. 2023. Towards Automated Circuit Discovery for Mechanistic Interpretability. InAdvances in Neural In- formation Processing Systems 36: Annual Conference on Neural Informa- tion Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, De- cember 10 - 1...

  12. [12]

    Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei

  13. [13]

    URL https://doi.org/10.18653/v1/2022.acl -long.581

    Knowledge Neurons in Pretrained Transformers. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8493–8502. doi:10.18653/v1/2022.acl-long.581

  14. [14]

    2000.Combinatorial Group Testing and Its Appli- cations

    Dingzhu Du and Frank Hwang. 2000.Combinatorial Group Testing and Its Appli- cations. World Scientific

  15. [15]

    Jerome H Friedman and Bogdan E Popescu. 2008. Predictive Learning via Rule Ensembles.The Annals of Applied Statistics(2008), 916–954

  16. [16]

    2005.Submodular Functions and Optimization(2 ed.)

    Satoru Fujishige. 2005.Submodular Functions and Optimization(2 ed.). Annals of Discrete Mathematics, Vol. 58. Elsevier. doi:10.1016/S0167-5060(05)80001-9

  17. [17]

    Teofilo F Gonzalez. 1985. Clustering to minimize the maximum intercluster distance.Theoretical computer science38 (1985), 293–306

  18. [18]

    Jian Gu, Aldeida Aleti, Chunyang Chen, and Hongyu Zhang. 2023. Neuron Patch- ing: Semantic-based Neuron-level Language Model Repair for Code Generation. (2023). arXiv:2312.05356 [cs.CL]

  19. [19]

    Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov. 2024. Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms. InICML 2024 Workshop on Mechanistic Interpretability

  20. [20]

    Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs). arXiv:1606.08415 [cs.LG] https://arxiv.org/abs/1606.08415

  21. [21]

    Julia Herbinger, Susanne Dandl, Fiona K Ewald, Sofia Loibl, and Giuseppe Casal- icchio. 2023. Leveraging model-based trees as interpretable surrogate models for model distillation. InEuropean Conference on Artificial Intelligence. Springer, 232–249

  22. [22]

    Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie Zhou, Wenge Rong, and Zhang Xiong. 2023. Transformer-Patcher: One Mistake Worth One Neuron. InICLR

  23. [23]

    John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. 2024. Best-of-n jailbreaking.arXiv preprint arXiv:2412.03556(2024)

  24. [24]

    Frank K. Hwang. 1972. A Method for Detecting All Defective Members in a Population by Group Testing.J. Amer. Statist. Assoc.67, 339 (1972), 605–608. doi:10.1080/01621459.1972.10481257

  25. [25]

    Houcheng Jiang, Junfeng Fang, Tianyu Zhang, Baolong Bi, An Zhang, Ruipeng Wang, Tao Liang, and Xiang Wang. 2025. Neuron-Level Sequential Editing for Large Language Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers). 16678– 16702

  26. [26]

    McCormick, and David Madigan

    Benjamin Letham, Cynthia Rudin, Tyler H. McCormick, and David Madigan

  27. [27]

    Letham, C

    Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model.The Annals of Applied Statistics9, 3 (Sept. 2015), 1350–1371. doi:10.1214/15-AOAS848

  28. [28]

    Tianhe Lin, Jian Xie, Siyu Yuan, and Deqing Yang. 2025. Implicit reasoning in transformers is reasoning through shortcuts.arXiv preprint arXiv:2503.07604 (2025)

  29. [29]

    Bing Liu, Wynne Hsu, and Yiming Ma. 1998. Integrating Classification and Association Rule Mining. InProceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. AAAI Press, 80–86

  30. [30]

    Jinzhe Liu, Junshu Sun, Shufan Shen, Chenxue Yang, and Shuhui Wang. 2025. Edit Less, Achieve More: Dynamic Sparse Neuron Masking for Lifelong Knowledge Editing in LLMs. (2025). arXiv:2510.22139 [cs.CL]

  31. [31]

    Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions.Advances in neural information processing systems30 (2017)

  32. [32]

    Andrzej Maćkiewicz and Waldemar Ratajczak. 1993. Principal components analysis (PCA).Computers & Geosciences19, 3 (1993), 303–342

  33. [33]

    R Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. InProceedings of the 57th annual meeting of the association for computational linguistics. 3428–3448

  34. [34]

    Sara El Mekkaoui, Loubna Benabbou, and Abdelaziz Berrado. 2023. Rule- Extraction Methods from Feedforward Neural Networks: A Systematic Literature Review.arXiv preprint arXiv:2312.12878(2023)

  35. [35]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and Editing Factual Associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 17359–17372

  36. [36]

    Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2023. Mass-Editing Memory in a Transformer. InProc. of ICLR

  37. [37]

    Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. 2022. Fast Model Editing at Scale. InProc. of ICLR

  38. [38]

    Neel Nanda and Joseph Bloom. 2022. TransformerLens. https://github.com/ TransformerLensOrg/TransformerLens

  39. [39]

    Mathematical Programming14(1), 265– 294 (1978).https://doi.org/10.1007/BF01588971,https://doi.org/10.1007/ BF01588971

    George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. 1978. An Analysis of Approximations for Maximizing Submodular Set Functions—I.Math- ematical Programming14, 1 (1978), 265–294. doi:10.1007/BF01588971 Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

  40. [40]

    Yaniv Nikankin, Anja Reusch, Aaron Mueller, and Yonatan Belinkov. 2024. Arith- metic without algorithms: Language models solve math with a bag of heuristics. arXiv preprint arXiv:2410.21272(2024)

  41. [41]

    nostalgebraist. 2020. Interpreting GPT: the Logit Lens. LessWrong. https://www. lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

  42. [42]

    Haowen Pan, Yixin Cao, Xiaozhi Wang, Xun Yang, and Meng Wang. 2024. Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers. InFindings of the Association for Computational Linguistics: ACL 2024. 1012–1037

  43. [43]

    Haowen Pan, Xiaozhi Wang, Yixin Cao, Zenglin Shi, Xun Yang, Juanzi Li, and Meng Wang. 2025. Precise Localization of Memories: A Fine-Grained Neuron- Level Knowledge Editing Technique for LLMs. InInternational Conference on Learning Representations (ICLR)

  44. [44]

    Rafael Poyiadzi, Xavier Renard, Thibault Laugel, Raul Santos-Rodriguez, and Marcin Detyniecki. 2021. Understanding surrogate explanations: the interplay between complexity, fidelity and coverage.arXiv preprint arXiv:2107.04309(2021)

  45. [45]

    Mengchao Ren. 2024. Advancements and applications of large language models in natural language processing: A comprehensive review.Applied and Computa- tional Engineering97 (2024), 55–63

  46. [46]

    Rudy Setiono and Huan Liu. 1995. Understanding neural networks via rule extraction. InIJCAI, Vol. 1. 480–485

  47. [47]

    Francesco Sovrano. 2025. Can Global XAI Methods Reveal Injected Bias in LLMs? SHAP vs Rule Extraction vs RuleSHAP.arXiv preprint arXiv:2505.11189(2025)

  48. [48]

    Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, et al. 2025. A survey of reasoning with foundation models: Concepts, methodologies, and outlook. Comput. Surveys57, 11 (2025), 1–43

  49. [49]

    Aaquib Syed, Can Rager, and Arthur Conmy. 2024. Attribution Patching Out- performs Automated Circuit Discovery. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, and Hanjie Chen (Eds.). Association for Computational Linguistics, Mia...

  50. [50]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. InAdvances in Neural Information Processing Systems (NeurIPS)

  51. [51]

    Huanqian Wang, Yang Yue, Rui Lu, Jingxin Shi, Andrew Zhao, Shenzhi Wang, Shiji Song, and Gao Huang. 2025. Model Surgery: Modulating LLM’s Behavior via Simple Parameter Editing. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 6337–6357

  52. [52]

    Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu-Hui Liu, Xinwei Chen, Jiacheng Xu, et al. 2025. A survey on large language models for mathematical reasoning.Comput. Surveys(2025)

  53. [53]

    Weixuan Wang, Jingyuan Yang, and Wei Peng. 2025. Semantics-Adaptive Acti- vation Intervention for LLMs via Dynamic Steering Vectors. InProc. of ICLR

  54. [54]

    Edwin B. Wilson. 1927. Probable Inference, the Law of Succession, and Statistical Inference.J. Amer. Statist. Assoc.22, 158 (1927), 209–212. doi:10.2307/2276774

  55. [55]

    An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. 2025. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383(2025)

  56. [56]

    Zeping Yu and Sophia Ananiadou. 2025. Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing. (2025). arXiv:2501.14457 [cs.CL]

  57. [57]

    Mateo Espinosa Zarlenga, Zohreh Shams, and Mateja Jamnik. 2021. Efficient decompositional rule extraction for deep neural networks.arXiv preprint arXiv:2111.12628(2021)

  58. [58]

    Jusheng Zhang, Ningyuan Liu, Yijia Fan, Zihao Huang, Qinglin Zeng, Kaitong Cai, Jian Wang, and Keze Wang. 2025. LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction. (2025). arXiv:2512.18623 [cs.CL]

  59. [59]

    Tianyu Zhang, Junfeng Fang, Houcheng Jiang, Baolong Bi, Xiang Wang, and Xiangnan He. 2025. Explainable and Efficient Editing for Large Language Models. InProceedings of the ACM on Web Conference 2025 (WWW ’25). 1963–1976. doi:10.1145/3696410.3714835

  60. [60]

    Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. 2024. Explainability for large lan- guage models: A survey.ACM Transactions on Intelligent Systems and Technology 15, 2 (2024), 1–38

  61. [61]

    Wei Zhou, Wei Wei, Guibang Cao, and Fei Wang. 2025. Editing Memories Through Few Targeted Neurons.Proceedings of the AAAI Conference on Artificial Intelligence39, 19 (2025), 20360–20368. doi:10.1609/aaai.v39i19.30250

  62. [62]

    𝜏-agonist

    Jan Ruben Zilke, Eneldo Loza Mencía, and Frederik Janssen. 2016. Deepred–rule extraction from deep neural networks. InInternational conference on discovery science. Springer, 457–473. F. Sovrano et al. Appendix roadmap The appendix is organized to make each empirical and methodological dependency explicit. Appendix A defines regime-conditional dominance, ...