arxiv: 2605.03058 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

Francesco Sovrano , Gabriele Dominici , Marc Langheinrich

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords rule extractionmechanistic interpretabilityagonist neuronsLLM circuitsablationexplainable AIhierarchical localization

0 comments

The pith

MechaRule localizes sparse agonist neurons to ground symbolic rules in LLM internal mechanisms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MechaRule as a pipeline to extract rules from large language models by anchoring them to specific neurons. It identifies agonist neurons whose deactivation disrupts particular behaviors like arithmetic reasoning or following jailbreak prompts. The method exploits the observation that these neuron effects tend to be monotone and saturating, allowing efficient search through hierarchical ablation rather than exhaustive testing. It also shows that using data splits that align with the model's actual rule-following behavior improves the accuracy of identifying these neurons compared to random splits. Experiments demonstrate that the found neurons account for most of the high-impact ones found by brute force and that turning them off substantially impairs the targeted behaviors.

Core claim

MechaRule grounds rule extraction in LLM circuits by efficiently localizing sparse neurons called agonists, whose activation neutralization disrupts rule-related behaviors. It rests on monotone and saturating agonist effects within fixed baseline/flip regimes for adaptive group testing and uses aligned data splits for reliable verification over unfaithful ones.

What carries the argument

Agonist neurons, sparse sets whose neutralization disrupts rule-related behaviors, localized via contrastive hierarchical ablation as adaptive group testing with regime-conditional strength predicate and confidence-guided pruning, yielding Theta(k log(N/k) + k) interventions.

Load-bearing premise

Sparse agonist effects are approximately monotone and saturating within a fixed baseline/flip regime, and agonists emerge more reliably when ablations are verified through data splits aligned with close-to-faithful rule behavior.

What would settle it

A brute-force search that identifies many high-effect agonists not recalled by MechaRule, or an experiment where suppressing the localized agonists fails to reduce arithmetic accuracy or jailbreak success, would falsify the central claims.

Figures

Figures reproduced from arXiv: 2605.03058 by Francesco Sovrano, Gabriele Dominici, Marc Langheinrich.

**Figure 1.** Figure 1: Problem overview. We seek a singleton neuron view at source ↗

**Figure 2.** Figure 2: Pipeline overview: RuleSHAP extracts behavioral splitter rules from view at source ↗

**Figure 3.** Figure 3: Arithmetic/Qwen2: layerwise flip coverage by high view at source ↗

**Figure 4.** Figure 4: Conceptual picture: dominance (overtopping) and overlap explain why group effects are neither purely additive nor view at source ↗

**Figure 5.** Figure 5: Runtime or two-pass intervention via high-MCC neuron-anchored rules. A neuron-anchored rule view at source ↗

**Figure 6.** Figure 6: Sensitivity of localized-neuron counts to the CHA effect threshold view at source ↗

**Figure 7.** Figure 7: Representative ECDF overlays for arithmetic under rule split + spectral coverage. view at source ↗

**Figure 8.** Figure 8: Agonist signatures for arithmetic on Qwen2-7B-Instruct under rule split + spectral coverage. view at source ↗

**Figure 9.** Figure 9: Agonist signatures for arithmetic on GPT-J-6B under rule split + spectral coverage. view at source ↗

**Figure 10.** Figure 10: Agonist signatures for BoN jailbreaking on Qwen2-7B-Instruct under rule split + spectral coverage. view at source ↗

read the original abstract

A key goal of explainable AI (XAI) is to express the decision logic of large language models (LLMs) in symbolic form and link it to internal mechanisms. Global rule-extraction methods typically learn symbolic surrogates without grounding rules in model circuitry, while mechanistic interpretability can connect behaviors to neuron sets but often depends on hand-crafted hypotheses and expensive neuron-level interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by efficiently localizing sparse neurons called agonists, whose activation neutralization disrupts rule-related behaviors. MechaRule rests on two empirical observations. First, within a fixed baseline/flip regime, sparse agonist effects can be approximately monotone and saturating: a few dominant neuron activations can overtop weaker ones at coarse scales, while overlapping neurons flip many of the same examples. This motivates viewing localization as adaptive group testing driven by a regime-conditional strength predicate with confidence-guided conservative pruning, yielding Theta(k log(N/k) + k) interventions over N candidates when k << N neurons are agonists under the monotone-overtopping abstraction. Second, agonists emerge more reliably when ablations are verified through data splits aligned with close-to-faithful rule behavior; spectral splits remain a useful rule-free fallback, while unfaithful splits degrade localization. Empirically, overtopping appears mainly in learned, task-aligned regimes: on arithmetic and jailbreak tasks across Qwen2 and GPT-J, MechaRule recalls 96.8% of high-effect brute-force agonists in completed comparisons, and suppressing localized agonists reduces arithmetic accuracy and jailbreak success by up to 71.1% and 8.8%, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MechaRule grounds symbolic rules in specific LLM neurons with an efficient ablation pipeline, but the monotone-overtopping assumption lacks the direct diagnostics needed to support the efficiency claims.

read the letter

MechaRule is a pipeline that localizes sparse agonist neurons in LLMs whose suppression disrupts rule-related behaviors like arithmetic or jailbreaking. It combines contrastive hierarchical ablation with adaptive group testing and regime-conditional pruning to cut down on interventions, and it uses data splits tied to faithful rule behavior to validate the localizations. This is a clear step beyond global rule extractors that ignore circuitry and beyond hand-crafted mechanistic work that often stays manual and expensive. The empirical numbers are the strongest part: on Qwen2 and GPT-J for arithmetic and jailbreak tasks, it recovers 96.8% of the high-effect agonists found by brute force, and suppressing the localized set reduces accuracy by up to 71.1% and jailbreak success by 8.8%. That shows the method can point to neurons that actually matter for the behaviors. The citation pattern is standard and the approach avoids obvious circularity by using independent splits. The soft spot is the load-bearing assumption. The efficiency bound and the pruning logic depend on sparse effects being roughly monotone and saturating within a fixed regime, with a few dominant neurons overtopping the rest. The paper reports the recall and the performance drops, but it does not include the straightforward checks like marginal-effect curves or interaction matrices that would confirm overtopping versus additive or synergistic alternatives. Without those, it is hard to know whether the conservative pruning is reliably safe or whether it risks dropping real agonists. The abstract also leaves out baselines, error bars, and full controls, though the full text may fill some of that in. This work is for researchers in XAI and mechanistic interpretability who want practical, circuit-grounded tools rather than purely symbolic surrogates. A reader focused on targeted interventions for safety or debugging would find the concrete results useful. It deserves a serious referee because the pipeline is coherent, the empirical grounding is real, and the idea extends prior work in a useful direction, even with the assumption gap. I would send it to peer review and ask specifically for the missing diagnostic plots and tighter experimental details.

Referee Report

2 major / 2 minor

Summary. The paper introduces MechaRule, a pipeline for neuron-anchored rule extraction in LLMs using contrastive hierarchical ablation to localize sparse agonist neurons. It rests on two empirical observations: within a fixed baseline/flip regime, sparse agonist effects are approximately monotone and saturating (motivating adaptive group testing with Theta(k log(N/k) + k) interventions), and agonists emerge more reliably with data splits aligned to close-to-faithful rule behavior. Experiments on arithmetic and jailbreak tasks across Qwen2 and GPT-J report 96.8% recall of high-effect brute-force agonists and performance reductions of up to 71.1% (arithmetic accuracy) and 8.8% (jailbreak success) after suppression.

Significance. If the results and underlying observations hold, the work could meaningfully advance XAI by efficiently grounding symbolic rules in LLM circuitry, reducing reliance on hand-crafted hypotheses while achieving practical localization efficiency. The reported recall and causal impact metrics indicate potential utility for task-aligned behaviors, though the empirical foundation limits immediate theoretical impact.

major comments (2)

[Abstract] Abstract (first observation): the monotone-overtopping abstraction (dominant neurons overtop weaker ones with overlapping flips) is load-bearing for both the Theta(k log(N/k) + k) complexity bound and the claim of reliable localization, yet the manuscript provides no marginal-effect curves, interaction matrices, or direct comparison to additive/synergistic alternatives to confirm saturation and dominance within the baseline/flip regime.
[Abstract] Abstract (empirical claims): the 96.8% recall of brute-force agonists and the 71.1%/8.8% performance reductions are presented without baselines, statistical tests, error bars, or full experimental controls, making it impossible to assess whether the results support the central claim that MechaRule reliably grounds rules in circuitry.

minor comments (2)

[Abstract] The second observation on data splits (spectral vs. faithful) is stated clearly but would benefit from explicit pseudocode or a small illustrative example showing how splits are constructed and verified.
Notation for 'agonists' and the 'regime-conditional strength predicate' is introduced without a dedicated definitions subsection, which could confuse readers unfamiliar with the contrastive ablation setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and commit to revisions that directly strengthen the substantiation of the monotone-overtopping abstraction and the presentation of empirical results.

read point-by-point responses

Referee: [Abstract] Abstract (first observation): the monotone-overtopping abstraction (dominant neurons overtop weaker ones with overlapping flips) is load-bearing for both the Theta(k log(N/k) + k) complexity bound and the claim of reliable localization, yet the manuscript provides no marginal-effect curves, interaction matrices, or direct comparison to additive/synergistic alternatives to confirm saturation and dominance within the baseline/flip regime.

Authors: We agree that the monotone-overtopping abstraction is central to both the complexity analysis and the localization claims. The manuscript grounds this in empirical observations from arithmetic and jailbreak tasks, but we acknowledge the absence of explicit marginal-effect curves, interaction matrices, and model comparisons. In the revised manuscript we will add these visualizations and analyses, including direct contrasts against additive and synergistic alternatives, to confirm saturation and dominance in the baseline/flip regime. revision: yes
Referee: [Abstract] Abstract (empirical claims): the 96.8% recall of brute-force agonists and the 71.1%/8.8% performance reductions are presented without baselines, statistical tests, error bars, or full experimental controls, making it impossible to assess whether the results support the central claim that MechaRule reliably grounds rules in circuitry.

Authors: The 96.8% recall and performance reductions are derived from completed brute-force comparisons reported in the experiments section. We agree that the abstract and results would be strengthened by explicit baselines, statistical tests, error bars, and additional controls. We will revise the abstract and results to incorporate these elements, including error bars on all reported metrics and expanded control experiments, to allow rigorous evaluation of the grounding claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on independent empirical observations and interventions

full rationale

The paper presents two empirical observations about agonist effects (monotone/saturating overtopping within fixed regimes, and better localization via aligned data splits) as starting points, then derives an adaptive group-testing procedure with Theta(k log(N/k) + k) complexity under that abstraction. It validates the approach via separate brute-force comparisons (96.8% recall) and causal suppression experiments (performance drops) on held-out tasks and models. No equations or claims reduce by construction to fitted parameters, self-definitions, or self-citations; the observations are stated as external empirical findings rather than outputs of the method itself. The efficiency bound follows directly from standard group-testing analysis once the abstraction is granted, and the recall metric compares against an independent brute-force baseline rather than a fitted surrogate.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Central claims rest on two stated empirical observations about neuron effects and data split reliability rather than derived principles; no free parameters or invented entities with independent evidence are detailed.

axioms (2)

domain assumption Sparse agonist effects can be approximately monotone and saturating within a fixed baseline/flip regime
First empirical observation motivating adaptive group testing and pruning.
domain assumption Agonists emerge more reliably when ablations are verified through data splits aligned with close-to-faithful rule behavior
Second empirical observation; spectral splits noted as fallback.

invented entities (1)

agonists no independent evidence
purpose: Sparse neurons whose activation neutralization disrupts rule-related behaviors
Core concept introduced to ground rule extraction in circuitry

pith-pipeline@v0.9.0 · 5605 in / 1567 out tokens · 94422 ms · 2026-05-08T18:50:46.110002+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/AlphaCoordinateFixation.lean (J-cost / cosh structure absent here) J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we model slice-wise flip rates as potentially monotone and saturating set functions; a concrete abstraction is a regime-conditional union-of-flips model ... compatible with coverage-type submodularity
(domain orthogonal to RS forcing chain) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate MechaRule on open-weight causal LLMs ... Qwen2-7B-Instruct, Qwen2-1.5B-Instruct, and gpt-j-6B ... main tasks are arithmetic, ... and Best-of-N jailbreaking

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 28 canonical work pages · 1 internal anchor

[1]

Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. 1993. Mining Associa- tion Rules between Sets of Items in Large Databases. InProceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 26-28, 1993, Peter Buneman and Sushil Jajodia (Eds.). ACM Press, 207–216. doi:10.1145/170035.170072

work page doi:10.1145/170035.170072 1993
[2]

Matthew Aldridge, Oliver Johnson, and Jonathan Scarlett. 2019. Group testing: an information theory perspective.Foundations and Trends in Communications and Information Theory15, 3–4 (5 Dec. 2019), 196–392. doi:10.1561/0100000099

work page doi:10.1561/0100000099 2019
[3]

Mamdouh Alenezi and Mohammed Akour. 2025. Ai-driven innovations in soft- ware engineering: a review of current practices and future directions.Applied Sciences15, 3 (2025), 1344

2025
[4]

Seltzer, and Cyn- thia Rudin

Elaine Angelino, Nicholas Larus-Stone, Daniel Alabi, Margo I. Seltzer, and Cyn- thia Rudin. 2017. Learning certifiably optimal rule lists. InProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 35–44. doi:10.1145/3097983.3098047

work page doi:10.1145/3097983.3098047 2017
[5]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer Normaliza- tion. arXiv:1607.06450 [stat.ML] https://arxiv.org/abs/1607.06450

work page Pith review arXiv 2016
[6]

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. 2023. Language models can explain neurons in language models. https://openaipublic.blob.core. windows.net/neuron-explainer/paper/index.html

2023
[7]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794

2016
[8]

Davide Chicco and Giuseppe Jurman. 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation.BMC genomics21, 1 (2020), 6

2020
[9]

J., & Pearson, E

Charles J. Clopper and Egon S. Pearson. 1934. The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial.Biometrika26, 4 (1934), 404–413. doi:10.1093/biomet/26.4.404

work page doi:10.1093/biomet/26.4.404 1934
[10]

William W. Cohen. 1995. Fast Effective Rule Induction. InProceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann, 115–123

1995
[11]

Mavor-Parker, Aengus Lynch, Stefan Heimer- sheim, and Adrià Garriga-Alonso

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimer- sheim, and Adrià Garriga-Alonso. 2023. Towards Automated Circuit Discovery for Mechanistic Interpretability. InAdvances in Neural In- formation Processing Systems 36: Annual Conference on Neural Informa- tion Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, De- cember 10 - 1...

2023
[12]

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei
[13]

URL https://doi.org/10.18653/v1/2022.acl -long.581

Knowledge Neurons in Pretrained Transformers. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8493–8502. doi:10.18653/v1/2022.acl-long.581

work page doi:10.18653/v1/2022.acl-long.581 2022
[14]

2000.Combinatorial Group Testing and Its Appli- cations

Dingzhu Du and Frank Hwang. 2000.Combinatorial Group Testing and Its Appli- cations. World Scientific

2000
[15]

Jerome H Friedman and Bogdan E Popescu. 2008. Predictive Learning via Rule Ensembles.The Annals of Applied Statistics(2008), 916–954

2008
[16]

2005.Submodular Functions and Optimization(2 ed.)

Satoru Fujishige. 2005.Submodular Functions and Optimization(2 ed.). Annals of Discrete Mathematics, Vol. 58. Elsevier. doi:10.1016/S0167-5060(05)80001-9

work page doi:10.1016/s0167-5060(05)80001-9 2005
[17]

Teofilo F Gonzalez. 1985. Clustering to minimize the maximum intercluster distance.Theoretical computer science38 (1985), 293–306

1985
[18]

Jian Gu, Aldeida Aleti, Chunyang Chen, and Hongyu Zhang. 2023. Neuron Patch- ing: Semantic-based Neuron-level Language Model Repair for Code Generation. (2023). arXiv:2312.05356 [cs.CL]

work page arXiv 2023
[19]

Michael Hanna, Sandro Pezzelle, and Yonatan Belinkov. 2024. Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms. InICML 2024 Workshop on Mechanistic Interpretability

2024
[20]

Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs). arXiv:1606.08415 [cs.LG] https://arxiv.org/abs/1606.08415

work page Pith review arXiv 2016
[21]

Julia Herbinger, Susanne Dandl, Fiona K Ewald, Sofia Loibl, and Giuseppe Casal- icchio. 2023. Leveraging model-based trees as interpretable surrogate models for model distillation. InEuropean Conference on Artificial Intelligence. Springer, 232–249

2023
[22]

Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie Zhou, Wenge Rong, and Zhang Xiong. 2023. Transformer-Patcher: One Mistake Worth One Neuron. InICLR

2023
[23]

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, and Mrinank Sharma. 2024. Best-of-n jailbreaking.arXiv preprint arXiv:2412.03556(2024)

work page arXiv 2024
[24]

Frank K. Hwang. 1972. A Method for Detecting All Defective Members in a Population by Group Testing.J. Amer. Statist. Assoc.67, 339 (1972), 605–608. doi:10.1080/01621459.1972.10481257

work page doi:10.1080/01621459.1972.10481257 1972
[25]

Houcheng Jiang, Junfeng Fang, Tianyu Zhang, Baolong Bi, An Zhang, Ruipeng Wang, Tao Liang, and Xiang Wang. 2025. Neuron-Level Sequential Editing for Large Language Models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL) (Volume 1: Long Papers). 16678– 16702

2025
[26]

McCormick, and David Madigan

Benjamin Letham, Cynthia Rudin, Tyler H. McCormick, and David Madigan
[27]

Letham, C

Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model.The Annals of Applied Statistics9, 3 (Sept. 2015), 1350–1371. doi:10.1214/15-AOAS848

work page doi:10.1214/15-aoas848 2015
[28]

Tianhe Lin, Jian Xie, Siyu Yuan, and Deqing Yang. 2025. Implicit reasoning in transformers is reasoning through shortcuts.arXiv preprint arXiv:2503.07604 (2025)

work page arXiv 2025
[29]

Bing Liu, Wynne Hsu, and Yiming Ma. 1998. Integrating Classification and Association Rule Mining. InProceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. AAAI Press, 80–86

1998
[30]

Jinzhe Liu, Junshu Sun, Shufan Shen, Chenxue Yang, and Shuhui Wang. 2025. Edit Less, Achieve More: Dynamic Sparse Neuron Masking for Lifelong Knowledge Editing in LLMs. (2025). arXiv:2510.22139 [cs.CL]

work page arXiv 2025
[31]

Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions.Advances in neural information processing systems30 (2017)

2017
[32]

Andrzej Maćkiewicz and Waldemar Ratajczak. 1993. Principal components analysis (PCA).Computers & Geosciences19, 3 (1993), 303–342

1993
[33]

R Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. InProceedings of the 57th annual meeting of the association for computational linguistics. 3428–3448

2019
[34]

Sara El Mekkaoui, Loubna Benabbou, and Abdelaziz Berrado. 2023. Rule- Extraction Methods from Feedforward Neural Networks: A Systematic Literature Review.arXiv preprint arXiv:2312.12878(2023)

work page arXiv 2023
[35]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and Editing Factual Associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 35. 17359–17372

2022
[36]

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2023. Mass-Editing Memory in a Transformer. InProc. of ICLR

2023
[37]

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. 2022. Fast Model Editing at Scale. InProc. of ICLR

2022
[38]

Neel Nanda and Joseph Bloom. 2022. TransformerLens. https://github.com/ TransformerLensOrg/TransformerLens

2022
[39]

Mathematical Programming14(1), 265– 294 (1978).https://doi.org/10.1007/BF01588971,https://doi.org/10.1007/ BF01588971

George L. Nemhauser, Laurence A. Wolsey, and Marshall L. Fisher. 1978. An Analysis of Approximations for Maximizing Submodular Set Functions—I.Math- ematical Programming14, 1 (1978), 265–294. doi:10.1007/BF01588971 Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

work page doi:10.1007/bf01588971 1978
[40]

Yaniv Nikankin, Anja Reusch, Aaron Mueller, and Yonatan Belinkov. 2024. Arith- metic without algorithms: Language models solve math with a bag of heuristics. arXiv preprint arXiv:2410.21272(2024)

work page arXiv 2024
[41]

nostalgebraist. 2020. Interpreting GPT: the Logit Lens. LessWrong. https://www. lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

2020
[42]

Haowen Pan, Yixin Cao, Xiaozhi Wang, Xun Yang, and Meng Wang. 2024. Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers. InFindings of the Association for Computational Linguistics: ACL 2024. 1012–1037

2024
[43]

Haowen Pan, Xiaozhi Wang, Yixin Cao, Zenglin Shi, Xun Yang, Juanzi Li, and Meng Wang. 2025. Precise Localization of Memories: A Fine-Grained Neuron- Level Knowledge Editing Technique for LLMs. InInternational Conference on Learning Representations (ICLR)

2025
[44]

Rafael Poyiadzi, Xavier Renard, Thibault Laugel, Raul Santos-Rodriguez, and Marcin Detyniecki. 2021. Understanding surrogate explanations: the interplay between complexity, fidelity and coverage.arXiv preprint arXiv:2107.04309(2021)

work page arXiv 2021
[45]

Mengchao Ren. 2024. Advancements and applications of large language models in natural language processing: A comprehensive review.Applied and Computa- tional Engineering97 (2024), 55–63

2024
[46]

Rudy Setiono and Huan Liu. 1995. Understanding neural networks via rule extraction. InIJCAI, Vol. 1. 480–485

1995
[47]

Francesco Sovrano. 2025. Can Global XAI Methods Reveal Injected Bias in LLMs? SHAP vs Rule Extraction vs RuleSHAP.arXiv preprint arXiv:2505.11189(2025)

work page arXiv 2025
[48]

Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, et al. 2025. A survey of reasoning with foundation models: Concepts, methodologies, and outlook. Comput. Surveys57, 11 (2025), 1–43

2025
[49]

Aaquib Syed, Can Rager, and Arthur Conmy. 2024. Attribution Patching Out- performs Automated Circuit Discovery. InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller, and Hanjie Chen (Eds.). Association for Computational Linguistics, Mia...

work page doi:10.18653/v1/2024.blackboxnlp-1.25 2024
[50]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. InAdvances in Neural Information Processing Systems (NeurIPS)

2017
[51]

Huanqian Wang, Yang Yue, Rui Lu, Jingxin Shi, Andrew Zhao, Shenzhi Wang, Shiji Song, and Gao Huang. 2025. Model Surgery: Modulating LLM’s Behavior via Simple Parameter Editing. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 6337–6357

2025
[52]

Peng-Yuan Wang, Tian-Shuo Liu, Chenyang Wang, Ziniu Li, Yidi Wang, Shu Yan, Chengxing Jia, Xu-Hui Liu, Xinwei Chen, Jiacheng Xu, et al. 2025. A survey on large language models for mathematical reasoning.Comput. Surveys(2025)

2025
[53]

Weixuan Wang, Jingyuan Yang, and Wei Peng. 2025. Semantics-Adaptive Acti- vation Intervention for LLMs via Dynamic Steering Vectors. InProc. of ICLR

2025
[54]

Edwin B. Wilson. 1927. Probable Inference, the Law of Succession, and Statistical Inference.J. Amer. Statist. Assoc.22, 158 (1927), 209–212. doi:10.2307/2276774

work page doi:10.2307/2276774 1927
[55]

An Yang, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoyan Huang, Jiandong Jiang, Jianhong Tu, Jianwei Zhang, Jingren Zhou, et al. 2025. Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383(2025)

work page internal anchor Pith review arXiv 2025
[56]

Zeping Yu and Sophia Ananiadou. 2025. Understanding and Mitigating Gender Bias in LLMs via Interpretable Neuron Editing. (2025). arXiv:2501.14457 [cs.CL]

work page arXiv 2025
[57]

Mateo Espinosa Zarlenga, Zohreh Shams, and Mateja Jamnik. 2021. Efficient decompositional rule extraction for deep neural networks.arXiv preprint arXiv:2111.12628(2021)

work page arXiv 2021
[58]

Jusheng Zhang, Ningyuan Liu, Yijia Fan, Zihao Huang, Qinglin Zeng, Kaitong Cai, Jian Wang, and Keze Wang. 2025. LLM-CAS: Dynamic Neuron Perturbation for Real-Time Hallucination Correction. (2025). arXiv:2512.18623 [cs.CL]

work page arXiv 2025
[59]

Tianyu Zhang, Junfeng Fang, Houcheng Jiang, Baolong Bi, Xiang Wang, and Xiangnan He. 2025. Explainable and Efficient Editing for Large Language Models. InProceedings of the ACM on Web Conference 2025 (WWW ’25). 1963–1976. doi:10.1145/3696410.3714835

work page doi:10.1145/3696410.3714835 2025
[60]

Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. 2024. Explainability for large lan- guage models: A survey.ACM Transactions on Intelligent Systems and Technology 15, 2 (2024), 1–38

2024
[61]

Wei Zhou, Wei Wei, Guibang Cao, and Fei Wang. 2025. Editing Memories Through Few Targeted Neurons.Proceedings of the AAAI Conference on Artificial Intelligence39, 19 (2025), 20360–20368. doi:10.1609/aaai.v39i19.30250

work page doi:10.1609/aaai.v39i19.30250 2025
[62]

𝜏-agonist

Jan Ruben Zilke, Eneldo Loza Mencía, and Frederik Janssen. 2016. Deepred–rule extraction from deep neural networks. InInternational conference on discovery science. Springer, 457–473. F. Sovrano et al. Appendix roadmap The appendix is organized to make each empirical and methodological dependency explicit. Appendix A defines regime-conditional dominance, ...

work page arXiv 2016