pith. sign in

arxiv: 2605.21303 · v1 · pith:XKXXSS42new · submitted 2026-05-20 · 💻 cs.LG · cs.AI· cs.LO

From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach

Pith reviewed 2026-05-21 05:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.LO
keywords mechanistic interpretabilityneural circuitsinductive logic programmingcausal signaturesmodel comparisontheory constructioncircuit transfer
0
0 comments X

The pith

Pairing causal evidence with inductive logic turns isolated neural circuits into comparable and portable mechanistic theories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats each discovered circuit not as a standalone experiment but as input to a formal theory-building process. It defines a Causal Functional Signature that records what the circuit actually does with tokens according to causal interventions, and an architectural signature that encodes the circuit's structure as a logic program learned from scale-invariant predicates. If these signatures work as intended, findings from one model or task become directly comparable to those from another through logical subsumption, and insights transfer across scales instead of remaining trapped in individual papers. The approach is demonstrated by showing that attention circuits copy via one strategy while MLP circuits bind via another, and that the learned signatures separate structures more cleanly than standard graph methods.

Core claim

Each circuit is characterised at two levels: a Causal Functional Signature (CFS) that grounds component behaviour in causal attribution evidence and token role profiles, and an architectural signature τ_arch learned by inductive logic programming from scale-invariant structural predicates. Together these form a formal coherence layer that makes mechanistic claims explicit, comparable via θ-subsumption, and portable across model scales. CFS reveals qualitatively distinct computational strategies across task types, including attention-mediated copying versus MLP-mediated binding, while ILP signatures achieve substantially better structural separation than graph kernel and feature-vector Baseli

What carries the argument

The formal coherence layer consisting of the Causal Functional Signature (CFS), which encodes causal attribution evidence and token role profiles, and the architectural signature τ_arch learned by inductive logic programming from scale-invariant structural predicates.

If this is right

  • Mechanistic claims become explicitly comparable across experiments by checking whether one signature θ-subsumes another.
  • Architectural signatures support principled transfer of findings from smaller models to larger models and across architecture families.
  • Distinct computational strategies, such as attention-mediated copying versus MLP-mediated binding, become identifiable as separate classes.
  • Structural separation of circuits improves over graph kernel and feature-vector baselines while remaining interpretable as logic programs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A growing collection of signatures could function as a shared reference set against which new circuit discoveries are automatically matched or classified.
  • The same logic-programming abstraction might be applied to other forms of mechanistic evidence, such as activation patterns or ablation results, to build broader theories.
  • If signatures prove portable, they could be used to predict which circuit types are likely to emerge in a new architecture before any experiments are run.

Load-bearing premise

The scale-invariant structural predicates supplied to inductive logic programming are sufficient to capture the computationally relevant architectural features of circuits and that the resulting signatures support meaningful transfer and separation beyond what graph kernels already achieve.

What would settle it

If circuits that human experts judge to implement the same function receive signatures that fail to subsume one another under θ-subsumption, or if signatures trained on small models cannot classify or transfer to circuits found in larger models of the same family on the same task.

Figures

Figures reproduced from arXiv: 2605.21303 by Andre Freitas, Danilo S. Carvalho, Nura Aljaafari.

Figure 1
Figure 1. Figure 1: Overview for inductive circuit theory construction. A discovered circuit C receives two comple￾mentary descriptions: a Causal Functional Signature σ capturing what the circuit computes via causal attribution evidence, and an architectural signature τarch capturing how it is structurally realised as a scale-invariant Horn clause learned by ILParch. They form a reusable logical record L(C) = ⟨σ, τarch, SILP,… view at source ↗
Figure 2
Figure 2. Figure 2: DLA magnitude per component type (Pythia-1B). Mean |δv| for attention heads (x-axis) vs. MLP layers (y-axis). IOI falls below the diagonal (attn￾dominant, |δv|attn ≈ 2.0); Location, Time, and Path fall above (MLP-dominant). GT is MLP-dominant with high absolute magnitude (|δv|MLP ≈ 2.0), con￾sistent with MLP-mediated numerical comparison. Task Predicates in τarch C IOI comp_ratio(C, attn, R) R > 0.63 0.2 L… view at source ↗
Figure 3
Figure 3. Figure 3: ILP signature distance (Pythia-1B). IOI is distant from all circuits (d ≥ 0.32). Roles and GT clus￾ter tightly (d ≤ 0.13). ILP signature WL kernel Feature vector 0 1 2 3 4 5 IOI Separation (×) 3.9× 1.3× 1.2× 4.2× 1.5× 1.1× Structural Baseline Comparison Pythia-1B LLaMA-1B [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: IOI separation: ILP vs. structural base￾lines. Grouped by model. ILP achieves 3.9–4.2×; WL kernel 1.3–1.5×. roles because its clause uses the same predicate vocabulary despite implementing a different com￾putation, suggesting that GT circuits are realised through structural patterns closer to binding than to attention-copying. The threshold δILP=0.30 cleanly separates IOI from all other circuits; the same … view at source ↗
Figure 5
Figure 5. Figure 5: Pipeline for inductive circuit theory construction. A discovered circuit C is encoded in a Formal Circuit Representation (FCR) across four layers (provenance, structure, causal behaviour, and learned signatures). A Causal Functional Signature (CFS, σ) is derived from causal attribution evidence (RQ1). ILParch learns an architectural signature τarch over scale-invariant predicates. Validated triples ⟨σ, τar… view at source ↗
Figure 6
Figure 6. Figure 6: Full ILP signature distance (Pythia-1B, 10 tasks). IOI and GT retain high separation from the role cluster. Within roles, Source and Time form a tight sub-cluster, while Goal is the most distant role. The expanded matrix confirms that the 5-task results ( [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: ILP signature distance (Pythia-14M, 10 tasks). Structural homogeneity dominates: most pairs cluster at d ≤ 0.06, with many identical (d = 0.00). Goal is the sole outlier (d = 0.28–0.36), driven by its degenerate size-only clause ( [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: ILP signature distance (LLaMA-3.2-1B, 10 tasks). IOI is strongly separated from all others (d = 0.34–0.45). GT is intermediate (d = 0.05–0.11 from roles). Goal is distinctive (d = 0.07–0.17) owing to its unique scaffold_event_entity motif. Location, Time, Source, and Topic share identical signatures (d = 0.00), collapsing into one structural class. ficiary marginal-contribution patterns seen across all thr… view at source ↗
Figure 9
Figure 9. Figure 9: Within-task vs. between-task structural variance (Pythia-1B). Each bar pair shows the mean WL kernel distance among splits of the same task (blue, within) vs. to other tasks (red, between). IOI has the largest gap (1.8×), confirming its prompt-invariant circuit topology. Most semantic roles show ratios near 1.0, indicating that prompt subsets produce circuits as structurally different from each other as fr… view at source ↗
Figure 10
Figure 10. Figure 10: δILP sensitivity (Pythia-1B, 10 tasks). F1 peaks at δILP = 0.10 (F1 = 0.81) with precision 0.72 and recall 0.93. The curve is stable across 0.10–0.15; beyond 0.15, all same-family pairs are accepted (recall = 1.0) but precision drops as cross-family pairs also fall below threshold. istry is the same (Pythia-1B). What differs is the source ∆T : IOI’s sign reversal (−0.91 in 14M vs. 5.24 in LLaMA) and Time’… view at source ↗
read the original abstract

Mechanistic interpretability produces circuit-level causal analyses of neural network behaviour, but discovered circuits often remain isolated experimental artefacts: there is no shared formal representation for what circuits compute, how they relate, or when two findings provide evidence for the same mechanism. This work provides a formal infrastructure for cumulative mechanistic science by treating circuit interpretation as inductive theory construction. Each circuit is characterised at two levels: a Causal Functional Signature (CFS), which grounds component behaviour in causal attribution evidence and token role profiles, and an architectural signature $\tau_{\mathrm{arch}}$, learned by inductive logic programming (ILP) from scale-invariant structural predicates. Together, these constitute a formal coherence layer that makes mechanistic claims explicit, comparable via $\theta$-subsumption, and portable across model scales. CFS reveals qualitatively distinct computational strategies across task types, including attention-mediated copying versus MLP-mediated binding. ILP signatures achieve substantially better structural separation than graph kernel and feature-vector baselines, and support principled transfer across model scales and architecture families.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a formal infrastructure for cumulative mechanistic interpretability by characterizing each circuit via a Causal Functional Signature (CFS) grounded in causal attribution evidence and token role profiles, together with an architectural signature τ_arch learned by inductive logic programming (ILP) from scale-invariant structural predicates. These elements form a coherence layer enabling explicit mechanistic claims that are comparable via θ-subsumption and portable across model scales and architecture families. The abstract reports that CFS reveals qualitatively distinct strategies (e.g., attention-mediated copying versus MLP-mediated binding) and that ILP signatures achieve substantially better structural separation than graph-kernel and feature-vector baselines.

Significance. If the central claims are substantiated with quantitative evidence, the work would supply a shared formal representation that moves mechanistic interpretability from isolated circuit discoveries toward cumulative theory construction. The introduction of ILP-derived signatures offers a principled route to structural comparison and cross-scale transfer that, if shown to exceed existing graph-based methods, could become a standard tool for relating findings across models.

major comments (3)
  1. Abstract: the claim that ILP signatures achieve 'substantially better structural separation' than graph kernel and feature-vector baselines is presented without any quantitative metrics, error bars, dataset details, or statistical tests. This absence is load-bearing for the central claim of improved separation and principled transfer.
  2. Abstract / method description: no explicit checks or ablations are reported demonstrating that the supplied scale-invariant structural predicates remain informative and align with causal roles in the CFS when circuit depth or width changes by orders of magnitude. Without such evidence the portability claim risks reducing to recovery of graph topology already achievable by kernels.
  3. Abstract: the qualitative distinction between attention-mediated copying and MLP-mediated binding via CFS is asserted but lacks detail on how the signatures are computed, validated, or compared to alternative functional representations, leaving the coherence-layer claim under-supported.
minor comments (2)
  1. The notation τ_arch and the precise definition of θ-subsumption would benefit from a short illustrative example early in the manuscript to aid readability.
  2. Clarify whether the ILP predicates are chosen a priori or derived from the circuit data, and state any assumptions about their completeness for capturing computationally relevant features.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important areas where the presentation of quantitative evidence and methodological details can be strengthened to better support the central claims. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract: the claim that ILP signatures achieve 'substantially better structural separation' than graph kernel and feature-vector baselines is presented without any quantitative metrics, error bars, dataset details, or statistical tests. This absence is load-bearing for the central claim of improved separation and principled transfer.

    Authors: We agree that the abstract would be strengthened by including key quantitative metrics, error bars, dataset details, and statistical tests. The experimental results in the main text (Section 5) contain these details, including separation performance across baselines with statistical comparisons. We have revised the abstract to incorporate a concise summary of the quantitative findings and dataset information. revision: yes

  2. Referee: Abstract / method description: no explicit checks or ablations are reported demonstrating that the supplied scale-invariant structural predicates remain informative and align with causal roles in the CFS when circuit depth or width changes by orders of magnitude. Without such evidence the portability claim risks reducing to recovery of graph topology already achievable by kernels.

    Authors: We agree that explicit ablations across large changes in depth and width would provide stronger support for the scale-invariance and alignment claims. The predicates are constructed to be independent of absolute model size (using relative and arity-invariant relations), and the manuscript demonstrates transfer on models of different scales. We will add a dedicated ablation subsection with experiments varying model size by orders of magnitude to show that predicate informativeness and alignment with CFS causal roles are preserved. revision: yes

  3. Referee: Abstract: the qualitative distinction between attention-mediated copying and MLP-mediated binding via CFS is asserted but lacks detail on how the signatures are computed, validated, or compared to alternative functional representations, leaving the coherence-layer claim under-supported.

    Authors: We agree that additional detail on computation and validation would better support the coherence-layer claim in the abstract. The full method section explains CFS construction via causal attribution evidence combined with token role profiles, with validation against known circuits. We have revised the abstract to briefly describe the signature computation process and how distinctions are validated. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper constructs CFS from causal attribution evidence and token role profiles, then learns τ_arch via ILP applied to explicitly supplied scale-invariant structural predicates. Claims of explicitness, θ-subsumption comparability, and cross-scale portability follow directly from these definitional choices and standard ILP properties rather than from any derived prediction or fitted parameter that reduces to the inputs by construction. The reported superior separation versus graph-kernel baselines is presented as an empirical outcome of applying the method, not a tautological result. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the abstract or described chain. The overall infrastructure is therefore an independent formal proposal whose central claims do not collapse to re-labeling of the supplied data or predicates.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The framework rests on the domain assumption that circuits possess identifiable causal and structural signatures that can be extracted and compared logically; two new representational entities are introduced without external falsification handles.

axioms (2)
  • domain assumption Circuits admit characterisation by causal attribution evidence together with token role profiles.
    Invoked in the definition of the Causal Functional Signature.
  • domain assumption Scale-invariant structural predicates exist that ILP can learn to produce transferable architectural signatures.
    Central premise enabling portability across model scales.
invented entities (2)
  • Causal Functional Signature (CFS) no independent evidence
    purpose: Grounds component behaviour in causal evidence and token roles.
    New representational construct introduced to formalise circuit function.
  • architectural signature τ_arch no independent evidence
    purpose: Provides a learned, comparable structural description via ILP.
    New representational construct for architectural comparison.

pith-pipeline@v0.9.0 · 5708 in / 1269 out tokens · 36389 ms · 2026-05-21T05:57:57.035981+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 3 internal anchors

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in

    Kevin Ro Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , booktitle=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2023 , url=

  8. [8]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  9. [10]

    Proceedings of the 37th Conference on Neural Information Processing Systems , year =

    Towards Automated Circuit Discovery for Mechanistic Interpretability , author =. Proceedings of the 37th Conference on Neural Information Processing Systems , year =

  10. [11]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , year =

    A Mechanistic Interpretation of Arithmetic Reasoning in Language Models Using Causal Mediation Analysis , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , year =

  11. [12]

    Findings of the Association for Computational Linguistics:

    Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference , author =. Findings of the Association for Computational Linguistics:. 2025 , address =

  12. [13]

    Decomposing Natural Logic Inferences for Neural

    Rozanova, Julia and Ferreira, Deborah and Thayaparan, Mokanarangan and Valentino, Marco and Freitas, Andre , booktitle =. Decomposing Natural Logic Inferences for Neural. 2022 , address =

  13. [14]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , editor =

    A Symbolic Framework for Evaluating Mathematical Reasoning and Generalisation with Transformers , author =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , editor =. 2024 , address =

  14. [15]

    Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries , editor =

    Formal Semantic Controls over Language Models , author =. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries , editor =. 2024 , address =

  15. [16]

    2025 , address =

    Quan, Xin and Valentino, Marco and Carvalho, Danilo and Dalal, Dhairya and Freitas, Andre , booktitle =. 2025 , address =. doi:10.18653/v1/2025.acl-demo.2 , isbn =

  16. [17]

    Annals of the New York Academy of Sciences , volume =

    Frame Semantics and the Nature of Language , author =. Annals of the New York Academy of Sciences , volume =. 1976 , doi =

  17. [18]

    Thayaparan, Mokanarangan and Valentino, Marco and Ferreira, Deborah and Rozanova, Julia and Freitas, Andr. Diff-. Transactions of the Association for Computational Linguistics , year =

  18. [19]

    Advances in Neural Information Processing Systems , volume =

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , volume =

  19. [20]

    Locating and Editing Factual Associations in

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle =. Locating and Editing Factual Associations in

  20. [21]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

    A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =

  21. [22]

    Decomposing Natural Logic Inferences for Neural

    Rozanova, Julia and Ferreira, Deborah and Thayaparan, Mokanarangan and Valentino, Marco and Freitas, Andre , booktitle =. Decomposing Natural Logic Inferences for Neural

  22. [23]

    A note on inductive generalization , journal =

    Gordon D Plotkin , year =. A note on inductive generalization , journal =

  23. [24]

    Subsumption and implication , journal =

    Georg Gottlob , keywords =. Subsumption and implication , journal =. 1987 , issn =. doi:https://doi.org/10.1016/0020-0190(87)90103-7 , url =

  24. [25]

    Computational Linguistics , note =

    Are formal and functional linguistic mechanisms dissociated in language models? , author =. Computational Linguistics , note =

  25. [26]

    International Conference on Learning Representations (ICLR) , year =

    Progress measures for grokking via mechanistic interpretability , author =. International Conference on Learning Representations (ICLR) , year =

  26. [27]

    How does

    Garc. How does. Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS) , series =

  27. [28]

    The Insurmountable Problem of Formal Reasoning in LLMs , author =

  28. [29]

    Di Rocco et al

    On the use of large language models in model-driven engineering: J. Di Rocco et al. , author=. Software and Systems Modeling , volume=. 2025 , publisher=

  29. [30]

    Computational Linguistics , pages =

    Hanna, Michael and Belinkov, Yonatan and Pezzelle, Sandro , title =. Computational Linguistics , pages =. 2025 , month =. doi:10.1162/coli.a.24 , url =

  30. [31]

    arXiv preprint arXiv:2502.11856 , year=

    LLMs as a synthesis between symbolic and continuous approaches to language , author=. arXiv preprint arXiv:2502.11856 , year=

  31. [32]

    Lectures on Government and Binding , title =

    Noam Chomsky , publisher =. Lectures on Government and Binding , title =. 1993 , lastchecked =. doi:doi:10.1515/9783110884166 , isbn =

  32. [33]

    How does

    Garc\'. How does. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics , pages =. 2024 , editor =

  33. [34]

    ICML 2024 Workshop on Mechanistic Interpretability , year=

    Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms , author=. ICML 2024 Workshop on Mechanistic Interpretability , year=

  34. [35]

    2025 , eprint=

    Emergence and Localisation of Semantic Role Circuits in LLMs , author=. 2025 , eprint=

  35. [36]

    and Jun, Eunice and Terry, Michael and Yang, Qian and Hartmann, Bjoern , title =

    Zamfirescu-Pereira, J.D. and Jun, Eunice and Terry, Michael and Yang, Qian and Hartmann, Bjoern , title =. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , articleno =. 2025 , isbn =. doi:10.1145/3706598.3714154 , abstract =

  36. [37]

    and Freitas, Andre

    Quan, Xin and Valentino, Marco and Dennis, Louise A. and Freitas, Andre. Verification and Refinement of Natural Language Explanations through LLM -Symbolic Theorem Proving. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.172

  37. [38]

    ACM Transactions on Intelligent Systems and Technology , volume=

    A comprehensive overview of large language models , author=. ACM Transactions on Intelligent Systems and Technology , volume=. 2025 , publisher=

  38. [39]

    2024 , eprint=

    A Primer on the Inner Workings of Transformer-based Language Models , author=. 2024 , eprint=

  39. [40]

    Finding Skill Neurons in Pre-trained Transformer-based Language Models

    Wang, Xiaozhi and Wen, Kaiyue and Zhang, Zhengyan and Hou, Lei and Liu, Zhiyuan and Li, Juanzi. Finding Skill Neurons in Pre-trained Transformer-based Language Models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.765

  40. [41]

    Journal of Machine Learning Research , year =

    Atticus Geiger and Duligur Ibeling and Amir Zur and Maheep Chaudhary and Sonakshi Chauhan and Jing Huang and Aryaman Arora and Zhengxuan Wu and Noah Goodman and Christopher Potts and Thomas Icard , title =. Journal of Machine Learning Research , year =

  41. [42]

    arXiv preprint arXiv:2506.09890 , year=

    The Emergence of Abstract Thought in Large Language Models Beyond Any Language , author=. arXiv preprint arXiv:2506.09890 , year=

  42. [43]

    Forty-second International Conference on Machine Learning , year=

    Towards Global-level Mechanistic Interpretability: A Perspective of Modular Circuits of Large Language Models , author=. Forty-second International Conference on Machine Learning , year=

  43. [44]

    New generation computing , volume=

    Inductive logic programming , author=. New generation computing , volume=. 1991 , publisher=

  44. [45]

    Probabilistic Inductive Logic Programming

    De Raedt, Luc and Kersting, Kristian. Probabilistic Inductive Logic Programming. Probabilistic Inductive Logic Programming: Theory and Applications. 2008. doi:10.1007/978-3-540-78652-8_1

  45. [46]

    Localizing Model Behavior with Path Patching

    Localizing model behavior with path patching , author=. arXiv preprint arXiv:2304.05969 , year=

  46. [47]

    International Conference on Machine Learning , pages=

    Pythia: A suite for analyzing large language models across training and scaling , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  47. [48]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  48. [49]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  49. [50]

    arXiv preprint arXiv:2411.04105 , year=

    A Implies B: Circuit Analysis in LLMs for Propositional Logical Reasoning , author=. arXiv preprint arXiv:2411.04105 , year=

  50. [51]

    Journal of Artificial Intelligence Research , volume=

    Inductive logic programming at 30: a new introduction , author=. Journal of Artificial Intelligence Research , volume=

  51. [52]

    2010 , publisher=

    Artificial intelligence a modern approach , author=. 2010 , publisher=

  52. [53]

    A mathematical theory of communication , year=

    Shannon, Claude Elwood , journal=. A mathematical theory of communication , year=

  53. [54]

    The Twelfth International Conference on Learning Representations , year=

    Circuit Component Reuse Across Tasks in Transformer Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  54. [55]

    The Twelfth International Conference on Learning Representations , year=

    Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. The Twelfth International Conference on Learning Representations , year=

  55. [56]

    Forty-second International Conference on Machine Learning , year=

    Validating Mechanistic Interpretations: An Axiomatic Approach , author=. Forty-second International Conference on Machine Learning , year=

  56. [57]

    2021 , journal=

    A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

  57. [58]

    2022 , journal=

    In-context Learning and Induction Heads , author=. 2022 , journal=

  58. [59]

    The Thirteenth International Conference on Learning Representations , year=

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  59. [60]

    Ameisen, Emmanuel and Lindsey, Jack and Pearce, Adam and Gurnee, Wes and Turner, Nicholas L. and Chen, Brian and Citro, Craig and Abrahams, David and Carter, Shan and Hosmer, Basil and Marcus, Jonathan and Sklar, Michael and Templeton, Adly and Bricken, Trenton and McDougall, Callum and Cunningham, Hoagy and Henighan, Thomas and Jermyn, Adam and Jones, An...

  60. [61]

    The Fourteenth International Conference on Learning Representations , year=

    Formal Mechanistic Interpretability: Automated Circuit Discovery with Provable Guarantees , author=. The Fourteenth International Conference on Learning Representations , year=

  61. [62]

    2022 , journal=

    Causal scrubbing, a method for rigorously testing interpretability hypotheses , author=. 2022 , journal=

  62. [63]

    2020 , howpublished =

    Interpreting GPT: The Logit Lens , author =. 2020 , howpublished =

  63. [64]

    , author Montani, I

    Honnibal, Matthew and Montani, Ines and Van Landeghem, Sofie and Boyd, Adriane , biburl =. doi:10.5281/zenodo.1212303 , interhash =

  64. [65]

    TRACE : Training and Inference-Time Interpretability Analysis for Language Models

    Aljaafari, Nura and Carvalho, Danilo and Freitas, Andre. TRACE : Training and Inference-Time Interpretability Analysis for Language Models. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2025. doi:10.18653/v1/2025.emnlp-demos.62

  65. [66]

    2022 , howpublished =

    TransformerLens , author =. 2022 , howpublished =

  66. [67]

    and Varoquaux, G

    Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in

  67. [68]

    Advances in neural information processing systems , volume=

    Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=

  68. [69]

    Borgwardt , title =

    Nino Shervashidze and Pascal Schweitzer and Erik Jan van Leeuwen and Kurt Mehlhorn and Karsten M. Borgwardt , title =. Journal of Machine Learning Research , year =

  69. [70]

    Machine Learning , volume=

    Random forests , author=. Machine Learning , volume=. 2001 , publisher=

  70. [71]

    Philipp Nazari and T

    Mib: A mechanistic interpretability benchmark , author=. arXiv preprint arXiv:2504.13151 , year=