Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer

Ishita Darade; Sushrut Thorat

arxiv: 2605.22488 · v1 · pith:OKACXIORnew · submitted 2026-05-21 · 💻 cs.LG

Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer

Ishita Darade , Sushrut Thorat This is my paper

Pith reviewed 2026-05-22 06:36 UTC · model grok-4.3

classification 💻 cs.LG

keywords transformer interpretabilitylinear probescausal interventionsarithmetic computationalgorithmic intermediatesbase conversioncircuit analysis

0 comments

The pith

A Transformer represents algorithmic intermediates for base-digit extraction without transmitting them through its causal pathway.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines a Transformer trained to extract the coefficient of B^D in the base-B expansion of N, a task with an explicit closed-form solution using intermediates like floor division and modulus. Linear probes decode these intermediates from activations, which makes a staged internal computation plausible at first glance. Causal interventions on the localized route carrying D information to the output positions show behavior independent of N and B, and sparse circuit analysis finds mostly separate routes for each input that combine late. The core result is that the model represents the intermediates needed for the closed-form solution but does not route them through the examined causal path to generate answers.

Core claim

The model represents the intermediates that make the closed-form solution plausible, but the identified localized causal route does not transmit them to the output stream. This case shows that probe-based conclusions can diverge sharply from causal observations, even when explicit algorithmic hypotheses are available.

What carries the argument

Localized causal route from the D-input stream to output positions, tested for independence from N and B, together with sparse circuit search that identifies separate input routes combining late.

If this is right

Probe evidence for staged computation can be misleading without causal verification.
Models may encode intermediates for a plausible algorithm without using them in the main information flow.
Sparse circuit analysis can reveal late combination of separate N, B, and D routes instead of sequential processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

High task accuracy can arise from computational strategies that differ from the most obvious closed-form decomposition.
Interpretability methods should combine probes with targeted causal tests to avoid overinterpreting representations.
Similar gaps between representation and causal use may occur in other domains requiring integration of structured inputs.

Load-bearing premise

Any actual staged arithmetic computation using the probed intermediates would necessarily appear in the localized route from the D-input stream to the output positions that the causal tests examined.

What would settle it

An observation that interventions on the probed intermediates along the D-to-output route alter outputs in a manner matching the closed-form floor(N/B^D) mod B, or that circuit search recovers a staged rather than late-combining path.

Figures

Figures reproduced from arXiv: 2605.22488 by Ishita Darade, Sushrut Thorat.

**Figure 2.** Figure 2: Closed-form quantities are decodable, making a staged algorithmic [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The behaviorally effective information carried by the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Sparse circuit search finds factorized routes that converge at the [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Structured prompts require integrating components according to task-relevant relations. How a network implements this integration is often hard to judge in language or vision, where those relations are rarely specified precisely enough to define a candidate internal algorithm. Arithmetic offers a cleaner setting. We study a Transformer trained on base-digit extraction: given $N$, $B$, and $D$, it must report the coefficient of $B^D$ in the base-$B$ expansion of $N$. The closed-form solution, $\lfloor N/B^D \rfloor \bmod B$, provides explicit candidate algorithmic intermediates. Across three seeds, the model reaches 99.83% exact-answer accuracy on held-out number-base intersections, establishing reliable task competence. Linear probes decode the intermediates, making staged arithmetic computation plausible. Causal tests then separate representation from use: within the localized route from the stream with $D$ as input to the output positions, behavior depends on early $D$-selective communication, independent of $N$ and $B$. Relatedly, a sparse circuit search finds mostly separate $N$, $B$, and $D$ routes that combine late rather than the staged route suggested by the probes. Thus, the model represents the intermediates that make the closed-form solution plausible, but the identified localized causal route does not transmit them to the output stream. This case shows that probe-based conclusions can diverge sharply from causal observations, even when explicit algorithmic hypotheses are available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Probes recover the arithmetic intermediates but causal tests on the localized D-route show independent processing instead.

read the letter

The main thing to know is that this paper gives a controlled case where linear probes decode the intermediates from the closed-form solution for base-digit extraction, yet the causal interventions and circuit search point to separate N, B, and D routes that combine late rather than staged use of those intermediates along the D-to-output path. The model hits 99.83% accuracy across seeds on held-out data, so task competence is not in doubt. Probes make the staged computation look plausible, but the interventions find that behavior in the D stream depends only on early D-selective communication and stays independent of N and B. The sparse circuit work backs this up by recovering mostly separate streams instead of the probe-suggested route. This is a direct empirical comparison in a setting with an explicit algorithm, which is the useful part. It shows why we cannot treat probe results as evidence of computation without causal checks. The soft spot is the scope of the causal tests. They focus on the localized D-route, but if the probed values are actually transmitted through the late merge of the separate N/B/D paths, then interventions confined to the D stream would miss it. The abstract also skips quantitative details on intervention effect sizes and circuit hyperparameters, which makes it harder to gauge how strongly the results rule out transmission. This is for mechanistic interpretability researchers who want a concrete example of probes and causal methods diverging. It deserves peer review because the setup is clean and the cautionary result is worth having in the literature, even if the authors should check the late-combination alternative to tighten the claim.

Referee Report

2 major / 2 minor

Summary. The paper trains a Transformer on the base-digit extraction task (given N, B, D output the coefficient of B^D in the base-B expansion of N, i.e. floor(N/B^D) mod B). The model reaches 99.83% accuracy across three seeds on held-out data. Linear probes decode the candidate algorithmic intermediates, but causal interventions within the localized D-input to output route show dependence only on early D-selective communication independent of N and B. A sparse circuit search instead identifies mostly separate N, B, and D routes that combine late. The authors conclude that the model represents the intermediates but the identified causal route does not transmit them, demonstrating divergence between probe and causal evidence for algorithmic implementation.

Significance. If the central findings hold, the work is significant for mechanistic interpretability: it supplies a clean arithmetic setting with an explicit closed-form solution and candidate intermediates, then shows that high probe accuracy does not entail use of those intermediates in the examined causal pathway. The multi-seed high accuracy and the contrast between probe and causal results provide a concrete cautionary example. Strengths include the precise task definition that makes staged-computation hypotheses falsifiable and the empirical reproducibility implied by the reported accuracy across seeds.

major comments (2)

[§4.2] §4.2 (Causal Interventions): The interventions are confined to the localized route from the D-input stream to output positions and report independence from N and B. However, the sparse circuit search (§4.3) identifies mostly separate N/B/D routes that combine late. If the probed intermediates (floor(N/B^D) mod B) are transmitted along the N or B paths or at late merge points, D-route-only interventions would not detect their use, leaving the claim that the model does not compute using the probed intermediates dependent on the untested assumption that any staged computation must appear inside the examined D-route.
[Abstract] Abstract and §3: The abstract states 99.83% accuracy and clear divergence but supplies no quantitative details on intervention effect sizes, probe accuracies, or circuit-search hyperparameters. Without these numbers it is difficult to judge whether the causal separation is strong enough to rule out transmission of the intermediates at the reported precision.

minor comments (2)

[Figure 3] Figure 3 and associated text: the circuit diagrams would benefit from explicit labeling of the late-merge points to clarify how the separate N/B/D routes interact with the D-selective early communication.
[Notation] Notation: the closed-form solution is written inconsistently as floor(N/B^D) mod B versus floor(N/B^D) mod B in different sections; standardize the expression and reference it to the same equation number.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help sharpen the scope of our causal claims and strengthen the quantitative reporting. We address each major point below and indicate revisions where we will update the manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 (Causal Interventions): The interventions are confined to the localized route from the D-input stream to output positions and report independence from N and B. However, the sparse circuit search (§4.3) identifies mostly separate N/B/D routes that combine late. If the probed intermediates (floor(N/B^D) mod B) are transmitted along the N or B paths or at late merge points, D-route-only interventions would not detect their use, leaving the claim that the model does not compute using the probed intermediates dependent on the untested assumption that any staged computation must appear inside the examined D-route.

Authors: We agree that the interventions target the D-input-to-output route and that the circuit search reveals largely separate N, B, and D pathways merging late. Our rationale for focusing on the D-route is that any transmission of the D-dependent intermediate floor(N/B^D) mod B must ultimately incorporate D information to affect the output; the observed early D-selective communication that is independent of N and B values therefore indicates that this intermediate is not being passed along the examined causal path. The late-merge finding from the circuit search is consistent with modular rather than staged processing. We acknowledge that this does not exhaustively rule out computation at unexamined late merge points. In revision we will expand §4.2 to explicitly state the scope of the interventions, articulate why the D-route is the critical test for D-modulated intermediates, and add a limitations paragraph noting that future work could target late-merge activations. revision: partial
Referee: [Abstract] Abstract and §3: The abstract states 99.83% accuracy and clear divergence but supplies no quantitative details on intervention effect sizes, probe accuracies, or circuit-search hyperparameters. Without these numbers it is difficult to judge whether the causal separation is strong enough to rule out transmission of the intermediates at the reported precision.

Authors: We accept that the abstract and §3 would be clearer with explicit quantitative values. In the revised version we will report the probe accuracies achieved on the candidate intermediates, the magnitude of accuracy changes under the causal interventions, and the key hyperparameters (e.g., sparsity level, number of circuits) used in the sparse circuit search. These additions will be placed in both the abstract and the main text of §3 without changing the central claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study with no derivations or fitted predictions

full rationale

The paper reports an empirical investigation: training a Transformer on base-digit extraction, decoding candidate intermediates via linear probes, and running causal interventions on localized routes. No mathematical derivation chain exists that reduces a claimed result to inputs by construction, no parameters are fitted then renamed as predictions, and no self-citations serve as load-bearing uniqueness theorems. All findings (99.83% accuracy, probe decodability, route independence from N/B) are direct experimental observations on held-out data across seeds. The separation of representation from use is an empirical outcome, not a self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is experimental and relies on standard assumptions of transformer training and linear probe faithfulness; no explicit free parameters, axioms, or invented entities are introduced beyond the task definition.

pith-pipeline@v0.9.0 · 5793 in / 1003 out tokens · 46270 ms · 2026-05-22T06:36:05.258664+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Linear probes decode the intermediates... Causal tests then separate representation from use: within the localized route from the stream with D as input to the output positions, behavior depends on early D-selective communication, independent of N and B.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Sparse circuit search finds mostly separate N, B, and D routes that combine late rather than the staged route suggested by the probes.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 4 internal anchors

[1]

Alain, G., Bengio, Y.: Understanding intermediate layers using linear classifier probes.In:InternationalConferenceonLearningRepresentationsWorkshop (2017)

work page 2017
[2]

Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

Belinkov, Y.: Probing classifiers: promises, shortcomings, and advances. Computa- tional Linguistics48(1), 207–219 (2022). https://doi.org/10.1162/coli_a_00422

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[3]

In: Advances in Neural Information Processing Systems (2023)

Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., Garriga-Alonso, A.: Towards automated circuit discovery for mechanistic interpretability. In: Advances in Neural Information Processing Systems (2023)

work page 2023
[4]

Transformer Circuits Thread (2021)

Elhage, N., Nanda, N., Olsson, C., et al.: A mathematical framework for Trans- former circuits. Transformer Circuits Thread (2021). https://transformer-circuits. pub/2021/framework/index.html

work page 2021
[5]

Toy Models of Superposition

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield- Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al.: Toy models of superposition. arXiv:2209.10652 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Transactions of the Association for Computational Linguistics9, 160–175 (2021)

Elazar, Y., Ravfogel, S., Jacovi, A., et al.: Amnesic probing: behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics9, 160–175 (2021). https://doi.org/10.1162/tacl_a_00359

work page doi:10.1162/tacl_a_00359 2021
[7]

In: Advances in Neural Information Processing Systems (2021)

Geiger, A., Lu, H., Icard, T., Potts, C.: Causal abstractions of neural networks. In: Advances in Neural Information Processing Systems (2021)

work page 2021
[8]

In: International Conference on Machine Learning (2022) 16 I

Geiger, A., Wu, Z., Lu, H., et al.: Inducing causal structure for interpretable neural networks. In: International Conference on Machine Learning (2022) 16 I. Darade and S. Thorat

work page 2022
[9]

In: Proceedings of the Third Conference on Causal Learning and Reasoning, pp

Geiger, A., Wu, Z., Potts, C., et al.: Finding alignments between interpretable causal variables and distributed neural representations. In: Proceedings of the Third Conference on Causal Learning and Reasoning, pp. 160–187 (2024)

work page 2024
[10]

Localizing Model Behavior with Path Patching

Goldowsky-Dill, N., MacLeod, C., Sato, L., Arora, A.: Localizing model behavior with path patching. arXiv:2304.05969 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

In: EMNLP-IJCNLP, pp

Hewitt, J., Liang, P.: Designing and interpreting probes with control tasks. In: EMNLP-IJCNLP, pp. 2733–2743 (2019). https://doi.org/10.18653/v1/D19-1275

work page doi:10.18653/v1/d19-1275 2019
[12]

Neuron93(5), 1003–1014 (2017)

Jazayeri, M., Afraz, A.: Navigating the neural space in search of the neural code. Neuron93(5), 1003–1014 (2017). https://doi.org/10.1016/j.neuron.2017.02.019

work page doi:10.1016/j.neuron.2017.02.019 2017
[13]

Representational similarity analysis – connecting the branches of systems neuroscience , issn =

Kriegeskorte, N., Mur, M., Bandettini, P.: Representational similarity analysis— connectingthebranchesofsystemsneuroscience.FrontiersinSystemsNeuroscience 2, 4 (2008). https://doi.org/10.3389/neuro.06.004.2008

work page doi:10.3389/neuro.06.004.2008 2008
[14]

Artificial Intelligence Laboratory Memo 357, Massachusetts Institute of Technology (1976)

Marr, D., Poggio, T.: From understanding computation to understanding neural circuitry. Artificial Intelligence Laboratory Memo 357, Massachusetts Institute of Technology (1976)

work page 1976
[15]

Marr, D.: Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman, San Francisco (1982)

work page 1982
[16]

In: Advances in Neural Information Processing Systems (2022)

Meng, K., Bau, D., Andonian, A., Belinkov, Y.: Locating and editing factual asso- ciations in GPT. In: Advances in Neural Information Processing Systems (2022)

work page 2022
[17]

In: International Conference on Learning Representations (2023)

Nanda, N., Chan, L., Lieberum, T., Smith, J., Steinhardt, J.: Progress measures for grokking via mechanistic interpretability. In: International Conference on Learning Representations (2023)

work page 2023
[18]

arXiv:2410.21272 (2024)

Nikankin,Y.,Reusch,A.,Mueller,A.,Belinkov,Y.:Arithmeticwithoutalgorithms: language models solve math with a bag of heuristics. arXiv:2410.21272 (2024)

work page arXiv 2024
[19]

Trends in Cognitive Sciences10(9), 424–430 (2006)

Norman, K.A., Polyn, S.M., Detre, G.J., et al.: Beyond mind-reading: multi-voxel pattern analysis of fMRI data. Trends in Cognitive Sciences10(9), 424–430 (2006). https://doi.org/10.1016/j.tics.2006.07.005

work page doi:10.1016/j.tics.2006.07.005 2006
[20]

pub/2022/in-context-learning-and-induction-heads/index.html

Olsson, C., Elhage, N., Nanda, N., Joseph, N., et al.: In-context learning and inductionheads.TransformerCircuitsThread(2022).https://transformer-circuits. pub/2022/in-context-learning-and-induction-heads/index.html

work page 2022
[21]

arXiv:2402.02619 (2024)

Quirke, P., Neo, C., Barez, F.: Arithmetic in Transformers explained. arXiv:2402.02619 (2024)

work page arXiv 2024
[22]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp

Stolfo, A., Belinkov, Y., Sachan, M.: A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7035–7052 (2023). https://doi.org/10.18653/v1/2023.emnlp-main.435

work page doi:10.18653/v1/2023.emnlp-main.435 2023
[23]

In: Advances in Neural Information Processing Systems (2017)

Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)

work page 2017
[24]

Path Integration and Object-Location Binding Emerge in an Action-Conditioned Predictive Sequence Network

Ventura, L.A., Bosch, V., Kietzmann, T.C., Thorat, S.: A minimal task reveals emergent path integration and object-location binding in a predictive sequence model. arXiv:2602.03490 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

In: Advances in Neural Information Pro- cessing Systems, vol

Vig, J., Gehrmann, S., Belinkov, Y., et al.: Investigating gender bias in language models using causal mediation analysis. In: Advances in Neural Information Pro- cessing Systems, vol. 33, pp. 12388–12401 (2020)

work page 2020
[26]

In: International Conference on Learning Representations (2023)

Wang, K., Variengien, A., Conmy, A., et al.: Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In: International Conference on Learning Representations (2023)

work page 2023
[27]

NeuroImage110, 48–59 (2015)

Weichwald, S., Meyer, T., Özdenizci, O., et al.: Causal interpretation rules for encoding and decoding models in neuroimaging. NeuroImage110, 48–59 (2015). https://doi.org/10.1016/j.neuroimage.2015.01.036

work page doi:10.1016/j.neuroimage.2015.01.036 2015
[28]

In: International Conference on Learning Represen- tations (2024)

Zhang, F., Nanda, N.: Towards best practices of activation patching in language models: metrics and methods. In: International Conference on Learning Represen- tations (2024)

work page 2024

[1] [1]

Alain, G., Bengio, Y.: Understanding intermediate layers using linear classifier probes.In:InternationalConferenceonLearningRepresentationsWorkshop (2017)

work page 2017

[2] [2]

Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

Belinkov, Y.: Probing classifiers: promises, shortcomings, and advances. Computa- tional Linguistics48(1), 207–219 (2022). https://doi.org/10.1162/coli_a_00422

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022

[3] [3]

In: Advances in Neural Information Processing Systems (2023)

Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., Garriga-Alonso, A.: Towards automated circuit discovery for mechanistic interpretability. In: Advances in Neural Information Processing Systems (2023)

work page 2023

[4] [4]

Transformer Circuits Thread (2021)

Elhage, N., Nanda, N., Olsson, C., et al.: A mathematical framework for Trans- former circuits. Transformer Circuits Thread (2021). https://transformer-circuits. pub/2021/framework/index.html

work page 2021

[5] [5]

Toy Models of Superposition

Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield- Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al.: Toy models of superposition. arXiv:2209.10652 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Transactions of the Association for Computational Linguistics9, 160–175 (2021)

Elazar, Y., Ravfogel, S., Jacovi, A., et al.: Amnesic probing: behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics9, 160–175 (2021). https://doi.org/10.1162/tacl_a_00359

work page doi:10.1162/tacl_a_00359 2021

[7] [7]

In: Advances in Neural Information Processing Systems (2021)

Geiger, A., Lu, H., Icard, T., Potts, C.: Causal abstractions of neural networks. In: Advances in Neural Information Processing Systems (2021)

work page 2021

[8] [8]

In: International Conference on Machine Learning (2022) 16 I

Geiger, A., Wu, Z., Lu, H., et al.: Inducing causal structure for interpretable neural networks. In: International Conference on Machine Learning (2022) 16 I. Darade and S. Thorat

work page 2022

[9] [9]

In: Proceedings of the Third Conference on Causal Learning and Reasoning, pp

Geiger, A., Wu, Z., Potts, C., et al.: Finding alignments between interpretable causal variables and distributed neural representations. In: Proceedings of the Third Conference on Causal Learning and Reasoning, pp. 160–187 (2024)

work page 2024

[10] [10]

Localizing Model Behavior with Path Patching

Goldowsky-Dill, N., MacLeod, C., Sato, L., Arora, A.: Localizing model behavior with path patching. arXiv:2304.05969 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

In: EMNLP-IJCNLP, pp

Hewitt, J., Liang, P.: Designing and interpreting probes with control tasks. In: EMNLP-IJCNLP, pp. 2733–2743 (2019). https://doi.org/10.18653/v1/D19-1275

work page doi:10.18653/v1/d19-1275 2019

[12] [12]

Neuron93(5), 1003–1014 (2017)

Jazayeri, M., Afraz, A.: Navigating the neural space in search of the neural code. Neuron93(5), 1003–1014 (2017). https://doi.org/10.1016/j.neuron.2017.02.019

work page doi:10.1016/j.neuron.2017.02.019 2017

[13] [13]

Representational similarity analysis – connecting the branches of systems neuroscience , issn =

Kriegeskorte, N., Mur, M., Bandettini, P.: Representational similarity analysis— connectingthebranchesofsystemsneuroscience.FrontiersinSystemsNeuroscience 2, 4 (2008). https://doi.org/10.3389/neuro.06.004.2008

work page doi:10.3389/neuro.06.004.2008 2008

[14] [14]

Artificial Intelligence Laboratory Memo 357, Massachusetts Institute of Technology (1976)

Marr, D., Poggio, T.: From understanding computation to understanding neural circuitry. Artificial Intelligence Laboratory Memo 357, Massachusetts Institute of Technology (1976)

work page 1976

[15] [15]

Marr, D.: Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman, San Francisco (1982)

work page 1982

[16] [16]

In: Advances in Neural Information Processing Systems (2022)

Meng, K., Bau, D., Andonian, A., Belinkov, Y.: Locating and editing factual asso- ciations in GPT. In: Advances in Neural Information Processing Systems (2022)

work page 2022

[17] [17]

In: International Conference on Learning Representations (2023)

Nanda, N., Chan, L., Lieberum, T., Smith, J., Steinhardt, J.: Progress measures for grokking via mechanistic interpretability. In: International Conference on Learning Representations (2023)

work page 2023

[18] [18]

arXiv:2410.21272 (2024)

Nikankin,Y.,Reusch,A.,Mueller,A.,Belinkov,Y.:Arithmeticwithoutalgorithms: language models solve math with a bag of heuristics. arXiv:2410.21272 (2024)

work page arXiv 2024

[19] [19]

Trends in Cognitive Sciences10(9), 424–430 (2006)

Norman, K.A., Polyn, S.M., Detre, G.J., et al.: Beyond mind-reading: multi-voxel pattern analysis of fMRI data. Trends in Cognitive Sciences10(9), 424–430 (2006). https://doi.org/10.1016/j.tics.2006.07.005

work page doi:10.1016/j.tics.2006.07.005 2006

[20] [20]

pub/2022/in-context-learning-and-induction-heads/index.html

Olsson, C., Elhage, N., Nanda, N., Joseph, N., et al.: In-context learning and inductionheads.TransformerCircuitsThread(2022).https://transformer-circuits. pub/2022/in-context-learning-and-induction-heads/index.html

work page 2022

[21] [21]

arXiv:2402.02619 (2024)

Quirke, P., Neo, C., Barez, F.: Arithmetic in Transformers explained. arXiv:2402.02619 (2024)

work page arXiv 2024

[22] [22]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp

Stolfo, A., Belinkov, Y., Sachan, M.: A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7035–7052 (2023). https://doi.org/10.18653/v1/2023.emnlp-main.435

work page doi:10.18653/v1/2023.emnlp-main.435 2023

[23] [23]

In: Advances in Neural Information Processing Systems (2017)

Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)

work page 2017

[24] [24]

Path Integration and Object-Location Binding Emerge in an Action-Conditioned Predictive Sequence Network

Ventura, L.A., Bosch, V., Kietzmann, T.C., Thorat, S.: A minimal task reveals emergent path integration and object-location binding in a predictive sequence model. arXiv:2602.03490 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

In: Advances in Neural Information Pro- cessing Systems, vol

Vig, J., Gehrmann, S., Belinkov, Y., et al.: Investigating gender bias in language models using causal mediation analysis. In: Advances in Neural Information Pro- cessing Systems, vol. 33, pp. 12388–12401 (2020)

work page 2020

[26] [26]

In: International Conference on Learning Representations (2023)

Wang, K., Variengien, A., Conmy, A., et al.: Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In: International Conference on Learning Representations (2023)

work page 2023

[27] [27]

NeuroImage110, 48–59 (2015)

Weichwald, S., Meyer, T., Özdenizci, O., et al.: Causal interpretation rules for encoding and decoding models in neuroimaging. NeuroImage110, 48–59 (2015). https://doi.org/10.1016/j.neuroimage.2015.01.036

work page doi:10.1016/j.neuroimage.2015.01.036 2015

[28] [28]

In: International Conference on Learning Represen- tations (2024)

Zhang, F., Nanda, N.: Towards best practices of activation patching in language models: metrics and methods. In: International Conference on Learning Represen- tations (2024)

work page 2024