pith. sign in

arxiv: 2605.22488 · v1 · pith:OKACXIORnew · submitted 2026-05-21 · 💻 cs.LG

Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer

Pith reviewed 2026-05-22 06:36 UTC · model grok-4.3

classification 💻 cs.LG
keywords transformer interpretabilitylinear probescausal interventionsarithmetic computationalgorithmic intermediatesbase conversioncircuit analysis
0
0 comments X

The pith

A Transformer represents algorithmic intermediates for base-digit extraction without transmitting them through its causal pathway.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines a Transformer trained to extract the coefficient of B^D in the base-B expansion of N, a task with an explicit closed-form solution using intermediates like floor division and modulus. Linear probes decode these intermediates from activations, which makes a staged internal computation plausible at first glance. Causal interventions on the localized route carrying D information to the output positions show behavior independent of N and B, and sparse circuit analysis finds mostly separate routes for each input that combine late. The core result is that the model represents the intermediates needed for the closed-form solution but does not route them through the examined causal path to generate answers.

Core claim

The model represents the intermediates that make the closed-form solution plausible, but the identified localized causal route does not transmit them to the output stream. This case shows that probe-based conclusions can diverge sharply from causal observations, even when explicit algorithmic hypotheses are available.

What carries the argument

Localized causal route from the D-input stream to output positions, tested for independence from N and B, together with sparse circuit search that identifies separate input routes combining late.

If this is right

  • Probe evidence for staged computation can be misleading without causal verification.
  • Models may encode intermediates for a plausible algorithm without using them in the main information flow.
  • Sparse circuit analysis can reveal late combination of separate N, B, and D routes instead of sequential processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • High task accuracy can arise from computational strategies that differ from the most obvious closed-form decomposition.
  • Interpretability methods should combine probes with targeted causal tests to avoid overinterpreting representations.
  • Similar gaps between representation and causal use may occur in other domains requiring integration of structured inputs.

Load-bearing premise

Any actual staged arithmetic computation using the probed intermediates would necessarily appear in the localized route from the D-input stream to the output positions that the causal tests examined.

What would settle it

An observation that interventions on the probed intermediates along the D-to-output route alter outputs in a manner matching the closed-form floor(N/B^D) mod B, or that circuit search recovers a staged rather than late-combining path.

Figures

Figures reproduced from arXiv: 2605.22488 by Ishita Darade, Sushrut Thorat.

Figure 1
Figure 1. Figure 1: Transformers solve base-digit extraction almost perfectly on held [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Closed-form quantities are decodable, making a staged algorithmic [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The behaviorally effective information carried by the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sparse circuit search finds factorized routes that converge at the [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Structured prompts require integrating components according to task-relevant relations. How a network implements this integration is often hard to judge in language or vision, where those relations are rarely specified precisely enough to define a candidate internal algorithm. Arithmetic offers a cleaner setting. We study a Transformer trained on base-digit extraction: given $N$, $B$, and $D$, it must report the coefficient of $B^D$ in the base-$B$ expansion of $N$. The closed-form solution, $\lfloor N/B^D \rfloor \bmod B$, provides explicit candidate algorithmic intermediates. Across three seeds, the model reaches 99.83% exact-answer accuracy on held-out number-base intersections, establishing reliable task competence. Linear probes decode the intermediates, making staged arithmetic computation plausible. Causal tests then separate representation from use: within the localized route from the stream with $D$ as input to the output positions, behavior depends on early $D$-selective communication, independent of $N$ and $B$. Relatedly, a sparse circuit search finds mostly separate $N$, $B$, and $D$ routes that combine late rather than the staged route suggested by the probes. Thus, the model represents the intermediates that make the closed-form solution plausible, but the identified localized causal route does not transmit them to the output stream. This case shows that probe-based conclusions can diverge sharply from causal observations, even when explicit algorithmic hypotheses are available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper trains a Transformer on the base-digit extraction task (given N, B, D output the coefficient of B^D in the base-B expansion of N, i.e. floor(N/B^D) mod B). The model reaches 99.83% accuracy across three seeds on held-out data. Linear probes decode the candidate algorithmic intermediates, but causal interventions within the localized D-input to output route show dependence only on early D-selective communication independent of N and B. A sparse circuit search instead identifies mostly separate N, B, and D routes that combine late. The authors conclude that the model represents the intermediates but the identified causal route does not transmit them, demonstrating divergence between probe and causal evidence for algorithmic implementation.

Significance. If the central findings hold, the work is significant for mechanistic interpretability: it supplies a clean arithmetic setting with an explicit closed-form solution and candidate intermediates, then shows that high probe accuracy does not entail use of those intermediates in the examined causal pathway. The multi-seed high accuracy and the contrast between probe and causal results provide a concrete cautionary example. Strengths include the precise task definition that makes staged-computation hypotheses falsifiable and the empirical reproducibility implied by the reported accuracy across seeds.

major comments (2)
  1. [§4.2] §4.2 (Causal Interventions): The interventions are confined to the localized route from the D-input stream to output positions and report independence from N and B. However, the sparse circuit search (§4.3) identifies mostly separate N/B/D routes that combine late. If the probed intermediates (floor(N/B^D) mod B) are transmitted along the N or B paths or at late merge points, D-route-only interventions would not detect their use, leaving the claim that the model does not compute using the probed intermediates dependent on the untested assumption that any staged computation must appear inside the examined D-route.
  2. [Abstract] Abstract and §3: The abstract states 99.83% accuracy and clear divergence but supplies no quantitative details on intervention effect sizes, probe accuracies, or circuit-search hyperparameters. Without these numbers it is difficult to judge whether the causal separation is strong enough to rule out transmission of the intermediates at the reported precision.
minor comments (2)
  1. [Figure 3] Figure 3 and associated text: the circuit diagrams would benefit from explicit labeling of the late-merge points to clarify how the separate N/B/D routes interact with the D-selective early communication.
  2. [Notation] Notation: the closed-form solution is written inconsistently as floor(N/B^D) mod B versus floor(N/B^D) mod B in different sections; standardize the expression and reference it to the same equation number.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help sharpen the scope of our causal claims and strengthen the quantitative reporting. We address each major point below and indicate revisions where we will update the manuscript.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Causal Interventions): The interventions are confined to the localized route from the D-input stream to output positions and report independence from N and B. However, the sparse circuit search (§4.3) identifies mostly separate N/B/D routes that combine late. If the probed intermediates (floor(N/B^D) mod B) are transmitted along the N or B paths or at late merge points, D-route-only interventions would not detect their use, leaving the claim that the model does not compute using the probed intermediates dependent on the untested assumption that any staged computation must appear inside the examined D-route.

    Authors: We agree that the interventions target the D-input-to-output route and that the circuit search reveals largely separate N, B, and D pathways merging late. Our rationale for focusing on the D-route is that any transmission of the D-dependent intermediate floor(N/B^D) mod B must ultimately incorporate D information to affect the output; the observed early D-selective communication that is independent of N and B values therefore indicates that this intermediate is not being passed along the examined causal path. The late-merge finding from the circuit search is consistent with modular rather than staged processing. We acknowledge that this does not exhaustively rule out computation at unexamined late merge points. In revision we will expand §4.2 to explicitly state the scope of the interventions, articulate why the D-route is the critical test for D-modulated intermediates, and add a limitations paragraph noting that future work could target late-merge activations. revision: partial

  2. Referee: [Abstract] Abstract and §3: The abstract states 99.83% accuracy and clear divergence but supplies no quantitative details on intervention effect sizes, probe accuracies, or circuit-search hyperparameters. Without these numbers it is difficult to judge whether the causal separation is strong enough to rule out transmission of the intermediates at the reported precision.

    Authors: We accept that the abstract and §3 would be clearer with explicit quantitative values. In the revised version we will report the probe accuracies achieved on the candidate intermediates, the magnitude of accuracy changes under the causal interventions, and the key hyperparameters (e.g., sparsity level, number of circuits) used in the sparse circuit search. These additions will be placed in both the abstract and the main text of §3 without changing the central claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical study with no derivations or fitted predictions

full rationale

The paper reports an empirical investigation: training a Transformer on base-digit extraction, decoding candidate intermediates via linear probes, and running causal interventions on localized routes. No mathematical derivation chain exists that reduces a claimed result to inputs by construction, no parameters are fitted then renamed as predictions, and no self-citations serve as load-bearing uniqueness theorems. All findings (99.83% accuracy, probe decodability, route independence from N/B) are direct experimental observations on held-out data across seeds. The separation of representation from use is an empirical outcome, not a self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is experimental and relies on standard assumptions of transformer training and linear probe faithfulness; no explicit free parameters, axioms, or invented entities are introduced beyond the task definition.

pith-pipeline@v0.9.0 · 5793 in / 1003 out tokens · 46270 ms · 2026-05-22T06:36:05.258664+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 4 internal anchors

  1. [1]

    Alain, G., Bengio, Y.: Understanding intermediate layers using linear classifier probes.In:InternationalConferenceonLearningRepresentationsWorkshop (2017)

  2. [2]

    Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022

    Belinkov, Y.: Probing classifiers: promises, shortcomings, and advances. Computa- tional Linguistics48(1), 207–219 (2022). https://doi.org/10.1162/coli_a_00422

  3. [3]

    In: Advances in Neural Information Processing Systems (2023)

    Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., Garriga-Alonso, A.: Towards automated circuit discovery for mechanistic interpretability. In: Advances in Neural Information Processing Systems (2023)

  4. [4]

    Transformer Circuits Thread (2021)

    Elhage, N., Nanda, N., Olsson, C., et al.: A mathematical framework for Trans- former circuits. Transformer Circuits Thread (2021). https://transformer-circuits. pub/2021/framework/index.html

  5. [5]

    Toy Models of Superposition

    Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield- Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al.: Toy models of superposition. arXiv:2209.10652 (2022)

  6. [6]

    Transactions of the Association for Computational Linguistics9, 160–175 (2021)

    Elazar, Y., Ravfogel, S., Jacovi, A., et al.: Amnesic probing: behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics9, 160–175 (2021). https://doi.org/10.1162/tacl_a_00359

  7. [7]

    In: Advances in Neural Information Processing Systems (2021)

    Geiger, A., Lu, H., Icard, T., Potts, C.: Causal abstractions of neural networks. In: Advances in Neural Information Processing Systems (2021)

  8. [8]

    In: International Conference on Machine Learning (2022) 16 I

    Geiger, A., Wu, Z., Lu, H., et al.: Inducing causal structure for interpretable neural networks. In: International Conference on Machine Learning (2022) 16 I. Darade and S. Thorat

  9. [9]

    In: Proceedings of the Third Conference on Causal Learning and Reasoning, pp

    Geiger, A., Wu, Z., Potts, C., et al.: Finding alignments between interpretable causal variables and distributed neural representations. In: Proceedings of the Third Conference on Causal Learning and Reasoning, pp. 160–187 (2024)

  10. [10]

    Localizing Model Behavior with Path Patching

    Goldowsky-Dill, N., MacLeod, C., Sato, L., Arora, A.: Localizing model behavior with path patching. arXiv:2304.05969 (2023)

  11. [11]

    In: EMNLP-IJCNLP, pp

    Hewitt, J., Liang, P.: Designing and interpreting probes with control tasks. In: EMNLP-IJCNLP, pp. 2733–2743 (2019). https://doi.org/10.18653/v1/D19-1275

  12. [12]

    Neuron93(5), 1003–1014 (2017)

    Jazayeri, M., Afraz, A.: Navigating the neural space in search of the neural code. Neuron93(5), 1003–1014 (2017). https://doi.org/10.1016/j.neuron.2017.02.019

  13. [13]

    Representational similarity analysis – connecting the branches of systems neuroscience , issn =

    Kriegeskorte, N., Mur, M., Bandettini, P.: Representational similarity analysis— connectingthebranchesofsystemsneuroscience.FrontiersinSystemsNeuroscience 2, 4 (2008). https://doi.org/10.3389/neuro.06.004.2008

  14. [14]

    Artificial Intelligence Laboratory Memo 357, Massachusetts Institute of Technology (1976)

    Marr, D., Poggio, T.: From understanding computation to understanding neural circuitry. Artificial Intelligence Laboratory Memo 357, Massachusetts Institute of Technology (1976)

  15. [15]

    Marr, D.: Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman, San Francisco (1982)

  16. [16]

    In: Advances in Neural Information Processing Systems (2022)

    Meng, K., Bau, D., Andonian, A., Belinkov, Y.: Locating and editing factual asso- ciations in GPT. In: Advances in Neural Information Processing Systems (2022)

  17. [17]

    In: International Conference on Learning Representations (2023)

    Nanda, N., Chan, L., Lieberum, T., Smith, J., Steinhardt, J.: Progress measures for grokking via mechanistic interpretability. In: International Conference on Learning Representations (2023)

  18. [18]

    arXiv:2410.21272 (2024)

    Nikankin,Y.,Reusch,A.,Mueller,A.,Belinkov,Y.:Arithmeticwithoutalgorithms: language models solve math with a bag of heuristics. arXiv:2410.21272 (2024)

  19. [19]

    Trends in Cognitive Sciences10(9), 424–430 (2006)

    Norman, K.A., Polyn, S.M., Detre, G.J., et al.: Beyond mind-reading: multi-voxel pattern analysis of fMRI data. Trends in Cognitive Sciences10(9), 424–430 (2006). https://doi.org/10.1016/j.tics.2006.07.005

  20. [20]

    pub/2022/in-context-learning-and-induction-heads/index.html

    Olsson, C., Elhage, N., Nanda, N., Joseph, N., et al.: In-context learning and inductionheads.TransformerCircuitsThread(2022).https://transformer-circuits. pub/2022/in-context-learning-and-induction-heads/index.html

  21. [21]

    arXiv:2402.02619 (2024)

    Quirke, P., Neo, C., Barez, F.: Arithmetic in Transformers explained. arXiv:2402.02619 (2024)

  22. [22]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp

    Stolfo, A., Belinkov, Y., Sachan, M.: A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7035–7052 (2023). https://doi.org/10.18653/v1/2023.emnlp-main.435

  23. [23]

    In: Advances in Neural Information Processing Systems (2017)

    Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)

  24. [24]

    Path Integration and Object-Location Binding Emerge in an Action-Conditioned Predictive Sequence Network

    Ventura, L.A., Bosch, V., Kietzmann, T.C., Thorat, S.: A minimal task reveals emergent path integration and object-location binding in a predictive sequence model. arXiv:2602.03490 (2026)

  25. [25]

    In: Advances in Neural Information Pro- cessing Systems, vol

    Vig, J., Gehrmann, S., Belinkov, Y., et al.: Investigating gender bias in language models using causal mediation analysis. In: Advances in Neural Information Pro- cessing Systems, vol. 33, pp. 12388–12401 (2020)

  26. [26]

    In: International Conference on Learning Representations (2023)

    Wang, K., Variengien, A., Conmy, A., et al.: Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In: International Conference on Learning Representations (2023)

  27. [27]

    NeuroImage110, 48–59 (2015)

    Weichwald, S., Meyer, T., Özdenizci, O., et al.: Causal interpretation rules for encoding and decoding models in neuroimaging. NeuroImage110, 48–59 (2015). https://doi.org/10.1016/j.neuroimage.2015.01.036

  28. [28]

    In: International Conference on Learning Represen- tations (2024)

    Zhang, F., Nanda, N.: Towards best practices of activation patching in language models: metrics and methods. In: International Conference on Learning Represen- tations (2024)