Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer
Pith reviewed 2026-05-22 06:36 UTC · model grok-4.3
The pith
A Transformer represents algorithmic intermediates for base-digit extraction without transmitting them through its causal pathway.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The model represents the intermediates that make the closed-form solution plausible, but the identified localized causal route does not transmit them to the output stream. This case shows that probe-based conclusions can diverge sharply from causal observations, even when explicit algorithmic hypotheses are available.
What carries the argument
Localized causal route from the D-input stream to output positions, tested for independence from N and B, together with sparse circuit search that identifies separate input routes combining late.
If this is right
- Probe evidence for staged computation can be misleading without causal verification.
- Models may encode intermediates for a plausible algorithm without using them in the main information flow.
- Sparse circuit analysis can reveal late combination of separate N, B, and D routes instead of sequential processing.
Where Pith is reading between the lines
- High task accuracy can arise from computational strategies that differ from the most obvious closed-form decomposition.
- Interpretability methods should combine probes with targeted causal tests to avoid overinterpreting representations.
- Similar gaps between representation and causal use may occur in other domains requiring integration of structured inputs.
Load-bearing premise
Any actual staged arithmetic computation using the probed intermediates would necessarily appear in the localized route from the D-input stream to the output positions that the causal tests examined.
What would settle it
An observation that interventions on the probed intermediates along the D-to-output route alter outputs in a manner matching the closed-form floor(N/B^D) mod B, or that circuit search recovers a staged rather than late-combining path.
Figures
read the original abstract
Structured prompts require integrating components according to task-relevant relations. How a network implements this integration is often hard to judge in language or vision, where those relations are rarely specified precisely enough to define a candidate internal algorithm. Arithmetic offers a cleaner setting. We study a Transformer trained on base-digit extraction: given $N$, $B$, and $D$, it must report the coefficient of $B^D$ in the base-$B$ expansion of $N$. The closed-form solution, $\lfloor N/B^D \rfloor \bmod B$, provides explicit candidate algorithmic intermediates. Across three seeds, the model reaches 99.83% exact-answer accuracy on held-out number-base intersections, establishing reliable task competence. Linear probes decode the intermediates, making staged arithmetic computation plausible. Causal tests then separate representation from use: within the localized route from the stream with $D$ as input to the output positions, behavior depends on early $D$-selective communication, independent of $N$ and $B$. Relatedly, a sparse circuit search finds mostly separate $N$, $B$, and $D$ routes that combine late rather than the staged route suggested by the probes. Thus, the model represents the intermediates that make the closed-form solution plausible, but the identified localized causal route does not transmit them to the output stream. This case shows that probe-based conclusions can diverge sharply from causal observations, even when explicit algorithmic hypotheses are available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper trains a Transformer on the base-digit extraction task (given N, B, D output the coefficient of B^D in the base-B expansion of N, i.e. floor(N/B^D) mod B). The model reaches 99.83% accuracy across three seeds on held-out data. Linear probes decode the candidate algorithmic intermediates, but causal interventions within the localized D-input to output route show dependence only on early D-selective communication independent of N and B. A sparse circuit search instead identifies mostly separate N, B, and D routes that combine late. The authors conclude that the model represents the intermediates but the identified causal route does not transmit them, demonstrating divergence between probe and causal evidence for algorithmic implementation.
Significance. If the central findings hold, the work is significant for mechanistic interpretability: it supplies a clean arithmetic setting with an explicit closed-form solution and candidate intermediates, then shows that high probe accuracy does not entail use of those intermediates in the examined causal pathway. The multi-seed high accuracy and the contrast between probe and causal results provide a concrete cautionary example. Strengths include the precise task definition that makes staged-computation hypotheses falsifiable and the empirical reproducibility implied by the reported accuracy across seeds.
major comments (2)
- [§4.2] §4.2 (Causal Interventions): The interventions are confined to the localized route from the D-input stream to output positions and report independence from N and B. However, the sparse circuit search (§4.3) identifies mostly separate N/B/D routes that combine late. If the probed intermediates (floor(N/B^D) mod B) are transmitted along the N or B paths or at late merge points, D-route-only interventions would not detect their use, leaving the claim that the model does not compute using the probed intermediates dependent on the untested assumption that any staged computation must appear inside the examined D-route.
- [Abstract] Abstract and §3: The abstract states 99.83% accuracy and clear divergence but supplies no quantitative details on intervention effect sizes, probe accuracies, or circuit-search hyperparameters. Without these numbers it is difficult to judge whether the causal separation is strong enough to rule out transmission of the intermediates at the reported precision.
minor comments (2)
- [Figure 3] Figure 3 and associated text: the circuit diagrams would benefit from explicit labeling of the late-merge points to clarify how the separate N/B/D routes interact with the D-selective early communication.
- [Notation] Notation: the closed-form solution is written inconsistently as floor(N/B^D) mod B versus floor(N/B^D) mod B in different sections; standardize the expression and reference it to the same equation number.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help sharpen the scope of our causal claims and strengthen the quantitative reporting. We address each major point below and indicate revisions where we will update the manuscript.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Causal Interventions): The interventions are confined to the localized route from the D-input stream to output positions and report independence from N and B. However, the sparse circuit search (§4.3) identifies mostly separate N/B/D routes that combine late. If the probed intermediates (floor(N/B^D) mod B) are transmitted along the N or B paths or at late merge points, D-route-only interventions would not detect their use, leaving the claim that the model does not compute using the probed intermediates dependent on the untested assumption that any staged computation must appear inside the examined D-route.
Authors: We agree that the interventions target the D-input-to-output route and that the circuit search reveals largely separate N, B, and D pathways merging late. Our rationale for focusing on the D-route is that any transmission of the D-dependent intermediate floor(N/B^D) mod B must ultimately incorporate D information to affect the output; the observed early D-selective communication that is independent of N and B values therefore indicates that this intermediate is not being passed along the examined causal path. The late-merge finding from the circuit search is consistent with modular rather than staged processing. We acknowledge that this does not exhaustively rule out computation at unexamined late merge points. In revision we will expand §4.2 to explicitly state the scope of the interventions, articulate why the D-route is the critical test for D-modulated intermediates, and add a limitations paragraph noting that future work could target late-merge activations. revision: partial
-
Referee: [Abstract] Abstract and §3: The abstract states 99.83% accuracy and clear divergence but supplies no quantitative details on intervention effect sizes, probe accuracies, or circuit-search hyperparameters. Without these numbers it is difficult to judge whether the causal separation is strong enough to rule out transmission of the intermediates at the reported precision.
Authors: We accept that the abstract and §3 would be clearer with explicit quantitative values. In the revised version we will report the probe accuracies achieved on the candidate intermediates, the magnitude of accuracy changes under the causal interventions, and the key hyperparameters (e.g., sparsity level, number of circuits) used in the sparse circuit search. These additions will be placed in both the abstract and the main text of §3 without changing the central claims. revision: yes
Circularity Check
No circularity: empirical study with no derivations or fitted predictions
full rationale
The paper reports an empirical investigation: training a Transformer on base-digit extraction, decoding candidate intermediates via linear probes, and running causal interventions on localized routes. No mathematical derivation chain exists that reduces a claimed result to inputs by construction, no parameters are fitted then renamed as predictions, and no self-citations serve as load-bearing uniqueness theorems. All findings (99.83% accuracy, probe decodability, route independence from N/B) are direct experimental observations on held-out data across seeds. The separation of representation from use is an empirical outcome, not a self-referential reduction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_strictMono unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Linear probes decode the intermediates... Causal tests then separate representation from use: within the localized route from the stream with D as input to the output positions, behavior depends on early D-selective communication, independent of N and B.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Sparse circuit search finds mostly separate N, B, and D routes that combine late rather than the staged route suggested by the probes.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Alain, G., Bengio, Y.: Understanding intermediate layers using linear classifier probes.In:InternationalConferenceonLearningRepresentationsWorkshop (2017)
work page 2017
-
[2]
Belinkov, Y.: Probing classifiers: promises, shortcomings, and advances. Computa- tional Linguistics48(1), 207–219 (2022). https://doi.org/10.1162/coli_a_00422
work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
-
[3]
In: Advances in Neural Information Processing Systems (2023)
Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., Garriga-Alonso, A.: Towards automated circuit discovery for mechanistic interpretability. In: Advances in Neural Information Processing Systems (2023)
work page 2023
-
[4]
Transformer Circuits Thread (2021)
Elhage, N., Nanda, N., Olsson, C., et al.: A mathematical framework for Trans- former circuits. Transformer Circuits Thread (2021). https://transformer-circuits. pub/2021/framework/index.html
work page 2021
-
[5]
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield- Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al.: Toy models of superposition. arXiv:2209.10652 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
Transactions of the Association for Computational Linguistics9, 160–175 (2021)
Elazar, Y., Ravfogel, S., Jacovi, A., et al.: Amnesic probing: behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics9, 160–175 (2021). https://doi.org/10.1162/tacl_a_00359
-
[7]
In: Advances in Neural Information Processing Systems (2021)
Geiger, A., Lu, H., Icard, T., Potts, C.: Causal abstractions of neural networks. In: Advances in Neural Information Processing Systems (2021)
work page 2021
-
[8]
In: International Conference on Machine Learning (2022) 16 I
Geiger, A., Wu, Z., Lu, H., et al.: Inducing causal structure for interpretable neural networks. In: International Conference on Machine Learning (2022) 16 I. Darade and S. Thorat
work page 2022
-
[9]
In: Proceedings of the Third Conference on Causal Learning and Reasoning, pp
Geiger, A., Wu, Z., Potts, C., et al.: Finding alignments between interpretable causal variables and distributed neural representations. In: Proceedings of the Third Conference on Causal Learning and Reasoning, pp. 160–187 (2024)
work page 2024
-
[10]
Localizing Model Behavior with Path Patching
Goldowsky-Dill, N., MacLeod, C., Sato, L., Arora, A.: Localizing model behavior with path patching. arXiv:2304.05969 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
Hewitt, J., Liang, P.: Designing and interpreting probes with control tasks. In: EMNLP-IJCNLP, pp. 2733–2743 (2019). https://doi.org/10.18653/v1/D19-1275
-
[12]
Jazayeri, M., Afraz, A.: Navigating the neural space in search of the neural code. Neuron93(5), 1003–1014 (2017). https://doi.org/10.1016/j.neuron.2017.02.019
-
[13]
Representational similarity analysis – connecting the branches of systems neuroscience , issn =
Kriegeskorte, N., Mur, M., Bandettini, P.: Representational similarity analysis— connectingthebranchesofsystemsneuroscience.FrontiersinSystemsNeuroscience 2, 4 (2008). https://doi.org/10.3389/neuro.06.004.2008
-
[14]
Artificial Intelligence Laboratory Memo 357, Massachusetts Institute of Technology (1976)
Marr, D., Poggio, T.: From understanding computation to understanding neural circuitry. Artificial Intelligence Laboratory Memo 357, Massachusetts Institute of Technology (1976)
work page 1976
-
[15]
Marr, D.: Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W. H. Freeman, San Francisco (1982)
work page 1982
-
[16]
In: Advances in Neural Information Processing Systems (2022)
Meng, K., Bau, D., Andonian, A., Belinkov, Y.: Locating and editing factual asso- ciations in GPT. In: Advances in Neural Information Processing Systems (2022)
work page 2022
-
[17]
In: International Conference on Learning Representations (2023)
Nanda, N., Chan, L., Lieberum, T., Smith, J., Steinhardt, J.: Progress measures for grokking via mechanistic interpretability. In: International Conference on Learning Representations (2023)
work page 2023
-
[18]
Nikankin,Y.,Reusch,A.,Mueller,A.,Belinkov,Y.:Arithmeticwithoutalgorithms: language models solve math with a bag of heuristics. arXiv:2410.21272 (2024)
-
[19]
Trends in Cognitive Sciences10(9), 424–430 (2006)
Norman, K.A., Polyn, S.M., Detre, G.J., et al.: Beyond mind-reading: multi-voxel pattern analysis of fMRI data. Trends in Cognitive Sciences10(9), 424–430 (2006). https://doi.org/10.1016/j.tics.2006.07.005
-
[20]
pub/2022/in-context-learning-and-induction-heads/index.html
Olsson, C., Elhage, N., Nanda, N., Joseph, N., et al.: In-context learning and inductionheads.TransformerCircuitsThread(2022).https://transformer-circuits. pub/2022/in-context-learning-and-induction-heads/index.html
work page 2022
-
[21]
Quirke, P., Neo, C., Barez, F.: Arithmetic in Transformers explained. arXiv:2402.02619 (2024)
-
[22]
In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp
Stolfo, A., Belinkov, Y., Sachan, M.: A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 7035–7052 (2023). https://doi.org/10.18653/v1/2023.emnlp-main.435
-
[23]
In: Advances in Neural Information Processing Systems (2017)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)
work page 2017
-
[24]
Ventura, L.A., Bosch, V., Kietzmann, T.C., Thorat, S.: A minimal task reveals emergent path integration and object-location binding in a predictive sequence model. arXiv:2602.03490 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
In: Advances in Neural Information Pro- cessing Systems, vol
Vig, J., Gehrmann, S., Belinkov, Y., et al.: Investigating gender bias in language models using causal mediation analysis. In: Advances in Neural Information Pro- cessing Systems, vol. 33, pp. 12388–12401 (2020)
work page 2020
-
[26]
In: International Conference on Learning Representations (2023)
Wang, K., Variengien, A., Conmy, A., et al.: Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In: International Conference on Learning Representations (2023)
work page 2023
-
[27]
Weichwald, S., Meyer, T., Özdenizci, O., et al.: Causal interpretation rules for encoding and decoding models in neuroimaging. NeuroImage110, 48–59 (2015). https://doi.org/10.1016/j.neuroimage.2015.01.036
-
[28]
In: International Conference on Learning Represen- tations (2024)
Zhang, F., Nanda, N.: Towards best practices of activation patching in language models: metrics and methods. In: International Conference on Learning Represen- tations (2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.