Causal Evidence of Stack Representations in Modeling Counter Languages Using Transformers

Nishit Singh

arxiv: 2606.03398 · v1 · pith:7XJSKORHnew · submitted 2026-06-02 · 💻 cs.CL · cs.AI

Causal Evidence of Stack Representations in Modeling Counter Languages Using Transformers

Nishit Singh This is my paper

Pith reviewed 2026-06-28 10:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords transformersstack representationscounter languagescausal analysislinear probesformal languagesablationsrepresentational interventions

0 comments

The pith

Ablating the principal stack-depth direction from transformer hidden states collapses accuracy on counter languages to near zero.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether stack representations that transformers learn while modeling counter languages are merely present or are actually required for the model's success. Linear probes are trained to recover stack depth from hidden states at each step, and the main direction identified by the probe is then removed from the model's activations. After this targeted removal, sequential prediction accuracy falls to near zero on the tasks, indicating that the stack information is not incidental but is actively used by the model to perform the computation.

Core claim

Transformers trained on next-token prediction over counter languages learn representations consistent with an underlying stack structure; the principal direction recovered by a linear probe on stack depth is causally necessary for performance because ablating it causes sequential accuracy to collapse to near 0%.

What carries the argument

Principal direction extracted from a linear probe that predicts stack depth from the model's hidden states at each token.

If this is right

The model actively uses the recovered stack representation to carry out its sequential computations on counter languages.
Representational similarity alone does not establish that a feature is used; causal ablation is required to demonstrate necessity.
Targeted removal of the direction disrupts performance specifically on the stack-dependent aspects of the task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same probe-and-ablate method could be applied to other formal languages to test which internal features are required rather than merely correlated.
If similar causal directions exist for hierarchical structure in natural language modeling, interventions on those directions might selectively impair syntactic processing.
This result suggests that linear directions in activation space can serve as levers for testing mechanistic hypotheses in sequence models.

Load-bearing premise

The principal direction recovered from the linear probe on stack depth is the specific representational component the model relies on for its computations, and ablating it selectively removes stack information without broadly disrupting other necessary internal features.

What would settle it

Ablating the identified principal direction from the hidden states and still observing high sequential accuracy on the counter language tasks would show the direction is not causally necessary.

Figures

Figures reproduced from arXiv: 2606.03398 by Nishit Singh.

**Figure 2.** Figure 2: Sequence accuracy as a function of ablation strength [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Position accuracy as a function of ablation strength [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Sequence accuracy as a function of ablation strength [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Position accuracy as a function of ablation strength [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Sequence accuracy as a function of ablation strength [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Position accuracy as a function of ablation strength [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Sequence accuracy as a function of ablation strength [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

read the original abstract

Formal languages have proven to be effective conduits to understand the inner mechanisms of transformers. Past work has shown that transformers trained on next token prediction over counter languages learn representations consistent with an underlying stack structure. Beyond representational analysis, this paper investigates the causal role of these representations. Linear probes are trained to predict the stack depth at each token from the model's hidden states, and a principal representation direction is extracted from the probe. Ablation of this direction from the model causes sequential accuracy to collapse to near 0%, providing strong empirical evidence that the stack representation is not just learned, but is causally necessary for model performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ablation of the probe-derived direction tanks accuracy to near zero, but the abstract leaves open whether that direction cleanly isolates stack mechanics or just some correlated signal.

read the letter

The main point is that this work takes the existing finding that transformers trained on counter languages develop representations consistent with stacks and adds a causal test: a linear probe recovers a principal direction for stack depth, and ablating it drops sequential accuracy to near zero. That moves the claim from "the model has something stack-like" to "that something is necessary for the behavior."

The strength is the intervention itself. Prior representational analyses are correlational; showing that removing the direction breaks performance supplies a different kind of evidence and could be useful for interpretability work on sequential memory.

The soft spot is exactly the one the stress-test flags. The abstract gives no sign of control ablations—random directions, probes trained on token identity or position, or other plausible features—to check whether performance collapses only for the stack direction or for many directions. Without those, the result shows that some depth-correlated direction matters, but does not yet establish that the model is routing its computation through a dedicated stack representation. If the full paper has those controls, the claim strengthens; if not, the causal story is weaker than presented.

This is for people already working on mechanistic interpretability of transformers on formal languages. A reader who wants to see causal methods applied to these models will find the setup worth examining, even if the current evidence is preliminary.

I would send it to peer review. The core idea is worth a full methods check and the addition of controls if they are missing.

Referee Report

2 major / 1 minor

Summary. This paper examines the causal role of stack representations in transformers modeling counter languages. It uses linear probes on model hidden states to predict stack depth, extracts a principal direction from the probe weights, and ablates this direction, resulting in a collapse of sequential accuracy to near zero. The authors conclude that this provides strong empirical evidence for the causal necessity of the stack representation in the model's performance.

Significance. Should the ablation be demonstrated to specifically target stack-related computations without broad disruption (e.g., via appropriate controls), the work would significantly advance the understanding of how transformers implement stack-like mechanisms for formal language tasks. It builds on prior representational studies by adding causal interventions, which is a strength of the approach.

major comments (2)

[Ablation experiments (methods/results)] The interpretation that the ablated direction corresponds to the model's stack representation requires evidence that ablating other directions (such as random vectors or directions from probes on unrelated features like position or token identity) does not produce similar performance collapse. Without such controls, the result shows only that some depth-correlated direction is necessary, not specifically the stack representation.
[Probe and direction extraction (methods)] It is unclear whether the principal direction recovered from the linear probe is the specific subspace used by the transformer for its stack-tracking computations, or if it is a linear combination of entangled signals. This assumption is load-bearing for the causal necessity claim but lacks supporting checks in the described procedure.

minor comments (1)

[Abstract] The abstract could more precisely define 'sequential accuracy' and reference the exact evaluation metric or task setup used for the collapse observation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the strength of the causal claims in our work. We address each major comment below.

read point-by-point responses

Referee: [Ablation experiments (methods/results)] The interpretation that the ablated direction corresponds to the model's stack representation requires evidence that ablating other directions (such as random vectors or directions from probes on unrelated features like position or token identity) does not produce similar performance collapse. Without such controls, the result shows only that some depth-correlated direction is necessary, not specifically the stack representation.

Authors: We agree that control ablations are required to establish specificity. The manuscript currently shows necessity of a depth-correlated direction but does not rule out that other directions could produce similar effects. In revision we will add experiments ablating random vectors and directions from probes on position and token identity, reporting that these do not induce comparable performance collapse. revision: yes
Referee: [Probe and direction extraction (methods)] It is unclear whether the principal direction recovered from the linear probe is the specific subspace used by the transformer for its stack-tracking computations, or if it is a linear combination of entangled signals. This assumption is load-bearing for the causal necessity claim but lacks supporting checks in the described procedure.

Authors: The direction is recovered as the weight vector of the linear probe trained on stack depth, which by construction aligns with the feature of interest. While linear methods can entangle signals, the subsequent ablation supplies causal evidence of necessity. We will add an explicit discussion of this methodological assumption and any feasible post-hoc checks using the existing probe weights. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical intervention study

full rationale

The paper's central claim rests on an experimental pipeline—training transformers on counter-language next-token prediction, fitting linear probes to recover a principal direction correlated with stack depth, and ablating that direction to observe accuracy collapse. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The ablation result is an independent empirical outcome, not a quantity forced by construction from the probe weights or any prior self-referential result. This matches the default expectation of a self-contained empirical paper with no load-bearing reductions to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5615 in / 1032 out tokens · 34509 ms · 2026-06-28T10:20:10.788978+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[4]

ICLR 2025 Workshop on World Models: Understanding, Modelling and Scaling , year=

Emergent Stack Representations in Modeling Counter Languages Using Transformers , author=. ICLR 2025 Workshop on World Models: Understanding, Modelling and Scaling , year=

2025
[5]

Probing Classifiers: Promises, Shortcomings, and Advances

Belinkov, Yonatan. Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics. 2022. doi:10.1162/coli_a_00422

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[6]

2022 , eprint=

Toy Models of Superposition , author=. 2022 , eprint=

2022
[7]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D

Hewitt, John and Liang, Percy. Designing and Interpreting Probes with Control Tasks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1275

work page doi:10.18653/v1/d19-1275 2019
[8]

Transactions of the Association for Computational Linguistics , author =

Hahn, Michael. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics. 2020. doi:10.1162/tacl_a_00306

work page doi:10.1162/tacl_a_00306 2020
[9]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , month=

Bhattamishra, Satwik and Ahuja, Kabir and Goyal, Navin. On the A bility and L imitations of T ransformers to R ecognize F ormal L anguages. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.576

work page doi:10.18653/v1/2020.emnlp-main.576 2020
[10]

On the Turing Completeness of Modern Neural Network Architectures , journal =

Jorge P. On the Turing Completeness of Modern Neural Network Architectures , journal =. 2019 , url =. 1901.03429 , timestamp =

Pith/arXiv arXiv 2019
[11]

2023 , eprint=

Progress measures for grokking via mechanistic interpretability , author=. 2023 , eprint=

2023
[12]

2023 , eprint=

Locating and Editing Factual Associations in GPT , author=. 2023 , eprint=

2023

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

[2] [2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

[3] [3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016

[4] [4]

ICLR 2025 Workshop on World Models: Understanding, Modelling and Scaling , year=

Emergent Stack Representations in Modeling Counter Languages Using Transformers , author=. ICLR 2025 Workshop on World Models: Understanding, Modelling and Scaling , year=

2025

[5] [5]

Probing Classifiers: Promises, Shortcomings, and Advances

Belinkov, Yonatan. Probing Classifiers: Promises, Shortcomings, and Advances. Computational Linguistics. 2022. doi:10.1162/coli_a_00422

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022

[6] [6]

2022 , eprint=

Toy Models of Superposition , author=. 2022 , eprint=

2022

[7] [7]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D

Hewitt, John and Liang, Percy. Designing and Interpreting Probes with Control Tasks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1275

work page doi:10.18653/v1/d19-1275 2019

[8] [8]

Transactions of the Association for Computational Linguistics , author =

Hahn, Michael. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics. 2020. doi:10.1162/tacl_a_00306

work page doi:10.1162/tacl_a_00306 2020

[9] [9]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , month=

Bhattamishra, Satwik and Ahuja, Kabir and Goyal, Navin. On the A bility and L imitations of T ransformers to R ecognize F ormal L anguages. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.576

work page doi:10.18653/v1/2020.emnlp-main.576 2020

[10] [10]

On the Turing Completeness of Modern Neural Network Architectures , journal =

Jorge P. On the Turing Completeness of Modern Neural Network Architectures , journal =. 2019 , url =. 1901.03429 , timestamp =

Pith/arXiv arXiv 2019

[11] [11]

2023 , eprint=

Progress measures for grokking via mechanistic interpretability , author=. 2023 , eprint=

2023

[12] [12]

2023 , eprint=

Locating and Editing Factual Associations in GPT , author=. 2023 , eprint=

2023