arxiv: 2605.07984 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Where's the Plan? Locating Latent Planning in Language Models with Lightweight Mechanistic Interventions

Nicole Ma , Nick Rui

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords mechanistic interpretabilityactivation patchinglatent planninglanguage modelsrhyme completionattention headscausal analysis

0 comments

The pith

Only Gemma-3-27B uses future-rhyme signals at line boundaries to causally plan its word choices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether language models form plans for future tokens constrained by structure, such as rhyme in couplets. Linear probing reveals that rhyme information for the next line is decodable at line boundaries in all model families tested, and this decodability increases with scale. Activation patching shows that this information only causally affects generation in Gemma-3-27B, where the causal source shifts from the prior rhyme word to the line boundary at around layer 30. The shift is localized to five specific attention heads that restore most of the model's rhyme planning ability when patched.

Core claim

Future rhyme information becomes linearly decodable from activations at line boundaries, strengthening with model scale. Yet only in Gemma-3-27B does this encoding causally drive generation, via a handoff around layer 30 from relying on the rhyme word to relying on the boundary. Two-stage path patching localizes this to five attention heads recovering about 90 percent of the rhyme-routing capacity.

What carries the argument

Activation patching and path patching applied to the rhyming couplet task, revealing the handoff of causal planning to line-boundary encodings in one model.

If this is right

Models that show strong probe signals may still not use those signals for generation.
Planning representations may be architecture-specific and scale-dependent.
Lightweight patching methods can identify planning sites efficiently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This gap between decodability and causality suggests that interpretability should prioritize causal tests over representational ones.
Similar planning handoffs might exist in other constrained tasks like mathematical reasoning or story coherence.
Interventions at boundaries could improve model performance on long-form generation if applied during training.

Load-bearing premise

Rhyming-couplet completion acts as a general test of forward-looking structural constraint satisfaction rather than a narrow surface pattern.

What would settle it

If activation patching at the line boundary in Gemma-3-27B fails to affect rhyme accuracy, or if it succeeds in other models, the claim of unique causal reliance would not hold.

Figures

Figures reproduced from arXiv: 2605.07984 by Nick Rui, Nicole Ma.

**Figure 1.** Figure 1: Top-5 accuracy of linear probes predicting k tokens ahead in general text (Pile). Wilson 95% CI bands are drawn but visually imperceptible: per-token N≈21,000 gives a half-width of ∼0.005 at typical p, so the curves are essentially noise-free at this sample size. Accuracy degrades monotonically with k and the k = 8 curve overlaps the unigram baseline across all layers, confirming that planning-compatible r… view at source ↗

**Figure 2.** Figure 2: A linear probe f (ℓ,0) trained to predict the rhyming token from activations at the newline position (i = 0). 3.2. Probing rhyming couplets For rhyming couplet generation, we first synthetically generate 1,200 (1,000 train, 200 validation) rhyming couplets with Claude Sonnet 4.6 (Anthropic, 2026), strategically prompting for diversity of topics and rhyme schemes. We truncate the second line of each coupl… view at source ↗

**Figure 3.** Figure 3: Top-5 and rhyme accuracy of linear probes trained to predict r2 from hidden states at various layers and positions. Shaded bands are Wilson 95% CIs computed from N = 200 validation items. Probes at the last word (i ≤ −1) and newline (i = 0) positions substantially outperform probes at subsequent generated positions (i > 0); the bands at the peak layers do not overlap with i ≥ 1 bands in any of the three mo… view at source ↗

**Figure 4.** Figure 4: Maximum accuracy gap across layers between probes at the newline (i = 0) and the first generated position (i = 1), plotted against model size. Black error bars are 95% CIs at the chosen peak layer (paired-difference Wald approximation; per-sample correctness was not stored, so the interval is conservative relative to the true paired CI). For Qwen3 sizes 0.6B–8B and Llama-3 sizes 1B–8B, the CI on the gap in… view at source ↗

**Figure 5.** Figure 5: Activation patching: the hidden state hℓ,i from a corrupted run is substituted into the clean run’s forward pass at position i and layer ℓ. A successful patch redirects generation toward the corrupted rhyme family, providing causal evidence that (i, ℓ) is a planning site. token (position i = −1 in Qwen3 and Llama-3, i = −2 in Gemma-3 due to its tokenization of the line-ending comma; see Appendix A) and t… view at source ↗

**Figure 6.** Figure 6: Per-layer activation patching at the last word token and newline (i = 0) for the largest model in each family. Full results for all model sizes are in Appendix D. (a) Attention weight, i=0 → i=−2. (b) Top-k head patching at the newline [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Localizing the planning site handoff in Gemma-3-27B to a sparse set of attention heads. (a) Attention weight from the newline token (i = 0) to the last word token (i = −2) across heads in layers 27–45. Red stars mark the top-5 heads by attention weight. (b) Corrupt rhyme rate when patching the top-k highest-attending heads simultaneously at the newline (5 prompt pairs × N=20). Black error bars are cluster-… view at source ↗

**Figure 8.** Figure 8: Two-stage path patching K-sweep on Gemma-3-27B. Attention-weight top-k peaks at k = 5 (57%, 90% of the fullresidual reference) and declines at k = 10, 15. Comma-control and random head sets stay at zero. Error bars are 95% clusterbootstrap CIs over prompt pairs. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 10.** Figure 10: shows the steered rhyme fraction at the last word and newline across all layers for Qwen3-32B, Gemma-3- 27B, and Llama-3.1-70B; the qualitative picture matches [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 9.** Figure 9: Steering vectors are mean-difference vectors between residual activations on prompts in two different rhyme schemes; adding αv (s→t) ℓ,i at (ℓ, i) during generation should redirect the rhyme toward scheme t. Computing the vectors requires 10 schemes × 100 train prompts = 1,000 hooked forward passes, and the evaluation sweep covers every (ℓ, i) across scheme pairs at 20 held-out prompts each. The patching r… view at source ↗

**Figure 11.** Figure 11: Top-1 probe accuracy predicting k tokens ahead in general text (Pile). Mirrors the top-5 pattern: accuracy degrades monotonically with k and falls to unigram baseline by k = 8. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: Top-1 probe accuracy predicting r2 on rhyming couplets. Mirrors the top-5 and rhyme accuracy results: the i ≤ 0 probes show substantially higher accuracy in middle-to-late layers than probes at i > 0. D. Additional Activation Patching Details Baselines To verify that the observed corrupt rhyme rates reflect the specific encoding of the corrupt rhyme word rather than a generic effect of perturbing the res… view at source ↗

**Figure 14.** Figure 14: Peak corrupt rhyme rate (maximum across all layers) for the last word token and newline (i = 0) at each model size. Black error bars are 95% cluster bootstrap CIs over prompt pairs (5 pairs per model). Gemma-3-27B is the only model whose newline CI is clearly separated from zero (0.63 [0.48, 0.78]); every other model has a newline CI upper bound ≤ 0.21 regardless of scale, while last-word patching is broa… view at source ↗

**Figure 15.** Figure 15: Per-layer activation patching, Qwen3 0.6B–4B. Black bars are 95% cluster bootstrap CIs (5 pairs × N=20). Last-word patching becomes effective only from 4B (peak 0.48 [0.25, 0.72]); 0.6B and 1.7B have peak CIs that nearly span zero. Newline (i = 0) patching is at noise across all sizes (peak CI upper bounds ≤ 0.21) [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗

**Figure 16.** Figure 16: Per-layer activation patching, Qwen3 8B–32B. Black bars are 95% cluster bootstrap CIs. Last-word peaks rise smoothly with scale (8B 0.54 [0.35, 0.72]; 14B 0.73 [0.46, 0.97]; 32B 0.74 [0.55, 0.93]). Newline patching remains at noise across all three sizes (CI upper bounds ≤ 0.21). 11 [PITH_FULL_IMAGE:figures/full_fig_p011_16.png] view at source ↗

**Figure 17.** Figure 17: Per-layer activation patching, Gemma-3 1B–27B. Black bars are 95% cluster bootstrap CIs. The newline (i = 0) channel is silent at every size below 27B (peak CI [0.00, 0.00] for 4B and 12B), and only emerges at 27B with peak 0.63 [0.48, 0.78] at L33. Last-word (i = −2) patching is effective from 1B onward [PITH_FULL_IMAGE:figures/full_fig_p012_17.png] view at source ↗

**Figure 19.** Figure 19: Corrupt rhyme rate across all six swept token positions, averaged over 5 prompt pairs. For Gemma-3-27B, the comma token at i = −1 is near zero throughout, confirming the handoff is specific to the newline token. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_19.png] view at source ↗

**Figure 18.** Figure 18: Per-layer activation patching, Llama-3 1B–70B. Black bars are 95% cluster bootstrap CIs. Last-word peaks are high from 3B onward (3B 0.91 [0.86, 0.96]; 8B 0.85 [0.73, 0.97]; 70B 0.86 [0.74, 0.96]); 1B is lower (peak 0.66 [0.56, 0.75]). Newline patching is at noise across all four sizes (CI upper bounds ≤ 0.09). (a) Qwen3-32B (b) Gemma-3-27B [PITH_FULL_IMAGE:figures/full_fig_p012_18.png] view at source ↗

**Figure 20.** Figure 20: Full position sweep across all four model groups in the Qwen3, Gemma-3, and Llama-3 families. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_20.png] view at source ↗

read the original abstract

We study planning site formation in language models -- where internal representations of structurally-constrained future tokens form during the forward pass, and whether they causally drive generation. Using rhyming-couplet completion as a clean test of forward-looking constraint, we apply two lightweight methods (linear probing and activation patching) across Qwen3, Gemma-3, and Llama-3 at more than ten scales. Probing shows that future-rhyme information is linearly decodable at the line boundary, with signal that strengthens with scale in all three families. Activation patching reveals that only Gemma-3-27B causally relies on this encoding, exhibiting a handoff in which the causal driver migrates from the rhyme word to the line boundary around layer 30. Every other model we test conditions on the rhyme word throughout generation, with near-zero causal effect at the line boundary despite strong probe signal. We localize the Gemma-3-27B handoff to five attention heads through two-stage path patching that recover ~90% of the rhyme-routing capacity at the newline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates the formation and causal use of internal representations for structurally constrained future tokens in language models during generation. Using rhyming-couplet completion as a test of forward-looking constraints, it applies linear probing and activation patching across Qwen3, Gemma-3, and Llama-3 families at multiple scales. Probing shows future-rhyme information is linearly decodable at line boundaries with scale-dependent signal strength, while patching reveals that only Gemma-3-27B causally relies on boundary encodings, with a handoff from the rhyme word around layer 30; this is localized to five attention heads via two-stage path patching recovering ~90% of rhyme-routing capacity. Other models show near-zero causal effect at boundaries despite strong probes, instead conditioning on the rhyme word throughout.

Significance. If the causal claims hold after addressing potential confounds, the work offers a lightweight, scalable approach to distinguish probe-detectable information from causally used representations in mechanistic interpretability. It highlights model-specific differences in handling structural constraints and provides concrete localization of planning-like behavior to specific heads in one large model, which could inform targeted interventions and scaling analyses. The cross-family, multi-scale design and use of path patching for head localization are particular strengths.

major comments (2)

[Abstract] Abstract and task description: the central claim that rhyming-couplet completion provides a 'clean test of forward-looking constraint' is not yet supported, because the rhyme word from the first line remains fully in context. Models can satisfy the constraint via direct attention to that prior token at generation time without ever forming or using a representation of an upcoming constraint at the newline; this alternative is consistent with strong linear probes (which can decode the already-known rhyme) yet near-zero causal effects at the boundary in all models except Gemma-3-27B.
[Abstract] Abstract: the reported handoff and five-head localization in Gemma-3-27B may reflect model-specific differences in routing an already-present cue rather than differences in latent planning capacity. Additional controls (e.g., ablating or masking the original rhyme token while preserving boundary representations) are needed to establish that the boundary encoding is used as a forward plan rather than a re-encoding of prior context.

minor comments (2)

The abstract supplies no statistical tests, error bars, or baseline comparisons for the patching effects or the ~90% recovery claim; these should be added to allow assessment of whether the causal effects exceed what would be expected from token-frequency or surface cues.
Notation for the two-stage path patching procedure and the exact definition of 'rhyme-routing capacity' should be clarified with a short methods diagram or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below, acknowledging where the manuscript requires clarification or additional controls, and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract and task description: the central claim that rhyming-couplet completion provides a 'clean test of forward-looking constraint' is not yet supported, because the rhyme word from the first line remains fully in context. Models can satisfy the constraint via direct attention to that prior token at generation time without ever forming or using a representation of an upcoming constraint at the newline; this alternative is consistent with strong linear probes (which can decode the already-known rhyme) yet near-zero causal effects at the boundary in all models except Gemma-3-27B.

Authors: We agree that the rhyme word remains fully in context, so models could satisfy the constraint by direct attention to the prior token without forming a new representation at the newline. Our patching results are consistent with this interpretation for Qwen3, Llama-3, and smaller Gemma models, where causal effects localize to the rhyme word with near-zero boundary effects. The distinctive finding is the handoff observed only in Gemma-3-27B. We will revise the abstract and task description to frame rhyming-couplet completion as a testbed for detecting when models shift from direct cue reliance to boundary-based representations, rather than claiming it as an unequivocally clean test of forward-looking constraints. revision: partial
Referee: [Abstract] Abstract: the reported handoff and five-head localization in Gemma-3-27B may reflect model-specific differences in routing an already-present cue rather than differences in latent planning capacity. Additional controls (e.g., ablating or masking the original rhyme token while preserving boundary representations) are needed to establish that the boundary encoding is used as a forward plan rather than a re-encoding of prior context.

Authors: The referee correctly identifies that the handoff and five-head localization could reflect routing of an already-present cue. We will add the suggested controls in the revised manuscript: we will mask or ablate the original rhyme token while preserving boundary activations, then re-run activation patching and two-stage path patching to measure whether boundary representations retain independent causal influence on rhyme routing in Gemma-3-27B. These results will be reported alongside the existing findings. revision: yes

Circularity Check

0 steps flagged

Purely empirical mechanistic study with no derivation chain

full rationale

The paper reports experimental results from linear probing (to detect decodable future-rhyme information) and activation/path patching (to test causal reliance) across multiple model families and scales on a rhyming-couplet task. No equations, first-principles derivations, or parameter-fitting steps are presented that reduce any central claim to its own inputs by construction. All findings are measured against external model behaviors and interventions; the abstract and described methods contain no self-citations used as load-bearing uniqueness theorems, no ansatzes smuggled via prior work, and no renaming of known results as new organization. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumptions that linear probes recover causally relevant information and that activation patching isolates causal pathways; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Linear probes can extract causally relevant information from model activations
Invoked when interpreting probe accuracy as evidence of internal encoding.
domain assumption Activation patching and path patching reveal causal dependencies in the forward pass
Core premise of the intervention experiments.

pith-pipeline@v0.9.0 · 5483 in / 1430 out tokens · 37367 ms · 2026-05-11T03:10:25.058354+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

[1]

Transcoders find interpretable

Jacob Dunefsky and Philippe Chlenski and Neel Nanda , booktitle=. Transcoders find interpretable. 2024 , url=

work page 2024
[2]

2025 , howpublished =

Circuit Tracing: Revealing Computational Graphs in Language Models , author =. 2025 , howpublished =

work page 2025
[3]

The Thirteenth International Conference on Learning Representations , year=

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[4]

2025 , eprint=

Emergent Response Planning in LLMs , author=. 2025 , eprint=

work page 2025
[5]

2026 , eprint=

What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering , author=. 2026 , eprint=

work page 2026
[6]

The Fourteenth International Conference on Learning Representations , year=

Latent Planning Emerges with Scale , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[7]

2025 , howpublished =

On the Biology of a Large Language Model , author =. 2025 , howpublished =

work page 2025
[8]

2025 , eprint=

ParaScopes: What do Language Models Activations Encode About Future Text? , author=. 2025 , eprint=

work page 2025
[9]

2022 , month = nov, journal =

McGrath, Thomas and Kapishnikov, Andrei and Tomašev, Nenad and Pearce, Adam and Wattenberg, Martin and Hassabis, Demis and Kim, Been and Paquet, Ulrich and Kramnik, Vladimir , year=. Acquisition of chess knowledge in AlphaZero , volume=. Proceedings of the National Academy of Sciences , publisher=. doi:10.1073/pnas.2206625119 , number=

work page doi:10.1073/pnas.2206625119
[10]

2024 , eprint=

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author=. 2024 , eprint=

work page 2024
[11]

2023 , eprint=

Emergent Linear Representations in World Models of Self-Supervised Sequence Models , author=. 2023 , eprint=

work page 2023
[12]

2024 , eprint=

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network , author=. 2024 , eprint=

work page 2024
[13]

2023 , eprint=

Discovering Latent Knowledge in Language Models Without Supervision , author=. 2023 , eprint=

work page 2023
[14]

2019 , eprint=

Designing and Interpreting Probes with Control Tasks , author=. 2019 , eprint=

work page 2019
[15]

2022 , eprint=

Locating and Editing Factual Associations in GPT , author=. 2022 , eprint=

work page 2022
[16]

2026 , month =

Anthropic , title =. 2026 , month =

work page 2026
[17]

International Conference on Learning Representations (ICLR) , year =

Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations (ICLR) , year =

work page
[18]

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and others , journal=. The

work page
[19]

Weide , title =

Richard L. Weide , title =. 1993 , howpublished =

work page 1993
[20]

2022 , eprint=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. 2022 , eprint=

work page 2022
[21]

2024 , eprint=

Let's Think Dot by Dot: Hidden Computation in Transformer Language Models , author=. 2024 , eprint=

work page 2024
[22]

2024 , eprint=

Training Large Language Models to Reason in a Continuous Latent Space , author=. 2024 , eprint=

work page 2024
[23]

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Kevin Wang and Alexandre Variengien and Arthur Conmy and Buck Shlegeris and Jacob Steinhardt , year=. Interpretability in the Wild: a Circuit for Indirect Object Identification in. 2211.00593 , archivePrefix=

work page internal anchor Pith review arXiv
[24]

2023 , eprint=

Steering Language Models With Activation Engineering , author=. 2023 , eprint=

work page 2023
[25]

2024 , eprint=

Refusal in Language Models Is Mediated by a Single Direction , author=. 2024 , eprint=

work page 2024
[26]

2023 , eprint=

Localizing Model Behavior with Path Patching , author=. 2023 , eprint=

work page 2023
[27]

2022 , howpublished=

Causal Scrubbing: a method for rigorously testing interpretability hypotheses , author=. 2022 , howpublished=

work page 2022
[28]

, title =

Wilson, Edwin B. , title =. Journal of the American Statistical Association , volume =. 1927 , publisher =

work page 1927
[29]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[30]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[31]

M. J. Kearns , title =

work page
[32]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[33]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[34]

Suppressed for Anonymity , author=

work page
[35]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[36]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959