pith. sign in

arxiv: 2604.12493 · v1 · submitted 2026-04-14 · 💻 cs.CL · cs.AI

Latent Planning Emerges with Scale

Pith reviewed 2026-05-10 14:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords latent planningmodel scalelarge language modelsinternal representationsmechanistic interpretabilityfuture token predictionQwen models
0
0 comments X

The pith

Larger language models develop internal features that represent and prepare for specific future words.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines latent planning as the presence of internal representations that both anticipate a particular future token or concept and adjust earlier text to make that token fit naturally. Testing this on the Qwen-3 model series from 0.6 billion to 14 billion parameters, the authors find that such representations appear and strengthen as model size grows. Even mid-sized models show early versions of these mechanisms, visible in how they encode an upcoming word such as accountant and steer grammar choices like an instead of a. On rhyme-completion tasks the effect is weaker and shorter-range, yet steering experiments still produce scale-dependent gains. The result supplies both a measurement method and direct evidence that implicit planning is a capacity that scales.

Core claim

Latent planning occurs when models possess internal planning representations that cause the generation of a specific future token or concept and shape preceding context to license said future token or concept. In the Qwen-3 family on simple planning tasks, this ability increases with scale; models that plan contain features that represent a planned-for word like accountant and causally drive outputs such as an rather than a, while even the 4B-8B models exhibit nascent versions of these mechanisms. On rhyming couplets, models frequently identify a rhyme in advance but seldom plan many tokens ahead, although steering toward planned words in prose elicits additional planning that also grows in

What carries the argument

Internal features that encode a planned-for future token and causally influence the probability of earlier tokens to license it.

If this is right

  • Scaling will produce stronger implicit planning in tasks such as story writing and code generation.
  • Mid-sized models already contain basic planning representations that can be measured and potentially amplified.
  • Steering interventions that point models toward planned words will yield larger planning gains in bigger models.
  • Mechanistic inspection of these features can track how planning representations form during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same measurement approach could be used to test whether other emergent behaviors also rest on identifiable internal representations.
  • If the features can be located reliably, targeted editing might allow finer control over long-range coherence without explicit prompting.
  • The pattern suggests planning is a general capacity that appears once models exceed a certain size threshold rather than a task-specific skill.

Load-bearing premise

The identified internal features and their token influences reflect genuine causal planning rather than surface correlations, and the chosen tasks validly capture the defined form of latent planning.

What would settle it

Intervening on or ablating the identified planning features and observing no corresponding change in the choice of future tokens or in the shaping of prior context would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.12493 by Emmanuel Ameisen, Michael Hanna.

Figure 1
Figure 1. Figure 1: Feature circuit for the input Someone who studies living organisms is a biologist. Some￾one who handles financial records is, explaining Qwen-3 (14B)’s output, an. The model plans to say accountant, causing it to output the appropriate article, an. Labeled nodes are sets of active transcoder features with shared semantics; edges indicate that the source node increases the target node’s activation. We demon… view at source ↗
Figure 2
Figure 2. Figure 2: Left: Qwen-3 family models’ recall of correct article on the a/an task. All models can recall a, but models ≤ 8B have lower recall on the less-common an. Right: The mean proportion of influence flowing through planning nodes in the a/an dataset, by model, article, and correctness. On an examples where the model correctly predicts the next token, more influence tends to flow through the planning nodes. This… view at source ↗
Figure 3
Figure 3. Figure 3: Left: Change in p(correct article) caused by zero and multiplying interventions on plan￾ning features. As expected, ablating these harms performance, while upweighting them improves it; however, both affect primarily an examples, the minority class. Right: Change in p(correct article) caused by direct-effect interventions. Effects are smaller, indicating that planning features act both directly (by upweigh… view at source ↗
Figure 4
Figure 4. Figure 4: A feature circuit for the couplet Fury burns where calm once stayed,\n Hope flickers where the shadows laid, explaining Qwen-3 (14B)’s decision to output shadows laid,\n. Halfway through outputting the couplet’s second line, the model’s ”near end of a line of poetry” features activate. These cause it to attend back to the end of the first line, where “end of a line of poetry” features are active, and to mo… view at source ↗
Figure 5
Figure 5. Figure 5: Left: Qwen-3 rhyming accuracy. In the base case, models have moderate rhyming accu￾racy, reaching 0.6 at 14B parameters (solid blue); when we consider assonant (vowel-only) rhyme, Qwen-3 (14B) achieves 0.8 (dashed blue). When steered to predict a new rhyme, model accuracy is only moderate for perfect rhymes (solid orange), but improves with scale, and is better on assonant rhyme (dashed orange). Right: Mod… view at source ↗
Figure 6
Figure 6. Figure 6: Adaptation metrics by model. As models grow, so does the proportion of (1) out￾puts containing X (blue), and (3) coherent and X￾containing outputs that also adapt the context to license X (green). Few examples do all three. Though the backward planning results are mostly negative, results for larger models (8B￾14B) trend in the right direction: they more accurately predict the steered rhyme given the steer… view at source ↗
Figure 7
Figure 7. Figure 7: A diagram of a transcoder. The transcoder takes in the dense MLP inputs, computes a [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A 2-layer transformer LM, and its corresponding local replacement model. We replace [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The interface used for circuit visualization / annotation, from [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Left: Recall of is and are on the is/are dataset, by model. Models below 8B in size mostly fail to predict is, while larger models perform perfectly. All models can predict are. Right: The mean proportion of influence flowing through planning nodes in the is/are dataset, by model, verb, and correctness; recall that the only incorrect examples are small models failing to predict is. The most influence flow… view at source ↗
Figure 11
Figure 11. Figure 11: Left: Change in p(correct verb) caused by zero and multiplying interventions on plan￾ning features. The former generally harm performance, while the latter improve it. Both affect only is examples, which have the most planning nodes, and also are the only examples models answer incorrectly. Right: Change in p(correct verb) caused by direct zero and multiplying interventions on planning features. As before… view at source ↗
Figure 12
Figure 12. Figure 12: Left: Recall of el and la examples by Qwen-3 models. Unlike in prior examples, the majority class (el) is not perfectly captured by any model, though recall is generally high. Moreover, while performance on the minority class la improves with scale, recall is ultimately still middling. Right: Interventions performed with respect to el/la planning features fail primarily due to a lack of said planning feat… view at source ↗
Figure 13
Figure 13. Figure 13: Left: Planned word accuracy, i.e. whether the model’s predicted word matches the intended word, when given the correct or incorrect article. Models above 4B in size are highly accu￾rate when given the correct article (> 80%), and even smaller model achieve moderate accuracies. Given the wrong article, accuracies are lower, but still non-zero, indicating that models may have a strong planning goal that pre… view at source ↗
Figure 14
Figure 14. Figure 14: Number accuracy, i.e. whether the model’s predicted number of animals matches the [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Effects of random features interventions on the [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Macro F1 scores when probing models’ last-token representations for the animal that the [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Success rates of our patching intervention, where we patch one prompt (p [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: A: Effects of intervening on end-of-line (EOL) features. Upweighting them in the sec￾ond line causes the line to end early, while downweighting them causes it to continue for longer than normal. B: Effect of upweighting near-end of line (NEOL) features in the second line. Upweighting these causes the model to emit a rhyme over 2 words earlier than normal. C: Effects of downweight￾ing EOL features at the e… view at source ↗
Figure 19
Figure 19. Figure 19: Length of the shared prefix between the original generation, generations with temperature [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: For each model, and each of its top-10 words by number of say X features, we steer on those say X features, setting their activations to 3, 5, or 7 times their original values. We do so on 5-15 token fragments of sentences from the TinyStories dataset (Eldan & Li, 2023)—a neutral context where models are not likely strongly planning. We then record whether each model eventually output X, and qualitatively… view at source ↗
Figure 20
Figure 20. Figure 20: A feature circuit for a couplet ending in [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Steering metrics by model size, averaged across steering strength. Overall, as model [PITH_FULL_IMAGE:figures/full_fig_p030_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Performance of base (dashed line) and instruction-tuned (solid line) models on the a/an [PITH_FULL_IMAGE:figures/full_fig_p032_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: The effectiveness of replacement interventions, measured by the percent of cases where [PITH_FULL_IMAGE:figures/full_fig_p033_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Results of steering on planning features from the [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Planning feature counts on an examples, by model and correctness. The smallest models have few planning features overall, while the largest has relatively many in both the correct and incorrect cases. In the 4B and 8B parameter models (with nascent circuits), there is a large gap between the number of planning features active in correct and incorrect cases.. L STEERING ON A-An PLANNING FEATURES In this se… view at source ↗
read the original abstract

LLMs can perform seemingly planning-intensive tasks, like writing coherent stories or functioning code, without explicitly verbalizing a plan; however, the extent to which they implicitly plan is unknown. In this paper, we define latent planning as occurring when LLMs possess internal planning representations that (1) cause the generation of a specific future token or concept, and (2) shape preceding context to license said future token or concept. We study the Qwen-3 family (0.6B-14B) on simple planning tasks, finding that latent planning ability increases with scale. Models that plan possess features that represent a planned-for word like "accountant", and cause them to output "an" rather than "a"; moreover, even the less-successful Qwen-3 4B-8B have nascent planning mechanisms. On the more complex task of completing rhyming couplets, we find that models often identify a rhyme ahead of time, but even large models seldom plan far ahead. However, we can elicit some planning that increases with scale when steering models towards planned words in prose. In sum, we offer a framework for measuring planning and mechanistic evidence of how models' planning abilities grow with scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper defines latent planning as the presence of internal representations that both cause a specific future token/concept and shape preceding context to license it. It reports that this ability increases with scale in the Qwen-3 family (0.6B–14B) on simple planning tasks, with larger models possessing features that represent planned-for words (e.g., 'accountant') and drive correct article choice ('an' vs. 'a'); even 4B–8B models show nascent versions of these mechanisms. On rhyming-couplet completion, models identify rhymes ahead of time but plan only limited distance ahead, though steering toward planned words elicits more planning that scales with size.

Significance. If the causal claims hold, the work supplies a concrete framework for measuring implicit planning and mechanistic evidence that planning representations strengthen with scale. This could guide future interpretability and training efforts aimed at improving long-horizon reasoning in LLMs.

major comments (3)
  1. [§4] §4 (feature identification for 'an'/'a' task): the central definition requires that the identified features 'cause' the future token, yet the reported evidence consists of correlational probes (feature activation similarity, logit inspection) without activation patching, steering, or counterfactual interventions on the planning direction during generation. This leaves open whether the features are doing the causal work or merely correlate with successful outputs.
  2. [§5] §5 (rhyming-couplet results and scaling claims): the statement that 'even the less-successful Qwen-3 4B-8B have nascent planning mechanisms' is load-bearing for the scaling narrative, but the manuscript provides no quantitative threshold, statistical test, or control condition that distinguishes nascent planning from task-specific correlations or memorization.
  3. [§3] §3 (task design): the simple planning and rhyming tasks are used to operationalize the definition, yet no ablation or control experiments are described that rule out the possibility that observed 'planning' features are downstream effects of surface-level token statistics rather than genuine future-oriented representations.
minor comments (2)
  1. [Abstract] The abstract states empirical findings without any mention of methods, controls, or statistical tests; a one-sentence summary of the experimental approach would improve readability.
  2. [§2] Notation for 'planned-for word' and 'planning feature' is introduced informally; a short definitions subsection or table would clarify the distinction between representation and causal effect.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify important gaps in causal evidence, quantification of nascent effects, and controls for alternative explanations. We address each point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (feature identification for 'an'/'a' task): the central definition requires that the identified features 'cause' the future token, yet the reported evidence consists of correlational probes (feature activation similarity, logit inspection) without activation patching, steering, or counterfactual interventions on the planning direction during generation. This leaves open whether the features are doing the causal work or merely correlate with successful outputs.

    Authors: We agree that the definition of latent planning is causal and that our current probes (activation similarity and logit inspection) are primarily correlational. While the scaling trends and the link between feature presence and article choice provide suggestive functional evidence, direct interventions are needed for stronger claims. In the revised manuscript we will add activation patching and steering experiments that intervene on the identified planning features during generation, testing whether they causally drive the future token and the preceding context. revision: yes

  2. Referee: [§5] §5 (rhyming-couplet results and scaling claims): the statement that 'even the less-successful Qwen-3 4B-8B have nascent planning mechanisms' is load-bearing for the scaling narrative, but the manuscript provides no quantitative threshold, statistical test, or control condition that distinguishes nascent planning from task-specific correlations or memorization.

    Authors: We acknowledge that the phrasing 'nascent planning mechanisms' for the 4B–8B models requires more rigorous grounding. The revision will introduce explicit quantitative thresholds (e.g., minimum feature–token correlation and downstream accuracy gains), statistical tests against matched control conditions, and additional baselines that isolate planning representations from task-specific correlations or memorization. These changes will make the scaling claims more precise and defensible. revision: yes

  3. Referee: [§3] §3 (task design): the simple planning and rhyming tasks are used to operationalize the definition, yet no ablation or control experiments are described that rule out the possibility that observed 'planning' features are downstream effects of surface-level token statistics rather than genuine future-oriented representations.

    Authors: The tasks were constructed so that correct article choice or rhyme selection depends on a future noun or rhyme word that is not yet in the local context. Nevertheless, we agree that surface-level statistical explanations must be explicitly ruled out. The revised version will include ablation controls that preserve local token statistics while removing the need for future planning (and vice versa) to confirm that the identified features reflect genuine latent planning rather than downstream correlations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical scaling study with independent measurements

full rationale

The paper introduces a definition of latent planning and reports observational results on how internal representations and task performance scale across the Qwen-3 model family (0.6B–14B). No mathematical derivations, self-referential equations, or fitted parameters are present that would reduce any claimed prediction or scaling result to the inputs by construction. The analysis consists of measurements on planning and rhyming tasks, with findings stated as empirical observations rather than tautological outputs of the definition itself. No load-bearing self-citations or ansatzes imported from prior work by the same authors appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters or axioms; the introduced definition of latent planning may implicitly assume that token prediction and feature analysis capture planning, but no details are available.

pith-pipeline@v0.9.0 · 5501 in / 1192 out tokens · 56412 ms · 2026-05-10T14:51:25.452963+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 3 internal anchors

  1. [1]

    In: Zong, C., Xia, F., Li, W., Navigli, R

    Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/ 2025.naacl-long.312. URLhttps://aclanthology.org/2025.naacl-long.312/. 11 Published as a conference paper at ICLR 2026 Carnegie Mellon University. The carnegie mellon pronouncing dictionary, 2014. URLhttp: //www.speech.cs.cmu.edu/cgi-bin/cmudict. Version 0.7b. Tyler A. Ch...

  2. [2]

    TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

    URLhttps://openreview.net/forum?id=J6zHcScAo0. Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english?, 2023. URLhttps://arxiv.org/abs/2305.07759. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn D...

  3. [3]

    https://transformer-circuits.pub/2021/framework/index.html. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of super- position.Transformer C...

  4. [4]

    Shauli Ravfogel, Grusha Prasad, Tal Linzen, and Yoav Goldberg

    URLhttps://www.lesswrong.com/posts/9NqgYesCutErskdmu/ parascopes-do-language-models-plan-the-upcoming-paragraph. Shauli Ravfogel, Grusha Prasad, Tal Linzen, and Yoav Goldberg. Counterfactual interventions reveal the causal effect of relative clause representations on agreement prediction. In Arianna Bisazza and Omri Abend (eds.),Proceedings of the 25th Co...

  5. [5]

    Qwen3 Technical Report

    Association for Computational Linguistics. doi: 10.18653/v1/2024.knowllm-1.1. URL https://aclanthology.org/2024.knowllm-1.1/. Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learnin...

  6. [6]

    {steered_word}

    **Contains steered word **: Does the steered generation contain the exact word "{steered_word}"? (Look for exact match, case- insensitive)

  7. [7]

    **Coherence**: Is the steered generation coherent up to the target word. Is it natural, or does it show signs of breakdown such as: - Excessive repetition of words or phrases - Unnatural/nonsensical sentences - Abrupt topic changes that don’t make sense - Grammatical breakdown - IMPORTANT: Focus on coherence up to the point where the steered word appears ...

  8. [8]

    a", "the

    **Context adaptation **: Compare the steered generation to the baseline generation, starting from the input_prompt. Did the model modify the words immediately preceding where the steered word appears (or would appear) compared to the baseline? Look for changes like: - Adding/changing articles ("a", "the", "an") - Adding/changing prepositions ("in", "on", ...