pith. sign in

arxiv: 2511.05963 · v2 · pith:EREDXFXXnew · submitted 2025-11-08 · 💻 cs.LG

Next-Latent Prediction Transformers Learn Compact World Models

Pith reviewed 2026-05-25 07:18 UTC · model grok-4.3

classification 💻 cs.LG
keywords transformersworld modelsbelief stateslatent predictionnext-token predictioncompact representationsinductive biasself-supervised learning
0
0 comments X

The pith

Adding a next-latent prediction loss makes transformer internal states converge to belief states that compress history for future prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes NextLat, an auxiliary objective that trains a transformer to predict its next latent representation from the current latent and the next token, in addition to standard next-token prediction. Without this term, transformers have no built-in pressure to maintain compact, consistent internal states across time steps. The authors prove that the joint objective forces the latents to converge toward belief states, the minimal information about past observations needed to predict future ones. This change is claimed to produce more coherent internal world models while preserving the original architecture, parallel training, and inference procedure. Readers would care because such representations are said to improve accuracy on tasks that require planning or reasoning over sequences.

Core claim

NextLat extends standard next-token training with self-supervised predictions in the latent space, training the model to predict its next latent state given the next token. The paper shows theoretically that the resulting latents converge to belief states, which are compressed summaries of history sufficient for predicting future observations. This auxiliary objective injects a recurrent inductive bias into transformers, encouraging formation of compact internal world models with coherent transition dynamics that standard next-token prediction does not guarantee.

What carries the argument

The next-latent prediction auxiliary loss, jointly optimized with the primary next-token loss, which drives convergence of latents to belief states.

If this is right

  • The learned latents become more compressed representations of history.
  • Downstream accuracy improves on world modeling, reasoning, planning, and language modeling tasks.
  • Variable-length self-speculative decoding becomes possible, accelerating inference up to 3.3x.
  • The transformer architecture, parallel training efficiency, and inference procedure remain unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same auxiliary objective might be added to other non-recurrent sequence models to induce similar compression.
  • Improved lookahead planning may stem directly from the enforced consistency of the learned transition dynamics.
  • Smaller models trained with NextLat could match the effective capacity of larger models trained only on next-token loss.

Load-bearing premise

Jointly optimizing the auxiliary latent-prediction loss with the main next-token loss will produce convergence to belief states without the auxiliary term dominating or destabilizing training.

What would settle it

Train a transformer under the NextLat objective and check whether its latent representations fail to compress history or produce inconsistent next-state predictions across different sequences of the same underlying process.

Figures

Figures reproduced from arXiv: 2511.05963 by Akshay Krishnamurthy, Alex Lamb, Edward S. Hu, Jayden Teoh, John Langford, Kwangjun Ahn, Manan Tomar, Pratyusha Sharma, Riashat Islam, Tim Pearce.

Figure 1
Figure 1. Figure 1: Reconstructed maps from sequences generated by transformers trained on Manhattan taxi rides [ [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of different predictive mechanisms at time step t = 3. Other methods supervise only the token-level emissions, leaving intermediate latent representations implicit. In contrast, NextLat explicitly learns latent dynamics that predicts hidden state hˆ t+1 from (ht, xt+1). Token-level supervision is then applied to the hˆ t+1. Therefore, accurate multi-token predictions (beyond the next token) em… view at source ↗
Figure 3
Figure 3. Figure 3: Reconstructed maps from transformers trained on Manhattan taxi rides using different objectives. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance on Countdown. Best result is bolded, and second best is underlined. Eq. 1 Eq. 2 Eq. 3 0 20 40 60 80 100 Validity (%) GPT BST MTP JTP NextLat [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Validity of equations (i.e., LHS = RHS) generated on Countdown. All models in this plot use d = 1. Setup. Following Gandhi et al. [2024], we generate 500k training problems with target numbers ranging from 10 to 100 and reserve 10% of the targets for out-of-distribution evaluation. During both training and testing, we insert eight ‘pause tokens’ [Goyal et al., 2023] after the target number, allowing models… view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of a G5,5 Path-Star graph [Bach￾mann and Nagarajan, 2024]. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cross-entropy loss difference relative to GPT, obtained from linear probes trained on frozen hidden states [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Illustration of the latent transition model pψ. We parameterize the latent transition model pψ with a three-layer MLP using GELU [Hendrycks and Gimpel, 2023] activations. The latent transition model takes as input the layer-normalized [Ba et al., 2016] concatenation of the current hidden state ht and next-token embedding Xt+1, and outputs a delta update applied via residual connection: hˆ t+1 = pψ(ht, Xt+1… view at source ↗
Figure 10
Figure 10. Figure 10: Full plot version of [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
read the original abstract

Transformers replace recurrence with a memory that grows with sequence length and self-attention that enables ad-hoc lookups over past tokens. Consequently, they lack an inherent incentive to compress history into compact latent states with consistent transition rules. This often leads to learning solutions that generalize poorly. We introduce Next-Latent Prediction (NextLat), which extends standard next-token training with self-supervised predictions in the latent space. Specifically, NextLat trains a transformer to learn latent representations that are predictive of its next latent state given the next token. Theoretically, we show that these latents provably converge towards belief states, compressed information about the history necessary to predict the future. This simple auxiliary objective injects a recurrent inductive bias into transformers while leaving their architecture, parallel training efficiency, and inference unchanged. NextLat effectively encourages transformers to form compact internal world models with coherent belief states and transition dynamics -- crucial properties not guaranteed by standard next-token prediction alone. Empirically, across benchmarks in world modeling, reasoning, planning, and language modeling, NextLat demonstrates significant gains over standard next-token prediction and other baselines in downstream accuracy, representation compression, and lookahead planning. Furthermore, NextLat enables variable-length self-speculative decoding, accelerating inference by up to 3.3x in language modeling. NextLat offers a simple yet effective paradigm for learning compact, predictive representations in transformers that generalize better Our code is available at https://github.com/microsoft/NextLat.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Next-Latent Prediction (NextLat), which augments standard next-token cross-entropy training of transformers with an auxiliary self-supervised loss that predicts the next latent representation given the next token. It claims that the resulting latents provably converge to belief states (compressed sufficient statistics of history for future prediction), inject a recurrent inductive bias without altering architecture or inference, and yield empirical gains in world modeling, reasoning, planning, language modeling, representation compression, and up to 3.3x faster self-speculative decoding.

Significance. If the convergence result holds under the joint objective actually used in training and the reported gains are robust to controls, the method would offer a lightweight way to encourage compact, predictive internal world models in transformers while preserving their parallel training advantages. The combination of a theoretical fixed-point argument with downstream improvements in lookahead planning and variable-length decoding would be a notable contribution to representation learning for sequential decision-making tasks.

major comments (2)
  1. [Abstract / theoretical analysis section] The theoretical claim that latents converge to belief states is derived only for the auxiliary latent-prediction objective (abstract and the paragraph beginning 'Theoretically, we show...'). No argument is supplied that the stationary points of the combined loss (next-token cross-entropy plus weighted auxiliary term) remain belief states; the next-token term could in principle select representations that are locally predictive of the immediate token but violate the belief-state fixed-point condition. This gap is load-bearing for the central 'provably converge' claim.
  2. [Experiments section] The empirical sections report gains on world-modeling, reasoning, and planning benchmarks, but the manuscript does not include an ablation that isolates whether the auxiliary loss alone (without the next-token term) already produces the claimed belief-state convergence, nor does it verify that the learned latents satisfy the belief-state property on the actual trained models.
minor comments (2)
  1. [Abstract] The abstract states 'Our code is available at https://github.com/microsoft/NextLat' but does not specify the commit or tag corresponding to the reported experiments.
  2. [Theoretical analysis] Notation for the latent transition and belief-state definitions should be introduced with explicit equations rather than prose descriptions to allow direct comparison with the derived fixed-point condition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the constructive feedback on our manuscript. We address the two major comments below and plan to make revisions to strengthen the theoretical claims and experimental validation.

read point-by-point responses
  1. Referee: [Abstract / theoretical analysis section] The theoretical claim that latents converge to belief states is derived only for the auxiliary latent-prediction objective (abstract and the paragraph beginning 'Theoretically, we show...'). No argument is supplied that the stationary points of the combined loss (next-token cross-entropy plus weighted auxiliary term) remain belief states; the next-token term could in principle select representations that are locally predictive of the immediate token but violate the belief-state fixed-point condition. This gap is load-bearing for the central 'provably converge' claim.

    Authors: The referee correctly identifies that our theoretical analysis derives the convergence to belief states specifically for the auxiliary latent-prediction objective. We did not provide a proof that the stationary points of the joint loss are belief states. In the revised manuscript, we will revise the abstract and the theoretical section to accurately reflect that the auxiliary objective has belief states as fixed points, and that the combined training is intended to encourage this property while maintaining next-token prediction performance. We will also add a discussion on why the next-token loss is not expected to violate the fixed-point condition under suitable hyperparameter choices for the auxiliary weight. revision: yes

  2. Referee: [Experiments section] The empirical sections report gains on world-modeling, reasoning, and planning benchmarks, but the manuscript does not include an ablation that isolates whether the auxiliary loss alone (without the next-token term) already produces the claimed belief-state convergence, nor does it verify that the learned latents satisfy the belief-state property on the actual trained models.

    Authors: We agree that the manuscript would benefit from an ablation isolating the auxiliary loss and direct verification of the belief-state property on the trained models. In the revision, we will add such an ablation study where possible (noting that training solely with the auxiliary loss may require adjustments for stability) and include metrics or checks to verify that the learned latents act as sufficient statistics for future predictions. This will help confirm the empirical realization of the theoretical property. revision: yes

Circularity Check

0 steps flagged

No circularity; theoretical claim presented as independent proof

full rationale

The paper states a theoretical result that latents converge to belief states under the auxiliary latent-prediction objective. No quoted equations or self-citations reduce this claim by construction to fitted inputs, renamed empirical patterns, or load-bearing prior work by the same authors. The joint next-token + auxiliary loss is acknowledged as the actual training objective, but the provided text frames the convergence as a separate proof rather than a statistical consequence of the fit itself. This is the normal case of a self-contained derivation; the skeptic concern is a potential applicability gap, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5821 in / 995 out tokens · 54763 ms · 2026-05-25T07:18:57.466192+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Theoretically, we show that these latents provably converge towards belief states, compressed information about the history necessary to predict the future. ... For these consistency objectives to be satisfied, ht must converge to a belief state

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Optimizing for next-token consistency ... and transition consistency ... ensures existence of measurable maps ... ht must jointly optimize toward a belief state

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling

    cs.LG 2026-04 unverdicted novelty 7.0

    Applying STP at consecutive semantic reasoning steps achieves 168x more accurate multi-step latent prediction on ProcessBench than frozen baselines, with trajectories forming smooth curves best captured by non-linear ...

  2. Improving Sampling for Masked Diffusion Models via Information Gain

    cs.CL 2026-02 unverdicted novelty 7.0

    Info-Gain Sampler improves MDM decoding by using bidirectional information gain to reduce cumulative uncertainty, outperforming greedy samplers on reasoning accuracy and creative writing tasks.

  3. Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework

    cs.LG 2026-04 unverdicted novelty 6.0

    ST-PT turns transformers into explicit factor graphs for time series, enabling structural injection of symbolic priors, per-sample conditional generation, and principled latent autoregressive forecasting via MFVI iterations.

  4. The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning

    cs.LG 2026-04 unverdicted novelty 6.0

    LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between d...

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 4 Pith papers · 11 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur

    URL https://arxiv.org/abs/2503.21801. Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length generalization in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Pro...

  3. [3]

    Layer Normalization

    URL https://arxiv. org/abs/1607.06450. Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Mach...

  4. [4]

    Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred Warmuth

    URLhttps://arxiv.org/abs/2304.12210. Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred Warmuth. Classifying learnable geometric concepts with the vapnik-chervonenkis dimension. InProceedings of the eighteenth annual ACM symposium on Theory of computing, pages 273–282,

  5. [5]

    [Online; accessed 12-October-2025]

    URL https://en.wikipedia.org/wiki/Countdown_ (game_show). [Online; accessed 12-October-2025]. Kenneth James Williams Craik.The nature of explanation, volume

  6. [6]

    Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D

    URL https://proceedings.neurips.cc/paper_files/paper/1993/file/ e0ec453e28e061cc58ac43f91dc2f3f0-Paper.pdf. Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith an...

  7. [7]

    TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

    URL https://proceedings.neurips.cc/paper_files/ paper/2024/file/75b0edb869e2cd509d64d0e8ff446bc1-Paper-Conference.pdf. Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759,

  8. [8]

    Bruce A Francis and Walter Murray Wonham

    URLhttps://arxiv.org/abs/2510.17558. Bruce A Francis and Walter Murray Wonham. The internal model principle of control theory.Automatica, 12(5): 457–465,

  9. [9]

    Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D Goodman

    URLhttps://arxiv.org/abs/2403.08540. Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D Goodman. Stream of search (sos): Learning to search in language.arXiv preprint arXiv:2404.03683,

  10. [10]

    Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,

    Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,

  11. [11]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    URL https: //arxiv.org/abs/2312.00752. Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces,

  12. [12]

    Efficiently Modeling Long Sequences with Structured State Spaces

    URLhttps://arxiv.org/abs/2111.00396. Wes Gurnee and Max Tegmark. Language models represent space and time. InThe Twelfth International Conference on Learning Representations,

  13. [13]

    World Models

    URLhttps://openreview.net/forum?id=jE8xbmvFin. David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3),

  14. [14]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603,

  15. [15]

    Lillicrap, Mohammad Norouzi, and Jimmy Ba

    Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

  16. [16]

    doi: 10.1038/s41586-025-08744-2

    ISSN 1476-4687. doi: 10.1038/s41586-025-08744-2. URLhttps: //doi.org/10.1038/s41586-025-08744-2. 14 Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control,

  17. [17]

    Temporal difference learning for model predictive control

    URLhttps://arxiv.org/abs/2203.04955. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus),

  18. [18]

    Gaussian Error Linear Units (GELUs)

    URL https://arxiv.org/ abs/1606.08415. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

  19. [19]

    Huang, P., Liu, S., Liu, Z., Yan, Y ., Wang, S., Chen, Z., and Xiao, T

    URL https://arxiv.org/abs/2410.23506. Hai Huang, Yann LeCun, and Randall Balestriero. Llm-jepa: Large language models meet joint embedding predictive architectures.arXiv preprint arXiv:2509.14252,

  20. [20]

    Jamba: A Hybrid Transformer-Mamba Language Model

    URL https://arxiv.org/abs/ 2403.19887. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

  21. [21]

    Ilya Loshchilov and Frank Hutter

    URL https://arxiv.org/abs/ 2203.01205. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization,

  22. [22]

    Decoupled Weight Decay Regularization

    URL https://arxiv.org/ abs/1711.05101. Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, and Michael Gastpar. Attention with markov: A curious case of single-layer transformers. InThe Thirteenth Interna- tional Conference on Learning Representations,

  23. [23]

    URLhttps://aclanthology.org/2022.tacl-1.49/

    doi: 10.1162/tacl_a_00493. URLhttps://aclanthology.org/2022.tacl-1.49/. William Merrill, Jackson Petty, and Ashish Sabharwal. The illusion of state in state-space models,

  24. [24]

    R Chris Miall and Daniel M Wolpert

    URL https://arxiv.org/abs/2404.08819. R Chris Miall and Daniel M Wolpert. Forward models for physiological motor control.Neural networks, 9(8): 1265–1279,

  25. [25]

    Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction.arXiv preprint arXiv:2504.15266,

    Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, and Aditi Raghunathan. Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction.arXiv preprint arXiv:2504.15266,

  26. [26]

    Roma Patel and Ellie Pavlick

    URLhttps://arxiv.org/abs/2402.04248. Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. InInternational conference on learning representations,

  27. [27]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    URLhttps://arxiv.org/abs/2201.02177. Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling,

  28. [28]

    URL https://arxiv.org/abs/2007. 05929. Charlotte Striebel. Sufficient statistics in the optimum control of stochastic systems.Journal of Mathematical Analysis and Applications, 12(3):576–592,

  29. [29]

    Keyon Vafa, Justin Y Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

    URLhttps://arxiv.org/abs/2409.05816. Keyon Vafa, Justin Y Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. Evaluating the world model implicit in a generative model.Advances in Neural Information Processing Systems, 37:26941–26975,

  30. [30]

    Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim

    URL https://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. In Kevin Duh, H...

  31. [31]

    doi: 10.18653/v1/2024.naacl-long.102

    Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.102. URL https://aclanthology.org/ 2024.naacl-long.102/. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Ha...

  32. [32]

    Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong

    URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 271db9922b8d1f4dd7aaef84ed5ac703-Paper-Conference.pdf. Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autore- gression: Discrete diffusion for complex reasoning and planning. InThe Thirteenth International Conference on Learning Representations,

  33. [33]

    Learning invariant representations for reinforcement learning without reconstruction.arXiv preprint arXiv:2006.10742,

    Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction.arXiv preprint arXiv:2006.10742,

  34. [34]

    17 A Belief States in Sequence Modeling Recent work has introduced variants of sequence modeling architectures based on the principle of learning belief states, i.e., BST and JTP

    URLhttps://arxiv.org/abs/2306.10125. 17 A Belief States in Sequence Modeling Recent work has introduced variants of sequence modeling architectures based on the principle of learning belief states, i.e., BST and JTP. We review these methods here. Letθ denote the parameters of a transformer-based model. Let hs:t denote the hidden states produced by the tra...

  35. [35]

    short-horizon belief states

    suggest that their method learn “short-horizon belief states”, they do not formally define the conditions under which this occurs. To understand how JTP learns belief states, we start by defining ak-observablesystem. Definition A.1(k-observability for sequences).A system isk-observable if for any two sequencesH=X 1:t and H ′ =X 1:j that induce the same jo...

  36. [36]

    A formal proof proceeds by backward induction on t

    ensures the existence of measurable mapsp θ andp ψ that allow recursive decoding of future tokens: ht pθ − − − − − − − → decode token Xt+1 pψ − − − − − − → update state ht+1 pθ − − − − − − − → decode token Xt+2 pψ − − − − − − → update state ht+2 · · · pθ − →XT . A formal proof proceeds by backward induction on t. For the base case t=T−1 , the claim follow...

  37. [37]

    grokking

    # (B,1,1) 17next_tokens = batch 18next_states = hidden_states 19current_states = hidden_states 20loss_next_h = 0 21loss_kl = 0 22 23# Recursive multi-step predictions 24for _ in range(multi_step_horizon): 25# Shift hidden states back by 1 using dummy initial state, similar to RNNs 26current_states = torch.cat([initial_hidden, current_states[:, :-1]], dim=...

  38. [38]

    Singular values smaller than 1e−12 are discarded, and the effective rank is then computed following Roy and Vetterli [2007]

    through the model to obtain the hidden state matrix. Singular values smaller than 1e−12 are discarded, and the effective rank is then computed following Roy and Vetterli [2007]. For GPT and NextLat, we use the final-layer hidden states. For JTP, we extract the hidden states immediately before the self-attention module in the Fetch head (see Equations 4–5 ...

  39. [39]

    Each problem consists of four input numbers and a solution sequence comprising three equations, consistent with prior work [Gandhi et al., 2024, Ye et al., 2025]

    for the Countdown training and evaluation setup. Each problem consists of four input numbers and a solution sequence comprising three equations, consistent with prior work [Gandhi et al., 2024, Ye et al., 2025]. A training example is formatted as 14,83,88,91,23|83−14 = 69,91−88 = 3,69/3 = 23 where the first four numbers are the inputs, the fifth is the ta...

  40. [40]

    This difference accounts for the performance gap observed in the BST and JTP baselines in Figure

    which generate a fresh set of graphs every batch, we adopt the original, more challenging setup of Bachmann and Nagarajan [2024], which uses a fixed sample size of 200k and node values sampled from N= 100 . This difference accounts for the performance gap observed in the BST and JTP baselines in Figure

  41. [41]

    Because the task’s sample space grows exponentially with graph size, identifying the correct algorithm that generalizes across all graph instances is highly nontrivial

    The Path-Star experiment is designed to expose the myopic behavior of teacher-forced next-token prediction, which can encourage models to exploit superficial regularities—an 22 effect referred to as theClever Hans cheat[Bachmann and Nagarajan, 2024]. Because the task’s sample space grows exponentially with graph size, identifying the correct algorithm tha...