Next-Latent Prediction Transformers Learn Compact World Models
Pith reviewed 2026-05-25 07:18 UTC · model grok-4.3
The pith
Adding a next-latent prediction loss makes transformer internal states converge to belief states that compress history for future prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NextLat extends standard next-token training with self-supervised predictions in the latent space, training the model to predict its next latent state given the next token. The paper shows theoretically that the resulting latents converge to belief states, which are compressed summaries of history sufficient for predicting future observations. This auxiliary objective injects a recurrent inductive bias into transformers, encouraging formation of compact internal world models with coherent transition dynamics that standard next-token prediction does not guarantee.
What carries the argument
The next-latent prediction auxiliary loss, jointly optimized with the primary next-token loss, which drives convergence of latents to belief states.
If this is right
- The learned latents become more compressed representations of history.
- Downstream accuracy improves on world modeling, reasoning, planning, and language modeling tasks.
- Variable-length self-speculative decoding becomes possible, accelerating inference up to 3.3x.
- The transformer architecture, parallel training efficiency, and inference procedure remain unchanged.
Where Pith is reading between the lines
- The same auxiliary objective might be added to other non-recurrent sequence models to induce similar compression.
- Improved lookahead planning may stem directly from the enforced consistency of the learned transition dynamics.
- Smaller models trained with NextLat could match the effective capacity of larger models trained only on next-token loss.
Load-bearing premise
Jointly optimizing the auxiliary latent-prediction loss with the main next-token loss will produce convergence to belief states without the auxiliary term dominating or destabilizing training.
What would settle it
Train a transformer under the NextLat objective and check whether its latent representations fail to compress history or produce inconsistent next-state predictions across different sequences of the same underlying process.
Figures
read the original abstract
Transformers replace recurrence with a memory that grows with sequence length and self-attention that enables ad-hoc lookups over past tokens. Consequently, they lack an inherent incentive to compress history into compact latent states with consistent transition rules. This often leads to learning solutions that generalize poorly. We introduce Next-Latent Prediction (NextLat), which extends standard next-token training with self-supervised predictions in the latent space. Specifically, NextLat trains a transformer to learn latent representations that are predictive of its next latent state given the next token. Theoretically, we show that these latents provably converge towards belief states, compressed information about the history necessary to predict the future. This simple auxiliary objective injects a recurrent inductive bias into transformers while leaving their architecture, parallel training efficiency, and inference unchanged. NextLat effectively encourages transformers to form compact internal world models with coherent belief states and transition dynamics -- crucial properties not guaranteed by standard next-token prediction alone. Empirically, across benchmarks in world modeling, reasoning, planning, and language modeling, NextLat demonstrates significant gains over standard next-token prediction and other baselines in downstream accuracy, representation compression, and lookahead planning. Furthermore, NextLat enables variable-length self-speculative decoding, accelerating inference by up to 3.3x in language modeling. NextLat offers a simple yet effective paradigm for learning compact, predictive representations in transformers that generalize better Our code is available at https://github.com/microsoft/NextLat.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Next-Latent Prediction (NextLat), which augments standard next-token cross-entropy training of transformers with an auxiliary self-supervised loss that predicts the next latent representation given the next token. It claims that the resulting latents provably converge to belief states (compressed sufficient statistics of history for future prediction), inject a recurrent inductive bias without altering architecture or inference, and yield empirical gains in world modeling, reasoning, planning, language modeling, representation compression, and up to 3.3x faster self-speculative decoding.
Significance. If the convergence result holds under the joint objective actually used in training and the reported gains are robust to controls, the method would offer a lightweight way to encourage compact, predictive internal world models in transformers while preserving their parallel training advantages. The combination of a theoretical fixed-point argument with downstream improvements in lookahead planning and variable-length decoding would be a notable contribution to representation learning for sequential decision-making tasks.
major comments (2)
- [Abstract / theoretical analysis section] The theoretical claim that latents converge to belief states is derived only for the auxiliary latent-prediction objective (abstract and the paragraph beginning 'Theoretically, we show...'). No argument is supplied that the stationary points of the combined loss (next-token cross-entropy plus weighted auxiliary term) remain belief states; the next-token term could in principle select representations that are locally predictive of the immediate token but violate the belief-state fixed-point condition. This gap is load-bearing for the central 'provably converge' claim.
- [Experiments section] The empirical sections report gains on world-modeling, reasoning, and planning benchmarks, but the manuscript does not include an ablation that isolates whether the auxiliary loss alone (without the next-token term) already produces the claimed belief-state convergence, nor does it verify that the learned latents satisfy the belief-state property on the actual trained models.
minor comments (2)
- [Abstract] The abstract states 'Our code is available at https://github.com/microsoft/NextLat' but does not specify the commit or tag corresponding to the reported experiments.
- [Theoretical analysis] Notation for the latent transition and belief-state definitions should be introduced with explicit equations rather than prose descriptions to allow direct comparison with the derived fixed-point condition.
Simulated Author's Rebuttal
We are grateful to the referee for the constructive feedback on our manuscript. We address the two major comments below and plan to make revisions to strengthen the theoretical claims and experimental validation.
read point-by-point responses
-
Referee: [Abstract / theoretical analysis section] The theoretical claim that latents converge to belief states is derived only for the auxiliary latent-prediction objective (abstract and the paragraph beginning 'Theoretically, we show...'). No argument is supplied that the stationary points of the combined loss (next-token cross-entropy plus weighted auxiliary term) remain belief states; the next-token term could in principle select representations that are locally predictive of the immediate token but violate the belief-state fixed-point condition. This gap is load-bearing for the central 'provably converge' claim.
Authors: The referee correctly identifies that our theoretical analysis derives the convergence to belief states specifically for the auxiliary latent-prediction objective. We did not provide a proof that the stationary points of the joint loss are belief states. In the revised manuscript, we will revise the abstract and the theoretical section to accurately reflect that the auxiliary objective has belief states as fixed points, and that the combined training is intended to encourage this property while maintaining next-token prediction performance. We will also add a discussion on why the next-token loss is not expected to violate the fixed-point condition under suitable hyperparameter choices for the auxiliary weight. revision: yes
-
Referee: [Experiments section] The empirical sections report gains on world-modeling, reasoning, and planning benchmarks, but the manuscript does not include an ablation that isolates whether the auxiliary loss alone (without the next-token term) already produces the claimed belief-state convergence, nor does it verify that the learned latents satisfy the belief-state property on the actual trained models.
Authors: We agree that the manuscript would benefit from an ablation isolating the auxiliary loss and direct verification of the belief-state property on the trained models. In the revision, we will add such an ablation study where possible (noting that training solely with the auxiliary loss may require adjustments for stability) and include metrics or checks to verify that the learned latents act as sufficient statistics for future predictions. This will help confirm the empirical realization of the theoretical property. revision: yes
Circularity Check
No circularity; theoretical claim presented as independent proof
full rationale
The paper states a theoretical result that latents converge to belief states under the auxiliary latent-prediction objective. No quoted equations or self-citations reduce this claim by construction to fitted inputs, renamed empirical patterns, or load-bearing prior work by the same authors. The joint next-token + auxiliary loss is acknowledged as the actual training objective, but the provided text frames the convergence as a separate proof rather than a statistical consequence of the fit itself. This is the normal case of a self-contained derivation; the skeptic concern is a potential applicability gap, not a circular reduction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Theoretically, we show that these latents provably converge towards belief states, compressed information about the history necessary to predict the future. ... For these consistency objectives to be satisfied, ht must converge to a belief state
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Optimizing for next-token consistency ... and transition consistency ... ensures existence of measurable maps ... ht must jointly optimize toward a belief state
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling
Applying STP at consecutive semantic reasoning steps achieves 168x more accurate multi-step latent prediction on ProcessBench than frozen baselines, with trajectories forming smooth curves best captured by non-linear ...
-
Improving Sampling for Masked Diffusion Models via Information Gain
Info-Gain Sampler improves MDM decoding by using bidirectional information gain to reduce cumulative uncertainty, outperforming greedy samplers on reasoning accuracy and creative writing tasks.
-
Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework
ST-PT turns transformers into explicit factor graphs for time series, enabling structural injection of symbolic priors, per-sample conditional generation, and principled latent autoregressive forecasting via MFVI iterations.
-
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning
LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between d...
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
URL https://arxiv.org/abs/2503.21801. Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length generalization in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Pro...
-
[3]
URL https://arxiv. org/abs/1607.06450. Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Mach...
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred Warmuth
URLhttps://arxiv.org/abs/2304.12210. Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred Warmuth. Classifying learnable geometric concepts with the vapnik-chervonenkis dimension. InProceedings of the eighteenth annual ACM symposium on Theory of computing, pages 273–282,
-
[5]
[Online; accessed 12-October-2025]
URL https://en.wikipedia.org/wiki/Countdown_ (game_show). [Online; accessed 12-October-2025]. Kenneth James Williams Craik.The nature of explanation, volume
work page 2025
-
[6]
URL https://proceedings.neurips.cc/paper_files/paper/1993/file/ e0ec453e28e061cc58ac43f91dc2f3f0-Paper.pdf. Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith an...
work page 1993
-
[7]
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
URL https://proceedings.neurips.cc/paper_files/ paper/2024/file/75b0edb869e2cd509d64d0e8ff446bc1-Paper-Conference.pdf. Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759,
work page internal anchor Pith review arXiv 2024
-
[8]
Bruce A Francis and Walter Murray Wonham
URLhttps://arxiv.org/abs/2510.17558. Bruce A Francis and Walter Murray Wonham. The internal model principle of control theory.Automatica, 12(5): 457–465,
-
[9]
URLhttps://arxiv.org/abs/2403.08540. Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D Goodman. Stream of search (sos): Learning to search in language.arXiv preprint arXiv:2404.03683,
-
[10]
Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,
Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,
-
[11]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
URL https: //arxiv.org/abs/2312.00752. Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Efficiently Modeling Long Sequences with Structured State Spaces
URLhttps://arxiv.org/abs/2111.00396. Wes Gurnee and Max Tegmark. Language models represent space and time. InThe Twelfth International Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
URLhttps://openreview.net/forum?id=jE8xbmvFin. David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3),
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Dream to Control: Learning Behaviors by Latent Imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603,
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[15]
Lillicrap, Mohammad Norouzi, and Jimmy Ba
Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,
work page 2021
-
[16]
doi: 10.1038/s41586-025-08744-2
ISSN 1476-4687. doi: 10.1038/s41586-025-08744-2. URLhttps: //doi.org/10.1038/s41586-025-08744-2. 14 Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control,
-
[17]
Temporal difference learning for model predictive control
URLhttps://arxiv.org/abs/2203.04955. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus),
-
[18]
Gaussian Error Linear Units (GELUs)
URL https://arxiv.org/ abs/1606.08415. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Huang, P., Liu, S., Liu, Z., Yan, Y ., Wang, S., Chen, Z., and Xiao, T
URL https://arxiv.org/abs/2410.23506. Hai Huang, Yann LeCun, and Randall Balestriero. Llm-jepa: Large language models meet joint embedding predictive architectures.arXiv preprint arXiv:2509.14252,
-
[20]
Jamba: A Hybrid Transformer-Mamba Language Model
URL https://arxiv.org/abs/ 2403.19887. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Ilya Loshchilov and Frank Hutter
URL https://arxiv.org/abs/ 2203.01205. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization,
-
[22]
Decoupled Weight Decay Regularization
URL https://arxiv.org/ abs/1711.05101. Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, and Michael Gastpar. Attention with markov: A curious case of single-layer transformers. InThe Thirteenth Interna- tional Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
URLhttps://aclanthology.org/2022.tacl-1.49/
doi: 10.1162/tacl_a_00493. URLhttps://aclanthology.org/2022.tacl-1.49/. William Merrill, Jackson Petty, and Ashish Sabharwal. The illusion of state in state-space models,
-
[24]
R Chris Miall and Daniel M Wolpert
URL https://arxiv.org/abs/2404.08819. R Chris Miall and Daniel M Wolpert. Forward models for physiological motor control.Neural networks, 9(8): 1265–1279,
-
[25]
Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, and Aditi Raghunathan. Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction.arXiv preprint arXiv:2504.15266,
-
[26]
URLhttps://arxiv.org/abs/2402.04248. Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. InInternational conference on learning representations,
-
[27]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
URLhttps://arxiv.org/abs/2201.02177. Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
URL https://arxiv.org/abs/2007. 05929. Charlotte Striebel. Sufficient statistics in the optimum control of stochastic systems.Journal of Mathematical Analysis and Applications, 12(3):576–592,
work page 2007
-
[29]
Keyon Vafa, Justin Y Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan
URLhttps://arxiv.org/abs/2409.05816. Keyon Vafa, Justin Y Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. Evaluating the world model implicit in a generative model.Advances in Neural Information Processing Systems, 37:26941–26975,
-
[30]
URL https://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. In Kevin Duh, H...
work page 2017
-
[31]
doi: 10.18653/v1/2024.naacl-long.102
Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.102. URL https://aclanthology.org/ 2024.naacl-long.102/. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Ha...
-
[32]
Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong
URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 271db9922b8d1f4dd7aaef84ed5ac703-Paper-Conference.pdf. Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autore- gression: Discrete diffusion for complex reasoning and planning. InThe Thirteenth International Conference on Learning Representations,
work page 2023
-
[33]
Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction.arXiv preprint arXiv:2006.10742,
-
[34]
URLhttps://arxiv.org/abs/2306.10125. 17 A Belief States in Sequence Modeling Recent work has introduced variants of sequence modeling architectures based on the principle of learning belief states, i.e., BST and JTP. We review these methods here. Letθ denote the parameters of a transformer-based model. Let hs:t denote the hidden states produced by the tra...
-
[35]
suggest that their method learn “short-horizon belief states”, they do not formally define the conditions under which this occurs. To understand how JTP learns belief states, we start by defining ak-observablesystem. Definition A.1(k-observability for sequences).A system isk-observable if for any two sequencesH=X 1:t and H ′ =X 1:j that induce the same jo...
work page 2025
-
[36]
A formal proof proceeds by backward induction on t
ensures the existence of measurable mapsp θ andp ψ that allow recursive decoding of future tokens: ht pθ − − − − − − − → decode token Xt+1 pψ − − − − − − → update state ht+1 pθ − − − − − − − → decode token Xt+2 pψ − − − − − − → update state ht+2 · · · pθ − →XT . A formal proof proceeds by backward induction on t. For the base case t=T−1 , the claim follow...
work page 2023
-
[37]
# (B,1,1) 17next_tokens = batch 18next_states = hidden_states 19current_states = hidden_states 20loss_next_h = 0 21loss_kl = 0 22 23# Recursive multi-step predictions 24for _ in range(multi_step_horizon): 25# Shift hidden states back by 1 using dummy initial state, similar to RNNs 26current_states = torch.cat([initial_hidden, current_states[:, :-1]], dim=...
work page 2019
-
[38]
through the model to obtain the hidden state matrix. Singular values smaller than 1e−12 are discarded, and the effective rank is then computed following Roy and Vetterli [2007]. For GPT and NextLat, we use the final-layer hidden states. For JTP, we extract the hidden states immediately before the self-attention module in the Fetch head (see Equations 4–5 ...
work page 2007
-
[39]
for the Countdown training and evaluation setup. Each problem consists of four input numbers and a solution sequence comprising three equations, consistent with prior work [Gandhi et al., 2024, Ye et al., 2025]. A training example is formatted as 14,83,88,91,23|83−14 = 69,91−88 = 3,69/3 = 23 where the first four numbers are the inputs, the fifth is the ta...
work page 2024
-
[40]
This difference accounts for the performance gap observed in the BST and JTP baselines in Figure
which generate a fresh set of graphs every batch, we adopt the original, more challenging setup of Bachmann and Nagarajan [2024], which uses a fixed sample size of 200k and node values sampled from N= 100 . This difference accounts for the performance gap observed in the BST and JTP baselines in Figure
work page 2024
-
[41]
The Path-Star experiment is designed to expose the myopic behavior of teacher-forced next-token prediction, which can encourage models to exploit superficial regularities—an 22 effect referred to as theClever Hans cheat[Bachmann and Nagarajan, 2024]. Because the task’s sample space grows exponentially with graph size, identifying the correct algorithm tha...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.