Next-Latent Prediction Transformers Learn Compact World Models

Akshay Krishnamurthy; Alex Lamb; Edward S. Hu; Jayden Teoh; John Langford; Kwangjun Ahn; Manan Tomar; Pratyusha Sharma; Riashat Islam; Tim Pearce

arxiv: 2511.05963 · v2 · pith:EREDXFXXnew · submitted 2025-11-08 · 💻 cs.LG

Next-Latent Prediction Transformers Learn Compact World Models

Jayden Teoh , Manan Tomar , Kwangjun Ahn , Edward S. Hu , Tim Pearce , Pratyusha Sharma , Akshay Krishnamurthy , Riashat Islam

show 2 more authors

Alex Lamb John Langford

This is my paper

Pith reviewed 2026-05-25 07:18 UTC · model grok-4.3

classification 💻 cs.LG

keywords transformersworld modelsbelief stateslatent predictionnext-token predictioncompact representationsinductive biasself-supervised learning

0 comments

The pith

Adding a next-latent prediction loss makes transformer internal states converge to belief states that compress history for future prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes NextLat, an auxiliary objective that trains a transformer to predict its next latent representation from the current latent and the next token, in addition to standard next-token prediction. Without this term, transformers have no built-in pressure to maintain compact, consistent internal states across time steps. The authors prove that the joint objective forces the latents to converge toward belief states, the minimal information about past observations needed to predict future ones. This change is claimed to produce more coherent internal world models while preserving the original architecture, parallel training, and inference procedure. Readers would care because such representations are said to improve accuracy on tasks that require planning or reasoning over sequences.

Core claim

NextLat extends standard next-token training with self-supervised predictions in the latent space, training the model to predict its next latent state given the next token. The paper shows theoretically that the resulting latents converge to belief states, which are compressed summaries of history sufficient for predicting future observations. This auxiliary objective injects a recurrent inductive bias into transformers, encouraging formation of compact internal world models with coherent transition dynamics that standard next-token prediction does not guarantee.

What carries the argument

The next-latent prediction auxiliary loss, jointly optimized with the primary next-token loss, which drives convergence of latents to belief states.

If this is right

The learned latents become more compressed representations of history.
Downstream accuracy improves on world modeling, reasoning, planning, and language modeling tasks.
Variable-length self-speculative decoding becomes possible, accelerating inference up to 3.3x.
The transformer architecture, parallel training efficiency, and inference procedure remain unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same auxiliary objective might be added to other non-recurrent sequence models to induce similar compression.
Improved lookahead planning may stem directly from the enforced consistency of the learned transition dynamics.
Smaller models trained with NextLat could match the effective capacity of larger models trained only on next-token loss.

Load-bearing premise

Jointly optimizing the auxiliary latent-prediction loss with the main next-token loss will produce convergence to belief states without the auxiliary term dominating or destabilizing training.

What would settle it

Train a transformer under the NextLat objective and check whether its latent representations fail to compress history or produce inconsistent next-state predictions across different sequences of the same underlying process.

Figures

Figures reproduced from arXiv: 2511.05963 by Akshay Krishnamurthy, Alex Lamb, Edward S. Hu, Jayden Teoh, John Langford, Kwangjun Ahn, Manan Tomar, Pratyusha Sharma, Riashat Islam, Tim Pearce.

**Figure 2.** Figure 2: Illustration of different predictive mechanisms at time step t = 3. Other methods supervise only the token-level emissions, leaving intermediate latent representations implicit. In contrast, NextLat explicitly learns latent dynamics that predicts hidden state hˆ t+1 from (ht, xt+1). Token-level supervision is then applied to the hˆ t+1. Therefore, accurate multi-token predictions (beyond the next token) em… view at source ↗

**Figure 3.** Figure 3: Reconstructed maps from transformers trained on Manhattan taxi rides using different objectives. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Performance on Countdown. Best result is bolded, and second best is underlined. Eq. 1 Eq. 2 Eq. 3 0 20 40 60 80 100 Validity (%) GPT BST MTP JTP NextLat [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Validity of equations (i.e., LHS = RHS) generated on Countdown. All models in this plot use d = 1. Setup. Following Gandhi et al. [2024], we generate 500k training problems with target numbers ranging from 10 to 100 and reserve 10% of the targets for out-of-distribution evaluation. During both training and testing, we insert eight ‘pause tokens’ [Goyal et al., 2023] after the target number, allowing models… view at source ↗

**Figure 7.** Figure 7: Illustration of a G5,5 Path-Star graph [Bachmann and Nagarajan, 2024]. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Cross-entropy loss difference relative to GPT, obtained from linear probes trained on frozen hidden states [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Illustration of the latent transition model pψ. We parameterize the latent transition model pψ with a three-layer MLP using GELU [Hendrycks and Gimpel, 2023] activations. The latent transition model takes as input the layer-normalized [Ba et al., 2016] concatenation of the current hidden state ht and next-token embedding Xt+1, and outputs a delta update applied via residual connection: hˆ t+1 = pψ(ht, Xt+1… view at source ↗

**Figure 10.** Figure 10: Full plot version of [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

read the original abstract

Transformers replace recurrence with a memory that grows with sequence length and self-attention that enables ad-hoc lookups over past tokens. Consequently, they lack an inherent incentive to compress history into compact latent states with consistent transition rules. This often leads to learning solutions that generalize poorly. We introduce Next-Latent Prediction (NextLat), which extends standard next-token training with self-supervised predictions in the latent space. Specifically, NextLat trains a transformer to learn latent representations that are predictive of its next latent state given the next token. Theoretically, we show that these latents provably converge towards belief states, compressed information about the history necessary to predict the future. This simple auxiliary objective injects a recurrent inductive bias into transformers while leaving their architecture, parallel training efficiency, and inference unchanged. NextLat effectively encourages transformers to form compact internal world models with coherent belief states and transition dynamics -- crucial properties not guaranteed by standard next-token prediction alone. Empirically, across benchmarks in world modeling, reasoning, planning, and language modeling, NextLat demonstrates significant gains over standard next-token prediction and other baselines in downstream accuracy, representation compression, and lookahead planning. Furthermore, NextLat enables variable-length self-speculative decoding, accelerating inference by up to 3.3x in language modeling. NextLat offers a simple yet effective paradigm for learning compact, predictive representations in transformers that generalize better Our code is available at https://github.com/microsoft/NextLat.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NextLat adds a simple latent-prediction auxiliary that may help compression, but the convergence claim is stated only for the auxiliary term alone and the joint objective leaves a gap.

read the letter

The main new piece is NextLat: an auxiliary loss that has the model predict its own next latent representation given the next token. They argue this pushes the hidden states toward belief states (the minimal sufficient statistic for future prediction) and show some downstream gains in world modeling, planning, and a 3.3x inference speedup via variable-length speculative decoding. The change is cheap—no architecture or inference change—and the code is public, which is useful.

Referee Report

2 major / 2 minor

Summary. The paper proposes Next-Latent Prediction (NextLat), which augments standard next-token cross-entropy training of transformers with an auxiliary self-supervised loss that predicts the next latent representation given the next token. It claims that the resulting latents provably converge to belief states (compressed sufficient statistics of history for future prediction), inject a recurrent inductive bias without altering architecture or inference, and yield empirical gains in world modeling, reasoning, planning, language modeling, representation compression, and up to 3.3x faster self-speculative decoding.

Significance. If the convergence result holds under the joint objective actually used in training and the reported gains are robust to controls, the method would offer a lightweight way to encourage compact, predictive internal world models in transformers while preserving their parallel training advantages. The combination of a theoretical fixed-point argument with downstream improvements in lookahead planning and variable-length decoding would be a notable contribution to representation learning for sequential decision-making tasks.

major comments (2)

[Abstract / theoretical analysis section] The theoretical claim that latents converge to belief states is derived only for the auxiliary latent-prediction objective (abstract and the paragraph beginning 'Theoretically, we show...'). No argument is supplied that the stationary points of the combined loss (next-token cross-entropy plus weighted auxiliary term) remain belief states; the next-token term could in principle select representations that are locally predictive of the immediate token but violate the belief-state fixed-point condition. This gap is load-bearing for the central 'provably converge' claim.
[Experiments section] The empirical sections report gains on world-modeling, reasoning, and planning benchmarks, but the manuscript does not include an ablation that isolates whether the auxiliary loss alone (without the next-token term) already produces the claimed belief-state convergence, nor does it verify that the learned latents satisfy the belief-state property on the actual trained models.

minor comments (2)

[Abstract] The abstract states 'Our code is available at https://github.com/microsoft/NextLat' but does not specify the commit or tag corresponding to the reported experiments.
[Theoretical analysis] Notation for the latent transition and belief-state definitions should be introduced with explicit equations rather than prose descriptions to allow direct comparison with the derived fixed-point condition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the constructive feedback on our manuscript. We address the two major comments below and plan to make revisions to strengthen the theoretical claims and experimental validation.

read point-by-point responses

Referee: [Abstract / theoretical analysis section] The theoretical claim that latents converge to belief states is derived only for the auxiliary latent-prediction objective (abstract and the paragraph beginning 'Theoretically, we show...'). No argument is supplied that the stationary points of the combined loss (next-token cross-entropy plus weighted auxiliary term) remain belief states; the next-token term could in principle select representations that are locally predictive of the immediate token but violate the belief-state fixed-point condition. This gap is load-bearing for the central 'provably converge' claim.

Authors: The referee correctly identifies that our theoretical analysis derives the convergence to belief states specifically for the auxiliary latent-prediction objective. We did not provide a proof that the stationary points of the joint loss are belief states. In the revised manuscript, we will revise the abstract and the theoretical section to accurately reflect that the auxiliary objective has belief states as fixed points, and that the combined training is intended to encourage this property while maintaining next-token prediction performance. We will also add a discussion on why the next-token loss is not expected to violate the fixed-point condition under suitable hyperparameter choices for the auxiliary weight. revision: yes
Referee: [Experiments section] The empirical sections report gains on world-modeling, reasoning, and planning benchmarks, but the manuscript does not include an ablation that isolates whether the auxiliary loss alone (without the next-token term) already produces the claimed belief-state convergence, nor does it verify that the learned latents satisfy the belief-state property on the actual trained models.

Authors: We agree that the manuscript would benefit from an ablation isolating the auxiliary loss and direct verification of the belief-state property on the trained models. In the revision, we will add such an ablation study where possible (noting that training solely with the auxiliary loss may require adjustments for stability) and include metrics or checks to verify that the learned latents act as sufficient statistics for future predictions. This will help confirm the empirical realization of the theoretical property. revision: yes

Circularity Check

0 steps flagged

No circularity; theoretical claim presented as independent proof

full rationale

The paper states a theoretical result that latents converge to belief states under the auxiliary latent-prediction objective. No quoted equations or self-citations reduce this claim by construction to fitted inputs, renamed empirical patterns, or load-bearing prior work by the same authors. The joint next-token + auxiliary loss is acknowledged as the actual training objective, but the provided text frames the convergence as a separate proof rather than a statistical consequence of the fit itself. This is the normal case of a self-contained derivation; the skeptic concern is a potential applicability gap, not a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5821 in / 995 out tokens · 54763 ms · 2026-05-25T07:18:57.466192+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Theoretically, we show that these latents provably converge towards belief states, compressed information about the history necessary to predict the future. ... For these consistency objectives to be satisfied, ht must converge to a belief state
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Optimizing for next-token consistency ... and transition consistency ... ensures existence of measurable maps ... ht must jointly optimize toward a belief state

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Semantic Step Prediction: Multi-Step Latent Forecasting in LLM Reasoning Trajectories via Step Sampling
cs.LG 2026-04 unverdicted novelty 7.0

Applying STP at consecutive semantic reasoning steps achieves 168x more accurate multi-step latent prediction on ProcessBench than frozen baselines, with trajectories forming smooth curves best captured by non-linear ...
Improving Sampling for Masked Diffusion Models via Information Gain
cs.CL 2026-02 unverdicted novelty 7.0

Info-Gain Sampler improves MDM decoding by using bidirectional information gain to reduce cumulative uncertainty, outperforming greedy samplers on reasoning accuracy and creative writing tasks.
Exploring the Potential of Probabilistic Transformer for Time Series Modeling: A Report on the ST-PT Framework
cs.LG 2026-04 unverdicted novelty 6.0

ST-PT turns transformers into explicit factor graphs for time series, enabling structural injection of symbolic priors, per-sample conditional generation, and principled latent autoregressive forecasting via MFVI iterations.
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning
cs.LG 2026-04 unverdicted novelty 6.0

LLMs discover latent planning strategies up to five steps during training and execute them up to eight steps at test time, with larger models reaching seven under few-shot prompting, revealing a dissociation between d...

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 4 Pith papers · 11 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur

URL https://arxiv.org/abs/2503.21801. Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length generalization in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Pro...

work page arXiv
[3]

Layer Normalization

URL https://arxiv. org/abs/1607.06450. Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Mach...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred Warmuth

URLhttps://arxiv.org/abs/2304.12210. Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred Warmuth. Classifying learnable geometric concepts with the vapnik-chervonenkis dimension. InProceedings of the eighteenth annual ACM symposium on Theory of computing, pages 273–282,

work page arXiv
[5]

[Online; accessed 12-October-2025]

URL https://en.wikipedia.org/wiki/Countdown_ (game_show). [Online; accessed 12-October-2025]. Kenneth James Williams Craik.The nature of explanation, volume

work page 2025
[6]

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D

URL https://proceedings.neurips.cc/paper_files/paper/1993/file/ e0ec453e28e061cc58ac43f91dc2f3f0-Paper.pdf. Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith an...

work page 1993
[7]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

URL https://proceedings.neurips.cc/paper_files/ paper/2024/file/75b0edb869e2cd509d64d0e8ff446bc1-Paper-Conference.pdf. Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759,

work page internal anchor Pith review arXiv 2024
[8]

Bruce A Francis and Walter Murray Wonham

URLhttps://arxiv.org/abs/2510.17558. Bruce A Francis and Walter Murray Wonham. The internal model principle of control theory.Automatica, 12(5): 457–465,

work page arXiv
[9]

Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D Goodman

URLhttps://arxiv.org/abs/2403.08540. Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D Goodman. Stream of search (sos): Learning to search in language.arXiv preprint arXiv:2404.03683,

work page arXiv
[10]

Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,

work page arXiv
[11]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

URL https: //arxiv.org/abs/2312.00752. Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Efficiently Modeling Long Sequences with Structured State Spaces

URLhttps://arxiv.org/abs/2111.00396. Wes Gurnee and Max Tegmark. Language models represent space and time. InThe Twelfth International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

World Models

URLhttps://openreview.net/forum?id=jE8xbmvFin. David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3),

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603,

work page internal anchor Pith review Pith/arXiv arXiv 1912
[15]

Lillicrap, Mohammad Norouzi, and Jimmy Ba

Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

work page 2021
[16]

doi: 10.1038/s41586-025-08744-2

ISSN 1476-4687. doi: 10.1038/s41586-025-08744-2. URLhttps: //doi.org/10.1038/s41586-025-08744-2. 14 Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control,

work page doi:10.1038/s41586-025-08744-2
[17]

Temporal difference learning for model predictive control

URLhttps://arxiv.org/abs/2203.04955. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus),

work page arXiv
[18]

Gaussian Error Linear Units (GELUs)

URL https://arxiv.org/ abs/1606.08415. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Huang, P., Liu, S., Liu, Z., Yan, Y ., Wang, S., Chen, Z., and Xiao, T

URL https://arxiv.org/abs/2410.23506. Hai Huang, Yann LeCun, and Randall Balestriero. Llm-jepa: Large language models meet joint embedding predictive architectures.arXiv preprint arXiv:2509.14252,

work page arXiv
[20]

Jamba: A Hybrid Transformer-Mamba Language Model

URL https://arxiv.org/abs/ 2403.19887. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Ilya Loshchilov and Frank Hutter

URL https://arxiv.org/abs/ 2203.01205. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization,

work page arXiv
[22]

Decoupled Weight Decay Regularization

URL https://arxiv.org/ abs/1711.05101. Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, and Michael Gastpar. Attention with markov: A curious case of single-layer transformers. InThe Thirteenth Interna- tional Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

URLhttps://aclanthology.org/2022.tacl-1.49/

doi: 10.1162/tacl_a_00493. URLhttps://aclanthology.org/2022.tacl-1.49/. William Merrill, Jackson Petty, and Ashish Sabharwal. The illusion of state in state-space models,

work page doi:10.1162/tacl_a_00493 2022
[24]

R Chris Miall and Daniel M Wolpert

URL https://arxiv.org/abs/2404.08819. R Chris Miall and Daniel M Wolpert. Forward models for physiological motor control.Neural networks, 9(8): 1265–1279,

work page arXiv
[25]

Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction.arXiv preprint arXiv:2504.15266,

Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, and Aditi Raghunathan. Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction.arXiv preprint arXiv:2504.15266,

work page arXiv
[26]

Roma Patel and Ellie Pavlick

URLhttps://arxiv.org/abs/2402.04248. Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. InInternational conference on learning representations,

work page arXiv
[27]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

URLhttps://arxiv.org/abs/2201.02177. Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

URL https://arxiv.org/abs/2007. 05929. Charlotte Striebel. Sufficient statistics in the optimum control of stochastic systems.Journal of Mathematical Analysis and Applications, 12(3):576–592,

work page 2007
[29]

Keyon Vafa, Justin Y Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

URLhttps://arxiv.org/abs/2409.05816. Keyon Vafa, Justin Y Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. Evaluating the world model implicit in a generative model.Advances in Neural Information Processing Systems, 37:26941–26975,

work page arXiv
[30]

Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim

URL https://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. In Kevin Duh, H...

work page 2017
[31]

doi: 10.18653/v1/2024.naacl-long.102

Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.102. URL https://aclanthology.org/ 2024.naacl-long.102/. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Ha...

work page doi:10.18653/v1/2024.naacl-long.102 2024
[32]

Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 271db9922b8d1f4dd7aaef84ed5ac703-Paper-Conference.pdf. Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autore- gression: Discrete diffusion for complex reasoning and planning. InThe Thirteenth International Conference on Learning Representations,

work page 2023
[33]

Learning invariant representations for reinforcement learning without reconstruction.arXiv preprint arXiv:2006.10742,

Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction.arXiv preprint arXiv:2006.10742,

work page arXiv 2006
[34]

17 A Belief States in Sequence Modeling Recent work has introduced variants of sequence modeling architectures based on the principle of learning belief states, i.e., BST and JTP

URLhttps://arxiv.org/abs/2306.10125. 17 A Belief States in Sequence Modeling Recent work has introduced variants of sequence modeling architectures based on the principle of learning belief states, i.e., BST and JTP. We review these methods here. Letθ denote the parameters of a transformer-based model. Let hs:t denote the hidden states produced by the tra...

work page arXiv 2025
[35]

short-horizon belief states

suggest that their method learn “short-horizon belief states”, they do not formally define the conditions under which this occurs. To understand how JTP learns belief states, we start by defining ak-observablesystem. Definition A.1(k-observability for sequences).A system isk-observable if for any two sequencesH=X 1:t and H ′ =X 1:j that induce the same jo...

work page 2025
[36]

A formal proof proceeds by backward induction on t

ensures the existence of measurable mapsp θ andp ψ that allow recursive decoding of future tokens: ht pθ − − − − − − − → decode token Xt+1 pψ − − − − − − → update state ht+1 pθ − − − − − − − → decode token Xt+2 pψ − − − − − − → update state ht+2 · · · pθ − →XT . A formal proof proceeds by backward induction on t. For the base case t=T−1 , the claim follow...

work page 2023
[37]

grokking

# (B,1,1) 17next_tokens = batch 18next_states = hidden_states 19current_states = hidden_states 20loss_next_h = 0 21loss_kl = 0 22 23# Recursive multi-step predictions 24for _ in range(multi_step_horizon): 25# Shift hidden states back by 1 using dummy initial state, similar to RNNs 26current_states = torch.cat([initial_hidden, current_states[:, :-1]], dim=...

work page 2019
[38]

Singular values smaller than 1e−12 are discarded, and the effective rank is then computed following Roy and Vetterli [2007]

through the model to obtain the hidden state matrix. Singular values smaller than 1e−12 are discarded, and the effective rank is then computed following Roy and Vetterli [2007]. For GPT and NextLat, we use the final-layer hidden states. For JTP, we extract the hidden states immediately before the self-attention module in the Fetch head (see Equations 4–5 ...

work page 2007
[39]

Each problem consists of four input numbers and a solution sequence comprising three equations, consistent with prior work [Gandhi et al., 2024, Ye et al., 2025]

for the Countdown training and evaluation setup. Each problem consists of four input numbers and a solution sequence comprising three equations, consistent with prior work [Gandhi et al., 2024, Ye et al., 2025]. A training example is formatted as 14,83,88,91,23|83−14 = 69,91−88 = 3,69/3 = 23 where the first four numbers are the inputs, the fifth is the ta...

work page 2024
[40]

This difference accounts for the performance gap observed in the BST and JTP baselines in Figure

which generate a fresh set of graphs every batch, we adopt the original, more challenging setup of Bachmann and Nagarajan [2024], which uses a fixed sample size of 200k and node values sampled from N= 100 . This difference accounts for the performance gap observed in the BST and JTP baselines in Figure

work page 2024
[41]

Because the task’s sample space grows exponentially with graph size, identifying the correct algorithm that generalizes across all graph instances is highly nontrivial

The Path-Star experiment is designed to expose the myopic behavior of teacher-forced next-token prediction, which can encourage models to exploit superficial regularities—an 22 effect referred to as theClever Hans cheat[Bachmann and Nagarajan, 2024]. Because the task’s sample space grows exponentially with graph size, identifying the correct algorithm tha...

work page 2024

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur

URL https://arxiv.org/abs/2503.21801. Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length generalization in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Pro...

work page arXiv

[3] [3]

Layer Normalization

URL https://arxiv. org/abs/1607.06450. Gregor Bachmann and Vaishnavh Nagarajan. The pitfalls of next-token prediction. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceed- ings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Mach...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred Warmuth

URLhttps://arxiv.org/abs/2304.12210. Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred Warmuth. Classifying learnable geometric concepts with the vapnik-chervonenkis dimension. InProceedings of the eighteenth annual ACM symposium on Theory of computing, pages 273–282,

work page arXiv

[5] [5]

[Online; accessed 12-October-2025]

URL https://en.wikipedia.org/wiki/Countdown_ (game_show). [Online; accessed 12-October-2025]. Kenneth James Williams Craik.The nature of explanation, volume

work page 2025

[6] [6]

Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D

URL https://proceedings.neurips.cc/paper_files/paper/1993/file/ e0ec453e28e061cc58ac43f91dc2f3f0-Paper.pdf. Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Sean Welleck, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith an...

work page 1993

[7] [7]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

URL https://proceedings.neurips.cc/paper_files/ paper/2024/file/75b0edb869e2cd509d64d0e8ff446bc1-Paper-Conference.pdf. Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759,

work page internal anchor Pith review arXiv 2024

[8] [8]

Bruce A Francis and Walter Murray Wonham

URLhttps://arxiv.org/abs/2510.17558. Bruce A Francis and Walter Murray Wonham. The internal model principle of control theory.Automatica, 12(5): 457–465,

work page arXiv

[9] [9]

Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D Goodman

URLhttps://arxiv.org/abs/2403.08540. Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, and Noah D Goodman. Stream of search (sos): Learning to search in language.arXiv preprint arXiv:2404.03683,

work page arXiv

[10] [10]

Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens.arXiv preprint arXiv:2310.02226,

work page arXiv

[11] [11]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

URL https: //arxiv.org/abs/2312.00752. Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Efficiently Modeling Long Sequences with Structured State Spaces

URLhttps://arxiv.org/abs/2111.00396. Wes Gurnee and Max Tegmark. Language models represent space and time. InThe Twelfth International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

World Models

URLhttps://openreview.net/forum?id=jE8xbmvFin. David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3),

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603,

work page internal anchor Pith review Pith/arXiv arXiv 1912

[15] [15]

Lillicrap, Mohammad Norouzi, and Jimmy Ba

Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

work page 2021

[16] [16]

doi: 10.1038/s41586-025-08744-2

ISSN 1476-4687. doi: 10.1038/s41586-025-08744-2. URLhttps: //doi.org/10.1038/s41586-025-08744-2. 14 Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control,

work page doi:10.1038/s41586-025-08744-2

[17] [17]

Temporal difference learning for model predictive control

URLhttps://arxiv.org/abs/2203.04955. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus),

work page arXiv

[18] [18]

Gaussian Error Linear Units (GELUs)

URL https://arxiv.org/ abs/1606.08415. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Huang, P., Liu, S., Liu, Z., Yan, Y ., Wang, S., Chen, Z., and Xiao, T

URL https://arxiv.org/abs/2410.23506. Hai Huang, Yann LeCun, and Randall Balestriero. Llm-jepa: Large language models meet joint embedding predictive architectures.arXiv preprint arXiv:2509.14252,

work page arXiv

[20] [20]

Jamba: A Hybrid Transformer-Mamba Language Model

URL https://arxiv.org/abs/ 2403.19887. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Ilya Loshchilov and Frank Hutter

URL https://arxiv.org/abs/ 2203.01205. Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization,

work page arXiv

[22] [22]

Decoupled Weight Decay Regularization

URL https://arxiv.org/ abs/1711.05101. Ashok Vardhan Makkuva, Marco Bondaschi, Adway Girish, Alliot Nagle, Martin Jaggi, Hyeji Kim, and Michael Gastpar. Attention with markov: A curious case of single-layer transformers. InThe Thirteenth Interna- tional Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

URLhttps://aclanthology.org/2022.tacl-1.49/

doi: 10.1162/tacl_a_00493. URLhttps://aclanthology.org/2022.tacl-1.49/. William Merrill, Jackson Petty, and Ashish Sabharwal. The illusion of state in state-space models,

work page doi:10.1162/tacl_a_00493 2022

[24] [24]

R Chris Miall and Daniel M Wolpert

URL https://arxiv.org/abs/2404.08819. R Chris Miall and Daniel M Wolpert. Forward models for physiological motor control.Neural networks, 9(8): 1265–1279,

work page arXiv

[25] [25]

Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction.arXiv preprint arXiv:2504.15266,

Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, and Aditi Raghunathan. Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction.arXiv preprint arXiv:2504.15266,

work page arXiv

[26] [26]

Roma Patel and Ellie Pavlick

URLhttps://arxiv.org/abs/2402.04248. Roma Patel and Ellie Pavlick. Mapping language models to grounded conceptual spaces. InInternational conference on learning representations,

work page arXiv

[27] [27]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

URLhttps://arxiv.org/abs/2201.02177. Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

URL https://arxiv.org/abs/2007. 05929. Charlotte Striebel. Sufficient statistics in the optimum control of stochastic systems.Journal of Mathematical Analysis and Applications, 12(3):576–592,

work page 2007

[29] [29]

Keyon Vafa, Justin Y Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

URLhttps://arxiv.org/abs/2409.05816. Keyon Vafa, Justin Y Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. Evaluating the world model implicit in a generative model.Advances in Neural Information Processing Systems, 37:26941–26975,

work page arXiv

[30] [30]

Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim

URL https://proceedings.neurips.cc/paper_files/paper/2017/ file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. In Kevin Duh, H...

work page 2017

[31] [31]

doi: 10.18653/v1/2024.naacl-long.102

Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.102. URL https://aclanthology.org/ 2024.naacl-long.102/. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Ha...

work page doi:10.18653/v1/2024.naacl-long.102 2024

[32] [32]

Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 271db9922b8d1f4dd7aaef84ed5ac703-Paper-Conference.pdf. Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Beyond autore- gression: Discrete diffusion for complex reasoning and planning. InThe Thirteenth International Conference on Learning Representations,

work page 2023

[33] [33]

Learning invariant representations for reinforcement learning without reconstruction.arXiv preprint arXiv:2006.10742,

Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, and Sergey Levine. Learning invariant representations for reinforcement learning without reconstruction.arXiv preprint arXiv:2006.10742,

work page arXiv 2006

[34] [34]

17 A Belief States in Sequence Modeling Recent work has introduced variants of sequence modeling architectures based on the principle of learning belief states, i.e., BST and JTP

URLhttps://arxiv.org/abs/2306.10125. 17 A Belief States in Sequence Modeling Recent work has introduced variants of sequence modeling architectures based on the principle of learning belief states, i.e., BST and JTP. We review these methods here. Letθ denote the parameters of a transformer-based model. Let hs:t denote the hidden states produced by the tra...

work page arXiv 2025

[35] [35]

short-horizon belief states

suggest that their method learn “short-horizon belief states”, they do not formally define the conditions under which this occurs. To understand how JTP learns belief states, we start by defining ak-observablesystem. Definition A.1(k-observability for sequences).A system isk-observable if for any two sequencesH=X 1:t and H ′ =X 1:j that induce the same jo...

work page 2025

[36] [36]

A formal proof proceeds by backward induction on t

ensures the existence of measurable mapsp θ andp ψ that allow recursive decoding of future tokens: ht pθ − − − − − − − → decode token Xt+1 pψ − − − − − − → update state ht+1 pθ − − − − − − − → decode token Xt+2 pψ − − − − − − → update state ht+2 · · · pθ − →XT . A formal proof proceeds by backward induction on t. For the base case t=T−1 , the claim follow...

work page 2023

[37] [37]

grokking

# (B,1,1) 17next_tokens = batch 18next_states = hidden_states 19current_states = hidden_states 20loss_next_h = 0 21loss_kl = 0 22 23# Recursive multi-step predictions 24for _ in range(multi_step_horizon): 25# Shift hidden states back by 1 using dummy initial state, similar to RNNs 26current_states = torch.cat([initial_hidden, current_states[:, :-1]], dim=...

work page 2019

[38] [38]

Singular values smaller than 1e−12 are discarded, and the effective rank is then computed following Roy and Vetterli [2007]

through the model to obtain the hidden state matrix. Singular values smaller than 1e−12 are discarded, and the effective rank is then computed following Roy and Vetterli [2007]. For GPT and NextLat, we use the final-layer hidden states. For JTP, we extract the hidden states immediately before the self-attention module in the Fetch head (see Equations 4–5 ...

work page 2007

[39] [39]

Each problem consists of four input numbers and a solution sequence comprising three equations, consistent with prior work [Gandhi et al., 2024, Ye et al., 2025]

for the Countdown training and evaluation setup. Each problem consists of four input numbers and a solution sequence comprising three equations, consistent with prior work [Gandhi et al., 2024, Ye et al., 2025]. A training example is formatted as 14,83,88,91,23|83−14 = 69,91−88 = 3,69/3 = 23 where the first four numbers are the inputs, the fifth is the ta...

work page 2024

[40] [40]

This difference accounts for the performance gap observed in the BST and JTP baselines in Figure

which generate a fresh set of graphs every batch, we adopt the original, more challenging setup of Bachmann and Nagarajan [2024], which uses a fixed sample size of 200k and node values sampled from N= 100 . This difference accounts for the performance gap observed in the BST and JTP baselines in Figure

work page 2024

[41] [41]

Because the task’s sample space grows exponentially with graph size, identifying the correct algorithm that generalizes across all graph instances is highly nontrivial

The Path-Star experiment is designed to expose the myopic behavior of teacher-forced next-token prediction, which can encourage models to exploit superficial regularities—an 22 effect referred to as theClever Hans cheat[Bachmann and Nagarajan, 2024]. Because the task’s sample space grows exponentially with graph size, identifying the correct algorithm tha...

work page 2024