World-Model Collapse as a Phase Transition

Xinyuan Song; Zekun Cai

arxiv: 2606.31399 · v1 · pith:N6MQDJPFnew · submitted 2026-06-30 · 💻 cs.AI

World-Model Collapse as a Phase Transition

Xinyuan Song , Zekun Cai This is my paper

Pith reviewed 2026-07-01 05:16 UTC · model grok-4.3

classification 💻 cs.AI

keywords world modelsphase transitionslanguage agentslong-horizon planningstate fidelityagent collapsedeterministic tasksimplicit models

0 comments

The pith

Long-horizon language agents undergo abrupt world-model collapse at critical parameter boundaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether implicit world models inside language agents exhibit phase-transition behavior analogous to physical systems. In a deterministic task family that supplies exact gold states at every step, small shifts in state cardinality, horizon length, or mutation rate leave performance stable until a narrow critical band, after which fidelity drops sharply. Per-step analysis shows that accurate representation of the world fails before the agent produces invalid actions, so the agent acts from an internally corrupted model rather than simply choosing poorly. A broad grid search maps a solved plateau, a thin transition zone, and a collapsed regime; stronger models move the boundary but retain the qualitative jump. The result frames world-model fidelity as a distinct, measurable limit on long-horizon competence.

Core claim

Near critical boundaries in state load, dependency density, horizon, and mutation rate, a small parameter change triggers sudden loss of world-state fidelity, so the agent operates from a corrupted internal model of the environment rather than merely selecting a bad action. Stronger models translate the location of the boundary without eliminating the transition itself.

What carries the argument

Grid search over a deterministic task family with exact per-step gold states, yielding a phase diagram of solved plateau, narrow transition band, and collapse floor, together with per-step traces that separate world-state fidelity from action validity.

If this is right

World-state fidelity degrades before action validity inside the transition regime.
Stronger models shift the critical boundary but preserve the qualitative phase transition.
World-model collapse constitutes a measurable bottleneck separate from action selection errors.
The transition appears across observation modes and mutation rates within the tested task family.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Scaling model size alone may move but not remove the collapse boundary for long-horizon tasks.
The same critical-boundary logic could appear in non-deterministic or partially observed settings.
Explicit monitoring of internal state representations might detect impending collapse before behavior visibly degrades.
Training objectives that penalize state divergence could push the transition boundary outward.

Load-bearing premise

The deterministic task family with exact per-step gold states measures implicit world-model fidelity without confounding effects from the different observation modes or mutation rates.

What would settle it

Per-step traces that keep world-state fidelity high across the transition band, or that show action validity failing before state fidelity, would falsify the claimed mechanism.

Figures

Figures reproduced from arXiv: 2606.31399 by Xinyuan Song, Zekun Cai.

**Figure 2.** Figure 2: Three-call agent loop used in every episode. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: One-dimensional cross-sections of the confirmatory grid. The [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Cross-model comparison on the StatefulPuzzle grid. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Single-axis ablations around a transition-zone backdrop. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Critical-point scans for claude-haiku-4-5 on StatefulPuzzle. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Water looks unchanged as it warms, then at a critical point it boils. We ask whether long-horizon language agents show an analogous transition in their implicit world models. In some parameter settings, changing state load by a small amount, or adding a single step of horizon, leaves behavior nearly unchanged; near a critical boundary, the same small change causes a sudden world collapse. We study this effect in a deterministic task family with exact per-step gold state. A large grid search over state cardinality, dependency density, horizon, branching, observation mode, and mutation rate reveals a phase diagram: a solved plateau, a narrow transition band, and a collapse floor. Per-step traces show the mechanism: world-state fidelity fails before action validity, so the agent is not merely choosing a bad action; it is acting from a corrupted world. Stronger models translate the critical boundary but do not remove the qualitative transition. These results make world-model collapse a measurable bottleneck for long-horizon agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames world-model collapse in agents as a phase transition from a grid search, but gold-state access likely confounds the fidelity measurement.

read the letter

The main thing to know is that this paper runs a large grid search over state cardinality, dependency density, horizon, branching, observation mode, and mutation rate in a deterministic task family, and reports a phase diagram with a solved plateau, narrow transition band, and collapse floor. They also claim that per-step traces show world-state fidelity failing before action validity, and that stronger models shift the critical boundary without eliminating the transition.

What is new is the explicit phase-transition framing applied to implicit world models in language agents, plus the ordering of failure modes. The setup does a reasonable job of making the collapse look measurable rather than gradual, and the distinction between acting from a corrupted world versus just picking a bad action is a useful observation.

The soft spot is the measurement itself. The task supplies exact per-step gold states for the fidelity metric while the agent operates without them. If the extraction procedure uses that gold in a way that varies with observation mode or mutation rate, the reported ordering of fidelity collapse before action failure is no longer guaranteed to reflect only the agent's implicit model. The abstract gives no equations, no error bars, and no description of how fidelity is computed from the traces, so the central claim rests on an unverified assumption.

This is for researchers working on long-horizon language agents and reliability bottlenecks. A reader who wants to see whether phase-transition language adds anything beyond existing robustness work will get some value from the diagram, but anyone needing reproducible methods or falsifiable predictions will find the current version thin.

It deserves a serious referee to check the fidelity extraction procedure and whether the confound can be ruled out. I would send it to review rather than desk reject, with the expectation that the measurement details will need substantial clarification.

Referee Report

2 major / 1 minor

Summary. The paper claims that long-horizon language agents exhibit a phase-transition-like collapse in their implicit world models. In a deterministic task family with per-step gold states, a grid search over state cardinality, dependency density, horizon, branching, observation mode, and mutation rate produces a phase diagram consisting of a solved plateau, a narrow transition band, and a collapse floor. Per-step traces indicate that world-state fidelity degrades before action validity, so the agent acts from a corrupted world model rather than merely selecting a bad action; stronger models shift the critical boundary but preserve the qualitative transition.

Significance. If the central measurement of implicit fidelity is valid, the work supplies a controlled, falsifiable characterization of world-model collapse as a measurable bottleneck for long-horizon agents, analogous to physical phase transitions. The grid-search methodology and emphasis on per-step traces provide a reproducible experimental scaffold that could guide future scaling studies.

major comments (2)

[grid search and per-step traces description] The fidelity metric is extracted using exact per-step gold states that are supplied to the experimenter but unavailable to the agent. Because this access is held constant while observation mode and mutation rate are varied, any mode-dependent ease of reconstruction from the gold state could produce an ordering of failures that does not reflect the agent's implicit world model. This directly undercuts the claim that 'world-state fidelity fails before action validity' as an intrinsic property of the model rather than an artifact of the measurement procedure.
[grid search description] The abstract states that a 'large grid search' reveals a phase diagram with a 'narrow transition band,' yet supplies neither the number of trials per cell, error bars on the transition location, nor any statistical test distinguishing the band from sampling noise. Without these quantities the reported sharpness of the transition cannot be evaluated and the distinction between plateau, band, and floor remains qualitative.

minor comments (1)

[Abstract] The opening physical analogy ('Water looks unchanged as it warms, then at a critical point it boils') is evocative but should be accompanied by a brief statement of which features of the phase transition are intended to map onto the agent setting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review of our manuscript. We address the major comments point by point below.

read point-by-point responses

Referee: [grid search and per-step traces description] The fidelity metric is extracted using exact per-step gold states that are supplied to the experimenter but unavailable to the agent. Because this access is held constant while observation mode and mutation rate are varied, any mode-dependent ease of reconstruction from the gold state could produce an ordering of failures that does not reflect the agent's implicit world model. This directly undercuts the claim that 'world-state fidelity fails before action validity' as an intrinsic property of the model rather than an artifact of the measurement procedure.

Authors: The gold states represent the objective reality of the deterministic task and are used solely by the experimenter to compute the fidelity metric; they are never provided to the agent. This setup allows us to directly measure how well the agent's implicit world model aligns with the true state, which is the core of the claim. The observation mode variations affect what the agent sees, but the fidelity is always checked against the same ground truth. We maintain that the observed ordering (fidelity failing first) reflects the agent's internal state corruption rather than a measurement artifact, as the same pattern holds across multiple observation modes. Nevertheless, we will revise the methods section to explicitly discuss the measurement procedure and its assumptions to address this concern. revision: partial
Referee: [grid search description] The abstract states that a 'large grid search' reveals a phase diagram with a 'narrow transition band,' yet supplies neither the number of trials per cell, error bars on the transition location, nor any statistical test distinguishing the band from sampling noise. Without these quantities the reported sharpness of the transition cannot be evaluated and the distinction between plateau, band, and floor remains qualitative.

Authors: We agree that the manuscript would benefit from more quantitative details on the grid search. In the revised version, we will report the number of trials per cell (10 independent runs), include error bars on the phase boundaries based on these runs, and add a statistical test (such as a permutation test) to support the identification of the narrow transition band. This will provide a more rigorous basis for the phase diagram claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical parameter sweep with independent measurements

full rationale

The paper reports results from a large grid search over discrete parameters (state cardinality, dependency density, horizon, branching, observation mode, mutation rate) in a deterministic task family that supplies exact per-step gold states. The phase diagram, transition band, and ordering of fidelity vs. action failure are direct outputs of these sweeps and per-step trace comparisons. No equations, fitted parameters, or derivations are presented that reduce to self-defined terms or self-citations. The central claim rests on observable empirical patterns rather than any load-bearing mathematical reduction or ansatz smuggled via prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on abstract; no free parameters, axioms, or invented entities are specified.

pith-pipeline@v0.9.1-grok · 5690 in / 1040 out tokens · 40641 ms · 2026-07-01T05:16:59.403934+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

93 extracted references · 14 canonical work pages · 8 internal anchors

[1]

International Conference on Learning Representations (ICLR) , year =

Shridhar, Mohit and Yuan, Xingdi and C. International Conference on Learning Representations (ICLR) , year =
[2]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Wang, Ruoyao and Jansen, Peter and C. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

2022
[3]

2022 , url =

Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , booktitle =. 2022 , url =

2022
[4]

Zhou, Shuyan and Xu, Frank F. and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Ou, Tianyue and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham , booktitle =. 2024 , url =

2024
[5]

and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal =

Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal =. 2023 , url =

2023
[6]

GAIA: a benchmark for General AI Assistants

Mialon, Gr. arXiv preprint arXiv:2311.12983 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[7]

2023 , url =

Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and others , journal =. 2023 , url =

2023
[8]

Advances in Neural Information Processing Systems (NeurIPS) , year =

-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[9]

2024 , url =

Xie, Jian and Zhang, Kai and Chen, Jiangjie and Zhu, Tinghui and Lou, Renze and Tian, Yuandong and Xiao, Yanghua and Su, Yu , booktitle =. 2024 , url =

2024
[10]

and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , journal =

Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , journal =. 2024 , url =

2024
[11]

2025 , url =

Luo, Haotian and Zhang, Huaisong and Zhang, Xuelin and others , journal =. 2025 , url =

2025
[12]

2025 , url =

Imajuku, Yuki and Horie, Kohki and Iwata, Yoichi and others , journal =. 2025 , url =

2025
[13]

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks , author =. arXiv preprint arXiv:2503.09572 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[14]

2023 , note =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , note =

2023
[15]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[16]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Graph of Thoughts: Solving Elaborate Problems with Large Language Models , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =
[17]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[18]

2023 , url =

Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and Zhao, Sihan and Hong, Lauren and Tian, Runchu and Xie, Ruobing and Zhou, Jie and Gerstein, Mark and Li, Dahai and Liu, Zhiyuan and Sun, Maosong , journal =. 2023 , url =

2023
[19]

2024 , url =

Guo, Zhicheng and Cheng, Sijie and Wang, Hao and Liang, Shihao and Qin, Yujia and Li, Peng and Liu, Zhiyuan and Sun, Maosong and Liu, Yang , journal =. 2024 , url =

2024
[20]

2026 , url =

Zhao, Ziliang and Xu, Zenan and Wang, Shuting and Qian, Hongjin and Lei, Yan and Hu, Minda and Wang, Zhao and Dou, Shihan and Dou, Zhicheng and Zhou, Pluto , journal =. 2026 , url =

2026
[21]

2025 , howpublished =

2025
[22]

2024 , howpublished =

2024
[23]

arXiv preprint arXiv:2410.21276 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal =. The. 2024 , url =

2024
[25]

2026 , url =

Ding, Shuangrui and Dai, Xuanlang and Xing, Long and Ding, Shengyuan and Liu, Ziyu and JingYi, Yang and Yang, Penghui and Zhang, Zhixiong and Wei, Xilin and Fang, Xinyu and Ma, Yubo and Duan, Haodong and Shao, Jing and Wang, Jiaqi and Lin, Dahua and Chen, Kai and Zang, Yuhang , journal =. 2026 , url =

2026
[26]

2026 , url =

Fang, Shicheng and Wang, Yuxin and Liu, Xiaoran and Lu, Jiahao and Tan, Chuanyuan and Chen, Xinchi and Zheng, Yining and Huang, Xuanjing and Qiu, Xipeng , journal =. 2026 , url =

2026
[27]

and Nadgir, Nitya and Narayanan, Arvind , journal =

Kapoor, Sayash and Stroebl, Benedikt and Siegel, Zachary S. and Nadgir, Nitya and Narayanan, Arvind , journal =. 2025 , url =

2025
[28]

Transactions of the Association for Computational Linguistics , year =

Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , year =
[29]

Evaluating Long-Context Reasoning in

Chung, Andy and Zhang, Yichi and Lin, Kaixiang and others , journal =. Evaluating Long-Context Reasoning in. 2025 , url =

2025
[30]

2025 , eprint=

Context Discipline and Performance Correlation: Analyzing LLM Performance and Quality Degradation Under Varying Context Lengths , author=. 2025 , eprint=

2025
[31]

2025 , url =

Zhou, Zijian and others , journal =. 2025 , url =

2025
[32]

Aeon: High-Performance Neuro-Symbolic Memory Management for Long-Horizon

Arslan, Mustafa , journal =. Aeon: High-Performance Neuro-Symbolic Memory Management for Long-Horizon. 2026 , url =

2026
[33]

Look Back to Reason Forward: Revisitable Memory for Long-Context

Shi, Yaorui and Chen, Yuxin and Wang, Siyuan and Li, Sihang and Cai, Hengxing and Gu, Qi and Wang, Xiang and Zhang, An , journal =. Look Back to Reason Forward: Revisitable Memory for Long-Context. 2025 , url =

2025
[34]

arXiv preprint arXiv:2102.13249 , year =

Chess as a Testbed for Language Model State Tracking , author =. arXiv preprint arXiv:2102.13249 , year =

work page arXiv
[35]

2023 , url =

Chen, Siwei and Xiao, Anxing and Hsu, David , journal =. 2023 , url =

2023
[36]

and Geng, Longling , journal =

Chang, Edward Y. and Geng, Longling , journal =. 2025 , url =

2025
[37]

2026 , url =

Zhu, Wangrong and Yi, Qiutong Tony and Jia, Robin , journal =. 2026 , url =

2026
[38]

2026 , url =

Hou, Dengzhe and Jiang, Lingyu and Li, Dengjie and others , journal =. 2026 , url =

2026
[39]

and Baghshah, M

Samiei, Mahdi and Mansouri, M. and Baghshah, M. , journal =. The Illusion of Procedural Reasoning: Measuring Long-Horizon. 2025 , url =

2025
[40]

2024 , url =

Ge, Zhiqi and Huang, Hongzhe and others , journal =. 2024 , url =

2024
[41]

2025 , url =

Chen, Delong and Chung, Willy and Bang, Yejin and Ji, Ziwei and Fung, Pascale , journal =. 2025 , url =

2025
[42]

2026 , url =

Chao, Hanxiang and Bai, Yihan and Sheng, Rui and Li, Tianle and Sun, Yushi , journal =. 2026 , url =

2026
[43]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[44]

Understanding the Dark Side of

Zhang, Qingjie and Qiu, Han and others , journal =. Understanding the Dark Side of. 2024 , url =

2024
[45]

On the Intrinsic Self-Correction Capability of

Liu, Guang-Da and Mao, Haitao and others , journal =. On the Intrinsic Self-Correction Capability of. 2024 , url =

2024
[46]

Yu Xia, Yiran Jenny Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A Rossi, Lina Yao, and Ju- lian McAuley

Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate , author =. arXiv preprint arXiv:2509.05396 , year =

work page arXiv
[47]

arXiv preprint arXiv:2505.10571 , year=

On the Failure of Latent State Persistence in Large Language Models , author =. arXiv preprint arXiv:2505.10571 , year =

work page arXiv
[48]

Transactions on Machine Learning Research , year =

Emergent Abilities of Large Language Models , author =. Transactions on Machine Learning Research , year =
[49]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Are Emergent Abilities of Large Language Models a Mirage? , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[50]

Phase transition in large language models and the criticality of natural languages

Critical Phase Transition in Large Language Models , author =. arXiv preprint arXiv:2406.05335 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[51]

arXiv preprint arXiv:2602.19008 , year =

Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks , author =. arXiv preprint arXiv:2602.19008 , year =

work page arXiv
[52]

arXiv preprint arXiv:2410.12409 , year =

Revealing the Barriers of Language Agents in Planning , author =. arXiv preprint arXiv:2410.12409 , year =

work page arXiv
[53]

Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of

Valmeekam, Karthik and Stechly, Kaya and Gundawar, Atharva and Kambhampati, Subbarao , journal =. Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of. 2024 , url =

2024
[54]

2022 , url =

Valmeekam, Karthik and Olmo, Alberto and Sreedharan, Sarath and Kambhampati, Subbarao , journal =. 2022 , url =

2022
[55]

On the Planning Abilities of Large Language Models (

Valmeekam, Karthik and others , journal =. On the Planning Abilities of Large Language Models (. 2023 , url =

2023
[56]

2024 , url =

Kambhampati, Subbarao and Valmeekam, Karthik and Guan, Lin and Stechly, Kaya and Verma, Mudit and Bhambri, Siddhant and Saldyt, Lucas and Murthy, Anil , journal =. 2024 , url =

2024
[57]

2026 , eprint=

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents , author=. 2026 , eprint=

2026
[58]

Can We Rely on

Chen, Yanan and Pesaranghader, Ali and Sadhu, Tanmana and others , journal =. Can We Rely on. 2024 , url =

2024
[59]

Robust Tool Use via

Zhang, Zhiwei and Zhao, Fei and others , journal =. Robust Tool Use via. 2026 , url =

2026
[60]

Statistics in Medicine , volume =

Comparative Analysis of Two Rates , author =. Statistics in Medicine , volume =. 1985 , doi =

1985
[61]

International Conference on Learning Representations (ICLR) , year =

Progress Measures for Grokking via Mechanistic Interpretability , author =. International Conference on Learning Representations (ICLR) , year =
[62]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author =. arXiv preprint arXiv:2001.08361 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2001
[63]

International Conference on Learning Representations (ICLR) , year =

Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks , author =. International Conference on Learning Representations (ICLR) , year =
[64]

In-context Learning and Induction Heads

In-Context Learning and Induction Heads , author =. Transformer Circuits Thread / arXiv preprint arXiv:2209.11895 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[65]

International Conference on Machine Learning (ICML) , year =

Genie: Generative Interactive Environments , author =. International Conference on Machine Learning (ICML) , year =
[66]

Trends in Cognitive Sciences , volume =

Dissociating Language and Thought in Large Language Models , author =. Trends in Cognitive Sciences , volume =. 2024 , doi =

2024
[67]

, Valmeekam, K

On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks , author =. arXiv preprint arXiv:2402.08115 , year =

work page arXiv
[68]

Understanding the Planning of

Huang, Xu and Liu, Weiwen and Chen, Xiaolong and Wang, Xingmei and Wang, Hao and Lian, Defu and Wang, Yasheng and Tang, Ruiming and Chen, Enhong , journal =. Understanding the Planning of. 2024 , url =

2024
[69]

2023 , url =

Liu, Bo and Jiang, Yuqian and Zhang, Xiaohan and Liu, Qiang and Zhang, Shiqi and Biswas, Joydeep and Stone, Peter , journal =. 2023 , url =

2023
[70]

and Kaelbling, Leslie Pack and Katz, Michael , booktitle =

Silver, Tom and Dan, Soham and Srinivas, Kavitha and Tenenbaum, Joshua B. and Kaelbling, Leslie Pack and Katz, Michael , booktitle =. Generalized Planning in. 2024 , url =

2024
[71]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =
[72]

Transactions on Machine Learning Research (TMLR) , year =

Cognitive Architectures for Language Agents , author =. Transactions on Machine Learning Research (TMLR) , year =
[73]

Frontiers of Computer Science , volume =

A Survey on Large Language Model Based Autonomous Agents , author =. Frontiers of Computer Science , volume =. 2024 , doi =

2024
[74]

and Stoica, Ion and Gonzalez, Joseph E

Packer, Charles and Wooders, Sarah and Lin, Kevin and Fang, Vivian and Patil, Shishir G. and Stoica, Ion and Gonzalez, Joseph E. , journal =. 2024 , url =

2024
[75]

Proceedings of the National Academy of Sciences (PNAS) , volume =

Reconciling Modern Machine-Learning Practice and the Classical Bias--Variance Trade-off , author =. Proceedings of the National Academy of Sciences (PNAS) , volume =. 2019 , doi =

2019
[76]

International Conference on Learning Representations (ICLR) , year =

Deep Double Descent: Where Bigger Models and More Data Hurt , author =. International Conference on Learning Representations (ICLR) , year =
[77]

2022 , url =

The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks , author =. 2022 , url =

2022
[78]

Advances in Neural Information Processing Systems (NeurIPS) , year =

World Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[79]

2022 , note =

A Path Towards Autonomous Machine Intelligence (Version 0.9.2) , author =. 2022 , note =

2022
[80]

Mastering Diverse Domains through World Models

Mastering Diverse Domains through World Models , author =. arXiv preprint arXiv:2301.04104 , year =

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.

[1] [1]

International Conference on Learning Representations (ICLR) , year =

Shridhar, Mohit and Yuan, Xingdi and C. International Conference on Learning Representations (ICLR) , year =

[2] [2]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

Wang, Ruoyao and Jansen, Peter and C. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

2022

[3] [3]

2022 , url =

Yao, Shunyu and Chen, Howard and Yang, John and Narasimhan, Karthik , booktitle =. 2022 , url =

2022

[4] [4]

Zhou, Shuyan and Xu, Frank F. and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Ou, Tianyue and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham , booktitle =. 2024 , url =

2024

[5] [5]

and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal =

Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , journal =. 2023 , url =

2023

[6] [6]

GAIA: a benchmark for General AI Assistants

Mialon, Gr. arXiv preprint arXiv:2311.12983 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

2023 , url =

Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and others , journal =. 2023 , url =

2023

[8] [8]

Advances in Neural Information Processing Systems (NeurIPS) , year =

-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[9] [9]

2024 , url =

Xie, Jian and Zhang, Kai and Chen, Jiangjie and Zhu, Tinghui and Lou, Renze and Tian, Yuandong and Xiao, Yanghua and Su, Yu , booktitle =. 2024 , url =

2024

[10] [10]

and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , journal =

Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , journal =. 2024 , url =

2024

[11] [11]

2025 , url =

Luo, Haotian and Zhang, Huaisong and Zhang, Xuelin and others , journal =. 2025 , url =

2025

[12] [12]

2025 , url =

Imajuku, Yuki and Horie, Kohki and Iwata, Yoichi and others , journal =. 2025 , url =

2025

[13] [13]

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks , author =. arXiv preprint arXiv:2503.09572 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

2023 , note =

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , note =

2023

[15] [15]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[16] [16]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Graph of Thoughts: Solving Elaborate Problems with Large Language Models , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

[17] [17]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Toolformer: Language Models Can Teach Themselves to Use Tools , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[18] [18]

2023 , url =

Qin, Yujia and Liang, Shihao and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and Zhao, Sihan and Hong, Lauren and Tian, Runchu and Xie, Ruobing and Zhou, Jie and Gerstein, Mark and Li, Dahai and Liu, Zhiyuan and Sun, Maosong , journal =. 2023 , url =

2023

[19] [19]

2024 , url =

Guo, Zhicheng and Cheng, Sijie and Wang, Hao and Liang, Shihao and Qin, Yujia and Li, Peng and Liu, Zhiyuan and Sun, Maosong and Liu, Yang , journal =. 2024 , url =

2024

[20] [20]

2026 , url =

Zhao, Ziliang and Xu, Zenan and Wang, Shuting and Qian, Hongjin and Lei, Yan and Hu, Minda and Wang, Zhao and Dou, Shihan and Dou, Zhicheng and Zhou, Pluto , journal =. 2026 , url =

2026

[21] [21]

2025 , howpublished =

2025

[22] [22]

2024 , howpublished =

2024

[23] [23]

arXiv preprint arXiv:2410.21276 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , journal =. The. 2024 , url =

2024

[25] [25]

2026 , url =

Ding, Shuangrui and Dai, Xuanlang and Xing, Long and Ding, Shengyuan and Liu, Ziyu and JingYi, Yang and Yang, Penghui and Zhang, Zhixiong and Wei, Xilin and Fang, Xinyu and Ma, Yubo and Duan, Haodong and Shao, Jing and Wang, Jiaqi and Lin, Dahua and Chen, Kai and Zang, Yuhang , journal =. 2026 , url =

2026

[26] [26]

2026 , url =

Fang, Shicheng and Wang, Yuxin and Liu, Xiaoran and Lu, Jiahao and Tan, Chuanyuan and Chen, Xinchi and Zheng, Yining and Huang, Xuanjing and Qiu, Xipeng , journal =. 2026 , url =

2026

[27] [27]

and Nadgir, Nitya and Narayanan, Arvind , journal =

Kapoor, Sayash and Stroebl, Benedikt and Siegel, Zachary S. and Nadgir, Nitya and Narayanan, Arvind , journal =. 2025 , url =

2025

[28] [28]

Transactions of the Association for Computational Linguistics , year =

Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , year =

[29] [29]

Evaluating Long-Context Reasoning in

Chung, Andy and Zhang, Yichi and Lin, Kaixiang and others , journal =. Evaluating Long-Context Reasoning in. 2025 , url =

2025

[30] [30]

2025 , eprint=

Context Discipline and Performance Correlation: Analyzing LLM Performance and Quality Degradation Under Varying Context Lengths , author=. 2025 , eprint=

2025

[31] [31]

2025 , url =

Zhou, Zijian and others , journal =. 2025 , url =

2025

[32] [32]

Aeon: High-Performance Neuro-Symbolic Memory Management for Long-Horizon

Arslan, Mustafa , journal =. Aeon: High-Performance Neuro-Symbolic Memory Management for Long-Horizon. 2026 , url =

2026

[33] [33]

Look Back to Reason Forward: Revisitable Memory for Long-Context

Shi, Yaorui and Chen, Yuxin and Wang, Siyuan and Li, Sihang and Cai, Hengxing and Gu, Qi and Wang, Xiang and Zhang, An , journal =. Look Back to Reason Forward: Revisitable Memory for Long-Context. 2025 , url =

2025

[34] [34]

arXiv preprint arXiv:2102.13249 , year =

Chess as a Testbed for Language Model State Tracking , author =. arXiv preprint arXiv:2102.13249 , year =

work page arXiv

[35] [35]

2023 , url =

Chen, Siwei and Xiao, Anxing and Hsu, David , journal =. 2023 , url =

2023

[36] [36]

and Geng, Longling , journal =

Chang, Edward Y. and Geng, Longling , journal =. 2025 , url =

2025

[37] [37]

2026 , url =

Zhu, Wangrong and Yi, Qiutong Tony and Jia, Robin , journal =. 2026 , url =

2026

[38] [38]

2026 , url =

Hou, Dengzhe and Jiang, Lingyu and Li, Dengjie and others , journal =. 2026 , url =

2026

[39] [39]

and Baghshah, M

Samiei, Mahdi and Mansouri, M. and Baghshah, M. , journal =. The Illusion of Procedural Reasoning: Measuring Long-Horizon. 2025 , url =

2025

[40] [40]

2024 , url =

Ge, Zhiqi and Huang, Hongzhe and others , journal =. 2024 , url =

2024

[41] [41]

2025 , url =

Chen, Delong and Chung, Willy and Bang, Yejin and Ji, Ziwei and Fung, Pascale , journal =. 2025 , url =

2025

[42] [42]

2026 , url =

Chao, Hanxiang and Bai, Yihan and Sheng, Rui and Li, Tianle and Sun, Yushi , journal =. 2026 , url =

2026

[43] [43]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Reflexion: Language Agents with Verbal Reinforcement Learning , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[44] [44]

Understanding the Dark Side of

Zhang, Qingjie and Qiu, Han and others , journal =. Understanding the Dark Side of. 2024 , url =

2024

[45] [45]

On the Intrinsic Self-Correction Capability of

Liu, Guang-Da and Mao, Haitao and others , journal =. On the Intrinsic Self-Correction Capability of. 2024 , url =

2024

[46] [46]

Yu Xia, Yiran Jenny Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A Rossi, Lina Yao, and Ju- lian McAuley

Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate , author =. arXiv preprint arXiv:2509.05396 , year =

work page arXiv

[47] [47]

arXiv preprint arXiv:2505.10571 , year=

On the Failure of Latent State Persistence in Large Language Models , author =. arXiv preprint arXiv:2505.10571 , year =

work page arXiv

[48] [48]

Transactions on Machine Learning Research , year =

Emergent Abilities of Large Language Models , author =. Transactions on Machine Learning Research , year =

[49] [49]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Are Emergent Abilities of Large Language Models a Mirage? , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[50] [50]

Phase transition in large language models and the criticality of natural languages

Critical Phase Transition in Large Language Models , author =. arXiv preprint arXiv:2406.05335 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

arXiv preprint arXiv:2602.19008 , year =

Capable but Unreliable: Canonical Path Deviation as a Causal Mechanism of Agent Failure in Long-Horizon Tasks , author =. arXiv preprint arXiv:2602.19008 , year =

work page arXiv

[52] [52]

arXiv preprint arXiv:2410.12409 , year =

Revealing the Barriers of Language Agents in Planning , author =. arXiv preprint arXiv:2410.12409 , year =

work page arXiv

[53] [53]

Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of

Valmeekam, Karthik and Stechly, Kaya and Gundawar, Atharva and Kambhampati, Subbarao , journal =. Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of. 2024 , url =

2024

[54] [54]

2022 , url =

Valmeekam, Karthik and Olmo, Alberto and Sreedharan, Sarath and Kambhampati, Subbarao , journal =. 2022 , url =

2022

[55] [55]

On the Planning Abilities of Large Language Models (

Valmeekam, Karthik and others , journal =. On the Planning Abilities of Large Language Models (. 2023 , url =

2023

[56] [56]

2024 , url =

Kambhampati, Subbarao and Valmeekam, Karthik and Guan, Lin and Stechly, Kaya and Verma, Mudit and Bhambri, Siddhant and Saldyt, Lucas and Murthy, Anil , journal =. 2024 , url =

2024

[57] [57]

2026 , eprint=

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents , author=. 2026 , eprint=

2026

[58] [58]

Can We Rely on

Chen, Yanan and Pesaranghader, Ali and Sadhu, Tanmana and others , journal =. Can We Rely on. 2024 , url =

2024

[59] [59]

Robust Tool Use via

Zhang, Zhiwei and Zhao, Fei and others , journal =. Robust Tool Use via. 2026 , url =

2026

[60] [60]

Statistics in Medicine , volume =

Comparative Analysis of Two Rates , author =. Statistics in Medicine , volume =. 1985 , doi =

1985

[61] [61]

International Conference on Learning Representations (ICLR) , year =

Progress Measures for Grokking via Mechanistic Interpretability , author =. International Conference on Learning Representations (ICLR) , year =

[62] [62]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author =. arXiv preprint arXiv:2001.08361 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2001

[63] [63]

International Conference on Learning Representations (ICLR) , year =

Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks , author =. International Conference on Learning Representations (ICLR) , year =

[64] [64]

In-context Learning and Induction Heads

In-Context Learning and Induction Heads , author =. Transformer Circuits Thread / arXiv preprint arXiv:2209.11895 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

International Conference on Machine Learning (ICML) , year =

Genie: Generative Interactive Environments , author =. International Conference on Machine Learning (ICML) , year =

[66] [66]

Trends in Cognitive Sciences , volume =

Dissociating Language and Thought in Large Language Models , author =. Trends in Cognitive Sciences , volume =. 2024 , doi =

2024

[67] [67]

, Valmeekam, K

On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks , author =. arXiv preprint arXiv:2402.08115 , year =

work page arXiv

[68] [68]

Understanding the Planning of

Huang, Xu and Liu, Weiwen and Chen, Xiaolong and Wang, Xingmei and Wang, Hao and Lian, Defu and Wang, Yasheng and Tang, Ruiming and Chen, Enhong , journal =. Understanding the Planning of. 2024 , url =

2024

[69] [69]

2023 , url =

Liu, Bo and Jiang, Yuqian and Zhang, Xiaohan and Liu, Qiang and Zhang, Shiqi and Biswas, Joydeep and Stone, Peter , journal =. 2023 , url =

2023

[70] [70]

and Kaelbling, Leslie Pack and Katz, Michael , booktitle =

Silver, Tom and Dan, Soham and Srinivas, Kavitha and Tenenbaum, Joshua B. and Kaelbling, Leslie Pack and Katz, Michael , booktitle =. Generalized Planning in. 2024 , url =

2024

[71] [71]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST) , year =

[72] [72]

Transactions on Machine Learning Research (TMLR) , year =

Cognitive Architectures for Language Agents , author =. Transactions on Machine Learning Research (TMLR) , year =

[73] [73]

Frontiers of Computer Science , volume =

A Survey on Large Language Model Based Autonomous Agents , author =. Frontiers of Computer Science , volume =. 2024 , doi =

2024

[74] [74]

and Stoica, Ion and Gonzalez, Joseph E

Packer, Charles and Wooders, Sarah and Lin, Kevin and Fang, Vivian and Patil, Shishir G. and Stoica, Ion and Gonzalez, Joseph E. , journal =. 2024 , url =

2024

[75] [75]

Proceedings of the National Academy of Sciences (PNAS) , volume =

Reconciling Modern Machine-Learning Practice and the Classical Bias--Variance Trade-off , author =. Proceedings of the National Academy of Sciences (PNAS) , volume =. 2019 , doi =

2019

[76] [76]

International Conference on Learning Representations (ICLR) , year =

Deep Double Descent: Where Bigger Models and More Data Hurt , author =. International Conference on Learning Representations (ICLR) , year =

[77] [77]

2022 , url =

The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks , author =. 2022 , url =

2022

[78] [78]

Advances in Neural Information Processing Systems (NeurIPS) , year =

World Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[79] [79]

2022 , note =

A Path Towards Autonomous Machine Intelligence (Version 0.9.2) , author =. 2022 , note =

2022

[80] [80]

Mastering Diverse Domains through World Models

Mastering Diverse Domains through World Models , author =. arXiv preprint arXiv:2301.04104 , year =

work page internal anchor Pith review Pith/arXiv arXiv