Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

Eric P. Xing; Jinyu Hou; Lara S\'a Neves; Mingkai Deng; Taylor W. Killian; Varad Pimpalkhute; Zhengzhong Liu

arxiv: 2605.22138 · v1 · pith:4LN6XSLPnew · submitted 2026-05-21 · 💻 cs.AI · cs.CL· cs.LG· cs.RO

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

Mingkai Deng , Jinyu Hou , Lara S\'a Neves , Varad Pimpalkhute , Taylor W. Killian , Zhengzhong Liu , Eric P. Xing This is my paper

Pith reviewed 2026-05-22 06:28 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.RO

keywords agentic reasoningsimulative planningself-regulationLLM agentsreinforcement learningtoken efficiencyplanning horizon

0 comments

The pith

Decomposing agent reasoning into simulation, self-regulation, and reaction lets smaller models match much larger ones with far less computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that agentic reasoning improves when an LLM separates its thinking into three parts: a simulative system that predicts future states using itself as a world model, a self-regulator that chooses when and how much to plan, and a reactive system for immediate actions. This setup avoids the inefficiency of always doing long chain-of-thought reasoning. Experiments across math, science, and web tasks find that a 30 billion parameter version performs as well as systems with hundreds of billions or trillions of parameters, but consumes between 26 and 95 percent fewer reasoning tokens. Reinforcement learning on this structure lengthens the average planning horizon by about 23 percent while barely increasing how often the planner is called.

Core claim

SR²AM realizes simulative reasoning and self-regulation as distinct stages in an LLM's chain-of-thought, with the base model serving as the world model for predicting future states. Supervised training followed by reinforcement learning produces agents that invoke planning selectively and extend their planning depth, achieving competitive accuracy on diverse tasks with substantially reduced token consumption compared to larger reactive or always-planning baselines.

What carries the argument

Self-regulated simulative reasoning, in which the LLM simulates future states to ground deliberation and a learned configurator decides the presence, structure, and horizon of planning.

If this is right

A 30B model can reach Pass@1 levels comparable to 685B-1T systems on math, science, tabular, and web tasks.
Reasoning token usage drops by 25.8 to 95.3 percent relative to comparable agentic LLMs.
Reinforcement learning boosts average planning horizon by 22.8 percent with only a 2.0 percent rise in planning frequency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The self-regulation principle may generalize to controlling other agent behaviors such as when to update internal knowledge or adjust exploration rates.
Deploying such agents could become more practical in settings where token budgets or compute are limited.
Further scaling the approach might show whether the LLM world model remains reliable as task complexity increases beyond the tested domains.

Load-bearing premise

An LLM can function as a reliable world model for predicting future states across many tasks without any per-domain engineering or extra training.

What would settle it

A direct comparison on a new task domain where the simulative agent's success rate falls below that of a non-planning reactive baseline would indicate the world-model assumption does not hold.

read the original abstract

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR$^2$AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper decomposes agentic reasoning into simulative planning with the LLM as world model, learned self-regulation, and reactive execution, then uses RL to extend planning horizons while holding frequency steady, claiming a 30B model matches much larger systems with far fewer tokens.

read the letter

The main thing here is a clean three-part split for agentic LLMs: simulative reasoning that treats the base model as a world model for future-state prediction, a self-regulator that learns when and how far to plan, and plain reactive execution. They build this inside chain-of-thought, train first on traces from bigger models, then apply RL specifically to lengthen the planning horizon rather than increase how often planning happens. The reported outcome is that their 30B version reaches Pass@1 numbers competitive with 685B-1T systems while cutting reasoning tokens by 25-95% across math, science, tabular, and web tasks, and the RL step delivers a 22.8% horizon gain with only 2% more planning calls. That separation and the targeted RL objective are the clearest new pieces. The framework is straightforward to follow and the quantitative targets are specific enough to be checkable. The soft spot is the missing direct test of simulation quality. Nothing in the abstract shows how often the predicted future states actually match what happens after execution, so it remains possible that the efficiency comes mainly from more structured prompting rather than grounded simulation. Baselines, error bars, and data splits are also not detailed here, which leaves the performance parity claim harder to evaluate. This is useful reading for anyone working on controlling compute in LLM agents. The ideas are coherent and the claims are concrete, so it deserves a serious referee who can ask for the missing validation runs and full experimental controls.

Referee Report

2 major / 2 minor

Summary. The paper introduces SR²AM, a framework for efficient agentic reasoning by decomposing decision-making into three systems: System I for reactive execution, System II for simulative reasoning using the LLM as a world model for future-state prediction, and System III for self-regulation to decide when and how to plan. It presents two versions, v0.1 based on prompted multi-module system and v1.0 reconstructed from pretrained LLM traces, trained with supervised and reinforcement learning. Evaluations on math, science, tabular analysis, and web tasks show v1.0-30B achieving Pass@1 competitive with 685B-1T models while using 25.8-95.3% fewer reasoning tokens, with RL increasing planning horizon by 22.8% and planning frequency by only 2.0%.

Significance. Should the empirical claims prove robust upon detailed verification, this work could significantly advance efficient agentic systems by providing explicit control over planning through self-regulation and simulation, leading to substantial token savings without sacrificing performance. The demonstration that RL can extend planning horizons with minimal increase in frequency is a valuable insight. However, the reliance on the base LLM as an accurate world model without per-domain training or validation raises questions about generalizability.

major comments (2)

[Results and Evaluation] The abstract and results claim competitive Pass@1 and large token reductions for v1.0-30B compared to 685B-1T systems, but provide no specifics on the baselines (e.g., which agentic LLMs), exact benchmarks with dataset names, number of evaluation runs, statistical significance tests, or error bars. This information is essential to substantiate the efficiency claims and rule out confounds from task selection or prompting variations.
[Simulative Reasoning and World Model] The approach posits that the LLM can serve as a reliable world model for simulative future-state prediction across diverse tasks without additional training or per-domain engineering. However, there are no reported metrics on simulation fidelity, such as the accuracy of predicted states versus actual outcomes in math, science, or web tasks. If the simulations largely amount to rephrased reasoning steps rather than grounded predictions, the decomposition into Systems I/II/III may not provide benefits beyond standard multi-stage prompting, undermining the central efficiency and RL horizon-extension results.

minor comments (2)

[Terminology] The terms 'System I', 'System II', and 'System III' are introduced without a clear reference to their origins or a diagram illustrating their interactions, which could aid reader comprehension.
[Abstract] The abstract states 'RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%', but does not define how planning horizon and frequency are measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving the clarity and substantiation of our empirical claims and the conceptual framing of simulative reasoning. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Results and Evaluation] The abstract and results claim competitive Pass@1 and large token reductions for v1.0-30B compared to 685B-1T systems, but provide no specifics on the baselines (e.g., which agentic LLMs), exact benchmarks with dataset names, number of evaluation runs, statistical significance tests, or error bars. This information is essential to substantiate the efficiency claims and rule out confounds from task selection or prompting variations.

Authors: We agree that greater specificity is needed to allow independent verification. The original manuscript described the evaluation at a high level across math, science, tabular analysis, and web tasks. In the revised version, we have expanded the Experiments and Evaluation sections to explicitly name the agentic baselines (including ReAct-style agents, Reflexion, and comparisons against specific large models in the 685B-1T range), list the precise datasets and splits used, report results over five independent runs with standard error bars, and include paired statistical significance tests against baselines. These additions directly address potential confounds from task selection or prompting. revision: yes
Referee: [Simulative Reasoning and World Model] The approach posits that the LLM can serve as a reliable world model for simulative future-state prediction across diverse tasks without additional training or per-domain engineering. However, there are no reported metrics on simulation fidelity, such as the accuracy of predicted states versus actual outcomes in math, science, or web tasks. If the simulations largely amount to rephrased reasoning steps rather than grounded predictions, the decomposition into Systems I/II/III may not provide benefits beyond standard multi-stage prompting, undermining the central efficiency and RL horizon-extension results.

Authors: This is a fair and substantive concern regarding the grounding of the world-model assumption. We maintain that the explicit three-system decomposition, combined with the observed RL effects (22.8% longer planning horizons with only 2.0% increase in planning frequency), provides evidence that the simulative component contributes beyond standard prompting. Nevertheless, we acknowledge the absence of direct fidelity metrics. In the revision we have added an appendix with qualitative examples of state predictions on math and science tasks together with a discussion of how prediction accuracy can be assessed on verifiable subtasks; we also clarify the distinction from multi-stage prompting by emphasizing the learned self-regulation of when and how far to simulate. We note that full per-domain quantitative validation would require additional annotation effort beyond the scope of the current study. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained against external benchmarks

full rationale

The paper advances a conceptual decomposition of agentic reasoning into simulative (System II), self-regulatory (System III), and reactive (System I) components, then implements SR²AM via prompted CoT stages and RL on traces from existing pretrained models. All reported outcomes—Pass@1 parity with larger models, 25.8-95.3% token reduction, and RL-driven 22.8% horizon increase—are presented as measured empirical results from evaluations on math, science, tabular, and web tasks rather than quantities derived by construction from fitted parameters or self-referential definitions. No equations, uniqueness theorems, or load-bearing self-citations appear in the provided text that would collapse the architecture or performance claims back to the inputs. The LLM-as-world-model assumption is stated explicitly and tested through overall task performance, not smuggled in via circular fit.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that LLMs can serve as world models and on the empirical results of RL training; no explicit free parameters are named beyond the reported horizon increase, and no new physical or mathematical entities are postulated.

free parameters (1)

average planning horizon
RL training produces a 22.8% increase; this quantity is an outcome of optimization rather than an input constant.

axioms (1)

domain assumption An LLM can act as a world model for future-state prediction without per-domain engineering
Invoked when describing simulative reasoning (System II) as providing unified planning across tasks.

invented entities (1)

System III self-regulation configurator no independent evidence
purpose: Learned module that decides when and how deeply to invoke simulative planning
Conceptual component introduced to control planning presence and structure.

pith-pipeline@v0.9.0 · 5922 in / 1540 out tokens · 63295 ms · 2026-05-22T06:28:33.920969+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

with the LLM itself serving as the world model in language space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · 25 internal anchors

[1]

GPT-4 Technical Report

JoshAchiam,StevenAdler,SandhiniAgarwal,LamaAhmad,IlgeAkkaya,FlorenciaLeoniAleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXivpreprintarXiv:2303.08774,2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

PranjalAggarwalandSeanWelleck. L1: Controllinghowlongareasoningmodelthinkswithreinforce- mentlearning.arXivpreprintarXiv:2503.04697,2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Claude3.7sonnetandclaudecode,February2025

Anthropic. Claude3.7sonnetandclaudecode,February2025

work page
[4]

Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025

Daman Arora and Andrea Zanette. Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025

work page arXiv 2025
[5]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

MidoAssran,AdrienBardes,DavidFan,QuentinGarrido,RussellHowes,MatthewMuckley,Ammar Rizvi,ClaireRoberts,KoustuvSinha,ArtemZholus,etal. V-jepa2: Self-supervisedvideomodelsenable understanding,predictionandplanning.arXivpreprintarXiv:2506.09985,2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Axolotl: Opensourcellmpost-training

Axolotlmaintainersandcontributors. Axolotl: Opensourcellmpost-training. https://github.com/ axolotl-ai-cloud/axolotl,May2023. Software

work page
[7]

Navigationworldmodels

AmirBar,GaoyueZhou,DannyTran,TrevorDarrell,andYannLeCun. Navigationworldmodels. In ProceedingsoftheComputerVisionandPatternRecognitionConference,pages15791–15801,2025

work page 2025
[8]

Llama-nemotron: Efficient reasoning models,

Akhiad Bercovich, Itay Levy, Izik Golan, et al. Llama-nemotron: Efficient reasoning models.arXiv preprintarXiv:2505.00949,2025

work page arXiv 2025
[9]

Language modelsarefew-shotlearners

TomBrown,BenjaminMann,NickRyder,MelanieSubbiah,JaredDKaplan,PrafullaDhariwal,Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, GretchenKrueger,TomHenighan,RewonChild,AdityaRamesh,DanielZiegler,JeffreyWu,Clemens Winter,ChrisHesse,MarkChen,EricSigler,MateuszLitwin,ScottGray,BenjaminChess,JackClark, Christo...

work page 1901
[10]

Deerflow,2026

ByteDance. Deerflow,2026. Usethemain-1.xbranchforDeerFlow1.x;accessed2026-04-04

work page 2026
[11]

AdvancedTextbooksinControlandSignal Processing.SpringerLondon,2edition,2007

E.F.CamachoandC.Bordons.ModelPredictiveControl. AdvancedTextbooksinControlandSignal Processing.SpringerLondon,2edition,2007

work page 2007
[12]

Evaluating Large Language Models Trained on Code

MarkChen,JerryTworek,HeewooJun,QimingYuan,HenriquePondeDeOliveiraPinto,JaredKaplan, HarriEdwards,YuriBurda,NicholasJoseph,GregBrockman,etal. Evaluatinglargelanguagemodels trainedoncode.arXivpreprintarXiv:2107.03374,2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

A 2FM:Anadaptiveagentfoundationmodel fortool-awarehybridreasoning.arXivpreprintarXiv:2510.12838,2025

QianbenChen,JingyiCao,JiayuZhang,TianruiQin,etal. A 2FM:Anadaptiveagentfoundationmodel fortool-awarehybridreasoning.arXivpreprintarXiv:2510.12838,2025

work page arXiv 2025
[14]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen et al. Do not think that much for 2+3=? on the overthinking of o1-like LLMs.arXiv preprintarXiv:2412.21187,2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

FinQA:A datasetofnumericalreasoningoverfinancialdata

ZhiyuChen,WenhuChen,ChareseSmiley,SameenaShah,IanaBorber,SebastianYe,etal. FinQA:A datasetofnumericalreasoningoverfinancialdata. InEMNLP,2021

work page 2021
[16]

Fullstackbench: Evaluatingllmsasfullstackcoders

YaoCheng,JianfengChen,JieChen,LiChen,LiyuChen,WentaoChen,ZhengyuChen,ShijieGeng, AoyanLi,BoLi,BowenLi,LinyiLi,BoyiLiu,JiahengLiu,KaiboLiu,QiLiu,ShukaiLiu,SiyaoLiu, TianyiLiu,TingkaiLiu,YongfeiLiu,RuiLong,JingMai,GuanghanNing,Z.Y.Peng,KaiShen,Jiahao Su, Jing Su, Tao Sun, Yifan Sun, Yunzhe Tao, Guoyin Wang, Siwei Wang, Xuwu Wang, Yite Wang, ZihanWang,Jinxia...

work page arXiv 2024
[17]

Killian, Haonan Li, Mikhail Yurochkin, Eric P

ZhoujunCheng,ShiboHao,TianyangLiu,FanZhou,YutaoXie,FengYao,YuexinBian,NilabjoDey, Yonghao Zhuang, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Taylor W. Killian, Haonan Li, Mikhail Yurochkin, Eric P. Xing, ZhengzhongLiu,andZhitingHu. RevisitingreinforcementlearningforLLMreasoningfromacross- do...

work page 2026
[18]

Deepseek-v3.2: Efficientreasoning&agenticai,December2025

DeepSeek. Deepseek-v3.2: Efficientreasoning&agenticai,December2025

work page
[19]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: IncentivizingreasoningcapabilityinLLMsviareinforcementlearning. arXivpreprintarXiv:2501.12948,2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

General Agentic Planning Through Simulative Reasoning with World Models

MingkaiDeng,JinyuHou,ZhitingHu,andEricXing. Generalagenticplanningthroughsimulative reasoningwithworldmodels.arXivpreprintarXiv:2507.23773,2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Xinrun Du, Yifan Sun, Kaixin Zhu, Junying Liu, Bangyan Zhao, et al. SuperGPQA: Scaling LLM evaluationacross285graduatedisciplines.arXivpreprintarXiv:2502.14739,2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Megascience: Pushing the frontiers of post-training datasetsforsciencereasoning.arXivpreprintarXiv:2507.16812,2025

Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post-training datasetsforsciencereasoning.arXivpreprintarXiv:2507.16812,2025

work page arXiv 2025
[23]

Thinkless: LLMlearnswhentothink.arXivpreprint arXiv:2505.13379,2025

GongfanFang,XinyinMa,andXinchaoWang. Thinkless: LLMlearnswhentothink.arXivpreprint arXiv:2505.13379,2025

work page arXiv 2025
[24]

Helix: A vision-language-action model for generalist humanoid control.https://www

Figure AI. Helix: A vision-language-action model for generalist humanoid control.https://www. figure.ai/news/helix,2025. Accessed: 2026-05-06

work page 2025
[26]

World reasoning arena.arXivpreprintarXiv:2603.25887, 2026

Qiyue Gao, Kun Zhou, Jiannan Xiang, Zihan Liu, Dequan Yang, Junrong Chen, Arif Ahmad, Cong Zeng, Ganesh Bannur, Xinqi Huang, et al. World reasoning arena.arXivpreprintarXiv:2603.25887, 2026

work page arXiv 2026
[27]

Modelpredictivecontrol: Theoryandpractice—a survey.Automatica,25(3):335–348,1989

CarlosE.Garcia,DavidM.Prett,andManfredMorari. Modelpredictivecontrol: Theoryandpractice—a survey.Automatica,25(3):335–348,1989

work page 1989
[28]

Inversescalingintest-timecompute.TransactionsonMachineLearning Research,2025

AryoPradiptaGema,AlexanderHägele,RunjinChen,AndyArditi,JacobGoldman-Wetzler,KitFraser- Taliente,HenrySleight,LindaPetrini,JulianMichael,BeatriceAlex,PasqualeMinervini,YandaChen, JoeBenton,andEthanPerez. Inversescalingintest-timecompute.TransactionsonMachineLearning Research,2025

work page 2025
[29]

Trydeepresearchandournewexperimentalmodelingemini,youraiassistant

Google. Trydeepresearchandournewexperimentalmodelingemini,youraiassistant. https://blog. google/products-and-platforms/products/gemini/google-gemini-deep-research/,Decem- ber2024. Accessed: 2026-04-04

work page 2026
[30]

World Models

DavidHaandJürgenSchmidhuber. Worldmodels.arXivpreprintarXiv:1803.10122,2(3):440,2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Dream to Control: Learning Behaviors by Latent Imagination

DanijarHafner,TimothyLillicrap,JimmyBa,andMohammadNorouzi. Dreamtocontrol: Learning behaviorsbylatentimagination.arXivpreprintarXiv:1912.01603,2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[32]

Learninglatentdynamicsforplanningfrompixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learninglatentdynamicsforplanningfrompixels. InProceedingsofthe36thInternational ConferenceonMachineLearning,2019

work page 2019
[33]

Reasoningwith language model is planning with world model

ShiboHao,YiGu,HaodiMa,JoshuaHong,ZhenWang,DaisyWang,andZhitingHu. Reasoningwith language model is planning with world model. InProceedings of the 2023 Conference on Empirical MethodsinNaturalLanguageProcessing,pages8154–8173,2023

work page 2023
[34]

Whentoreason,whentoact: Aunifiedpolicyforadaptivereasoningandacting.arXiv preprintarXiv:2505.07363,2025

ZhaoyiHeetal. Whentoreason,whentoact: Aunifiedpolicyforadaptivereasoningandacting.arXiv preprintarXiv:2505.07363,2025

work page arXiv 2025
[35]

MeasuringmathematicalproblemsolvingwiththeMATHdataset.NeurIPS,2021

DanHendrycks,CollinBurns,SauravKadavath,AkulArora,StevenBasart,EricTang,DawnSong,and JacobSteinhardt. MeasuringmathematicalproblemsolvingwiththeMATHdataset.NeurIPS,2021. 12

work page 2021
[36]

Constructingamulti-hop qadatasetforcomprehensiveevaluationofreasoningsteps

XanhHo,Anh-KhoaDuongNguyen,SakuSugawara,andAkikoAizawa. Constructingamulti-hop qadatasetforcomprehensiveevaluationofreasoningsteps. InProceedingsofthe28thInternational ConferenceonComputationalLinguistics,pages6609–6625,2020

work page 2020
[37]

Metagpt: Meta programming for a multi-agent collaborativeframework

SiruiHong,MingchenZhuge,JonathanChen,XiawuZheng,YuhengCheng,JinlinWang,CeyaoZhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborativeframework. InThetwelfthinternationalconferenceonlearningrepresentations,2023

work page 2023
[38]

Language models, agent models, and world models: The law for machine reasoning and planning.arXiv preprint arXiv:2312.05230, 2023

ZhitingHuandTianminShu. Languagemodels,agentmodels,andworldmodels: Thelawformachine reasoningandplanning.arXivpreprintarXiv:2312.05230,2023

work page arXiv 2023
[39]

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black,GeorgeBokinsky,ShihaoCao,ThomasCharbonnier,etal. 𝜋0.7: asteerablegeneralistrobotic foundationmodelwithemergentcapabilities.arXivpreprintarXiv:2604.15483,2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[40]

Thinkonlywhenyouneedwithlargehybrid-reasoningmodels.arXivpreprint arXiv:2505.14631,2025

XiaoYuanJiangetal. Thinkonlywhenyouneedwithlargehybrid-reasoningmodels.arXivpreprint arXiv:2505.14631,2025

work page arXiv 2025
[41]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

BowenJin,HansiYue,ZhichengDou,JiayiYu,HaoPeng,andJiaweiHan. Search-R1: TrainingLLMs toreasonandleveragesearchengineswithreinforcementlearning.arXivpreprintarXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Farrar,StrausandGiroux,2011

DanielKahneman.Thinking,FastandSlow. Farrar,StrausandGiroux,2011

work page 2011
[43]

C3ot: Generatingshorterchain-of-thoughtwithoutcompromisingeffectiveness.arXiv preprintarXiv:2412.11664,2024

RyanKangetal. C3ot: Generatingshorterchain-of-thoughtwithoutcompromisingeffectiveness.arXiv preprintarXiv:2412.11664,2024

work page arXiv 2024
[44]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInternational ConferenceonLearningRepresentations,2015

work page 2015
[45]

Langgraph, 2026

LangChain Inc. Langgraph, 2026. Open-source framework for building stateful agents; accessed 2026-04-04

work page 2026
[46]

PhDthesis,UniversitàdellaSvizzeraitaliana,June2008

ShaneLegg.MachineSuperIntelligence. PhDthesis,UniversitàdellaSvizzeraitaliana,June2008

work page
[47]

ARM:Adaptivereasoningmodel.arXivpreprintarXiv:2505.20258,2025

JianLietal. ARM:Adaptivereasoningmodel.arXivpreprintarXiv:2505.20258,2025

work page arXiv 2025
[48]

Chain-of-agents: End-to-endagentfoundationmodelsviamulti-agentdistillation andagenticRL.arXivpreprintarXiv:2508.13167,2025

Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, ShuyingFan,XiaowanLi,TiannanWang,PaiLiu,KingZhu,HeZhu,DingfengShi,PiaohongWang, YeyiGuan,XiangruTang,MinghaoLiu,YuchenEleanorJiang,JianYang,JiahengLiu,GeZhang,and Wangchuns...

work page arXiv 2025
[49]

Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025

JuntengLiu,YunjiLi,ChiZhang,JingyangLi,AiliChen,KeJi,WeiyuCheng,ZijiaWu,ChengyuDu, QidiXu,etal. Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025

work page arXiv 2025
[50]

K2-V2: A 360-open, reasoning-enhanced LLM,

ZhengzhongLiu,LipingTang,LinghaoJin,HaonanLi,NikhilRanjan,DesaiFan,ShauryaRohatgi, Richard Fan, Omkar Pangarkar, Huijuan Wang, et al. K2-v2: A 360-open, reasoning-enhanced llm. arXivpreprintarXiv:2512.06201,2025

work page arXiv 2025
[51]

Understandingr1-zero-liketraining: Acriticalperspective,2025

ZichenLiu,ChangyuChen,WenjunLi,PenghuiQi,TianyuPang,ChaoDu,WeeSunLee,andMinLin. Understandingr1-zero-liketraining: Acriticalperspective,2025

work page 2025
[52]

AdaCoT:Pareto-optimaladaptivechain-of-thoughttriggeringviareinforcement learning.arXivpreprintarXiv:2505.11896,2025

ChenweiLou,ZeweiSun,XinnianLiang,MengQu,WeiShen,WenqiWang,YuntaoLi,QingpingYang, andShuangzhiWu. AdaCoT:Pareto-optimaladaptivechain-of-thoughttriggeringviareinforcement learning.arXivpreprintarXiv:2505.11896,2025

work page arXiv 2025
[53]

2024-25aimethresholdsareavailable,December2024

MAACommunications. 2024-25aimethresholdsareavailable,December2024. UpdatedJanuary6, 2025

work page 2024
[54]

Americaninvitationalmathematicsexamination(AIME),2024

MathematicalAssociationofAmerica. Americaninvitationalmathematicsexamination(AIME),2024. 13

work page 2024
[55]

Aproposalforthe dartmouthsummerresearchprojectonartificialintelligence,August1955

JohnMcCarthy,MarvinL.Minsky,NathanielRochester,andClaudeE.Shannon. Aproposalforthe dartmouthsummerresearchprojectonartificialintelligence,August1955. ProposaldatedAugust31, 1955

work page 1955
[56]

GAIA: a benchmark for General AI Assistants

GrégoireMialon,ClémentineFourrier,CraigSwift,ThomasWolf,YannLeCun,andThomasScialom. GAIA:AbenchmarkforgeneralAIassistants.arXivpreprintarXiv:2311.12983,2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Miroflow,2026

MiroMindAI. Miroflow,2026. Open-sourceresearch-agentframework;accessed2026-04-04

work page 2026
[58]

Kimi k2.5: Visual agentic intelligence.https://www.kimi.com/blog/kimi-k2-5

Moonshot AI. Kimi k2.5: Visual agentic intelligence.https://www.kimi.com/blog/kimi-k2-5. Accessed: 2026-04-05

work page 2026
[59]

Self-trainingelicitsconcisereasoninginlargelanguagemodels.arXivpreprint arXiv:2502.14922,2025

TergelMunkhbatetal. Self-trainingelicitsconcisereasoninginlargelanguagemodels.arXivpreprint arXiv:2502.14922,2025

work page arXiv 2025
[60]

Reportonageneralproblemsolvingprogram

AllenNewell,JohnCShaw,andHerbertASimon. Reportonageneralproblemsolvingprogram. In IFIPcongress,volume256,page1959.Pittsburgh,PA,1959

work page 1959
[61]

Computer-usingagent

OpenAI. Computer-usingagent

work page
[62]

LearningtoreasonwithLLMs

OpenAI. LearningtoreasonwithLLMs. 2024

work page 2024
[63]

Openaio1andnewtoolsfordevelopers,December2024

OpenAI. Openaio1andnewtoolsfordevelopers,December2024

work page
[64]

BrowseComp: Asimplechallengeforbrowsingagents.arXivpreprintarXiv:2501.15896,2025

OpenAI. BrowseComp: Asimplechallengeforbrowsingagents.arXivpreprintarXiv:2501.15896,2025

work page arXiv 2025
[65]

gpt-oss-120b&gpt-oss-20bmodelcard,August2025

OpenAI. gpt-oss-120b&gpt-oss-20bmodelcard,August2025

work page
[66]

Introducing codex

OpenAI. Introducing codex. https://openai.com/index/introducing-codex/, May 2025. Ac- cessed: 2026-04-04

work page 2025
[67]

Introducingdeepresearch,2025

OpenAI. Introducingdeepresearch,2025

work page 2025
[68]

https://openai.com/index/gpt-4-1/,April2025.Accessed: 2026-04-05

OpenAI.Introducinggpt-4.1intheapi. https://openai.com/index/gpt-4-1/,April2025.Accessed: 2026-04-05

work page 2026
[69]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. https://openai.com/index/ o3-o4-mini-system-card/,April2025. OfficialsystemcardforOpenAIo3ando4-mini

work page
[70]

Introducinggpt-5.4,March2026

OpenAI. Introducinggpt-5.4,March2026

work page
[71]

Learningwhentoplan: Efficiently allocatingtest-timecomputeforLLMagents,2026

DavidePaglieri,BartłomiejCupiał,JonathanCook,UlyanaPiterbarg,JensTuyls,EdwardGrefenstette, JakobNicolausFoerster,JackParker-Holder,andTimRocktäschel. Learningwhentoplan: Efficiently allocatingtest-timecomputeforLLMagents,2026

work page 2026
[72]

A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026

LongPhan,AliceGatti,ZiwenHan,NathanielLi,JosephinaHu,etal. Humanity’slastexam.arXiv preprintarXiv:2501.14249,2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Physicalintelligence( 𝜋)

PhysicalIntelligence. Physicalintelligence( 𝜋). https://www.pi.website/,2024. Accessed: 2026-05- 06

work page 2024
[74]

Current agents fail to leverage world model as tool for foresight.arXivpreprintarXiv:2601.03905,2026

ChengQian,EmreCanAcikgoz,BingxuanLi,XiusiChen,YujiZhang,BingxiangHe,QinyuLuo,Dilek Hakkani-Tür, Gokhan Tur, Yunzhu Li, et al. Current agents fail to leverage world model as tool for foresight.arXivpreprintarXiv:2601.03905,2026

work page arXiv 2026
[75]

Qwen3-235b-a22b-instruct-2507

Qwen Team. Qwen3-235b-a22b-instruct-2507. https://huggingface.co/Qwen/ Qwen3-235B-A22B-Instruct-2507. Modelcard,accessed2026-04-05

work page
[76]

Qwen3-235b-a22b-thinking-2507

Qwen Team. Qwen3-235b-a22b-thinking-2507. https://huggingface.co/Qwen/ Qwen3-235B-A22B-Thinking-2507. Modelcard,accessed2026-04-05

work page
[77]

Qwen3-30b-a3b-instruct-2507

Qwen Team. Qwen3-30b-a3b-instruct-2507. https://huggingface.co/Qwen/ Qwen3-30B-A3B-Instruct-2507. Modelcard,accessed2026-04-05

work page
[78]

Qwen3-30b-a3b-thinking-2507

Qwen Team. Qwen3-30b-a3b-thinking-2507. https://huggingface.co/Qwen/ Qwen3-30B-A3B-Thinking-2507. Modelcard,accessed2026-04-05. 14

work page
[79]

Qwen3-8b

QwenTeam. Qwen3-8b. https://huggingface.co/Qwen/Qwen3-8B. Modelcard,accessed2026-04- 05

work page
[80]

Qwen3-next-80b-a3b-instruct

Qwen Team. Qwen3-next-80b-a3b-instruct. https://huggingface.co/Qwen/ Qwen3-Next-80B-A3B-Instruct. Modelcard,accessed2026-04-05

work page
[81]

Qwen3-32b.https://huggingface.co/Qwen/Qwen3-32B,May2025

QwenTeam. Qwen3-32b.https://huggingface.co/Qwen/Qwen3-32B,May2025. Modelcard

work page

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

JoshAchiam,StevenAdler,SandhiniAgarwal,LamaAhmad,IlgeAkkaya,FlorenciaLeoniAleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXivpreprintarXiv:2303.08774,2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

PranjalAggarwalandSeanWelleck. L1: Controllinghowlongareasoningmodelthinkswithreinforce- mentlearning.arXivpreprintarXiv:2503.04697,2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Claude3.7sonnetandclaudecode,February2025

Anthropic. Claude3.7sonnetandclaudecode,February2025

work page

[4] [4]

Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025

Daman Arora and Andrea Zanette. Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025

work page arXiv 2025

[5] [5]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

MidoAssran,AdrienBardes,DavidFan,QuentinGarrido,RussellHowes,MatthewMuckley,Ammar Rizvi,ClaireRoberts,KoustuvSinha,ArtemZholus,etal. V-jepa2: Self-supervisedvideomodelsenable understanding,predictionandplanning.arXivpreprintarXiv:2506.09985,2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Axolotl: Opensourcellmpost-training

Axolotlmaintainersandcontributors. Axolotl: Opensourcellmpost-training. https://github.com/ axolotl-ai-cloud/axolotl,May2023. Software

work page

[7] [7]

Navigationworldmodels

AmirBar,GaoyueZhou,DannyTran,TrevorDarrell,andYannLeCun. Navigationworldmodels. In ProceedingsoftheComputerVisionandPatternRecognitionConference,pages15791–15801,2025

work page 2025

[8] [8]

Llama-nemotron: Efficient reasoning models,

Akhiad Bercovich, Itay Levy, Izik Golan, et al. Llama-nemotron: Efficient reasoning models.arXiv preprintarXiv:2505.00949,2025

work page arXiv 2025

[9] [9]

Language modelsarefew-shotlearners

TomBrown,BenjaminMann,NickRyder,MelanieSubbiah,JaredDKaplan,PrafullaDhariwal,Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, GretchenKrueger,TomHenighan,RewonChild,AdityaRamesh,DanielZiegler,JeffreyWu,Clemens Winter,ChrisHesse,MarkChen,EricSigler,MateuszLitwin,ScottGray,BenjaminChess,JackClark, Christo...

work page 1901

[10] [10]

Deerflow,2026

ByteDance. Deerflow,2026. Usethemain-1.xbranchforDeerFlow1.x;accessed2026-04-04

work page 2026

[11] [11]

AdvancedTextbooksinControlandSignal Processing.SpringerLondon,2edition,2007

E.F.CamachoandC.Bordons.ModelPredictiveControl. AdvancedTextbooksinControlandSignal Processing.SpringerLondon,2edition,2007

work page 2007

[12] [12]

Evaluating Large Language Models Trained on Code

MarkChen,JerryTworek,HeewooJun,QimingYuan,HenriquePondeDeOliveiraPinto,JaredKaplan, HarriEdwards,YuriBurda,NicholasJoseph,GregBrockman,etal. Evaluatinglargelanguagemodels trainedoncode.arXivpreprintarXiv:2107.03374,2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

A 2FM:Anadaptiveagentfoundationmodel fortool-awarehybridreasoning.arXivpreprintarXiv:2510.12838,2025

QianbenChen,JingyiCao,JiayuZhang,TianruiQin,etal. A 2FM:Anadaptiveagentfoundationmodel fortool-awarehybridreasoning.arXivpreprintarXiv:2510.12838,2025

work page arXiv 2025

[14] [14]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen et al. Do not think that much for 2+3=? on the overthinking of o1-like LLMs.arXiv preprintarXiv:2412.21187,2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

FinQA:A datasetofnumericalreasoningoverfinancialdata

ZhiyuChen,WenhuChen,ChareseSmiley,SameenaShah,IanaBorber,SebastianYe,etal. FinQA:A datasetofnumericalreasoningoverfinancialdata. InEMNLP,2021

work page 2021

[16] [16]

Fullstackbench: Evaluatingllmsasfullstackcoders

YaoCheng,JianfengChen,JieChen,LiChen,LiyuChen,WentaoChen,ZhengyuChen,ShijieGeng, AoyanLi,BoLi,BowenLi,LinyiLi,BoyiLiu,JiahengLiu,KaiboLiu,QiLiu,ShukaiLiu,SiyaoLiu, TianyiLiu,TingkaiLiu,YongfeiLiu,RuiLong,JingMai,GuanghanNing,Z.Y.Peng,KaiShen,Jiahao Su, Jing Su, Tao Sun, Yifan Sun, Yunzhe Tao, Guoyin Wang, Siwei Wang, Xuwu Wang, Yite Wang, ZihanWang,Jinxia...

work page arXiv 2024

[17] [17]

Killian, Haonan Li, Mikhail Yurochkin, Eric P

ZhoujunCheng,ShiboHao,TianyangLiu,FanZhou,YutaoXie,FengYao,YuexinBian,NilabjoDey, Yonghao Zhuang, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Taylor W. Killian, Haonan Li, Mikhail Yurochkin, Eric P. Xing, ZhengzhongLiu,andZhitingHu. RevisitingreinforcementlearningforLLMreasoningfromacross- do...

work page 2026

[18] [18]

Deepseek-v3.2: Efficientreasoning&agenticai,December2025

DeepSeek. Deepseek-v3.2: Efficientreasoning&agenticai,December2025

work page

[19] [19]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. DeepSeek-R1: IncentivizingreasoningcapabilityinLLMsviareinforcementlearning. arXivpreprintarXiv:2501.12948,2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

General Agentic Planning Through Simulative Reasoning with World Models

MingkaiDeng,JinyuHou,ZhitingHu,andEricXing. Generalagenticplanningthroughsimulative reasoningwithworldmodels.arXivpreprintarXiv:2507.23773,2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Xinrun Du, Yifan Sun, Kaixin Zhu, Junying Liu, Bangyan Zhao, et al. SuperGPQA: Scaling LLM evaluationacross285graduatedisciplines.arXivpreprintarXiv:2502.14739,2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Megascience: Pushing the frontiers of post-training datasetsforsciencereasoning.arXivpreprintarXiv:2507.16812,2025

Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post-training datasetsforsciencereasoning.arXivpreprintarXiv:2507.16812,2025

work page arXiv 2025

[23] [23]

Thinkless: LLMlearnswhentothink.arXivpreprint arXiv:2505.13379,2025

GongfanFang,XinyinMa,andXinchaoWang. Thinkless: LLMlearnswhentothink.arXivpreprint arXiv:2505.13379,2025

work page arXiv 2025

[24] [24]

Helix: A vision-language-action model for generalist humanoid control.https://www

Figure AI. Helix: A vision-language-action model for generalist humanoid control.https://www. figure.ai/news/helix,2025. Accessed: 2026-05-06

work page 2025

[25] [26]

World reasoning arena.arXivpreprintarXiv:2603.25887, 2026

Qiyue Gao, Kun Zhou, Jiannan Xiang, Zihan Liu, Dequan Yang, Junrong Chen, Arif Ahmad, Cong Zeng, Ganesh Bannur, Xinqi Huang, et al. World reasoning arena.arXivpreprintarXiv:2603.25887, 2026

work page arXiv 2026

[26] [27]

Modelpredictivecontrol: Theoryandpractice—a survey.Automatica,25(3):335–348,1989

CarlosE.Garcia,DavidM.Prett,andManfredMorari. Modelpredictivecontrol: Theoryandpractice—a survey.Automatica,25(3):335–348,1989

work page 1989

[27] [28]

Inversescalingintest-timecompute.TransactionsonMachineLearning Research,2025

AryoPradiptaGema,AlexanderHägele,RunjinChen,AndyArditi,JacobGoldman-Wetzler,KitFraser- Taliente,HenrySleight,LindaPetrini,JulianMichael,BeatriceAlex,PasqualeMinervini,YandaChen, JoeBenton,andEthanPerez. Inversescalingintest-timecompute.TransactionsonMachineLearning Research,2025

work page 2025

[28] [29]

Trydeepresearchandournewexperimentalmodelingemini,youraiassistant

Google. Trydeepresearchandournewexperimentalmodelingemini,youraiassistant. https://blog. google/products-and-platforms/products/gemini/google-gemini-deep-research/,Decem- ber2024. Accessed: 2026-04-04

work page 2026

[29] [30]

World Models

DavidHaandJürgenSchmidhuber. Worldmodels.arXivpreprintarXiv:1803.10122,2(3):440,2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [31]

Dream to Control: Learning Behaviors by Latent Imagination

DanijarHafner,TimothyLillicrap,JimmyBa,andMohammadNorouzi. Dreamtocontrol: Learning behaviorsbylatentimagination.arXivpreprintarXiv:1912.01603,2019

work page internal anchor Pith review Pith/arXiv arXiv 1912

[31] [32]

Learninglatentdynamicsforplanningfrompixels

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learninglatentdynamicsforplanningfrompixels. InProceedingsofthe36thInternational ConferenceonMachineLearning,2019

work page 2019

[32] [33]

Reasoningwith language model is planning with world model

ShiboHao,YiGu,HaodiMa,JoshuaHong,ZhenWang,DaisyWang,andZhitingHu. Reasoningwith language model is planning with world model. InProceedings of the 2023 Conference on Empirical MethodsinNaturalLanguageProcessing,pages8154–8173,2023

work page 2023

[33] [34]

Whentoreason,whentoact: Aunifiedpolicyforadaptivereasoningandacting.arXiv preprintarXiv:2505.07363,2025

ZhaoyiHeetal. Whentoreason,whentoact: Aunifiedpolicyforadaptivereasoningandacting.arXiv preprintarXiv:2505.07363,2025

work page arXiv 2025

[34] [35]

MeasuringmathematicalproblemsolvingwiththeMATHdataset.NeurIPS,2021

DanHendrycks,CollinBurns,SauravKadavath,AkulArora,StevenBasart,EricTang,DawnSong,and JacobSteinhardt. MeasuringmathematicalproblemsolvingwiththeMATHdataset.NeurIPS,2021. 12

work page 2021

[35] [36]

Constructingamulti-hop qadatasetforcomprehensiveevaluationofreasoningsteps

XanhHo,Anh-KhoaDuongNguyen,SakuSugawara,andAkikoAizawa. Constructingamulti-hop qadatasetforcomprehensiveevaluationofreasoningsteps. InProceedingsofthe28thInternational ConferenceonComputationalLinguistics,pages6609–6625,2020

work page 2020

[36] [37]

Metagpt: Meta programming for a multi-agent collaborativeframework

SiruiHong,MingchenZhuge,JonathanChen,XiawuZheng,YuhengCheng,JinlinWang,CeyaoZhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborativeframework. InThetwelfthinternationalconferenceonlearningrepresentations,2023

work page 2023

[37] [38]

Language models, agent models, and world models: The law for machine reasoning and planning.arXiv preprint arXiv:2312.05230, 2023

ZhitingHuandTianminShu. Languagemodels,agentmodels,andworldmodels: Thelawformachine reasoningandplanning.arXivpreprintarXiv:2312.05230,2023

work page arXiv 2023

[38] [39]

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black,GeorgeBokinsky,ShihaoCao,ThomasCharbonnier,etal. 𝜋0.7: asteerablegeneralistrobotic foundationmodelwithemergentcapabilities.arXivpreprintarXiv:2604.15483,2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [40]

Thinkonlywhenyouneedwithlargehybrid-reasoningmodels.arXivpreprint arXiv:2505.14631,2025

XiaoYuanJiangetal. Thinkonlywhenyouneedwithlargehybrid-reasoningmodels.arXivpreprint arXiv:2505.14631,2025

work page arXiv 2025

[40] [41]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

BowenJin,HansiYue,ZhichengDou,JiayiYu,HaoPeng,andJiaweiHan. Search-R1: TrainingLLMs toreasonandleveragesearchengineswithreinforcementlearning.arXivpreprintarXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [42]

Farrar,StrausandGiroux,2011

DanielKahneman.Thinking,FastandSlow. Farrar,StrausandGiroux,2011

work page 2011

[42] [43]

C3ot: Generatingshorterchain-of-thoughtwithoutcompromisingeffectiveness.arXiv preprintarXiv:2412.11664,2024

RyanKangetal. C3ot: Generatingshorterchain-of-thoughtwithoutcompromisingeffectiveness.arXiv preprintarXiv:2412.11664,2024

work page arXiv 2024

[43] [44]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInternational ConferenceonLearningRepresentations,2015

work page 2015

[44] [45]

Langgraph, 2026

LangChain Inc. Langgraph, 2026. Open-source framework for building stateful agents; accessed 2026-04-04

work page 2026

[45] [46]

PhDthesis,UniversitàdellaSvizzeraitaliana,June2008

ShaneLegg.MachineSuperIntelligence. PhDthesis,UniversitàdellaSvizzeraitaliana,June2008

work page

[46] [47]

ARM:Adaptivereasoningmodel.arXivpreprintarXiv:2505.20258,2025

JianLietal. ARM:Adaptivereasoningmodel.arXivpreprintarXiv:2505.20258,2025

work page arXiv 2025

[47] [48]

Chain-of-agents: End-to-endagentfoundationmodelsviamulti-agentdistillation andagenticRL.arXivpreprintarXiv:2508.13167,2025

Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, ShuyingFan,XiaowanLi,TiannanWang,PaiLiu,KingZhu,HeZhu,DingfengShi,PiaohongWang, YeyiGuan,XiangruTang,MinghaoLiu,YuchenEleanorJiang,JianYang,JiahengLiu,GeZhang,and Wangchuns...

work page arXiv 2025

[48] [49]

Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025

JuntengLiu,YunjiLi,ChiZhang,JingyangLi,AiliChen,KeJi,WeiyuCheng,ZijiaWu,ChengyuDu, QidiXu,etal. Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025

work page arXiv 2025

[49] [50]

K2-V2: A 360-open, reasoning-enhanced LLM,

ZhengzhongLiu,LipingTang,LinghaoJin,HaonanLi,NikhilRanjan,DesaiFan,ShauryaRohatgi, Richard Fan, Omkar Pangarkar, Huijuan Wang, et al. K2-v2: A 360-open, reasoning-enhanced llm. arXivpreprintarXiv:2512.06201,2025

work page arXiv 2025

[50] [51]

Understandingr1-zero-liketraining: Acriticalperspective,2025

ZichenLiu,ChangyuChen,WenjunLi,PenghuiQi,TianyuPang,ChaoDu,WeeSunLee,andMinLin. Understandingr1-zero-liketraining: Acriticalperspective,2025

work page 2025

[51] [52]

AdaCoT:Pareto-optimaladaptivechain-of-thoughttriggeringviareinforcement learning.arXivpreprintarXiv:2505.11896,2025

ChenweiLou,ZeweiSun,XinnianLiang,MengQu,WeiShen,WenqiWang,YuntaoLi,QingpingYang, andShuangzhiWu. AdaCoT:Pareto-optimaladaptivechain-of-thoughttriggeringviareinforcement learning.arXivpreprintarXiv:2505.11896,2025

work page arXiv 2025

[52] [53]

2024-25aimethresholdsareavailable,December2024

MAACommunications. 2024-25aimethresholdsareavailable,December2024. UpdatedJanuary6, 2025

work page 2024

[53] [54]

Americaninvitationalmathematicsexamination(AIME),2024

MathematicalAssociationofAmerica. Americaninvitationalmathematicsexamination(AIME),2024. 13

work page 2024

[54] [55]

Aproposalforthe dartmouthsummerresearchprojectonartificialintelligence,August1955

JohnMcCarthy,MarvinL.Minsky,NathanielRochester,andClaudeE.Shannon. Aproposalforthe dartmouthsummerresearchprojectonartificialintelligence,August1955. ProposaldatedAugust31, 1955

work page 1955

[55] [56]

GAIA: a benchmark for General AI Assistants

GrégoireMialon,ClémentineFourrier,CraigSwift,ThomasWolf,YannLeCun,andThomasScialom. GAIA:AbenchmarkforgeneralAIassistants.arXivpreprintarXiv:2311.12983,2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[56] [57]

Miroflow,2026

MiroMindAI. Miroflow,2026. Open-sourceresearch-agentframework;accessed2026-04-04

work page 2026

[57] [58]

Kimi k2.5: Visual agentic intelligence.https://www.kimi.com/blog/kimi-k2-5

Moonshot AI. Kimi k2.5: Visual agentic intelligence.https://www.kimi.com/blog/kimi-k2-5. Accessed: 2026-04-05

work page 2026

[58] [59]

Self-trainingelicitsconcisereasoninginlargelanguagemodels.arXivpreprint arXiv:2502.14922,2025

TergelMunkhbatetal. Self-trainingelicitsconcisereasoninginlargelanguagemodels.arXivpreprint arXiv:2502.14922,2025

work page arXiv 2025

[59] [60]

Reportonageneralproblemsolvingprogram

AllenNewell,JohnCShaw,andHerbertASimon. Reportonageneralproblemsolvingprogram. In IFIPcongress,volume256,page1959.Pittsburgh,PA,1959

work page 1959

[60] [61]

Computer-usingagent

OpenAI. Computer-usingagent

work page

[61] [62]

LearningtoreasonwithLLMs

OpenAI. LearningtoreasonwithLLMs. 2024

work page 2024

[62] [63]

Openaio1andnewtoolsfordevelopers,December2024

OpenAI. Openaio1andnewtoolsfordevelopers,December2024

work page

[63] [64]

BrowseComp: Asimplechallengeforbrowsingagents.arXivpreprintarXiv:2501.15896,2025

OpenAI. BrowseComp: Asimplechallengeforbrowsingagents.arXivpreprintarXiv:2501.15896,2025

work page arXiv 2025

[64] [65]

gpt-oss-120b&gpt-oss-20bmodelcard,August2025

OpenAI. gpt-oss-120b&gpt-oss-20bmodelcard,August2025

work page

[65] [66]

Introducing codex

OpenAI. Introducing codex. https://openai.com/index/introducing-codex/, May 2025. Ac- cessed: 2026-04-04

work page 2025

[66] [67]

Introducingdeepresearch,2025

OpenAI. Introducingdeepresearch,2025

work page 2025

[67] [68]

https://openai.com/index/gpt-4-1/,April2025.Accessed: 2026-04-05

OpenAI.Introducinggpt-4.1intheapi. https://openai.com/index/gpt-4-1/,April2025.Accessed: 2026-04-05

work page 2026

[68] [69]

Openai o3 and o4-mini system card

OpenAI. Openai o3 and o4-mini system card. https://openai.com/index/ o3-o4-mini-system-card/,April2025. OfficialsystemcardforOpenAIo3ando4-mini

work page

[69] [70]

Introducinggpt-5.4,March2026

OpenAI. Introducinggpt-5.4,March2026

work page

[70] [71]

Learningwhentoplan: Efficiently allocatingtest-timecomputeforLLMagents,2026

DavidePaglieri,BartłomiejCupiał,JonathanCook,UlyanaPiterbarg,JensTuyls,EdwardGrefenstette, JakobNicolausFoerster,JackParker-Holder,andTimRocktäschel. Learningwhentoplan: Efficiently allocatingtest-timecomputeforLLMagents,2026

work page 2026

[71] [72]

A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026

LongPhan,AliceGatti,ZiwenHan,NathanielLi,JosephinaHu,etal. Humanity’slastexam.arXiv preprintarXiv:2501.14249,2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[72] [73]

Physicalintelligence( 𝜋)

PhysicalIntelligence. Physicalintelligence( 𝜋). https://www.pi.website/,2024. Accessed: 2026-05- 06

work page 2024

[73] [74]

Current agents fail to leverage world model as tool for foresight.arXivpreprintarXiv:2601.03905,2026

ChengQian,EmreCanAcikgoz,BingxuanLi,XiusiChen,YujiZhang,BingxiangHe,QinyuLuo,Dilek Hakkani-Tür, Gokhan Tur, Yunzhu Li, et al. Current agents fail to leverage world model as tool for foresight.arXivpreprintarXiv:2601.03905,2026

work page arXiv 2026

[74] [75]

Qwen3-235b-a22b-instruct-2507

Qwen Team. Qwen3-235b-a22b-instruct-2507. https://huggingface.co/Qwen/ Qwen3-235B-A22B-Instruct-2507. Modelcard,accessed2026-04-05

work page

[75] [76]

Qwen3-235b-a22b-thinking-2507

Qwen Team. Qwen3-235b-a22b-thinking-2507. https://huggingface.co/Qwen/ Qwen3-235B-A22B-Thinking-2507. Modelcard,accessed2026-04-05

work page

[76] [77]

Qwen3-30b-a3b-instruct-2507

Qwen Team. Qwen3-30b-a3b-instruct-2507. https://huggingface.co/Qwen/ Qwen3-30B-A3B-Instruct-2507. Modelcard,accessed2026-04-05

work page

[77] [78]

Qwen3-30b-a3b-thinking-2507

Qwen Team. Qwen3-30b-a3b-thinking-2507. https://huggingface.co/Qwen/ Qwen3-30B-A3B-Thinking-2507. Modelcard,accessed2026-04-05. 14

work page

[78] [79]

Qwen3-8b

QwenTeam. Qwen3-8b. https://huggingface.co/Qwen/Qwen3-8B. Modelcard,accessed2026-04- 05

work page

[79] [80]

Qwen3-next-80b-a3b-instruct

Qwen Team. Qwen3-next-80b-a3b-instruct. https://huggingface.co/Qwen/ Qwen3-Next-80B-A3B-Instruct. Modelcard,accessed2026-04-05

work page

[80] [81]

Qwen3-32b.https://huggingface.co/Qwen/Qwen3-32B,May2025

QwenTeam. Qwen3-32b.https://huggingface.co/Qwen/Qwen3-32B,May2025. Modelcard

work page