pith. sign in

arxiv: 2605.22138 · v1 · pith:4LN6XSLPnew · submitted 2026-05-21 · 💻 cs.AI · cs.CL· cs.LG· cs.RO

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

Pith reviewed 2026-05-22 06:28 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.RO
keywords agentic reasoningsimulative planningself-regulationLLM agentsreinforcement learningtoken efficiencyplanning horizon
0
0 comments X

The pith

Decomposing agent reasoning into simulation, self-regulation, and reaction lets smaller models match much larger ones with far less computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that agentic reasoning improves when an LLM separates its thinking into three parts: a simulative system that predicts future states using itself as a world model, a self-regulator that chooses when and how much to plan, and a reactive system for immediate actions. This setup avoids the inefficiency of always doing long chain-of-thought reasoning. Experiments across math, science, and web tasks find that a 30 billion parameter version performs as well as systems with hundreds of billions or trillions of parameters, but consumes between 26 and 95 percent fewer reasoning tokens. Reinforcement learning on this structure lengthens the average planning horizon by about 23 percent while barely increasing how often the planner is called.

Core claim

SR²AM realizes simulative reasoning and self-regulation as distinct stages in an LLM's chain-of-thought, with the base model serving as the world model for predicting future states. Supervised training followed by reinforcement learning produces agents that invoke planning selectively and extend their planning depth, achieving competitive accuracy on diverse tasks with substantially reduced token consumption compared to larger reactive or always-planning baselines.

What carries the argument

Self-regulated simulative reasoning, in which the LLM simulates future states to ground deliberation and a learned configurator decides the presence, structure, and horizon of planning.

If this is right

  • A 30B model can reach Pass@1 levels comparable to 685B-1T systems on math, science, tabular, and web tasks.
  • Reasoning token usage drops by 25.8 to 95.3 percent relative to comparable agentic LLMs.
  • Reinforcement learning boosts average planning horizon by 22.8 percent with only a 2.0 percent rise in planning frequency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The self-regulation principle may generalize to controlling other agent behaviors such as when to update internal knowledge or adjust exploration rates.
  • Deploying such agents could become more practical in settings where token budgets or compute are limited.
  • Further scaling the approach might show whether the LLM world model remains reliable as task complexity increases beyond the tested domains.

Load-bearing premise

An LLM can function as a reliable world model for predicting future states across many tasks without any per-domain engineering or extra training.

What would settle it

A direct comparison on a new task domain where the simulative agent's success rate falls below that of a non-planning reactive baseline would indicate the world-model assumption does not hold.

read the original abstract

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR$^2$AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SR²AM, a framework for efficient agentic reasoning by decomposing decision-making into three systems: System I for reactive execution, System II for simulative reasoning using the LLM as a world model for future-state prediction, and System III for self-regulation to decide when and how to plan. It presents two versions, v0.1 based on prompted multi-module system and v1.0 reconstructed from pretrained LLM traces, trained with supervised and reinforcement learning. Evaluations on math, science, tabular analysis, and web tasks show v1.0-30B achieving Pass@1 competitive with 685B-1T models while using 25.8-95.3% fewer reasoning tokens, with RL increasing planning horizon by 22.8% and planning frequency by only 2.0%.

Significance. Should the empirical claims prove robust upon detailed verification, this work could significantly advance efficient agentic systems by providing explicit control over planning through self-regulation and simulation, leading to substantial token savings without sacrificing performance. The demonstration that RL can extend planning horizons with minimal increase in frequency is a valuable insight. However, the reliance on the base LLM as an accurate world model without per-domain training or validation raises questions about generalizability.

major comments (2)
  1. [Results and Evaluation] The abstract and results claim competitive Pass@1 and large token reductions for v1.0-30B compared to 685B-1T systems, but provide no specifics on the baselines (e.g., which agentic LLMs), exact benchmarks with dataset names, number of evaluation runs, statistical significance tests, or error bars. This information is essential to substantiate the efficiency claims and rule out confounds from task selection or prompting variations.
  2. [Simulative Reasoning and World Model] The approach posits that the LLM can serve as a reliable world model for simulative future-state prediction across diverse tasks without additional training or per-domain engineering. However, there are no reported metrics on simulation fidelity, such as the accuracy of predicted states versus actual outcomes in math, science, or web tasks. If the simulations largely amount to rephrased reasoning steps rather than grounded predictions, the decomposition into Systems I/II/III may not provide benefits beyond standard multi-stage prompting, undermining the central efficiency and RL horizon-extension results.
minor comments (2)
  1. [Terminology] The terms 'System I', 'System II', and 'System III' are introduced without a clear reference to their origins or a diagram illustrating their interactions, which could aid reader comprehension.
  2. [Abstract] The abstract states 'RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%', but does not define how planning horizon and frequency are measured.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving the clarity and substantiation of our empirical claims and the conceptual framing of simulative reasoning. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses
  1. Referee: [Results and Evaluation] The abstract and results claim competitive Pass@1 and large token reductions for v1.0-30B compared to 685B-1T systems, but provide no specifics on the baselines (e.g., which agentic LLMs), exact benchmarks with dataset names, number of evaluation runs, statistical significance tests, or error bars. This information is essential to substantiate the efficiency claims and rule out confounds from task selection or prompting variations.

    Authors: We agree that greater specificity is needed to allow independent verification. The original manuscript described the evaluation at a high level across math, science, tabular analysis, and web tasks. In the revised version, we have expanded the Experiments and Evaluation sections to explicitly name the agentic baselines (including ReAct-style agents, Reflexion, and comparisons against specific large models in the 685B-1T range), list the precise datasets and splits used, report results over five independent runs with standard error bars, and include paired statistical significance tests against baselines. These additions directly address potential confounds from task selection or prompting. revision: yes

  2. Referee: [Simulative Reasoning and World Model] The approach posits that the LLM can serve as a reliable world model for simulative future-state prediction across diverse tasks without additional training or per-domain engineering. However, there are no reported metrics on simulation fidelity, such as the accuracy of predicted states versus actual outcomes in math, science, or web tasks. If the simulations largely amount to rephrased reasoning steps rather than grounded predictions, the decomposition into Systems I/II/III may not provide benefits beyond standard multi-stage prompting, undermining the central efficiency and RL horizon-extension results.

    Authors: This is a fair and substantive concern regarding the grounding of the world-model assumption. We maintain that the explicit three-system decomposition, combined with the observed RL effects (22.8% longer planning horizons with only 2.0% increase in planning frequency), provides evidence that the simulative component contributes beyond standard prompting. Nevertheless, we acknowledge the absence of direct fidelity metrics. In the revision we have added an appendix with qualitative examples of state predictions on math and science tasks together with a discussion of how prediction accuracy can be assessed on verifiable subtasks; we also clarify the distinction from multi-stage prompting by emphasizing the learned self-regulation of when and how far to simulate. We note that full per-domain quantitative validation would require additional annotation effort beyond the scope of the current study. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained against external benchmarks

full rationale

The paper advances a conceptual decomposition of agentic reasoning into simulative (System II), self-regulatory (System III), and reactive (System I) components, then implements SR²AM via prompted CoT stages and RL on traces from existing pretrained models. All reported outcomes—Pass@1 parity with larger models, 25.8-95.3% token reduction, and RL-driven 22.8% horizon increase—are presented as measured empirical results from evaluations on math, science, tabular, and web tasks rather than quantities derived by construction from fitted parameters or self-referential definitions. No equations, uniqueness theorems, or load-bearing self-citations appear in the provided text that would collapse the architecture or performance claims back to the inputs. The LLM-as-world-model assumption is stated explicitly and tested through overall task performance, not smuggled in via circular fit.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that LLMs can serve as world models and on the empirical results of RL training; no explicit free parameters are named beyond the reported horizon increase, and no new physical or mathematical entities are postulated.

free parameters (1)
  • average planning horizon
    RL training produces a 22.8% increase; this quantity is an outcome of optimization rather than an input constant.
axioms (1)
  • domain assumption An LLM can act as a world model for future-state prediction without per-domain engineering
    Invoked when describing simulative reasoning (System II) as providing unified planning across tasks.
invented entities (1)
  • System III self-regulation configurator no independent evidence
    purpose: Learned module that decides when and how deeply to invoke simulative planning
    Conceptual component introduced to control planning presence and structure.

pith-pipeline@v0.9.0 · 5922 in / 1540 out tokens · 63295 ms · 2026-05-22T06:28:33.920969+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

154 extracted references · 154 canonical work pages · 25 internal anchors

  1. [1]

    GPT-4 Technical Report

    JoshAchiam,StevenAdler,SandhiniAgarwal,LamaAhmad,IlgeAkkaya,FlorenciaLeoniAleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXivpreprintarXiv:2303.08774,2023

  2. [2]

    L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

    PranjalAggarwalandSeanWelleck. L1: Controllinghowlongareasoningmodelthinkswithreinforce- mentlearning.arXivpreprintarXiv:2503.04697,2025

  3. [3]

    Claude3.7sonnetandclaudecode,February2025

    Anthropic. Claude3.7sonnetandclaudecode,February2025

  4. [4]

    Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025

    Daman Arora and Andrea Zanette. Training language models to reason efficiently.arXiv preprint arXiv:2502.04463,2025

  5. [5]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    MidoAssran,AdrienBardes,DavidFan,QuentinGarrido,RussellHowes,MatthewMuckley,Ammar Rizvi,ClaireRoberts,KoustuvSinha,ArtemZholus,etal. V-jepa2: Self-supervisedvideomodelsenable understanding,predictionandplanning.arXivpreprintarXiv:2506.09985,2025

  6. [6]

    Axolotl: Opensourcellmpost-training

    Axolotlmaintainersandcontributors. Axolotl: Opensourcellmpost-training. https://github.com/ axolotl-ai-cloud/axolotl,May2023. Software

  7. [7]

    Navigationworldmodels

    AmirBar,GaoyueZhou,DannyTran,TrevorDarrell,andYannLeCun. Navigationworldmodels. In ProceedingsoftheComputerVisionandPatternRecognitionConference,pages15791–15801,2025

  8. [8]

    Llama-nemotron: Efficient reasoning models,

    Akhiad Bercovich, Itay Levy, Izik Golan, et al. Llama-nemotron: Efficient reasoning models.arXiv preprintarXiv:2505.00949,2025

  9. [9]

    Language modelsarefew-shotlearners

    TomBrown,BenjaminMann,NickRyder,MelanieSubbiah,JaredDKaplan,PrafullaDhariwal,Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, GretchenKrueger,TomHenighan,RewonChild,AdityaRamesh,DanielZiegler,JeffreyWu,Clemens Winter,ChrisHesse,MarkChen,EricSigler,MateuszLitwin,ScottGray,BenjaminChess,JackClark, Christo...

  10. [10]

    Deerflow,2026

    ByteDance. Deerflow,2026. Usethemain-1.xbranchforDeerFlow1.x;accessed2026-04-04

  11. [11]

    AdvancedTextbooksinControlandSignal Processing.SpringerLondon,2edition,2007

    E.F.CamachoandC.Bordons.ModelPredictiveControl. AdvancedTextbooksinControlandSignal Processing.SpringerLondon,2edition,2007

  12. [12]

    Evaluating Large Language Models Trained on Code

    MarkChen,JerryTworek,HeewooJun,QimingYuan,HenriquePondeDeOliveiraPinto,JaredKaplan, HarriEdwards,YuriBurda,NicholasJoseph,GregBrockman,etal. Evaluatinglargelanguagemodels trainedoncode.arXivpreprintarXiv:2107.03374,2021

  13. [13]

    A 2FM:Anadaptiveagentfoundationmodel fortool-awarehybridreasoning.arXivpreprintarXiv:2510.12838,2025

    QianbenChen,JingyiCao,JiayuZhang,TianruiQin,etal. A 2FM:Anadaptiveagentfoundationmodel fortool-awarehybridreasoning.arXivpreprintarXiv:2510.12838,2025

  14. [14]

    Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    Xingyu Chen et al. Do not think that much for 2+3=? on the overthinking of o1-like LLMs.arXiv preprintarXiv:2412.21187,2024

  15. [15]

    FinQA:A datasetofnumericalreasoningoverfinancialdata

    ZhiyuChen,WenhuChen,ChareseSmiley,SameenaShah,IanaBorber,SebastianYe,etal. FinQA:A datasetofnumericalreasoningoverfinancialdata. InEMNLP,2021

  16. [16]

    Fullstackbench: Evaluatingllmsasfullstackcoders

    YaoCheng,JianfengChen,JieChen,LiChen,LiyuChen,WentaoChen,ZhengyuChen,ShijieGeng, AoyanLi,BoLi,BowenLi,LinyiLi,BoyiLiu,JiahengLiu,KaiboLiu,QiLiu,ShukaiLiu,SiyaoLiu, TianyiLiu,TingkaiLiu,YongfeiLiu,RuiLong,JingMai,GuanghanNing,Z.Y.Peng,KaiShen,Jiahao Su, Jing Su, Tao Sun, Yifan Sun, Yunzhe Tao, Guoyin Wang, Siwei Wang, Xuwu Wang, Yite Wang, ZihanWang,Jinxia...

  17. [17]

    Killian, Haonan Li, Mikhail Yurochkin, Eric P

    ZhoujunCheng,ShiboHao,TianyangLiu,FanZhou,YutaoXie,FengYao,YuexinBian,NilabjoDey, Yonghao Zhuang, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Taylor W. Killian, Haonan Li, Mikhail Yurochkin, Eric P. Xing, ZhengzhongLiu,andZhitingHu. RevisitingreinforcementlearningforLLMreasoningfromacross- do...

  18. [18]

    Deepseek-v3.2: Efficientreasoning&agenticai,December2025

    DeepSeek. Deepseek-v3.2: Efficientreasoning&agenticai,December2025

  19. [19]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. DeepSeek-R1: IncentivizingreasoningcapabilityinLLMsviareinforcementlearning. arXivpreprintarXiv:2501.12948,2025

  20. [20]

    General Agentic Planning Through Simulative Reasoning with World Models

    MingkaiDeng,JinyuHou,ZhitingHu,andEricXing. Generalagenticplanningthroughsimulative reasoningwithworldmodels.arXivpreprintarXiv:2507.23773,2025

  21. [21]

    SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

    Xinrun Du, Yifan Sun, Kaixin Zhu, Junying Liu, Bangyan Zhao, et al. SuperGPQA: Scaling LLM evaluationacross285graduatedisciplines.arXivpreprintarXiv:2502.14739,2025

  22. [22]

    Megascience: Pushing the frontiers of post-training datasetsforsciencereasoning.arXivpreprintarXiv:2507.16812,2025

    Run-Ze Fan, Zengzhi Wang, and Pengfei Liu. Megascience: Pushing the frontiers of post-training datasetsforsciencereasoning.arXivpreprintarXiv:2507.16812,2025

  23. [23]

    Thinkless: LLMlearnswhentothink.arXivpreprint arXiv:2505.13379,2025

    GongfanFang,XinyinMa,andXinchaoWang. Thinkless: LLMlearnswhentothink.arXivpreprint arXiv:2505.13379,2025

  24. [24]

    Helix: A vision-language-action model for generalist humanoid control.https://www

    Figure AI. Helix: A vision-language-action model for generalist humanoid control.https://www. figure.ai/news/helix,2025. Accessed: 2026-05-06

  25. [26]

    World reasoning arena.arXivpreprintarXiv:2603.25887, 2026

    Qiyue Gao, Kun Zhou, Jiannan Xiang, Zihan Liu, Dequan Yang, Junrong Chen, Arif Ahmad, Cong Zeng, Ganesh Bannur, Xinqi Huang, et al. World reasoning arena.arXivpreprintarXiv:2603.25887, 2026

  26. [27]

    Modelpredictivecontrol: Theoryandpractice—a survey.Automatica,25(3):335–348,1989

    CarlosE.Garcia,DavidM.Prett,andManfredMorari. Modelpredictivecontrol: Theoryandpractice—a survey.Automatica,25(3):335–348,1989

  27. [28]

    Inversescalingintest-timecompute.TransactionsonMachineLearning Research,2025

    AryoPradiptaGema,AlexanderHägele,RunjinChen,AndyArditi,JacobGoldman-Wetzler,KitFraser- Taliente,HenrySleight,LindaPetrini,JulianMichael,BeatriceAlex,PasqualeMinervini,YandaChen, JoeBenton,andEthanPerez. Inversescalingintest-timecompute.TransactionsonMachineLearning Research,2025

  28. [29]

    Trydeepresearchandournewexperimentalmodelingemini,youraiassistant

    Google. Trydeepresearchandournewexperimentalmodelingemini,youraiassistant. https://blog. google/products-and-platforms/products/gemini/google-gemini-deep-research/,Decem- ber2024. Accessed: 2026-04-04

  29. [30]

    World Models

    DavidHaandJürgenSchmidhuber. Worldmodels.arXivpreprintarXiv:1803.10122,2(3):440,2018

  30. [31]

    Dream to Control: Learning Behaviors by Latent Imagination

    DanijarHafner,TimothyLillicrap,JimmyBa,andMohammadNorouzi. Dreamtocontrol: Learning behaviorsbylatentimagination.arXivpreprintarXiv:1912.01603,2019

  31. [32]

    Learninglatentdynamicsforplanningfrompixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learninglatentdynamicsforplanningfrompixels. InProceedingsofthe36thInternational ConferenceonMachineLearning,2019

  32. [33]

    Reasoningwith language model is planning with world model

    ShiboHao,YiGu,HaodiMa,JoshuaHong,ZhenWang,DaisyWang,andZhitingHu. Reasoningwith language model is planning with world model. InProceedings of the 2023 Conference on Empirical MethodsinNaturalLanguageProcessing,pages8154–8173,2023

  33. [34]

    Whentoreason,whentoact: Aunifiedpolicyforadaptivereasoningandacting.arXiv preprintarXiv:2505.07363,2025

    ZhaoyiHeetal. Whentoreason,whentoact: Aunifiedpolicyforadaptivereasoningandacting.arXiv preprintarXiv:2505.07363,2025

  34. [35]

    MeasuringmathematicalproblemsolvingwiththeMATHdataset.NeurIPS,2021

    DanHendrycks,CollinBurns,SauravKadavath,AkulArora,StevenBasart,EricTang,DawnSong,and JacobSteinhardt. MeasuringmathematicalproblemsolvingwiththeMATHdataset.NeurIPS,2021. 12

  35. [36]

    Constructingamulti-hop qadatasetforcomprehensiveevaluationofreasoningsteps

    XanhHo,Anh-KhoaDuongNguyen,SakuSugawara,andAkikoAizawa. Constructingamulti-hop qadatasetforcomprehensiveevaluationofreasoningsteps. InProceedingsofthe28thInternational ConferenceonComputationalLinguistics,pages6609–6625,2020

  36. [37]

    Metagpt: Meta programming for a multi-agent collaborativeframework

    SiruiHong,MingchenZhuge,JonathanChen,XiawuZheng,YuhengCheng,JinlinWang,CeyaoZhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborativeframework. InThetwelfthinternationalconferenceonlearningrepresentations,2023

  37. [38]

    Language models, agent models, and world models: The law for machine reasoning and planning.arXiv preprint arXiv:2312.05230, 2023

    ZhitingHuandTianminShu. Languagemodels,agentmodels,andworldmodels: Thelawformachine reasoningandplanning.arXivpreprintarXiv:2312.05230,2023

  38. [39]

    ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    Physical Intelligence, Bo Ai, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Greg Balke, Kevin Black,GeorgeBokinsky,ShihaoCao,ThomasCharbonnier,etal. 𝜋0.7: asteerablegeneralistrobotic foundationmodelwithemergentcapabilities.arXivpreprintarXiv:2604.15483,2026

  39. [40]

    Thinkonlywhenyouneedwithlargehybrid-reasoningmodels.arXivpreprint arXiv:2505.14631,2025

    XiaoYuanJiangetal. Thinkonlywhenyouneedwithlargehybrid-reasoningmodels.arXivpreprint arXiv:2505.14631,2025

  40. [41]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    BowenJin,HansiYue,ZhichengDou,JiayiYu,HaoPeng,andJiaweiHan. Search-R1: TrainingLLMs toreasonandleveragesearchengineswithreinforcementlearning.arXivpreprintarXiv:2503.09516, 2025

  41. [42]

    Farrar,StrausandGiroux,2011

    DanielKahneman.Thinking,FastandSlow. Farrar,StrausandGiroux,2011

  42. [43]

    C3ot: Generatingshorterchain-of-thoughtwithoutcompromisingeffectiveness.arXiv preprintarXiv:2412.11664,2024

    RyanKangetal. C3ot: Generatingshorterchain-of-thoughtwithoutcompromisingeffectiveness.arXiv preprintarXiv:2412.11664,2024

  43. [44]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InInternational ConferenceonLearningRepresentations,2015

  44. [45]

    Langgraph, 2026

    LangChain Inc. Langgraph, 2026. Open-source framework for building stateful agents; accessed 2026-04-04

  45. [46]

    PhDthesis,UniversitàdellaSvizzeraitaliana,June2008

    ShaneLegg.MachineSuperIntelligence. PhDthesis,UniversitàdellaSvizzeraitaliana,June2008

  46. [47]

    ARM:Adaptivereasoningmodel.arXivpreprintarXiv:2505.20258,2025

    JianLietal. ARM:Adaptivereasoningmodel.arXivpreprintarXiv:2505.20258,2025

  47. [48]

    Chain-of-agents: End-to-endagentfoundationmodelsviamulti-agentdistillation andagenticRL.arXivpreprintarXiv:2508.13167,2025

    Weizhen Li, Jianbo Lin, Zhuosong Jiang, Jingyi Cao, Xinpeng Liu, Jiayu Zhang, Zhenqiang Huang, Qianben Chen, Weichen Sun, Qiexiang Wang, Hongxuan Lu, Tianrui Qin, Chenghao Zhu, Yi Yao, ShuyingFan,XiaowanLi,TiannanWang,PaiLiu,KingZhu,HeZhu,DingfengShi,PiaohongWang, YeyiGuan,XiangruTang,MinghaoLiu,YuchenEleanorJiang,JianYang,JiahengLiu,GeZhang,and Wangchuns...

  48. [49]

    Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025

    JuntengLiu,YunjiLi,ChiZhang,JingyangLi,AiliChen,KeJi,WeiyuCheng,ZijiaWu,ChengyuDu, QidiXu,etal. Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025

  49. [50]

    K2-V2: A 360-open, reasoning-enhanced LLM,

    ZhengzhongLiu,LipingTang,LinghaoJin,HaonanLi,NikhilRanjan,DesaiFan,ShauryaRohatgi, Richard Fan, Omkar Pangarkar, Huijuan Wang, et al. K2-v2: A 360-open, reasoning-enhanced llm. arXivpreprintarXiv:2512.06201,2025

  50. [51]

    Understandingr1-zero-liketraining: Acriticalperspective,2025

    ZichenLiu,ChangyuChen,WenjunLi,PenghuiQi,TianyuPang,ChaoDu,WeeSunLee,andMinLin. Understandingr1-zero-liketraining: Acriticalperspective,2025

  51. [52]

    AdaCoT:Pareto-optimaladaptivechain-of-thoughttriggeringviareinforcement learning.arXivpreprintarXiv:2505.11896,2025

    ChenweiLou,ZeweiSun,XinnianLiang,MengQu,WeiShen,WenqiWang,YuntaoLi,QingpingYang, andShuangzhiWu. AdaCoT:Pareto-optimaladaptivechain-of-thoughttriggeringviareinforcement learning.arXivpreprintarXiv:2505.11896,2025

  52. [53]

    2024-25aimethresholdsareavailable,December2024

    MAACommunications. 2024-25aimethresholdsareavailable,December2024. UpdatedJanuary6, 2025

  53. [54]

    Americaninvitationalmathematicsexamination(AIME),2024

    MathematicalAssociationofAmerica. Americaninvitationalmathematicsexamination(AIME),2024. 13

  54. [55]

    Aproposalforthe dartmouthsummerresearchprojectonartificialintelligence,August1955

    JohnMcCarthy,MarvinL.Minsky,NathanielRochester,andClaudeE.Shannon. Aproposalforthe dartmouthsummerresearchprojectonartificialintelligence,August1955. ProposaldatedAugust31, 1955

  55. [56]

    GAIA: a benchmark for General AI Assistants

    GrégoireMialon,ClémentineFourrier,CraigSwift,ThomasWolf,YannLeCun,andThomasScialom. GAIA:AbenchmarkforgeneralAIassistants.arXivpreprintarXiv:2311.12983,2023

  56. [57]

    Miroflow,2026

    MiroMindAI. Miroflow,2026. Open-sourceresearch-agentframework;accessed2026-04-04

  57. [58]

    Kimi k2.5: Visual agentic intelligence.https://www.kimi.com/blog/kimi-k2-5

    Moonshot AI. Kimi k2.5: Visual agentic intelligence.https://www.kimi.com/blog/kimi-k2-5. Accessed: 2026-04-05

  58. [59]

    Self-trainingelicitsconcisereasoninginlargelanguagemodels.arXivpreprint arXiv:2502.14922,2025

    TergelMunkhbatetal. Self-trainingelicitsconcisereasoninginlargelanguagemodels.arXivpreprint arXiv:2502.14922,2025

  59. [60]

    Reportonageneralproblemsolvingprogram

    AllenNewell,JohnCShaw,andHerbertASimon. Reportonageneralproblemsolvingprogram. In IFIPcongress,volume256,page1959.Pittsburgh,PA,1959

  60. [61]

    Computer-usingagent

    OpenAI. Computer-usingagent

  61. [62]

    LearningtoreasonwithLLMs

    OpenAI. LearningtoreasonwithLLMs. 2024

  62. [63]

    Openaio1andnewtoolsfordevelopers,December2024

    OpenAI. Openaio1andnewtoolsfordevelopers,December2024

  63. [64]

    BrowseComp: Asimplechallengeforbrowsingagents.arXivpreprintarXiv:2501.15896,2025

    OpenAI. BrowseComp: Asimplechallengeforbrowsingagents.arXivpreprintarXiv:2501.15896,2025

  64. [65]

    gpt-oss-120b&gpt-oss-20bmodelcard,August2025

    OpenAI. gpt-oss-120b&gpt-oss-20bmodelcard,August2025

  65. [66]

    Introducing codex

    OpenAI. Introducing codex. https://openai.com/index/introducing-codex/, May 2025. Ac- cessed: 2026-04-04

  66. [67]

    Introducingdeepresearch,2025

    OpenAI. Introducingdeepresearch,2025

  67. [68]

    https://openai.com/index/gpt-4-1/,April2025.Accessed: 2026-04-05

    OpenAI.Introducinggpt-4.1intheapi. https://openai.com/index/gpt-4-1/,April2025.Accessed: 2026-04-05

  68. [69]

    Openai o3 and o4-mini system card

    OpenAI. Openai o3 and o4-mini system card. https://openai.com/index/ o3-o4-mini-system-card/,April2025. OfficialsystemcardforOpenAIo3ando4-mini

  69. [70]

    Introducinggpt-5.4,March2026

    OpenAI. Introducinggpt-5.4,March2026

  70. [71]

    Learningwhentoplan: Efficiently allocatingtest-timecomputeforLLMagents,2026

    DavidePaglieri,BartłomiejCupiał,JonathanCook,UlyanaPiterbarg,JensTuyls,EdwardGrefenstette, JakobNicolausFoerster,JackParker-Holder,andTimRocktäschel. Learningwhentoplan: Efficiently allocatingtest-timecomputeforLLMagents,2026

  71. [72]

    A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026

    LongPhan,AliceGatti,ZiwenHan,NathanielLi,JosephinaHu,etal. Humanity’slastexam.arXiv preprintarXiv:2501.14249,2025

  72. [73]

    Physicalintelligence( 𝜋)

    PhysicalIntelligence. Physicalintelligence( 𝜋). https://www.pi.website/,2024. Accessed: 2026-05- 06

  73. [74]

    Current agents fail to leverage world model as tool for foresight.arXivpreprintarXiv:2601.03905,2026

    ChengQian,EmreCanAcikgoz,BingxuanLi,XiusiChen,YujiZhang,BingxiangHe,QinyuLuo,Dilek Hakkani-Tür, Gokhan Tur, Yunzhu Li, et al. Current agents fail to leverage world model as tool for foresight.arXivpreprintarXiv:2601.03905,2026

  74. [75]

    Qwen3-235b-a22b-instruct-2507

    Qwen Team. Qwen3-235b-a22b-instruct-2507. https://huggingface.co/Qwen/ Qwen3-235B-A22B-Instruct-2507. Modelcard,accessed2026-04-05

  75. [76]

    Qwen3-235b-a22b-thinking-2507

    Qwen Team. Qwen3-235b-a22b-thinking-2507. https://huggingface.co/Qwen/ Qwen3-235B-A22B-Thinking-2507. Modelcard,accessed2026-04-05

  76. [77]

    Qwen3-30b-a3b-instruct-2507

    Qwen Team. Qwen3-30b-a3b-instruct-2507. https://huggingface.co/Qwen/ Qwen3-30B-A3B-Instruct-2507. Modelcard,accessed2026-04-05

  77. [78]

    Qwen3-30b-a3b-thinking-2507

    Qwen Team. Qwen3-30b-a3b-thinking-2507. https://huggingface.co/Qwen/ Qwen3-30B-A3B-Thinking-2507. Modelcard,accessed2026-04-05. 14

  78. [79]

    Qwen3-8b

    QwenTeam. Qwen3-8b. https://huggingface.co/Qwen/Qwen3-8B. Modelcard,accessed2026-04- 05

  79. [80]

    Qwen3-next-80b-a3b-instruct

    Qwen Team. Qwen3-next-80b-a3b-instruct. https://huggingface.co/Qwen/ Qwen3-Next-80B-A3B-Instruct. Modelcard,accessed2026-04-05

  80. [81]

    Qwen3-32b.https://huggingface.co/Qwen/Qwen3-32B,May2025

    QwenTeam. Qwen3-32b.https://huggingface.co/Qwen/Qwen3-32B,May2025. Modelcard

Showing first 80 references.