pith. machine review for the scientific record. sign in

arxiv: 2604.18131 · v1 · submitted 2026-04-20 · 💻 cs.AI

Recognition: unknown

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:32 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsself-evolutionworld knowledge explorationoutcome-based rewardreward-free inferencespontaneous adaptationmeta-evolutionweb navigation agents
0
0 comments X

The pith

LLM agents can be trained to spontaneously explore and summarize world knowledge in unseen environments without rewards or instructions at runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agents can develop an internal capacity to explore unknown settings and generate useful world knowledge before tackling tasks. This capacity is instilled by applying an outcome-based reward only during training that scores how much the agent's self-created knowledge raises success rates on later tasks. Once trained, the agent operates without any external rewards, rules, or human input and still adapts on its own. A sympathetic reader cares because current agents stop improving the moment external supervision ends, which limits their usefulness in open or changing environments.

Core claim

By optimizing an outcome-based reward that directly measures the downstream task benefit produced by the agent's self-generated world knowledge, the model acquires a native meta-evolution ability to perform spontaneous exploration and summarization of completely unseen environments using only its internal parameters once training ends.

What carries the argument

The outcome-based reward mechanism used exclusively in training that quantifies the improvement in task success rates attributable to the agent's self-generated world knowledge.

If this is right

  • Qwen3-30B and Seed-OSS-36B models show roughly 20 percent higher success rates on WebVoyager and WebWalker after the training shift.
  • A 14B Qwen3 model equipped with the generated knowledge outperforms the unassisted Gemini-2.5-Flash on the same tasks.
  • The trained agents adapt to unknown environments using only their parameters and without external rewards or human guidance.
  • Evolution moves from reward-dependent processes that halt without supervision to intrinsic processes that continue spontaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training pattern could support agents that keep improving across long sequences of novel tasks without repeated reward engineering.
  • The method might extend to non-web domains where agents must discover structure in new physical or digital settings.
  • Repeated application could allow a single model to accumulate useful knowledge across many unrelated environments without task-specific fine-tuning.

Load-bearing premise

The training reward successfully installs a general skill for exploration and knowledge summarization that activates and remains useful in environments never seen during training and without any external signals.

What would settle it

Place the trained agent in an entirely new simulated environment never used in training or evaluation, provide no rewards or instructions, and measure whether its self-generated world knowledge still produces measurable gains in task success.

read the original abstract

Most agents today ``self-evolve'' by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops. In this work, we train agents to possess an intrinsic meta-evolution capability to spontaneously learn about unseen environments prior to task execution. To instill this ability, we design an outcome-based reward mechanism that measures how much an agent's self-generated world knowledge improves its success rate on downstream tasks. This reward signal is used exclusively during the training phase to teach the model how to explore and summarize effectively. At inference time, the agent requires no external rewards or human instructions. It spontaneously performs native self-evolution to adapt to unknown environments using its internal parameters. When applied to Qwen3-30B and Seed-OSS-36B, this shift to native evolution yields a 20% performance increase on WebVoyager and WebWalker. Most strikingly, the generated world knowledge even enables a compact 14B Qwen3 model to outperform the unassisted Gemini-2.5-Flash, establishing a new paradigm for truly evolving agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents a method for training LLM agents to achieve spontaneous self-evolution by using an outcome-based reward during training that is based on how self-generated world knowledge improves performance on downstream tasks. At inference, the agents are claimed to explore unknown environments and generate useful knowledge without any external rewards, instructions, or signals, leading to reported performance gains of 20% on WebVoyager and WebWalker for Qwen3-30B and Seed-OSS-36B models, and enabling a 14B Qwen3 model to outperform unassisted Gemini-2.5-Flash.

Significance. Should the results be reproducible and the method shown to produce genuinely intrinsic exploration capabilities that generalize to disjoint environments, this work would have high significance for the development of autonomous AI agents. It proposes a path toward agents that can self-improve in novel settings without ongoing human supervision, which is a key challenge in current agent systems. The cross-model performance claims, if validated, would also suggest practical benefits for deploying smaller models effectively.

major comments (3)
  1. Abstract: The central claim that the agent performs 'spontaneous' self-evolution at inference without external signals is not accompanied by any description of the inference prompt or confirmation that no task-related cues are provided; given that the reward is outcome-based on task success during training, it is unclear if the behavior generalizes beyond learned patterns from the training distribution.
  2. Results: The reported 20% performance increase and the outperformance of Gemini-2.5-Flash by the 14B model are presented without any experimental details, baselines, error bars, or statistical analysis, which is necessary to evaluate the significance of these gains.
  3. Methods/Experiments: There is no information on how environments are partitioned between training and inference, or ablations demonstrating that the exploration behavior is not due to residual task-specific learning from the reward optimization.
minor comments (1)
  1. Abstract: The term 'native self-evolution' is used without a precise definition or contrast to standard fine-tuning or in-context learning effects.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for clarification and strengthening. We address each major comment point by point below, providing additional context from the work and indicating revisions to the manuscript.

read point-by-point responses
  1. Referee: Abstract: The central claim that the agent performs 'spontaneous' self-evolution at inference without external signals is not accompanied by any description of the inference prompt or confirmation that no task-related cues are provided; given that the reward is outcome-based on task success during training, it is unclear if the behavior generalizes beyond learned patterns from the training distribution.

    Authors: We agree that the abstract and main text would benefit from explicit details on the inference setup to substantiate the spontaneous self-evolution claim. The revised manuscript adds a new subsection in Methods that quotes the exact inference prompt (a general directive to explore the current environment and generate structured world knowledge summaries, with no task descriptions, success criteria, or reward references). We also include an appendix with the full prompt template and confirmation that zero task-related cues are provided. To address generalization, we have added results on a held-out set of environments with no overlap in structure or content from training, showing the exploration behavior persists. revision: yes

  2. Referee: Results: The reported 20% performance increase and the outperformance of Gemini-2.5-Flash by the 14B model are presented without any experimental details, baselines, error bars, or statistical analysis, which is necessary to evaluate the significance of these gains.

    Authors: We acknowledge that the initial submission presented aggregate gains without sufficient supporting statistics. The revised manuscript expands Section 4 with full experimental details: per-task breakdowns for WebVoyager and WebWalker, comparisons against standard baselines (including ReAct, Reflexion, and other self-improvement agents), error bars from five independent runs with different seeds, and statistical significance via paired t-tests (p < 0.01 for the 20% average gain). The 14B Qwen3 vs. Gemini-2.5-Flash comparison is now reported with the exact evaluation protocol and variance measures. revision: yes

  3. Referee: Methods/Experiments: There is no information on how environments are partitioned between training and inference, or ablations demonstrating that the exploration behavior is not due to residual task-specific learning from the reward optimization.

    Authors: We have revised the Experiments and Methods sections to explicitly document the partitioning: training uses a collection of 50 web environments for reward computation on self-generated knowledge, while inference evaluation uses 20 completely disjoint environments (different domains, no shared pages or task templates). We also add ablation experiments that replace our outcome-based reward with direct task-success rewards during training; these show that residual task-specific patterns do not produce the same spontaneous exploration at inference, whereas our knowledge-improvement reward does. revision: yes

Circularity Check

0 steps flagged

No significant circularity; training-to-inference transfer follows standard RL structure without self-referential reduction

full rationale

The paper describes training an agent using an outcome-based reward that quantifies improvement in downstream task success from self-generated world knowledge, then deploys the resulting policy at inference with no rewards or instructions. This is a conventional RL setup in which the reward shapes the policy during training and is absent at test time. The abstract supplies no equations, uniqueness theorems, or derivations that reduce the claimed spontaneous inference behavior to the training reward by construction. Reported performance gains (20% on WebVoyager/WebWalker, 14B model outperforming Gemini-2.5-Flash) are presented as empirical outcomes rather than predictions forced by the input metric. No self-citations or ansatzes are invoked to justify core claims in the provided text. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is minimal. The central claim rests on the unstated assumption that the training reward teaches transferable exploration and summarization skills. No explicit free parameters, axioms, or invented entities are detailed in the provided text.

pith-pipeline@v0.9.0 · 5515 in / 1360 out tokens · 39433 ms · 2026-05-10T04:32:24.487449+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 41 canonical work pages · 17 internal anchors

  1. [1]

    Webvoyager: Building an end-to-end web agent with large multimodal models,

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919, 2024

  2. [2]

    arXiv preprint arXiv:2501.07572 , year=

    Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal.arXiv preprint arXiv:2501.07572, 2025

  3. [3]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  4. [4]

    Seed-oss open-source models.https://github.com/ByteDance-Seed/seed-oss, 2025

    ByteDance Seed Team. Seed-oss open-source models.https://github.com/ByteDance-Seed/seed-oss, 2025

  5. [5]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  7. [7]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

  8. [8]

    Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025

  9. [9]

    Cogito, ergo ludo: An agent that learns to play by reasoning and planning.arXiv preprint arXiv:2509.25052, 2025b

    Sai Wang, Yu Wu, and Zhongwen Xu. Cogito, ergo ludo: An agent that learns to play by reasoning and planning. arXiv preprint arXiv:2509.25052, 2025

  10. [10]

    Self-supervised prompt optimization.arXiv preprint arXiv:2502.06855, 2025

    Jinyu Xiang, Jiayi Zhang, Zhaoyang Yu, Fengwei Teng, Jinhao Tu, Xinbing Liang, Sirui Hong, Chenglin Wu, and Yuyu Luo. Self-supervised prompt optimization.arXiv preprint arXiv:2502.06855, 2025

  11. [11]

    Agentsquare: Automatic llm agent search in modular design space, 2025

    Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space.arXiv preprint arXiv:2410.06153, 2024

  12. [12]

    Yin and Z

    Li Yin and Zhangyang Wang. Llm-autodiff: Auto-differentiate any llm workflow.arXiv preprint arXiv:2501.16673, 2025

  13. [13]

    ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

    Siru Ouyang, Jun Yan, I Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, Xiangru Tang, et al. Reasoningbank: Scaling agent self-evolving with reasoning memory.arXiv preprint arXiv:2509.25140, 2025

  14. [14]

    URLhttps://arxiv.org/abs/2512.18746

    Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

  15. [15]

    Expel: Llm agents are experiential learners

    Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

  16. [16]

    Autoguide: Automated generation and selection of state-aware guidelines for large language model agents

    Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, and Honglak Lee. Autoguide: Automated generation and selection of state-aware guidelines for large language model agents. CoRR, 2024

  17. [17]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

  18. [18]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  19. [19]

    Darwin G

    Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954, 2025. 11

  20. [20]

    SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

    Boyuan Zheng, Michael Y Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, et al. Skillweaver: Web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079, 2025

  21. [21]

    From exploration to mastery: Enabling llms to master tools via self-driven interactions.arXiv preprint arXiv:2410.08197, 2024

    Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. From exploration to mastery: Enabling llms to master tools via self-driven interactions.arXiv preprint arXiv:2410.08197, 2024

  22. [22]

    Tool- Gen: Unified tool retrieval and calling via generation,

    Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, and Haonan Li. Toolgen: Unified tool retrieval and calling via generation.arXiv preprint arXiv:2410.03439, 2024

  23. [23]

    Agent learning via early experience.arXiv preprint arXiv:2510.08558, 2025

    Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, et al. Agent learning via early experience.arXiv preprint arXiv:2510.08558, 2025

  24. [24]

    arXiv preprint arXiv:2504.21024 , year=

    Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, and Dong Yu. Webevolver: Enhancing web agent self-improvement with coevolving world model.arXiv preprint arXiv:2504.21024, 2025

  25. [25]

    arXiv preprint arXiv:2506.15651 , year=

    Tevin Wang and Chenyan Xiong. Autorule: Reasoning chain-of-thought extracted rule-based rewards improve preference learning.arXiv preprint arXiv:2506.15651, 2025

  26. [26]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025

  27. [27]

    Learn-by-interact: A data-centric framework for self-adaptive agents in realistic environments,

    Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, and Sercan Ö Arık. Learn-by-interact: A data-centric framework for self-adaptive agents in realistic environments.arXiv preprint arXiv:2501.10893, 2025

  28. [28]

    WebAggregator: Enhancing Compositional Reasoning Capabilities of Deep Research Agent Foundation Models

    Rui Wang, Ce Zhang, Jun-Yu Ma, Jianshu Zhang, Hongru Wang, Yi Chen, Boyang Xue, Tianqing Fang, Zhisong Zhang, Hongming Zhang, et al. Explore to evolve: Scaling evolved aggregation logic via proactive online exploration for deep research agents.arXiv preprint arXiv:2510.14438, 2025

  29. [29]

    Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

    Yuxuan Wan, Tianqing Fang, Zaitang Li, Yintong Huo, Wenxuan Wang, Haitao Mi, Dong Yu, and Michael R Lyu. Inference-time scaling of verification: Self-evolving deep research agents via test-time rubric-guided verification. arXiv preprint arXiv:2601.15808, 2026

  30. [30]

    Spice: Self-play in corpus environments improves reasoning.arXiv, 2025

    Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

  31. [31]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025

  32. [32]

    instruction

    Yifei Zhou, Sergey Levine, Jason Weston, Xian Li, and Sainbayar Sukhbaatar. Self-challenging language model agents.arXiv preprint arXiv:2506.01716, 2025

  33. [33]

    Rlsr: Reinforcement learning from self reward

    Toby Simonds, Kevin Lopez, Akira Yoshiyama, and Dominique Garmier. Self rewarding self improving.arXiv preprint arXiv:2505.08827, 2025

  34. [34]

    Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang. Dr. zero: Self-evolving search agents without training data.arXiv preprint arXiv:2601.07055, 2026

  35. [35]

    Test-time training with self-supervision for generalization under distribution shifts

    Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational conference on machine learning, pages 9229–9248. PMLR, 2020

  36. [36]

    Atlas: Learning to optimally memorize the context at test time, 2025

    Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025

  37. [37]

    Titans: Learning to Memorize at Test Time

    Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024

  38. [38]

    Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695, 2025a

    Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695, 2025. 12

  39. [39]

    Learning to (Learn at Test Time): RNNs with Expressive Hidden States

    Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

  40. [40]

    With greater text comes greater necessity: Inference-time training helps long text generation.arXiv preprint arXiv:2401.11504, 2024

    Yan Wang, Dongyang Ma, and Deng Cai. With greater text comes greater necessity: Inference-time training helps long text generation.arXiv preprint arXiv:2401.11504, 2024

  41. [41]

    Test-Time Training with KV Binding Is Secretly Linear Attention

    Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, and Ruilong Li. Test-time training with kv binding is secretly linear attention.arXiv preprint arXiv:2602.21204, 2026

  42. [42]

    Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026

    Sidi Lu, Zhenwen Liang, Dongyang Ma, Yan Wang, Haitao Mi, and Dong Yu. Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026

  43. [43]

    Continuous self-improvement of large language models by test-time training with verifier-driven sample selection.arXiv preprint arXiv:2505.19475,

    Mohammad Mahdi Moradi, Hossam Amer, Sudhir Mudur, Weiwei Zhang, Yang Liu, and Walid Ahmed. Continuous self-improvement of large language models by test-time training with verifier-driven sample selection.arXiv preprint arXiv:2505.19475, 2025

  44. [44]

    Test-time learning for large language models.arXiv preprint arXiv:2505.20633, 2025

    Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models.arXiv preprint arXiv:2505.20633, 2025

  45. [45]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  46. [46]

    Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale

    Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages...

  47. [47]

    Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

    Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, et al. Cognitive kernel-pro: A framework for deep research agents and agent foundation models training.arXiv preprint arXiv:2508.00414, 2025

  48. [48]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024

  49. [49]

    gpt-oss-120b and gpt-oss-20b model card, 2025

    OpenAI. gpt-oss-120b and gpt-oss-20b model card, 2025

  50. [50]

    Kimi K2: Open Agentic Intelligence

    Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 13 A Details of Input Processing To reduce noise in large-scale web data and focus on domain-relevant information, we performimportance scoringandclus...

  51. [51]

    Feel free to choose any category that seems underdeveloped or has interesting URLs you haven’t explored yet

    Identify areas for expansion:Review your plan and the current guidebook. Feel free to choose any category that seems underdeveloped or has interesting URLs you haven’t explored yet

  52. [52]

    Review existing content:Use read_guidebook() to see what has already been written for your chosen category

  53. [53]

    Please rely on the actual webpage content to inspire your expansion and ensure accuracy

    Explore and gather new data:Scrape additional URLs within that category to discover fresh details. Please rely on the actual webpage content to inspire your expansion and ensure accuracy

  54. [54]

    You can expand summaries, add new page entries, or provide deeper insights to make the section richer and more comprehensive

    Integrate and enrich:Seamlessly weave your new discoveries into the existing text. You can expand summaries, add new page entries, or provide deeper insights to make the section richer and more comprehensive

  55. [55]

    Update the guidebook:Use the rewrite_category_section(category_name, new_section_- text)function to replace the old section with your newly expanded version

  56. [56]

    Continue this exploration process until your guidebook reaches at least{min_token}tokens

    Check progress:Use count_guidebook_tokens() to see how close you are to your goal. Continue this exploration process until your guidebook reaches at least{min_token}tokens. Step 2: Add Overview Header & Save •Callread_guidebook()to get the full current content. •Prepend an Overview section at the top: # [Website Domain] Guidebook ## Overview - **Website:*...

  57. [57]

    Call parse_cluster_stats() to read the file header and get the total number of URLs and categories. Based on the site size, decide your processing mode: for small sites (≤ 250 URLs), useFULL mode where every URL is included; for larger sites, useSELECTIVE modewhere you pick the most important URLs per category (ranked byscore, up to 20 per category if≤8 c...

  58. [58]

    Create a token allocation plan — distribute the target Guidebook length ({min_token}–{token_- limit} tokens) across categories proportionally by each category’seffective URL count(i.e., the number of URLs you will actually scrape, after applying the per-category cap from step 1 — not the raw total), then save it withwrite_plan()

  59. [59]

    Repeat until all categories are done

    Process categories one by one: callget_next_category() to load a category, scrape its selected URLs with web_agent(), write the category section withappend_to_guidebook(), then callmark_- category_done()to advance. Repeat until all categories are done

  60. [60]

    If it exceeds {token_limit}, compress verbose sections withrewrite_category_section()

    After all categories are processed, check the total length withcount_guidebook_tokens(). If it exceeds {token_limit}, compress verbose sections withrewrite_category_section(). If it falls below {min_token}, expand by scraping additional URLs. Finally, prepend an Overview header and callsave_final_guidebook(). Output format per category: ## Category: [Name...

  61. [61]

    Output 1 if the answer correctly answers the question and has the same meaning as the ground truth

  62. [62]

    The answer does NOT need to exactly match the ground truth

  63. [63]

    Differences in wording, format, order, or level of detail are acceptable as long as the meaning is equivalent

  64. [64]

    Concise answers should NOT be judged as incorrect simply because they are shorter than the ground 22 truth

  65. [65]

    Different formats that express the same information (e.g., numbers only, different date formats, paraphrases) should be considered correct

  66. [66]

    Output 0 only if the answer is incorrect, contradicts the ground truth, or fails to answer the question. Examples: Example 1 Question:What are the 2024 suggested retail prices of the Yamaha PAC612 electric guitar and the Sonogenic SHS-300 shoulder keyboard? Ground truth:PAC612 electric guitar suggested retail price: 8,400 RMB. SHS-300 shoulder keyboard su...

  67. [67]

    Web Task Instruction: A natural language instruction describing the task to be completed (e.g., search, verify, compare, summarize)

  68. [68]

    Result Response: The final textual response generated after performing the task

  69. [69]

    Evaluation rules:

    Accessibility Trees: Structured representations of the webpages at each step, serving as evidence of the actions taken. Evaluation rules:

  70. [70]

    You do NOT need to interact with websites or perform any real actions

  71. [71]

    Do NOT assume missing information

    You must base your judgment only on the provided instruction, response, and accessibility trees. Do NOT assume missing information

  72. [72]

    Your primary goal is to evaluate whether the actions reflected in the trees and the final response correctly follow the instruction

  73. [73]

    Missing any part leads to NOT SUCCESS

    If the task contains multiple requirements (e.g., find information and summarize it), all must be 23 completed. Missing any part leads to NOT SUCCESS

  74. [74]

    The accessibility trees serve as ground truth evidence of what actually happened during execution

  75. [75]

    If the Result Response contradicts the trees, trust the trees

  76. [76]

    Instructions:You should briefly explain your reasoning before giving the final verdict

    If the Result Response contains information not present in the trees, trust the response. Instructions:You should briefly explain your reasoning before giving the final verdict. Now evaluate the following case. TASK:{task} Result Response:{answer} Accessibility Trees:{trees} Output your final verdict as one of the following: SUCCESS NOT SUCCESS Do not out...

  77. [77]

    Target Question:{Question}

  78. [78]

    content summary

    Sub-page Summaries:{World Knowledge} [Your Decision Logic] Please carefully read the "content summary" of each sub-page in the World Knowledge and analyze its relevance to the "Target Question":

  79. [79]

    content summary

    Answer Directly (Rare): If the "content summary" in the World Knowledge already contains the specific factual data needed to fully answer the question, please provide the answer directly

  80. [80]

    About Us - Team Introduction

    Explore Sub-pages (Most Common): If the topic of one or more sub-pages in the World Knowledge is highly relevant to the question (e.g., the question is about finding executives, and a sub-page summary is "About Us - Team Introduction"), please extract the URLs of these sub-pages and explore them to find the answer.If you find a potential answer on these s...

Showing first 80 references.