arxiv: 2604.18131 · v1 · submitted 2026-04-20 · 💻 cs.AI

Recognition: unknown

Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

Qifan Zhang , Dongyang Ma , Tianqing Fang , Jia Li , Jing Tang , Nuo Chen , Haitao Mi , Yan Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:32 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsself-evolutionworld knowledge explorationoutcome-based rewardreward-free inferencespontaneous adaptationmeta-evolutionweb navigation agents

0 comments

The pith

LLM agents can be trained to spontaneously explore and summarize world knowledge in unseen environments without rewards or instructions at runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that agents can develop an internal capacity to explore unknown settings and generate useful world knowledge before tackling tasks. This capacity is instilled by applying an outcome-based reward only during training that scores how much the agent's self-created knowledge raises success rates on later tasks. Once trained, the agent operates without any external rewards, rules, or human input and still adapts on its own. A sympathetic reader cares because current agents stop improving the moment external supervision ends, which limits their usefulness in open or changing environments.

Core claim

By optimizing an outcome-based reward that directly measures the downstream task benefit produced by the agent's self-generated world knowledge, the model acquires a native meta-evolution ability to perform spontaneous exploration and summarization of completely unseen environments using only its internal parameters once training ends.

What carries the argument

The outcome-based reward mechanism used exclusively in training that quantifies the improvement in task success rates attributable to the agent's self-generated world knowledge.

If this is right

Qwen3-30B and Seed-OSS-36B models show roughly 20 percent higher success rates on WebVoyager and WebWalker after the training shift.
A 14B Qwen3 model equipped with the generated knowledge outperforms the unassisted Gemini-2.5-Flash on the same tasks.
The trained agents adapt to unknown environments using only their parameters and without external rewards or human guidance.
Evolution moves from reward-dependent processes that halt without supervision to intrinsic processes that continue spontaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same training pattern could support agents that keep improving across long sequences of novel tasks without repeated reward engineering.
The method might extend to non-web domains where agents must discover structure in new physical or digital settings.
Repeated application could allow a single model to accumulate useful knowledge across many unrelated environments without task-specific fine-tuning.

Load-bearing premise

The training reward successfully installs a general skill for exploration and knowledge summarization that activates and remains useful in environments never seen during training and without any external signals.

What would settle it

Place the trained agent in an entirely new simulated environment never used in training or evaluation, provide no rewards or instructions, and measure whether its self-generated world knowledge still produces measurable gains in task success.

read the original abstract

Most agents today ``self-evolve'' by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops. In this work, we train agents to possess an intrinsic meta-evolution capability to spontaneously learn about unseen environments prior to task execution. To instill this ability, we design an outcome-based reward mechanism that measures how much an agent's self-generated world knowledge improves its success rate on downstream tasks. This reward signal is used exclusively during the training phase to teach the model how to explore and summarize effectively. At inference time, the agent requires no external rewards or human instructions. It spontaneously performs native self-evolution to adapt to unknown environments using its internal parameters. When applied to Qwen3-30B and Seed-OSS-36B, this shift to native evolution yields a 20% performance increase on WebVoyager and WebWalker. Most strikingly, the generated world knowledge even enables a compact 14B Qwen3 model to outperform the unassisted Gemini-2.5-Flash, establishing a new paradigm for truly evolving agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper trains agents with outcome-based rewards on self-generated knowledge during training to produce spontaneous exploration at inference, claiming 20% gains and a 14B model beating Gemini, but the generalization claim needs verification.

read the letter

The main point is that they use downstream task success as the training signal for how well an agent's self-generated world knowledge helps, then drop all rewards and instructions at test time so the model explores and summarizes on its own. This produces reported 20% lifts on WebVoyager and WebWalker for the 30B and 36B models, plus the striking result that the 14B version with its generated knowledge beats unassisted Gemini-2.5-Flash. The framing of an intrinsic meta-evolution skill learned this way is the clearest new angle; it moves beyond standard reward modeling by making the exploration policy part of the model's parameters rather than an external loop. The results give a concrete data point that smaller models can close gaps on web-agent benchmarks when they first build their own knowledge. The soft spot is the untested leap from training signal to truly spontaneous behavior in disjoint environments. Because the reward is computed from task success after knowledge generation, the optimization necessarily sees task structures and success criteria, and nothing in the abstract shows environment splits, inference prompt wording, or ablations that would confirm the policy fires without any residual task heuristics. The performance claims also sit on thin ground without baselines, variance numbers, or controls for the Gemini comparison. This work is aimed at people building autonomous web agents, robotics planners, or personal assistants that need to adapt without constant human feedback. A reader focused on reward-free adaptation would get value from the training recipe and the scale of the reported gains, even while wanting the full methods section. It deserves a serious referee to check whether the intrinsic claim survives proper controls and whether the numbers hold up under scrutiny.

Referee Report

3 major / 1 minor

Summary. The manuscript presents a method for training LLM agents to achieve spontaneous self-evolution by using an outcome-based reward during training that is based on how self-generated world knowledge improves performance on downstream tasks. At inference, the agents are claimed to explore unknown environments and generate useful knowledge without any external rewards, instructions, or signals, leading to reported performance gains of 20% on WebVoyager and WebWalker for Qwen3-30B and Seed-OSS-36B models, and enabling a 14B Qwen3 model to outperform unassisted Gemini-2.5-Flash.

Significance. Should the results be reproducible and the method shown to produce genuinely intrinsic exploration capabilities that generalize to disjoint environments, this work would have high significance for the development of autonomous AI agents. It proposes a path toward agents that can self-improve in novel settings without ongoing human supervision, which is a key challenge in current agent systems. The cross-model performance claims, if validated, would also suggest practical benefits for deploying smaller models effectively.

major comments (3)

Abstract: The central claim that the agent performs 'spontaneous' self-evolution at inference without external signals is not accompanied by any description of the inference prompt or confirmation that no task-related cues are provided; given that the reward is outcome-based on task success during training, it is unclear if the behavior generalizes beyond learned patterns from the training distribution.
Results: The reported 20% performance increase and the outperformance of Gemini-2.5-Flash by the 14B model are presented without any experimental details, baselines, error bars, or statistical analysis, which is necessary to evaluate the significance of these gains.
Methods/Experiments: There is no information on how environments are partitioned between training and inference, or ablations demonstrating that the exploration behavior is not due to residual task-specific learning from the reward optimization.

minor comments (1)

Abstract: The term 'native self-evolution' is used without a precise definition or contrast to standard fine-tuning or in-context learning effects.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for clarification and strengthening. We address each major comment point by point below, providing additional context from the work and indicating revisions to the manuscript.

read point-by-point responses

Referee: Abstract: The central claim that the agent performs 'spontaneous' self-evolution at inference without external signals is not accompanied by any description of the inference prompt or confirmation that no task-related cues are provided; given that the reward is outcome-based on task success during training, it is unclear if the behavior generalizes beyond learned patterns from the training distribution.

Authors: We agree that the abstract and main text would benefit from explicit details on the inference setup to substantiate the spontaneous self-evolution claim. The revised manuscript adds a new subsection in Methods that quotes the exact inference prompt (a general directive to explore the current environment and generate structured world knowledge summaries, with no task descriptions, success criteria, or reward references). We also include an appendix with the full prompt template and confirmation that zero task-related cues are provided. To address generalization, we have added results on a held-out set of environments with no overlap in structure or content from training, showing the exploration behavior persists. revision: yes
Referee: Results: The reported 20% performance increase and the outperformance of Gemini-2.5-Flash by the 14B model are presented without any experimental details, baselines, error bars, or statistical analysis, which is necessary to evaluate the significance of these gains.

Authors: We acknowledge that the initial submission presented aggregate gains without sufficient supporting statistics. The revised manuscript expands Section 4 with full experimental details: per-task breakdowns for WebVoyager and WebWalker, comparisons against standard baselines (including ReAct, Reflexion, and other self-improvement agents), error bars from five independent runs with different seeds, and statistical significance via paired t-tests (p < 0.01 for the 20% average gain). The 14B Qwen3 vs. Gemini-2.5-Flash comparison is now reported with the exact evaluation protocol and variance measures. revision: yes
Referee: Methods/Experiments: There is no information on how environments are partitioned between training and inference, or ablations demonstrating that the exploration behavior is not due to residual task-specific learning from the reward optimization.

Authors: We have revised the Experiments and Methods sections to explicitly document the partitioning: training uses a collection of 50 web environments for reward computation on self-generated knowledge, while inference evaluation uses 20 completely disjoint environments (different domains, no shared pages or task templates). We also add ablation experiments that replace our outcome-based reward with direct task-success rewards during training; these show that residual task-specific patterns do not produce the same spontaneous exploration at inference, whereas our knowledge-improvement reward does. revision: yes

Circularity Check

0 steps flagged

No significant circularity; training-to-inference transfer follows standard RL structure without self-referential reduction

full rationale

The paper describes training an agent using an outcome-based reward that quantifies improvement in downstream task success from self-generated world knowledge, then deploys the resulting policy at inference with no rewards or instructions. This is a conventional RL setup in which the reward shapes the policy during training and is absent at test time. The abstract supplies no equations, uniqueness theorems, or derivations that reduce the claimed spontaneous inference behavior to the training reward by construction. Reported performance gains (20% on WebVoyager/WebWalker, 14B model outperforming Gemini-2.5-Flash) are presented as empirical outcomes rather than predictions forced by the input metric. No self-citations or ansatzes are invoked to justify core claims in the provided text. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so the ledger is minimal. The central claim rests on the unstated assumption that the training reward teaches transferable exploration and summarization skills. No explicit free parameters, axioms, or invented entities are detailed in the provided text.

pith-pipeline@v0.9.0 · 5515 in / 1360 out tokens · 39433 ms · 2026-05-10T04:32:24.487449+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 41 canonical work pages · 17 internal anchors

[1]

Webvoyager: Building an end-to-end web agent with large multimodal models,

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models.arXiv preprint arXiv:2401.13919, 2024

work page arXiv 2024
[2]

arXiv preprint arXiv:2501.07572 , year=

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal.arXiv preprint arXiv:2501.07572, 2025

work page arXiv 2025
[3]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Seed-oss open-source models.https://github.com/ByteDance-Seed/seed-oss, 2025

ByteDance Seed Team. Seed-oss open-source models.https://github.com/ByteDance-Seed/seed-oss, 2025

2025
[5]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

2025
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 2025

work page internal anchor Pith review arXiv 2025
[8]

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, et al. Agentic context engineering: Evolving contexts for self-improving language models.arXiv preprint arXiv:2510.04618, 2025

work page internal anchor Pith review arXiv 2025
[9]

Cogito, ergo ludo: An agent that learns to play by reasoning and planning.arXiv preprint arXiv:2509.25052, 2025b

Sai Wang, Yu Wu, and Zhongwen Xu. Cogito, ergo ludo: An agent that learns to play by reasoning and planning. arXiv preprint arXiv:2509.25052, 2025

work page arXiv 2025
[10]

Self-supervised prompt optimization.arXiv preprint arXiv:2502.06855, 2025

Jinyu Xiang, Jiayi Zhang, Zhaoyang Yu, Fengwei Teng, Jinhao Tu, Xinbing Liang, Sirui Hong, Chenglin Wu, and Yuyu Luo. Self-supervised prompt optimization.arXiv preprint arXiv:2502.06855, 2025

work page arXiv 2025
[11]

Agentsquare: Automatic llm agent search in modular design space, 2025

Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li. Agentsquare: Automatic llm agent search in modular design space.arXiv preprint arXiv:2410.06153, 2024

work page arXiv 2024
[12]

Yin and Z

Li Yin and Zhangyang Wang. Llm-autodiff: Auto-differentiate any llm workflow.arXiv preprint arXiv:2501.16673, 2025

work page arXiv 2025
[13]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang, Jun Yan, I Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, Xiangru Tang, et al. Reasoningbank: Scaling agent self-evolving with reasoning memory.arXiv preprint arXiv:2509.25140, 2025

work page internal anchor Pith review arXiv 2025
[14]

URLhttps://arxiv.org/abs/2512.18746

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, and Shuicheng Yan. Memevolve: Meta-evolution of agent memory systems.arXiv preprint arXiv:2512.18746, 2025

work page arXiv 2025
[15]

Expel: Llm agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19632–19642, 2024

2024
[16]

Autoguide: Automated generation and selection of state-aware guidelines for large language model agents

Yao Fu, Dong-Ki Kim, Jaekyeom Kim, Sungryull Sohn, Lajanugen Logeswaran, Kyunghoon Bae, and Honglak Lee. Autoguide: Automated generation and selection of state-aware guidelines for large language model agents. CoRR, 2024

2024
[17]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review arXiv 2025
[18]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review arXiv 2025
[19]

Darwin G

Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, and Jeff Clune. Darwin godel machine: Open-ended evolution of self-improving agents.arXiv preprint arXiv:2505.22954, 2025. 11

work page arXiv 2025
[20]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng, Michael Y Fatemi, Xiaolong Jin, Zora Zhiruo Wang, Apurva Gandhi, Yueqi Song, Yu Gu, Jayanth Srinivasa, Gaowen Liu, Graham Neubig, et al. Skillweaver: Web agents can self-improve by discovering and honing skills.arXiv preprint arXiv:2504.07079, 2025

work page internal anchor Pith review arXiv 2025
[21]

From exploration to mastery: Enabling llms to master tools via self-driven interactions.arXiv preprint arXiv:2410.08197, 2024

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-Rong Wen. From exploration to mastery: Enabling llms to master tools via self-driven interactions.arXiv preprint arXiv:2410.08197, 2024

work page arXiv 2024
[22]

Tool- Gen: Unified tool retrieval and calling via generation,

Renxi Wang, Xudong Han, Lei Ji, Shu Wang, Timothy Baldwin, and Haonan Li. Toolgen: Unified tool retrieval and calling via generation.arXiv preprint arXiv:2410.03439, 2024

work page arXiv 2024
[23]

Agent learning via early experience.arXiv preprint arXiv:2510.08558, 2025

Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, et al. Agent learning via early experience.arXiv preprint arXiv:2510.08558, 2025

work page arXiv 2025
[24]

arXiv preprint arXiv:2504.21024 , year=

Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, and Dong Yu. Webevolver: Enhancing web agent self-improvement with coevolving world model.arXiv preprint arXiv:2504.21024, 2025

work page arXiv 2025
[25]

arXiv preprint arXiv:2506.15651 , year=

Tevin Wang and Chenyan Xiong. Autorule: Reasoning chain-of-thought extracted rule-based rewards improve preference learning.arXiv preprint arXiv:2506.15651, 2025

work page arXiv 2025
[26]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, et al. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025

work page internal anchor Pith review arXiv 2025
[27]

Learn-by-interact: A data-centric framework for self-adaptive agents in realistic environments,

Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, and Sercan Ö Arık. Learn-by-interact: A data-centric framework for self-adaptive agents in realistic environments.arXiv preprint arXiv:2501.10893, 2025

work page arXiv 2025
[28]

WebAggregator: Enhancing Compositional Reasoning Capabilities of Deep Research Agent Foundation Models

Rui Wang, Ce Zhang, Jun-Yu Ma, Jianshu Zhang, Hongru Wang, Yi Chen, Boyang Xue, Tianqing Fang, Zhisong Zhang, Hongming Zhang, et al. Explore to evolve: Scaling evolved aggregation logic via proactive online exploration for deep research agents.arXiv preprint arXiv:2510.14438, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

Yuxuan Wan, Tianqing Fang, Zaitang Li, Yintong Huo, Wenxuan Wang, Haitao Mi, Dong Yu, and Michael R Lyu. Inference-time scaling of verification: Self-evolving deep research agents via test-time rubric-guided verification. arXiv preprint arXiv:2601.15808, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Spice: Self-play in corpus environments improves reasoning.arXiv, 2025

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

work page arXiv 2025
[31]

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025

work page internal anchor Pith review arXiv 2025
[32]

instruction

Yifei Zhou, Sergey Levine, Jason Weston, Xian Li, and Sainbayar Sukhbaatar. Self-challenging language model agents.arXiv preprint arXiv:2506.01716, 2025

work page arXiv 2025
[33]

Rlsr: Reinforcement learning from self reward

Toby Simonds, Kevin Lopez, Akira Yoshiyama, and Dominique Garmier. Self rewarding self improving.arXiv preprint arXiv:2505.08827, 2025

work page arXiv 2025
[34]

Zhenrui Yue, Kartikeya Upasani, Xianjun Yang, Suyu Ge, Shaoliang Nie, Yuning Mao, Zhe Liu, and Dong Wang. Dr. zero: Self-evolving search agents without training data.arXiv preprint arXiv:2601.07055, 2026

work page arXiv 2026
[35]

Test-time training with self-supervision for generalization under distribution shifts

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InInternational conference on machine learning, pages 9229–9248. PMLR, 2020

2020
[36]

Atlas: Learning to optimally memorize the context at test time, 2025

Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, and Vahab Mirrokni. Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025

work page arXiv 2025
[37]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024

work page internal anchor Pith review arXiv 2024
[38]

Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695, 2025a

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695, 2025. 12

work page arXiv 2025
[39]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024

work page internal anchor Pith review arXiv 2024
[40]

With greater text comes greater necessity: Inference-time training helps long text generation.arXiv preprint arXiv:2401.11504, 2024

Yan Wang, Dongyang Ma, and Deng Cai. With greater text comes greater necessity: Inference-time training helps long text generation.arXiv preprint arXiv:2401.11504, 2024

work page arXiv 2024
[41]

Test-Time Training with KV Binding Is Secretly Linear Attention

Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, and Ruilong Li. Test-time training with kv binding is secretly linear attention.arXiv preprint arXiv:2602.21204, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026

Sidi Lu, Zhenwen Liang, Dongyang Ma, Yan Wang, Haitao Mi, and Dong Yu. Locas: Your models are principled initializers of locally-supported parametric memories.arXiv preprint arXiv:2602.05085, 2026

work page arXiv 2026
[43]

Continuous self-improvement of large language models by test-time training with verifier-driven sample selection.arXiv preprint arXiv:2505.19475,

Mohammad Mahdi Moradi, Hossam Amer, Sudhir Mudur, Weiwei Zhang, Yang Liu, and Walid Ahmed. Continuous self-improvement of large language models by test-time training with verifier-driven sample selection.arXiv preprint arXiv:2505.19475, 2025

work page arXiv 2025
[44]

Test-time learning for large language models.arXiv preprint arXiv:2505.20633, 2025

Jinwu Hu, Zhitian Zhang, Guohao Chen, Xutao Wen, Chao Shuai, Wei Luo, Bin Xiao, Yuanqing Li, and Mingkui Tan. Test-time learning for large language models.arXiv preprint arXiv:2505.20633, 2025

work page arXiv 2025
[45]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023
[46]

Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages...

2022
[47]

Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, et al. Cognitive kernel-pro: A framework for deep research agents and agent foundation models training.arXiv preprint arXiv:2508.00414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024

2024
[49]

gpt-oss-120b and gpt-oss-20b model card, 2025

OpenAI. gpt-oss-120b and gpt-oss-20b model card, 2025

2025
[50]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025. 13 A Details of Input Processing To reduce noise in large-scale web data and focus on domain-relevant information, we performimportance scoringandclus...

work page internal anchor Pith review arXiv 2025
[51]

Feel free to choose any category that seems underdeveloped or has interesting URLs you haven’t explored yet

Identify areas for expansion:Review your plan and the current guidebook. Feel free to choose any category that seems underdeveloped or has interesting URLs you haven’t explored yet
[52]

Review existing content:Use read_guidebook() to see what has already been written for your chosen category
[53]

Please rely on the actual webpage content to inspire your expansion and ensure accuracy

Explore and gather new data:Scrape additional URLs within that category to discover fresh details. Please rely on the actual webpage content to inspire your expansion and ensure accuracy
[54]

You can expand summaries, add new page entries, or provide deeper insights to make the section richer and more comprehensive

Integrate and enrich:Seamlessly weave your new discoveries into the existing text. You can expand summaries, add new page entries, or provide deeper insights to make the section richer and more comprehensive
[55]

Update the guidebook:Use the rewrite_category_section(category_name, new_section_- text)function to replace the old section with your newly expanded version
[56]

Continue this exploration process until your guidebook reaches at least{min_token}tokens

Check progress:Use count_guidebook_tokens() to see how close you are to your goal. Continue this exploration process until your guidebook reaches at least{min_token}tokens. Step 2: Add Overview Header & Save •Callread_guidebook()to get the full current content. •Prepend an Overview section at the top: # [Website Domain] Guidebook ## Overview - **Website:*...
[57]

Call parse_cluster_stats() to read the file header and get the total number of URLs and categories. Based on the site size, decide your processing mode: for small sites (≤ 250 URLs), useFULL mode where every URL is included; for larger sites, useSELECTIVE modewhere you pick the most important URLs per category (ranked byscore, up to 20 per category if≤8 c...
[58]

Create a token allocation plan — distribute the target Guidebook length ({min_token}–{token_- limit} tokens) across categories proportionally by each category’seffective URL count(i.e., the number of URLs you will actually scrape, after applying the per-category cap from step 1 — not the raw total), then save it withwrite_plan()
[59]

Repeat until all categories are done

Process categories one by one: callget_next_category() to load a category, scrape its selected URLs with web_agent(), write the category section withappend_to_guidebook(), then callmark_- category_done()to advance. Repeat until all categories are done
[60]

If it exceeds {token_limit}, compress verbose sections withrewrite_category_section()

After all categories are processed, check the total length withcount_guidebook_tokens(). If it exceeds {token_limit}, compress verbose sections withrewrite_category_section(). If it falls below {min_token}, expand by scraping additional URLs. Finally, prepend an Overview header and callsave_final_guidebook(). Output format per category: ## Category: [Name...
[61]

Output 1 if the answer correctly answers the question and has the same meaning as the ground truth
[62]

The answer does NOT need to exactly match the ground truth
[63]

Differences in wording, format, order, or level of detail are acceptable as long as the meaning is equivalent
[64]

Concise answers should NOT be judged as incorrect simply because they are shorter than the ground 22 truth
[65]

Different formats that express the same information (e.g., numbers only, different date formats, paraphrases) should be considered correct
[66]

Output 0 only if the answer is incorrect, contradicts the ground truth, or fails to answer the question. Examples: Example 1 Question:What are the 2024 suggested retail prices of the Yamaha PAC612 electric guitar and the Sonogenic SHS-300 shoulder keyboard? Ground truth:PAC612 electric guitar suggested retail price: 8,400 RMB. SHS-300 shoulder keyboard su...

2024
[67]

Web Task Instruction: A natural language instruction describing the task to be completed (e.g., search, verify, compare, summarize)
[68]

Result Response: The final textual response generated after performing the task
[69]

Evaluation rules:

Accessibility Trees: Structured representations of the webpages at each step, serving as evidence of the actions taken. Evaluation rules:
[70]

You do NOT need to interact with websites or perform any real actions
[71]

Do NOT assume missing information

You must base your judgment only on the provided instruction, response, and accessibility trees. Do NOT assume missing information
[72]

Your primary goal is to evaluate whether the actions reflected in the trees and the final response correctly follow the instruction
[73]

Missing any part leads to NOT SUCCESS

If the task contains multiple requirements (e.g., find information and summarize it), all must be 23 completed. Missing any part leads to NOT SUCCESS
[74]

The accessibility trees serve as ground truth evidence of what actually happened during execution
[75]

If the Result Response contradicts the trees, trust the trees
[76]

Instructions:You should briefly explain your reasoning before giving the final verdict

If the Result Response contains information not present in the trees, trust the response. Instructions:You should briefly explain your reasoning before giving the final verdict. Now evaluate the following case. TASK:{task} Result Response:{answer} Accessibility Trees:{trees} Output your final verdict as one of the following: SUCCESS NOT SUCCESS Do not out...
[77]

Target Question:{Question}
[78]

content summary

Sub-page Summaries:{World Knowledge} [Your Decision Logic] Please carefully read the "content summary" of each sub-page in the World Knowledge and analyze its relevance to the "Target Question":
[79]

content summary

Answer Directly (Rare): If the "content summary" in the World Knowledge already contains the specific factual data needed to fully answer the question, please provide the answer directly
[80]

About Us - Team Introduction

Explore Sub-pages (Most Common): If the topic of one or more sub-pages in the World Knowledge is highly relevant to the question (e.g., the question is about finding executives, and a sub-page summary is "About Us - Team Introduction"), please extract the URLs of these sub-pages and explore them to find the answer.If you find a potential answer on these s...

Showing first 80 references.