pith. sign in

arxiv: 2605.16217 · v3 · pith:34ZDSE7Tnew · submitted 2026-05-15 · 💻 cs.CL · cs.AI· cs.IR

Argus: Evidence Assembly for Scalable Deep Research Agents

Pith reviewed 2026-05-21 07:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR
keywords deep research agentsevidence graphmulti-agent coordinationparallel searchReActreinforcement learninginformation seekingagentic systems
0
0 comments X

The pith

Argus uses a Navigator to maintain a shared evidence graph that dispatches Searchers for missing pieces instead of letting parallel rollouts duplicate work.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that deep research answers consist of complementary evidence pieces and that current parallel search methods waste effort on duplicates while inflating context sizes. Argus therefore splits the work: Searchers run standard ReAct rollouts to gather evidence for sub-queries, while a Navigator tracks an evidence graph, identifies gaps, sends new Searchers to fill them, and finally reasons over the completed graph to produce a sourced answer. Only the Navigator is trained with reinforcement learning; the Searcher stays unchanged, so the same system supports one Searcher or many without retraining. This matters because it targets the diminishing returns that appear when simply adding more parallel trajectories.

Core claim

Argus treats deep research as assembling a jigsaw from complementary evidence pieces. The Searcher collects traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifies which pieces are still missing, dispatches Searchers to gather them, and reasons over the completed graph to produce a source-traced final answer. The Navigator is trained with reinforcement learning to verify, dispatch, and synthesize, while the Searcher is trained independently as a standard ReAct agent. This design supports rollouts with a single Searcher or many in parallel without retraining.

What carries the argument

The shared evidence graph maintained by the Navigator, which tracks collected pieces, identifies gaps, and dispatches Searchers to gather complementary evidence without duplication.

Load-bearing premise

Deep research answers are built from distinct complementary evidence pieces that parallel rollouts tend to duplicate rather than complete, and the Navigator can reliably detect gaps and dispatch new Searchers without creating fresh duplication or context bloat.

What would settle it

Measure duplication rate and performance as the number of parallel Searchers increases from 8 to 64; if duplication stays high or gains plateau while Navigator context grows beyond 21.5K tokens, the assembly mechanism would fail to deliver its claimed benefit.

Figures

Figures reproduced from arXiv: 2605.16217 by Bo An, Haotian Xu, Kaiyu Yang, Liangcai Su, Lidong Bing, Simon Shaolei Du, Xiang Lin, Xinyu Wang, Zhen Zhang, Zhuo Chen.

Figure 1
Figure 1. Figure 1: Argus operating modes. (a) Standalone Searcher, single path. (b) Navigator identifies unfilled pieces and dispatches targeted queries. (c) Parallel Searchers each target a distinct piece. notion of which pieces of evidence have been gathered, which support or contradict one another, and which are still missing. Existing parallel-agent methods inherit this flatness: self-consistency [5], best of-N [8, 7], l… view at source ↗
Figure 2
Figure 2. Figure 2: Argus assembles answers like a jigsaw on a BrowseComp-style question. (I) Parallel exploration: Searchers execute ReAct rollouts. (II) Navigator-guided verification: the Navigator consolidates findings onto a shared evidence board (green: corroborated pieces; red: discarded probes) and dispatches Searchers at distinct gaps. (III) Synthesis: the Navigator traces each claim to its evidence Ei and outputs the… view at source ↗
Figure 3
Figure 3. Figure 3: Argus GRPO training pipeline. Given a question q and a pre-collected Searcher trajectory T, πθ samples N rollouts, each producing a full synthesis y ⋆ w/ v over the post-verification graph and a shadow synthesis y ⋆ w/o v over the pre-verification graph. Their contrast yields the trajectory reward, from which GRPO computes group-relative advantages regularized by KL to a fixed reference. synthesis twice ov… view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy on BrowseComp scales log￾linearly with aggregation context budget, surpass￾ing Gemini-3.1-Pro at 64× base compute. This decoupling is crucial. Most agentic sys￾tems hit a context wall long before exhausting compute limits. Argus instead restricts the bot￾tleneck to the Searcher. The 21.5k token graph view at the largest budget compresses accumu￾lated Searcher output by roughly 1200 to 1. This comf… view at source ↗
Figure 5
Figure 5. Figure 5: Synthesis and verification improve jointly during GRPO training. (a) Argus-Solo accuracy [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Argus, a cooperative agentic system in which a Navigator maintains a shared evidence graph, verifies missing complementary pieces, and dispatches one or more Searchers (each performing ReAct-style rollouts on sub-queries) to assemble evidence for deep research tasks. The Navigator is trained via reinforcement learning on verification, dispatching, and synthesis while the Searcher is trained independently; the architecture is claimed to support scaling from 1 to 64 Searchers without retraining. Reported results on a 35B-A3B MoE backbone include average gains of 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers while keeping Navigator reasoning context under 21.5K tokens.

Significance. If the empirical results and scaling behavior are substantiated, the work would be significant for inference-time scaling of research agents. Framing deep research as jigsaw-style assembly of complementary evidence rather than duplicated parallel trajectories directly targets diminishing returns and context limits in current systems. The separation of Navigator and Searcher training, allowing flexible parallelism without retraining, is a practical strength that could influence future multi-agent designs.

major comments (2)
  1. Abstract: the headline performance numbers (5.5-point gain with one Searcher, 12.7-point gain with eight parallel Searchers averaged over eight benchmarks, and 86.2 on BrowseComp with 64 Searchers) are presented without any description of baselines, error bars, run counts, dataset splits, or ablation controls. Because these numbers are the primary support for the claim that evidence-graph assembly outperforms standard parallel rollouts, the absence of experimental details is load-bearing for the central empirical contribution.
  2. Abstract: the scalability argument rests on the Navigator reliably identifying missing complementary evidence in the shared graph and dispatching Searchers without duplication or context growth beyond 21.5K tokens. No quantitative metrics on gap-detection precision, duplication rates, or evidence-graph size as a function of Searcher count are supplied, leaving the core 'jigsaw assembly' mechanism untested at the 64-Searcher scale where the strongest result is reported.
minor comments (2)
  1. Abstract: 'state of the art systems' should be hyphenated as 'state-of-the-art systems'.
  2. Consider adding a diagram of the evidence graph and the Navigator-Searcher interaction loop; the textual description alone makes it difficult to visualize how verification and dispatch avoid overlap.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of framing deep research as complementary evidence assembly. We address each major comment point by point below and are prepared to revise the manuscript accordingly.

read point-by-point responses
  1. Referee: Abstract: the headline performance numbers (5.5-point gain with one Searcher, 12.7-point gain with eight parallel Searchers averaged over eight benchmarks, and 86.2 on BrowseComp with 64 Searchers) are presented without any description of baselines, error bars, run counts, dataset splits, or ablation controls. Because these numbers are the primary support for the claim that evidence-graph assembly outperforms standard parallel rollouts, the absence of experimental details is load-bearing for the central empirical contribution.

    Authors: We agree that the abstract would be strengthened by additional context on the experimental setup. In the revised version we will expand the abstract to briefly identify the primary baselines (standard ReAct rollouts and parallel aggregation methods), note that results are averaged across multiple runs with error bars reported in the main text, and direct readers to the Experiments section for full details on dataset splits, run counts, and ablations. This keeps the abstract concise while making the headline numbers more interpretable. revision: yes

  2. Referee: Abstract: the scalability argument rests on the Navigator reliably identifying missing complementary evidence in the shared graph and dispatching Searchers without duplication or context growth beyond 21.5K tokens. No quantitative metrics on gap-detection precision, duplication rates, or evidence-graph size as a function of Searcher count are supplied, leaving the core 'jigsaw assembly' mechanism untested at the 64-Searcher scale where the strongest result is reported.

    Authors: The reported scaling results up to 64 Searchers, together with the bounded Navigator context length, provide empirical support that the Navigator successfully identifies complementary gaps and avoids excessive duplication. We nevertheless recognize the value of direct internal metrics. In the revision we will add a dedicated analysis (new figure or subsection) that reports gap-detection precision (via held-out evidence checks), average duplication rates, and evidence-graph size as functions of Searcher count, computed from the existing experimental runs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with benchmark results only

full rationale

The paper proposes an agentic system (Searcher + Navigator with shared evidence graph) and reports empirical benchmark gains from training the components independently with RL. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described claims. The performance numbers are direct experimental outcomes rather than reductions to inputs by construction. The central mechanism is presented as a design choice whose value is measured externally on benchmarks, with no load-bearing step that collapses to a self-definition or prior self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review based on abstract only; full implementation details unavailable. The approach implicitly assumes evidence can be cleanly decomposed and reassembled.

axioms (1)
  • domain assumption Deep research answers are composed of complementary pieces of evidence that can be identified and completed without duplication or information loss.
    Stated directly in the abstract as the motivation for moving beyond parallel rollouts.
invented entities (1)
  • Evidence graph no independent evidence
    purpose: Shared structure maintained by Navigator to track collected evidence, identify gaps, and coordinate dispatch of Searchers.
    Core new component introduced to enable the assembly process.

pith-pipeline@v0.9.0 · 5841 in / 1500 out tokens · 57291 ms · 2026-05-21T07:38:33.141425+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 15 internal anchors

  1. [1]

    Deep research system card, 2025

    OpenAI. Deep research system card, 2025. URL https://openai.com/index/ deep-research-system-card

  2. [2]

    Gemini deep research overview, 2025

    Google. Gemini deep research overview, 2025. URL https://gemini.google/overview/ deep-research/

  3. [3]

    Grok 3 beta — the age of reasoning agents, February 2025

    xAI. Grok 3 beta — the age of reasoning agents, February 2025. URL https://x.ai/news/ grok-3

  4. [4]

    Tongyi DeepResearch Technical Report

    Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025

  5. [5]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw

  6. [6]

    Solving math word problems with process- and outcome-based feedback

    Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022. URLhttps://arxiv.org/abs/2211.14275

  7. [7]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168

  8. [8]

    Scaling LLM test- time compute optimally can be more effective than scaling parameters for reasoning

    Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test- time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=4FWAwZtd2n

  9. [9]

    Parallelmuse: Agentic parallel thinking for deep information seeking, 2025

    Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao, Yida Zhao, Liwen Zhang, Haiyang Shen, Runnan Fang, Pengjun Xie, Jingren Zhou, and Yong Jiang. Parallelmuse: Agentic parallel thinking for deep information seeking, 2025. URL https://arxiv.org/ abs/2510.24698

  10. [10]

    Pacore: Learning to scale test-time compute with parallel coordinated reasoning, 2026

    Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Pacore: Learning to scale test-time compute with parallel coordinated reasoning, 2026. URLhttps://ar...

  11. [11]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URLhttps://arxiv.org/abs/2407.21787

  12. [12]

    Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

    Yoonsang Lee, Howard Yen, Xi Ye, and Danqi Chen. Agentic aggregation for parallel scaling of long-horizon agentic tasks.arXiv preprint arXiv:2604.11753, 2026

  13. [13]

    Pushing test-time scaling limits of deep search with asymmetric verification

    Weihao Zeng, Keqing He, Chuqiao Kuang, Xiaoguang Li, and Junxian He. Pushing test-time scaling limits of deep search with asymmetric verification. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=hxL4Uf9tR3

  14. [14]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  15. [15]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InNeurIPS 2022 Foundation Models for Decision Making Workshop, 2022. URL https://openreview.net/forum?id= tvI4u1ylcqs. 10

  16. [16]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022. URL https:/...

  17. [17]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URL https://arxiv.org/abs/ 2307.13854

  18. [18]

    Search-o1: Agentic search-enhanced large reasoning models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420– ...

  19. [19]

    Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

  20. [20]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization, 2025. URL https://arxiv. org/abs/2404.16130

  21. [21]

    HippoRAG: Neurobiologically inspired long-term memory for large language models

    Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. URL https: //openreview.net/forum?id=hkujvAPVsg

  22. [22]

    Proceedings of the 2023

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proc...

  23. [23]

    Long-form factuality in large language models

    Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Zixia Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, and Quoc V Le. Long-form factuality in large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=4M9f8VMt2C

  24. [24]

    Factool: Factuality detection in generative AI - a tool augmented framework for multi-task and multi-domain scenarios, 2024

    I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, and Pengfei Liu. Factool: Factuality detection in generative AI - a tool augmented framework for multi-task and multi-domain scenarios, 2024. URL https: //openreview.net/forum?id=jolYuxpVn1

  25. [25]

    Chain-of-verification reduces hallucination in large language models

    Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. Chain-of-verification reduces hallucination in large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 3563–3578, Bangkok, Thailand, August 2024. A...

  26. [26]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InThirty-seventh Conference on Neural Informati...

  27. [27]

    Smith, and Mike Lewis

    Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview. net/forum?id=feiAVaSXdb

  28. [28]

    Qwen3.5: Accelerating productivity with native multimodal agents, February

    Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

  29. [29]

    URLhttps://qwen.ai/blog?id=qwen3.5

  30. [30]

    WebSailor: Navigating Super-human Reasoning for Web Agent

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

  31. [31]

    Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable rein- forcement learning

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Ding-Chu Zhang, Xixi Wu, Xinmiao Yu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Zhi-Qin John Xu, Shuai Wang, Minhao Cheng, and Jingren Zhou. Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable rein- for...

  32. [32]

    URLhttps://openreview.net/forum?id=HuP16O5SJf

  33. [33]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  34. [34]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=v8L0pN6EOi

  35. [35]

    Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

    Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URL https://arxiv.org/abs/2504. 12516

  36. [36]

    BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

    Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025

  37. [37]

    xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025

    Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, Chen Sun, Han Hou, Hui Yang, James Pan, Jianan Lou, Jiayi Mao, Jizheng Liu, Jinpeng Li, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan, Xuanzheng Wang, Yuchen Zhang, Yi-Hsin Hung, Yuan Jiang, Zexuan Liu,...

  38. [38]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InInternational Conference on Learning Representations, volume 2024, pages 9025–9049, 2024

  39. [39]

    SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

    Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu. Sealqa: Raising the bar for reasoning in search-augmented language models.arXiv preprint arXiv:2506.01062, 2025

  40. [40]

    A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

  41. [41]

    Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks.arXiv preprint arXiv:2601.21165, 2026

    Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, and Tejal Patwardhan. Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks.arXiv preprint arXiv:2601.21165, 2026

  42. [42]

    Openai gpt-5.2 system card, 2026

    OpenAI. Openai gpt-5.2 system card, 2026. URL https://cdn.openai.com/pdf/ 3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf. 12

  43. [43]

    System card claude sonnet 4.6, 2026

    Anthropic. System card claude sonnet 4.6, 2026. URL https://www-cdn.anthropic.com/ bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf

  44. [44]

    Seed 2.0 model card: Towards intelligence frontier for real-world complex- ity, 2026

    ByteDance. Seed 2.0 model card: Towards intelligence frontier for real-world complex- ity, 2026. URL https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0%20Model%20Card.pdf

  45. [45]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

  46. [46]

    Kimi k2.6 technical blog, 2026

    Moonshot AI. Kimi k2.6 technical blog, 2026. URL https://www.kimi.com/blog/ kimi-k2-6

  47. [47]

    Deepseek v4 technical report, 2026

    DeepSeek AI. Deepseek v4 technical report, 2026. URL https://huggingface.co/ deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

  48. [48]

    Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026

    MiroMind Team, S Bai, L Bing, L Lei, R Li, X Li, X Lin, E Min, L Su, B Wang, et al. Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026

  49. [49]

    Webwalker: Benchmarking llms in web traversal

    Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10290–10305, 2025

  50. [50]

    Webthinker: Empowering large reasoning models with deep research capability.Advances in Neural Information Processing Systems, 38:120091–120131, 2026

    Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.Advances in Neural Information Processing Systems, 38:120091–120131, 2026

  51. [51]

    Webdancer: Towards autonomous information seeking agency

    Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhenglin Wang, Zhengwei Tao, Ding-Chu Zhang, Zekun Xi, Xiangru Tang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webdancer: Towards autonomous information seeking agency. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview.net...

  52. [52]

    Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents, 2025

    Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents, 2025. URLhttps://arxiv.org/abs/2509.13309

  53. [53]

    Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025

    Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, and Junxian He. Webexplorer: Explore and evolve for training long-horizon web agents, 2025. URL https://arxiv.org/abs/2509.06501

  54. [54]

    Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

    Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, et al. Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

  55. [55]

    Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

    Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

  56. [56]

    Reasoning with language model is planning with world model

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview. net/forum?id=VTWWvYtF1R

  57. [57]

    Autogen: Enabling next-gen llm applications via multi-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024. 13

  58. [58]

    Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

  59. [59]

    Metagpt: Meta programming for a multi-agent collaborative framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, volume 2024, pages 23247–23275, 2024

  60. [60]

    Improving factuality and reasoning in language models through multiagent debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning, pages 11733–11763, 2024

  61. [61]

    special mention

    Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al. Mindstorms in natural language-based societies of mind.arXiv preprint arXiv:2305.17066, 2023. 14 A Training Details SearcherThe Searcher shares the Navigator Qwen3.5-35B-A3B ...