Argus: Evidence Assembly for Scalable Deep Research Agents

Bo An; Haotian Xu; Kaiyu Yang; Liangcai Su; Lidong Bing; Simon Shaolei Du; Xiang Lin; Xinyu Wang; Zhen Zhang; Zhuo Chen

arxiv: 2605.16217 · v3 · pith:34ZDSE7Tnew · submitted 2026-05-15 · 💻 cs.CL · cs.AI· cs.IR

Argus: Evidence Assembly for Scalable Deep Research Agents

Zhen Zhang , Liangcai Su , Zhuo Chen , Xiang Lin , Haotian Xu , Simon Shaolei Du , Kaiyu Yang , Bo An

show 2 more authors

Lidong Bing Xinyu Wang

This is my paper

Pith reviewed 2026-05-21 07:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IR

keywords deep research agentsevidence graphmulti-agent coordinationparallel searchReActreinforcement learninginformation seekingagentic systems

0 comments

The pith

Argus uses a Navigator to maintain a shared evidence graph that dispatches Searchers for missing pieces instead of letting parallel rollouts duplicate work.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that deep research answers consist of complementary evidence pieces and that current parallel search methods waste effort on duplicates while inflating context sizes. Argus therefore splits the work: Searchers run standard ReAct rollouts to gather evidence for sub-queries, while a Navigator tracks an evidence graph, identifies gaps, sends new Searchers to fill them, and finally reasons over the completed graph to produce a sourced answer. Only the Navigator is trained with reinforcement learning; the Searcher stays unchanged, so the same system supports one Searcher or many without retraining. This matters because it targets the diminishing returns that appear when simply adding more parallel trajectories.

Core claim

Argus treats deep research as assembling a jigsaw from complementary evidence pieces. The Searcher collects traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifies which pieces are still missing, dispatches Searchers to gather them, and reasons over the completed graph to produce a source-traced final answer. The Navigator is trained with reinforcement learning to verify, dispatch, and synthesize, while the Searcher is trained independently as a standard ReAct agent. This design supports rollouts with a single Searcher or many in parallel without retraining.

What carries the argument

The shared evidence graph maintained by the Navigator, which tracks collected pieces, identifies gaps, and dispatches Searchers to gather complementary evidence without duplication.

Load-bearing premise

Deep research answers are built from distinct complementary evidence pieces that parallel rollouts tend to duplicate rather than complete, and the Navigator can reliably detect gaps and dispatch new Searchers without creating fresh duplication or context bloat.

What would settle it

Measure duplication rate and performance as the number of parallel Searchers increases from 8 to 64; if duplication stays high or gains plateau while Navigator context grows beyond 21.5K tokens, the assembly mechanism would fail to deliver its claimed benefit.

Figures

Figures reproduced from arXiv: 2605.16217 by Bo An, Haotian Xu, Kaiyu Yang, Liangcai Su, Lidong Bing, Simon Shaolei Du, Xiang Lin, Xinyu Wang, Zhen Zhang, Zhuo Chen.

**Figure 1.** Figure 1: Argus operating modes. (a) Standalone Searcher, single path. (b) Navigator identifies unfilled pieces and dispatches targeted queries. (c) Parallel Searchers each target a distinct piece. notion of which pieces of evidence have been gathered, which support or contradict one another, and which are still missing. Existing parallel-agent methods inherit this flatness: self-consistency [5], best of-N [8, 7], l… view at source ↗

**Figure 2.** Figure 2: Argus assembles answers like a jigsaw on a BrowseComp-style question. (I) Parallel exploration: Searchers execute ReAct rollouts. (II) Navigator-guided verification: the Navigator consolidates findings onto a shared evidence board (green: corroborated pieces; red: discarded probes) and dispatches Searchers at distinct gaps. (III) Synthesis: the Navigator traces each claim to its evidence Ei and outputs the… view at source ↗

**Figure 3.** Figure 3: Argus GRPO training pipeline. Given a question q and a pre-collected Searcher trajectory T, πθ samples N rollouts, each producing a full synthesis y ⋆ w/ v over the post-verification graph and a shadow synthesis y ⋆ w/o v over the pre-verification graph. Their contrast yields the trajectory reward, from which GRPO computes group-relative advantages regularized by KL to a fixed reference. synthesis twice ov… view at source ↗

**Figure 4.** Figure 4: Accuracy on BrowseComp scales loglinearly with aggregation context budget, surpassing Gemini-3.1-Pro at 64× base compute. This decoupling is crucial. Most agentic systems hit a context wall long before exhausting compute limits. Argus instead restricts the bottleneck to the Searcher. The 21.5k token graph view at the largest budget compresses accumulated Searcher output by roughly 1200 to 1. This comf… view at source ↗

**Figure 5.** Figure 5: Synthesis and verification improve jointly during GRPO training. (a) Argus-Solo accuracy [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Argus splits agents into an RL-trained Navigator that maintains an evidence graph for gap verification and dispatch plus independent ReAct Searchers, reporting clear benchmark lifts while holding context low.

read the letter

The main point is that Argus treats deep research as assembling complementary evidence pieces via a shared graph rather than running parallel rollouts that duplicate work. A Navigator verifies missing pieces and dispatches Searchers, while the Searcher stays a plain ReAct agent trained separately so you can scale the number of Searchers without retraining everything. Both use the same 35B-A3B MoE backbone. The abstract shows average gains of 5.5 points with one Searcher and 12.7 points with eight across eight benchmarks, plus 86.2 on BrowseComp with 64 Searchers, all while keeping Navigator context under 21.5K tokens. That beats the proprietary agents they tested and directly addresses the diminishing returns from duplication in standard parallel setups. The jigsaw framing and independent training are the clearest new elements here compared with plain ReAct or simple aggregation. The results are presented in straightforward numbers that make the scaling claim easy to grasp. The experimental section is the weak spot. The abstract gives headline scores but no baselines, dataset details, error bars, or ablations, so it is hard to tell how much the evidence graph and gap verification actually drive the gains versus just adding more parallel agents. There are also no reported metrics on gap-detection precision or duplication rates at high parallelism, which leaves the central mechanism untested exactly where the stress-test concern points. If verification is loose, context could still grow or overlaps could persist. This paper is for people building multi-agent systems for information-seeking and retrieval tasks. Readers who want concrete coordination patterns for scaling inference compute will find the architecture worth looking at even if the current evidence is preliminary. I would send it for peer review so the methods and controls can be checked properly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Argus, a cooperative agentic system in which a Navigator maintains a shared evidence graph, verifies missing complementary pieces, and dispatches one or more Searchers (each performing ReAct-style rollouts on sub-queries) to assemble evidence for deep research tasks. The Navigator is trained via reinforcement learning on verification, dispatching, and synthesis while the Searcher is trained independently; the architecture is claimed to support scaling from 1 to 64 Searchers without retraining. Reported results on a 35B-A3B MoE backbone include average gains of 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers while keeping Navigator reasoning context under 21.5K tokens.

Significance. If the empirical results and scaling behavior are substantiated, the work would be significant for inference-time scaling of research agents. Framing deep research as jigsaw-style assembly of complementary evidence rather than duplicated parallel trajectories directly targets diminishing returns and context limits in current systems. The separation of Navigator and Searcher training, allowing flexible parallelism without retraining, is a practical strength that could influence future multi-agent designs.

major comments (2)

Abstract: the headline performance numbers (5.5-point gain with one Searcher, 12.7-point gain with eight parallel Searchers averaged over eight benchmarks, and 86.2 on BrowseComp with 64 Searchers) are presented without any description of baselines, error bars, run counts, dataset splits, or ablation controls. Because these numbers are the primary support for the claim that evidence-graph assembly outperforms standard parallel rollouts, the absence of experimental details is load-bearing for the central empirical contribution.
Abstract: the scalability argument rests on the Navigator reliably identifying missing complementary evidence in the shared graph and dispatching Searchers without duplication or context growth beyond 21.5K tokens. No quantitative metrics on gap-detection precision, duplication rates, or evidence-graph size as a function of Searcher count are supplied, leaving the core 'jigsaw assembly' mechanism untested at the 64-Searcher scale where the strongest result is reported.

minor comments (2)

Abstract: 'state of the art systems' should be hyphenated as 'state-of-the-art systems'.
Consider adding a diagram of the evidence graph and the Navigator-Searcher interaction loop; the textual description alone makes it difficult to visualize how verification and dispatch avoid overlap.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of framing deep research as complementary evidence assembly. We address each major comment point by point below and are prepared to revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract: the headline performance numbers (5.5-point gain with one Searcher, 12.7-point gain with eight parallel Searchers averaged over eight benchmarks, and 86.2 on BrowseComp with 64 Searchers) are presented without any description of baselines, error bars, run counts, dataset splits, or ablation controls. Because these numbers are the primary support for the claim that evidence-graph assembly outperforms standard parallel rollouts, the absence of experimental details is load-bearing for the central empirical contribution.

Authors: We agree that the abstract would be strengthened by additional context on the experimental setup. In the revised version we will expand the abstract to briefly identify the primary baselines (standard ReAct rollouts and parallel aggregation methods), note that results are averaged across multiple runs with error bars reported in the main text, and direct readers to the Experiments section for full details on dataset splits, run counts, and ablations. This keeps the abstract concise while making the headline numbers more interpretable. revision: yes
Referee: Abstract: the scalability argument rests on the Navigator reliably identifying missing complementary evidence in the shared graph and dispatching Searchers without duplication or context growth beyond 21.5K tokens. No quantitative metrics on gap-detection precision, duplication rates, or evidence-graph size as a function of Searcher count are supplied, leaving the core 'jigsaw assembly' mechanism untested at the 64-Searcher scale where the strongest result is reported.

Authors: The reported scaling results up to 64 Searchers, together with the bounded Navigator context length, provide empirical support that the Navigator successfully identifies complementary gaps and avoids excessive duplication. We nevertheless recognize the value of direct internal metrics. In the revision we will add a dedicated analysis (new figure or subsection) that reports gap-detection precision (via held-out evidence checks), average duplication rates, and evidence-graph size as functions of Searcher count, computed from the existing experimental runs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture with benchmark results only

full rationale

The paper proposes an agentic system (Searcher + Navigator with shared evidence graph) and reports empirical benchmark gains from training the components independently with RL. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described claims. The performance numbers are direct experimental outcomes rather than reductions to inputs by construction. The central mechanism is presented as a design choice whose value is measured externally on benchmarks, with no load-bearing step that collapses to a self-definition or prior self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review based on abstract only; full implementation details unavailable. The approach implicitly assumes evidence can be cleanly decomposed and reassembled.

axioms (1)

domain assumption Deep research answers are composed of complementary pieces of evidence that can be identified and completed without duplication or information loss.
Stated directly in the abstract as the motivation for moving beyond parallel rollouts.

invented entities (1)

Evidence graph no independent evidence
purpose: Shared structure maintained by Navigator to track collected evidence, identify gaps, and coordinate dispatch of Searchers.
Core new component introduced to enable the assembly process.

pith-pipeline@v0.9.0 · 5841 in / 1500 out tokens · 57291 ms · 2026-05-21T07:38:33.141425+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 15 internal anchors

[1]

Deep research system card, 2025

OpenAI. Deep research system card, 2025. URL https://openai.com/index/ deep-research-system-card

work page 2025
[2]

Gemini deep research overview, 2025

Google. Gemini deep research overview, 2025. URL https://gemini.google/overview/ deep-research/

work page 2025
[3]

Grok 3 beta — the age of reasoning agents, February 2025

xAI. Grok 3 beta — the age of reasoning agents, February 2025. URL https://x.ai/news/ grok-3

work page 2025
[4]

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw

work page 2023
[6]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022. URLhttps://arxiv.org/abs/2211.14275

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Scaling LLM test- time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test- time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=4FWAwZtd2n

work page 2025
[9]

Parallelmuse: Agentic parallel thinking for deep information seeking, 2025

Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao, Yida Zhao, Liwen Zhang, Haiyang Shen, Runnan Fang, Pengjun Xie, Jingren Zhou, and Yong Jiang. Parallelmuse: Agentic parallel thinking for deep information seeking, 2025. URL https://arxiv.org/ abs/2510.24698

work page arXiv 2025
[10]

Pacore: Learning to scale test-time compute with parallel coordinated reasoning, 2026

Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Pacore: Learning to scale test-time compute with parallel coordinated reasoning, 2026. URLhttps://ar...

work page arXiv 2026
[11]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URLhttps://arxiv.org/abs/2407.21787

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

Yoonsang Lee, Howard Yen, Xi Ye, and Danqi Chen. Agentic aggregation for parallel scaling of long-horizon agentic tasks.arXiv preprint arXiv:2604.11753, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

Pushing test-time scaling limits of deep search with asymmetric verification

Weihao Zeng, Keqing He, Chuqiao Kuang, Xiaoguang Li, and Junxian He. Pushing test-time scaling limits of deep search with asymmetric verification. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=hxL4Uf9tR3

work page 2026
[14]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InNeurIPS 2022 Foundation Models for Decision Making Workshop, 2022. URL https://openreview.net/forum?id= tvI4u1ylcqs. 10

work page 2022
[16]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022. URL https:/...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URL https://arxiv.org/abs/ 2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420– ...

work page doi:10.18653/v1/2025.emnlp-main.276 2025
[19]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023
[20]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization, 2025. URL https://arxiv. org/abs/2404.16130

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

HippoRAG: Neurobiologically inspired long-term memory for large language models

Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. URL https: //openreview.net/forum?id=hkujvAPVsg

work page 2024
[22]

Proceedings of the 2023

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proc...

work page doi:10.18653/v1/2023.emnlp-main.741 2023
[23]

Long-form factuality in large language models

Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Zixia Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, and Quoc V Le. Long-form factuality in large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=4M9f8VMt2C

work page 2024
[24]

Factool: Factuality detection in generative AI - a tool augmented framework for multi-task and multi-domain scenarios, 2024

I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, and Pengfei Liu. Factool: Factuality detection in generative AI - a tool augmented framework for multi-task and multi-domain scenarios, 2024. URL https: //openreview.net/forum?id=jolYuxpVn1

work page 2024
[25]

Chain-of-verification reduces hallucination in large language models

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. Chain-of-verification reduces hallucination in large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 3563–3578, Bangkok, Thailand, August 2024. A...

work page doi:10.18653/v1/2024.findings-acl.212 2024
[26]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InThirty-seventh Conference on Neural Informati...

work page 2023
[27]

Smith, and Mike Lewis

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview. net/forum?id=feiAVaSXdb

work page 2023
[28]

Qwen3.5: Accelerating productivity with native multimodal agents, February

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

work page
[29]

URLhttps://qwen.ai/blog?id=qwen3.5

work page
[30]

WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable rein- forcement learning

Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Ding-Chu Zhang, Xixi Wu, Xinmiao Yu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Zhi-Qin John Xu, Shuai Wang, Minhao Cheng, and Jingren Zhou. Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable rein- for...

work page
[32]

URLhttps://openreview.net/forum?id=HuP16O5SJf

work page
[33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=v8L0pN6EOi

work page 2024
[35]

Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URL https://arxiv.org/abs/2504. 12516

work page 2025
[36]

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025

Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, Chen Sun, Han Hou, Hui Yang, James Pan, Jianan Lou, Jiayi Mao, Jizheng Liu, Jinpeng Li, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan, Xuanzheng Wang, Yuchen Zhang, Yi-Hsin Hung, Yuan Jiang, Zexuan Liu,...

work page arXiv 2025
[38]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InInternational Conference on Learning Representations, volume 2024, pages 9025–9049, 2024

work page 2024
[39]

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu. Sealqa: Raising the bar for reasoning in search-augmented language models.arXiv preprint arXiv:2506.01062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks.arXiv preprint arXiv:2601.21165, 2026

Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, and Tejal Patwardhan. Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks.arXiv preprint arXiv:2601.21165, 2026

work page arXiv 2026
[42]

Openai gpt-5.2 system card, 2026

OpenAI. Openai gpt-5.2 system card, 2026. URL https://cdn.openai.com/pdf/ 3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf. 12

work page 2026
[43]

System card claude sonnet 4.6, 2026

Anthropic. System card claude sonnet 4.6, 2026. URL https://www-cdn.anthropic.com/ bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf

work page 2026
[44]

Seed 2.0 model card: Towards intelligence frontier for real-world complex- ity, 2026

ByteDance. Seed 2.0 model card: Towards intelligence frontier for real-world complex- ity, 2026. URL https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0%20Model%20Card.pdf

work page 2026
[45]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Kimi k2.6 technical blog, 2026

Moonshot AI. Kimi k2.6 technical blog, 2026. URL https://www.kimi.com/blog/ kimi-k2-6

work page 2026
[47]

Deepseek v4 technical report, 2026

DeepSeek AI. Deepseek v4 technical report, 2026. URL https://huggingface.co/ deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

work page 2026
[48]

Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026

MiroMind Team, S Bai, L Bing, L Lei, R Li, X Li, X Lin, E Min, L Su, B Wang, et al. Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026

work page arXiv 2026
[49]

Webwalker: Benchmarking llms in web traversal

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10290–10305, 2025

work page 2025
[50]

Webthinker: Empowering large reasoning models with deep research capability.Advances in Neural Information Processing Systems, 38:120091–120131, 2026

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.Advances in Neural Information Processing Systems, 38:120091–120131, 2026

work page 2026
[51]

Webdancer: Towards autonomous information seeking agency

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhenglin Wang, Zhengwei Tao, Ding-Chu Zhang, Zekun Xi, Xiangru Tang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webdancer: Towards autonomous information seeking agency. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview.net...

work page 2026
[52]

Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents, 2025

Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents, 2025. URLhttps://arxiv.org/abs/2509.13309

work page arXiv 2025
[53]

Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025

Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, and Junxian He. Webexplorer: Explore and evolve for training long-horizon web agents, 2025. URL https://arxiv.org/abs/2509.06501

work page arXiv 2025
[54]

Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, et al. Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

work page arXiv 2025
[55]

Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

work page arXiv 2025
[56]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview. net/forum?id=VTWWvYtF1R

work page 2023
[57]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024. 13

work page 2024
[58]

Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

work page 2023
[59]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, volume 2024, pages 23247–23275, 2024

work page 2024
[60]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning, pages 11733–11763, 2024

work page 2024
[61]

special mention

Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al. Mindstorms in natural language-based societies of mind.arXiv preprint arXiv:2305.17066, 2023. 14 A Training Details SearcherThe Searcher shares the Navigator Qwen3.5-35B-A3B ...

work page arXiv 2023

[1] [1]

Deep research system card, 2025

OpenAI. Deep research system card, 2025. URL https://openai.com/index/ deep-research-system-card

work page 2025

[2] [2]

Gemini deep research overview, 2025

Google. Gemini deep research overview, 2025. URL https://gemini.google/overview/ deep-research/

work page 2025

[3] [3]

Grok 3 beta — the age of reasoning agents, February 2025

xAI. Grok 3 beta — the age of reasoning agents, February 2025. URL https://x.ai/news/ grok-3

work page 2025

[4] [4]

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw

work page 2023

[6] [6]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022. URLhttps://arxiv.org/abs/2211.14275

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Scaling LLM test- time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test- time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=4FWAwZtd2n

work page 2025

[9] [9]

Parallelmuse: Agentic parallel thinking for deep information seeking, 2025

Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao, Yida Zhao, Liwen Zhang, Haiyang Shen, Runnan Fang, Pengjun Xie, Jingren Zhou, and Yong Jiang. Parallelmuse: Agentic parallel thinking for deep information seeking, 2025. URL https://arxiv.org/ abs/2510.24698

work page arXiv 2025

[10] [10]

Pacore: Learning to scale test-time compute with parallel coordinated reasoning, 2026

Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Pacore: Learning to scale test-time compute with parallel coordinated reasoning, 2026. URLhttps://ar...

work page arXiv 2026

[11] [11]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URLhttps://arxiv.org/abs/2407.21787

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

Yoonsang Lee, Howard Yen, Xi Ye, and Danqi Chen. Agentic aggregation for parallel scaling of long-horizon agentic tasks.arXiv preprint arXiv:2604.11753, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

Pushing test-time scaling limits of deep search with asymmetric verification

Weihao Zeng, Keqing He, Chuqiao Kuang, Xiaoguang Li, and Junxian He. Pushing test-time scaling limits of deep search with asymmetric verification. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=hxL4Uf9tR3

work page 2026

[14] [14]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InNeurIPS 2022 Foundation Models for Decision Making Workshop, 2022. URL https://openreview.net/forum?id= tvI4u1ylcqs. 10

work page 2022

[16] [16]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022. URL https:/...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URL https://arxiv.org/abs/ 2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Search-o1: Agentic search-enhanced large reasoning models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420– ...

work page doi:10.18653/v1/2025.emnlp-main.276 2025

[19] [19]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023

work page 2023

[20] [20]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization, 2025. URL https://arxiv. org/abs/2404.16130

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

HippoRAG: Neurobiologically inspired long-term memory for large language models

Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. URL https: //openreview.net/forum?id=hkujvAPVsg

work page 2024

[22] [22]

Proceedings of the 2023

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proc...

work page doi:10.18653/v1/2023.emnlp-main.741 2023

[23] [23]

Long-form factuality in large language models

Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Zixia Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, and Quoc V Le. Long-form factuality in large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=4M9f8VMt2C

work page 2024

[24] [24]

Factool: Factuality detection in generative AI - a tool augmented framework for multi-task and multi-domain scenarios, 2024

I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, and Pengfei Liu. Factool: Factuality detection in generative AI - a tool augmented framework for multi-task and multi-domain scenarios, 2024. URL https: //openreview.net/forum?id=jolYuxpVn1

work page 2024

[25] [25]

Chain-of-verification reduces hallucination in large language models

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. Chain-of-verification reduces hallucination in large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 3563–3578, Bangkok, Thailand, August 2024. A...

work page doi:10.18653/v1/2024.findings-acl.212 2024

[26] [26]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InThirty-seventh Conference on Neural Informati...

work page 2023

[27] [27]

Smith, and Mike Lewis

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview. net/forum?id=feiAVaSXdb

work page 2023

[28] [28]

Qwen3.5: Accelerating productivity with native multimodal agents, February

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February

work page

[29] [29]

URLhttps://qwen.ai/blog?id=qwen3.5

work page

[30] [30]

WebSailor: Navigating Super-human Reasoning for Web Agent

Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable rein- forcement learning

Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Ding-Chu Zhang, Xixi Wu, Xinmiao Yu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Zhi-Qin John Xu, Shuai Wang, Minhao Cheng, and Jingren Zhou. Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable rein- for...

work page

[32] [32]

URLhttps://openreview.net/forum?id=HuP16O5SJf

work page

[33] [33]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=v8L0pN6EOi

work page 2024

[35] [35]

Browsecomp: A simple yet challenging benchmark for browsing agents, 2025

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URL https://arxiv.org/abs/2504. 12516

work page 2025

[36] [36]

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025

Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, Chen Sun, Han Hou, Hui Yang, James Pan, Jianan Lou, Jiayi Mao, Jizheng Liu, Jinpeng Li, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan, Xuanzheng Wang, Yuchen Zhang, Yi-Hsin Hung, Yuan Jiang, Zexuan Liu,...

work page arXiv 2025

[38] [38]

Gaia: a benchmark for general ai assistants

Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InInternational Conference on Learning Representations, volume 2024, pages 9025–9049, 2024

work page 2024

[39] [39]

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu. Sealqa: Raising the bar for reasoning in search-augmented language models.arXiv preprint arXiv:2506.01062, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

A benchmark of expert-level academic questions to assess AI capabilities.Nature, 649(8099):1139–1146, 2026

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks.arXiv preprint arXiv:2601.21165, 2026

Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, and Tejal Patwardhan. Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks.arXiv preprint arXiv:2601.21165, 2026

work page arXiv 2026

[42] [42]

Openai gpt-5.2 system card, 2026

OpenAI. Openai gpt-5.2 system card, 2026. URL https://cdn.openai.com/pdf/ 3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf. 12

work page 2026

[43] [43]

System card claude sonnet 4.6, 2026

Anthropic. System card claude sonnet 4.6, 2026. URL https://www-cdn.anthropic.com/ bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf

work page 2026

[44] [44]

Seed 2.0 model card: Towards intelligence frontier for real-world complex- ity, 2026

ByteDance. Seed 2.0 model card: Towards intelligence frontier for real-world complex- ity, 2026. URL https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0%20Model%20Card.pdf

work page 2026

[45] [45]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

Kimi k2.6 technical blog, 2026

Moonshot AI. Kimi k2.6 technical blog, 2026. URL https://www.kimi.com/blog/ kimi-k2-6

work page 2026

[47] [47]

Deepseek v4 technical report, 2026

DeepSeek AI. Deepseek v4 technical report, 2026. URL https://huggingface.co/ deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf

work page 2026

[48] [48]

Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026

MiroMind Team, S Bai, L Bing, L Lei, R Li, X Li, X Lin, E Min, L Su, B Wang, et al. Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026

work page arXiv 2026

[49] [49]

Webwalker: Benchmarking llms in web traversal

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10290–10305, 2025

work page 2025

[50] [50]

Webthinker: Empowering large reasoning models with deep research capability.Advances in Neural Information Processing Systems, 38:120091–120131, 2026

Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.Advances in Neural Information Processing Systems, 38:120091–120131, 2026

work page 2026

[51] [51]

Webdancer: Towards autonomous information seeking agency

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhenglin Wang, Zhengwei Tao, Ding-Chu Zhang, Zekun Xi, Xiangru Tang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webdancer: Towards autonomous information seeking agency. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview.net...

work page 2026

[52] [52]

Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents, 2025

Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents, 2025. URLhttps://arxiv.org/abs/2509.13309

work page arXiv 2025

[53] [53]

Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025

Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, and Junxian He. Webexplorer: Explore and evolve for training long-horizon web agents, 2025. URL https://arxiv.org/abs/2509.06501

work page arXiv 2025

[54] [54]

Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, et al. Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025

work page arXiv 2025

[55] [55]

Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025

work page arXiv 2025

[56] [56]

Reasoning with language model is planning with world model

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview. net/forum?id=VTWWvYtF1R

work page 2023

[57] [57]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024. 13

work page 2024

[58] [58]

Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023

work page 2023

[59] [59]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, volume 2024, pages 23247–23275, 2024

work page 2024

[60] [60]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning, pages 11733–11763, 2024

work page 2024

[61] [61]

special mention

Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al. Mindstorms in natural language-based societies of mind.arXiv preprint arXiv:2305.17066, 2023. 14 A Training Details SearcherThe Searcher shares the Navigator Qwen3.5-35B-A3B ...

work page arXiv 2023