Argus: Evidence Assembly for Scalable Deep Research Agents
Pith reviewed 2026-05-21 07:38 UTC · model grok-4.3
The pith
Argus uses a Navigator to maintain a shared evidence graph that dispatches Searchers for missing pieces instead of letting parallel rollouts duplicate work.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Argus treats deep research as assembling a jigsaw from complementary evidence pieces. The Searcher collects traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifies which pieces are still missing, dispatches Searchers to gather them, and reasons over the completed graph to produce a source-traced final answer. The Navigator is trained with reinforcement learning to verify, dispatch, and synthesize, while the Searcher is trained independently as a standard ReAct agent. This design supports rollouts with a single Searcher or many in parallel without retraining.
What carries the argument
The shared evidence graph maintained by the Navigator, which tracks collected pieces, identifies gaps, and dispatches Searchers to gather complementary evidence without duplication.
Load-bearing premise
Deep research answers are built from distinct complementary evidence pieces that parallel rollouts tend to duplicate rather than complete, and the Navigator can reliably detect gaps and dispatch new Searchers without creating fresh duplication or context bloat.
What would settle it
Measure duplication rate and performance as the number of parallel Searchers increases from 8 to 64; if duplication stays high or gains plateau while Navigator context grows beyond 21.5K tokens, the assembly mechanism would fail to deliver its claimed benefit.
Figures
read the original abstract
Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Argus, a cooperative agentic system in which a Navigator maintains a shared evidence graph, verifies missing complementary pieces, and dispatches one or more Searchers (each performing ReAct-style rollouts on sub-queries) to assemble evidence for deep research tasks. The Navigator is trained via reinforcement learning on verification, dispatching, and synthesis while the Searcher is trained independently; the architecture is claimed to support scaling from 1 to 64 Searchers without retraining. Reported results on a 35B-A3B MoE backbone include average gains of 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers while keeping Navigator reasoning context under 21.5K tokens.
Significance. If the empirical results and scaling behavior are substantiated, the work would be significant for inference-time scaling of research agents. Framing deep research as jigsaw-style assembly of complementary evidence rather than duplicated parallel trajectories directly targets diminishing returns and context limits in current systems. The separation of Navigator and Searcher training, allowing flexible parallelism without retraining, is a practical strength that could influence future multi-agent designs.
major comments (2)
- Abstract: the headline performance numbers (5.5-point gain with one Searcher, 12.7-point gain with eight parallel Searchers averaged over eight benchmarks, and 86.2 on BrowseComp with 64 Searchers) are presented without any description of baselines, error bars, run counts, dataset splits, or ablation controls. Because these numbers are the primary support for the claim that evidence-graph assembly outperforms standard parallel rollouts, the absence of experimental details is load-bearing for the central empirical contribution.
- Abstract: the scalability argument rests on the Navigator reliably identifying missing complementary evidence in the shared graph and dispatching Searchers without duplication or context growth beyond 21.5K tokens. No quantitative metrics on gap-detection precision, duplication rates, or evidence-graph size as a function of Searcher count are supplied, leaving the core 'jigsaw assembly' mechanism untested at the 64-Searcher scale where the strongest result is reported.
minor comments (2)
- Abstract: 'state of the art systems' should be hyphenated as 'state-of-the-art systems'.
- Consider adding a diagram of the evidence graph and the Navigator-Searcher interaction loop; the textual description alone makes it difficult to visualize how verification and dispatch avoid overlap.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential significance of framing deep research as complementary evidence assembly. We address each major comment point by point below and are prepared to revise the manuscript accordingly.
read point-by-point responses
-
Referee: Abstract: the headline performance numbers (5.5-point gain with one Searcher, 12.7-point gain with eight parallel Searchers averaged over eight benchmarks, and 86.2 on BrowseComp with 64 Searchers) are presented without any description of baselines, error bars, run counts, dataset splits, or ablation controls. Because these numbers are the primary support for the claim that evidence-graph assembly outperforms standard parallel rollouts, the absence of experimental details is load-bearing for the central empirical contribution.
Authors: We agree that the abstract would be strengthened by additional context on the experimental setup. In the revised version we will expand the abstract to briefly identify the primary baselines (standard ReAct rollouts and parallel aggregation methods), note that results are averaged across multiple runs with error bars reported in the main text, and direct readers to the Experiments section for full details on dataset splits, run counts, and ablations. This keeps the abstract concise while making the headline numbers more interpretable. revision: yes
-
Referee: Abstract: the scalability argument rests on the Navigator reliably identifying missing complementary evidence in the shared graph and dispatching Searchers without duplication or context growth beyond 21.5K tokens. No quantitative metrics on gap-detection precision, duplication rates, or evidence-graph size as a function of Searcher count are supplied, leaving the core 'jigsaw assembly' mechanism untested at the 64-Searcher scale where the strongest result is reported.
Authors: The reported scaling results up to 64 Searchers, together with the bounded Navigator context length, provide empirical support that the Navigator successfully identifies complementary gaps and avoids excessive duplication. We nevertheless recognize the value of direct internal metrics. In the revision we will add a dedicated analysis (new figure or subsection) that reports gap-detection precision (via held-out evidence checks), average duplication rates, and evidence-graph size as functions of Searcher count, computed from the existing experimental runs. revision: yes
Circularity Check
No circularity: empirical architecture with benchmark results only
full rationale
The paper proposes an agentic system (Searcher + Navigator with shared evidence graph) and reports empirical benchmark gains from training the components independently with RL. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described claims. The performance numbers are direct experimental outcomes rather than reductions to inputs by construction. The central mechanism is presented as a design choice whose value is measured externally on benchmarks, with no load-bearing step that collapses to a self-definition or prior self-citation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Deep research answers are composed of complementary pieces of evidence that can be identified and completed without duplication or information loss.
invented entities (1)
-
Evidence graph
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Deep research system card, 2025
OpenAI. Deep research system card, 2025. URL https://openai.com/index/ deep-research-system-card
work page 2025
-
[2]
Gemini deep research overview, 2025
Google. Gemini deep research overview, 2025. URL https://gemini.google/overview/ deep-research/
work page 2025
-
[3]
Grok 3 beta — the age of reasoning agents, February 2025
xAI. Grok 3 beta — the age of reasoning agents, February 2025. URL https://x.ai/news/ grok-3
work page 2025
-
[4]
Tongyi DeepResearch Technical Report
Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw
work page 2023
-
[6]
Solving math word problems with process- and outcome-based feedback
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022. URLhttps://arxiv.org/abs/2211.14275
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
Scaling LLM test- time compute optimally can be more effective than scaling parameters for reasoning
Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test- time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=4FWAwZtd2n
work page 2025
-
[9]
Parallelmuse: Agentic parallel thinking for deep information seeking, 2025
Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao, Yida Zhao, Liwen Zhang, Haiyang Shen, Runnan Fang, Pengjun Xie, Jingren Zhou, and Yong Jiang. Parallelmuse: Agentic parallel thinking for deep information seeking, 2025. URL https://arxiv.org/ abs/2510.24698
-
[10]
Pacore: Learning to scale test-time compute with parallel coordinated reasoning, 2026
Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Pacore: Learning to scale test-time compute with parallel coordinated reasoning, 2026. URLhttps://ar...
-
[11]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2024. URLhttps://arxiv.org/abs/2407.21787
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks
Yoonsang Lee, Howard Yen, Xi Ye, and Danqi Chen. Agentic aggregation for parallel scaling of long-horizon agentic tasks.arXiv preprint arXiv:2604.11753, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Pushing test-time scaling limits of deep search with asymmetric verification
Weihao Zeng, Keqing He, Chuqiao Kuang, Xiaoguang Li, and Junxian He. Pushing test-time scaling limits of deep search with asymmetric verification. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=hxL4Uf9tR3
work page 2026
-
[14]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InNeurIPS 2022 Foundation Models for Decision Making Workshop, 2022. URL https://openreview.net/forum?id= tvI4u1ylcqs. 10
work page 2022
-
[16]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. Webgpt: Browser-assisted question-answering with human feedback, 2022. URL https:/...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URL https://arxiv.org/abs/ 2307.13854
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Search-o1: Agentic search-enhanced large reasoning models
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5420– ...
-
[19]
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023
work page 2023
-
[20]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization, 2025. URL https://arxiv. org/abs/2404.16130
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
HippoRAG: Neurobiologically inspired long-term memory for large language models
Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InThe Thirty- eighth Annual Conference on Neural Information Processing Systems, 2024. URL https: //openreview.net/forum?id=hkujvAPVsg
work page 2024
-
[22]
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proc...
-
[23]
Long-form factuality in large language models
Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Zixia Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, and Quoc V Le. Long-form factuality in large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=4M9f8VMt2C
work page 2024
-
[24]
I-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, and Pengfei Liu. Factool: Factuality detection in generative AI - a tool augmented framework for multi-task and multi-domain scenarios, 2024. URL https: //openreview.net/forum?id=jolYuxpVn1
work page 2024
-
[25]
Chain-of-verification reduces hallucination in large language models
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. Chain-of-verification reduces hallucination in large language models. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 3563–3578, Bangkok, Thailand, August 2024. A...
-
[26]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InThirty-seventh Conference on Neural Informati...
work page 2023
-
[27]
Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. Measuring and narrowing the compositionality gap in language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview. net/forum?id=feiAVaSXdb
work page 2023
-
[28]
Qwen3.5: Accelerating productivity with native multimodal agents, February
Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February
-
[29]
URLhttps://qwen.ai/blog?id=qwen3.5
-
[30]
WebSailor: Navigating Super-human Reasoning for Web Agent
Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Ding-Chu Zhang, Xixi Wu, Xinmiao Yu, Jialong Wu, Xinyu Wang, Zile Qiao, Zhen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Zhi-Qin John Xu, Shuai Wang, Minhao Cheng, and Jingren Zhou. Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable rein- for...
-
[32]
URLhttps://openreview.net/forum?id=HuP16O5SJf
-
[33]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=v8L0pN6EOi
work page 2024
-
[35]
Browsecomp: A simple yet challenging benchmark for browsing agents, 2025
Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025. URL https://arxiv.org/abs/2504. 12516
work page 2025
-
[36]
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations, 2025
Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, Chen Sun, Han Hou, Hui Yang, James Pan, Jianan Lou, Jiayi Mao, Jizheng Liu, Jinpeng Li, Kangyi Liu, Kenkun Liu, Rui Wang, Run Li, Tong Niu, Wenlong Zhang, Wenqi Yan, Xuanzheng Wang, Yuchen Zhang, Yi-Hsin Hung, Yuan Jiang, Zexuan Liu,...
-
[38]
Gaia: a benchmark for general ai assistants
Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InInternational Conference on Learning Representations, volume 2024, pages 9025–9049, 2024
work page 2024
-
[39]
SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, and Tu Vu. Sealqa: Raising the bar for reasoning in search-augmented language models.arXiv preprint arXiv:2506.01062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, and Tejal Patwardhan. Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks.arXiv preprint arXiv:2601.21165, 2026
-
[42]
Openai gpt-5.2 system card, 2026
OpenAI. Openai gpt-5.2 system card, 2026. URL https://cdn.openai.com/pdf/ 3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf. 12
work page 2026
-
[43]
System card claude sonnet 4.6, 2026
Anthropic. System card claude sonnet 4.6, 2026. URL https://www-cdn.anthropic.com/ bbd8ef16d70b7a1665f14f306ee88b53f686aa75.pdf
work page 2026
-
[44]
Seed 2.0 model card: Towards intelligence frontier for real-world complex- ity, 2026
ByteDance. Seed 2.0 model card: Towards intelligence frontier for real-world complex- ity, 2026. URL https://lf3-static.bytednsdoc.com/obj/eden-cn/lapzild-tss/ ljhwZthlaukjlkulzlp/seed2/0214/Seed2.0%20Model%20Card.pdf
work page 2026
-
[45]
GLM-5: from Vibe Coding to Agentic Engineering
Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[46]
Kimi k2.6 technical blog, 2026
Moonshot AI. Kimi k2.6 technical blog, 2026. URL https://www.kimi.com/blog/ kimi-k2-6
work page 2026
-
[47]
Deepseek v4 technical report, 2026
DeepSeek AI. Deepseek v4 technical report, 2026. URL https://huggingface.co/ deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
work page 2026
-
[48]
MiroMind Team, S Bai, L Bing, L Lei, R Li, X Li, X Lin, E Min, L Su, B Wang, et al. Mirothinker-1.7 & h1: Towards heavy-duty research agents via verification.arXiv preprint arXiv:2603.15726, 2026
-
[49]
Webwalker: Benchmarking llms in web traversal
Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Linhai Zhang, Yulan He, Deyu Zhou, Pengjun Xie, et al. Webwalker: Benchmarking llms in web traversal. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10290–10305, 2025
work page 2025
-
[50]
Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yongkang Wu, Ji-Rong Wen, Yutao Zhu, and Zhicheng Dou. Webthinker: Empowering large reasoning models with deep research capability.Advances in Neural Information Processing Systems, 38:120091–120131, 2026
work page 2026
-
[51]
Webdancer: Towards autonomous information seeking agency
Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhenglin Wang, Zhengwei Tao, Ding-Chu Zhang, Zekun Xi, Xiangru Tang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webdancer: Towards autonomous information seeking agency. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2026. URL https: //openreview.net...
work page 2026
-
[52]
Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents, 2025
Zile Qiao, Guoxin Chen, Xuanzhong Chen, Donglei Yu, Wenbiao Yin, Xinyu Wang, Zhen Zhang, Baixuan Li, Huifeng Yin, Kuan Li, Rui Min, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webresearcher: Unleashing unbounded reasoning capability in long-horizon agents, 2025. URLhttps://arxiv.org/abs/2509.13309
-
[53]
Webexplorer: Exploreandevolvefortraininglong-horizonwebagents.arXivpreprint arXiv:2509.06501,2025
Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, and Junxian He. Webexplorer: Explore and evolve for training long-horizon web agents, 2025. URL https://arxiv.org/abs/2509.06501
-
[54]
Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, et al. Agentfold: Long-horizon web agents with proactive context management.arXiv preprint arXiv:2510.24699, 2025
-
[55]
Xixi Wu, Kuan Li, Yida Zhao, Liwen Zhang, Litu Ou, Huifeng Yin, Zhongwang Zhang, Xinmiao Yu, Dingchu Zhang, Yong Jiang, et al. Resum: Unlocking long-horizon search intelligence via context summarization.arXiv preprint arXiv:2509.13313, 2025
-
[56]
Reasoning with language model is planning with world model
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview. net/forum?id=VTWWvYtF1R
work page 2023
-
[57]
Autogen: Enabling next-gen llm applications via multi-agent conversations
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst conference on language modeling, 2024. 13
work page 2024
-
[58]
Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008, 2023
work page 2023
-
[59]
Metagpt: Meta programming for a multi-agent collaborative framework
Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, volume 2024, pages 23247–23275, 2024
work page 2024
-
[60]
Improving factuality and reasoning in language models through multiagent debate
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning, pages 11733–11763, 2024
work page 2024
-
[61]
Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al. Mindstorms in natural language-based societies of mind.arXiv preprint arXiv:2305.17066, 2023. 14 A Training Details SearcherThe Searcher shares the Navigator Qwen3.5-35B-A3B ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.