arxiv: 2604.14518 · v2 · submitted 2026-04-16 · 💻 cs.AI

Recognition: unknown

Mind DeepResearch Technical Report

MindDR Team , Li Auto Inc

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:43 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemsdeep research agentsreinforcement learning for agentsagent training pipelinebenchmark evaluationlanguage model agents

0 comments

The pith

A three-agent architecture and four-stage training pipeline lets ~30B models match larger systems on deep research tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MindDR, a framework that decomposes deep research into planning, searching, and reporting steps handled by separate agents. These agents undergo a sequence of supervised fine-tuning, search-focused reinforcement learning, report-focused reinforcement learning, and preference alignment. The resulting system posts leading scores on BrowseComp, WideSearch, xbench-DS, DeepResearch Bench, and a new real-world benchmark drawn from product queries. A sympathetic reader would care because the approach shows how structured agent collaboration and targeted training can deliver strong results without requiring the largest possible models, potentially lowering the cost of building capable research tools.

Core claim

MindDR achieves leading performance with ~30B-parameter models through a collaborative three-agent architecture (Planning Agent, DeepSearch Agent, and Report Agent) and a four-stage agent-specialized training pipeline comprising SFT cold-start, Search-RL, Report-RL and preference alignment. On BrowseComp-ZH it reaches 45.7 percent, on BrowseComp 42.8 percent, on WideSearch 46.5 percent, on xbench-DS 75.0 percent, and on DeepResearch Bench 52.5 percent, outperforming comparable-scale open-source agent systems and rivaling larger-scale models. The system has been deployed as an online product, and a new benchmark of 500 real-world Chinese queries evaluated with a multi-dimensional rubric shows

What carries the argument

The three-agent collaborative architecture (Planning Agent, DeepSearch Agent, Report Agent) paired with the four-stage training pipeline of SFT cold-start, Search-RL, Report-RL, and preference alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar staged training and agent decomposition could extend to other long-horizon tasks such as multi-step coding or scientific literature synthesis without larger base models.
The creation of MindDR Bench from actual user queries suggests that future agent systems will be evaluated more on real deployment distributions than on synthetic academic sets.
If the training pipeline generalizes, teams could iterate on agent specialization rather than raw parameter count to improve research capabilities.

Load-bearing premise

The reported benchmark gains come primarily from the three-agent design and training stages rather than from undisclosed data scale, benchmark-specific tuning, or evaluation differences.

What would settle it

A side-by-side test of the three-agent system against a single-agent model trained on identical data and compute budgets that shows no performance gap would indicate the architecture is not the main driver.

Figures

Figures reproduced from arXiv: 2604.14518 by Li Auto Inc, MindDR Team.

**Figure 2.** Figure 2: Overview of the MindDR multi-agent framework. A user query is first processed by the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Four-stage training pipeline of MindDR. • Reward tractability. End-to-end optimization over the full DR pipeline would require a single reward to capture tool correctness, reasoning quality, report coherence, and subjective preferences simultaneously. Such a composite reward is inevitably sparse and noisy, making credit assignment across dozens of reasoning steps intractable. Staged training decomposes thi… view at source ↗

**Figure 4.** Figure 4: Overview of the knowledge-graph-grounded query synthesis pipeline, consisting of four [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Training dynamics of Search-RL over 180 steps. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of the Report-RL framework. Given a long-form input, the policy model and [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Stage-wise DS benchmark performance from the base model to [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison with mainstream DR systems on the public DeepResearch-Benchmark leader [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Efficiency and scalability analysis on BrowseComp-ZH. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

read the original abstract

We present Mind DeepResearch (MindDR), an efficient multi-agent deep research framework that achieves leading performance with only ~30B-parameter models through a meticulously designed data synthesis and multi-stage training pipeline. The core innovation of MindDR lies in a collaborative three-agent architecture (Planning Agent, DeepSearch Agent, and Report Agent) and a four-stage agent-specialized training pipeline comprising SFT cold-start, Search-RL, Report-RL and preference alignment. With this regime, MindDR demonstrates competitive performance even with ~30B-scale models. Specifically, MindDR achieves 45.7% on BrowseComp-ZH, 42.8% on BrowseComp, 46.5% on WideSearch, 75.0% on xbench-DS, and 52.5 on DeepResearch Bench, outperforming comparable-scale open-source agent systems and rivaling larger-scale models. MindDR has been deployed as an online product in Li Auto. Furthermore, we introduce MindDR Bench, a curated benchmark of 500 real-world Chinese queries from our internal product user interactions, evaluated through a comprehensive multi-dimensional rubric system rather than relying on a single RACE metric. On MindDR Bench, MindDR achieves a state-of-the-art score of 51.8.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MindDR describes a practical three-agent research setup with staged training that runs at 30B scale and has been deployed, but the benchmark gains cannot be clearly attributed to the architecture without ablations or evaluation details.

read the letter

The paper lays out Mind DeepResearch as a three-agent system—Planning, DeepSearch, and Report—trained in four stages: SFT cold-start, Search-RL, Report-RL, and preference alignment. It reports competitive scores on BrowseComp, WideSearch, and similar benchmarks using models around 30B parameters, notes a production deployment at Li Auto, and introduces MindDR Bench, a 500-query set drawn from real internal user interactions scored on a multi-dimensional rubric. The deployment and the new benchmark grounded in actual product use are the clearest contributions here. Those elements show the team has moved beyond toy experiments to something running in the wild, which is worth noting for anyone building agent tools at similar scale. The staged training for role-specific agents is a straightforward way to get specialization without jumping to much larger models. The soft spots sit mainly in the evidence for why the results hold. The abstract lists the numbers but gives no ablations, no matched single-agent baselines, and no breakdown of how the benchmarks were scored or whether data leakage was checked. MindDR Bench comes from the same company’s product logs, so overlap with training data or environment-specific tuning is a real possibility that is not addressed. Without those controls, it is hard to isolate the effect of the three-agent design or the training sequence from data or evaluation choices. This paper is aimed at industry groups working on efficient research agents who want to see a deployed example and a real-world benchmark. Academic readers focused on rigorous agent evaluation would find it thin. It deserves peer review because the deployment and benchmark are substantive enough to examine, but the current version would need added experimental controls and clearer protocols to be convincing.

Referee Report

3 major / 1 minor

Summary. The manuscript presents Mind DeepResearch (MindDR), a multi-agent deep research framework using a collaborative three-agent architecture (Planning Agent, DeepSearch Agent, Report Agent) and a four-stage training pipeline (SFT cold-start, Search-RL, Report-RL, preference alignment). It claims that this approach enables ~30B-parameter models to achieve leading results on BrowseComp-ZH (45.7%), BrowseComp (42.8%), WideSearch (46.5%), xbench-DS (75.0%), DeepResearch Bench (52.5%), and a new internal MindDR Bench (51.8), outperforming comparable open-source agents while rivaling larger models; the system is deployed as a product at Li Auto.

Significance. If the performance claims hold under rigorous evaluation, the work would demonstrate that targeted multi-agent collaboration combined with staged reinforcement learning can close the gap between smaller and larger models on complex, multi-step research tasks. The introduction of a multi-dimensional rubric for MindDR Bench also offers a potential template for more nuanced agent evaluation beyond single-score metrics.

major comments (3)

[Abstract] Abstract: The reported benchmark scores are presented without any description of the baselines (model sizes, architectures, or prompting strategies), evaluation protocols, number of runs, statistical significance testing, or controls for data leakage. This absence prevents assessment of whether the gains are attributable to the claimed three-agent architecture and four-stage pipeline.
[Abstract] Abstract: MindDR Bench is constructed from 500 internal product user interactions at Li Auto; the manuscript supplies no evidence that these queries were held out from training data, no details on the multi-dimensional rubric scoring process, and no comparison to external benchmarks under identical conditions, undermining claims of state-of-the-art performance on this new benchmark.
[Abstract] Abstract: No ablation studies, matched-data single-agent controls, or component-wise experiments are described to isolate the contribution of the Planning/DeepSearch/Report collaboration versus data synthesis advantages or training choices, leaving the central attribution claim unsupported.

minor comments (1)

[Abstract] The abstract would benefit from explicit definitions or citations for each named benchmark (BrowseComp-ZH, WideSearch, etc.) to aid readers unfamiliar with them.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our technical report. We address each major comment point by point below, providing clarifications from the full manuscript and indicating revisions made to strengthen the presentation of results and methodology.

read point-by-point responses

Referee: [Abstract] Abstract: The reported benchmark scores are presented without any description of the baselines (model sizes, architectures, or prompting strategies), evaluation protocols, number of runs, statistical significance testing, or controls for data leakage. This absence prevents assessment of whether the gains are attributable to the claimed three-agent architecture and four-stage pipeline.

Authors: The abstract is kept concise per standard practice for technical reports. The full manuscript details the baselines (including model sizes, architectures such as comparable 7B-70B open-source agents, and prompting strategies) in Section 4.1, evaluation protocols in Section 4.2 (including query sampling and scoring), number of runs (three independent runs per benchmark with mean and standard deviation reported), statistical significance via paired t-tests, and data leakage controls (temporal splits and decontamination checks). We have revised the abstract to include a brief summary of the evaluation setup and added a consolidated baseline comparison table in the main text for easier assessment. revision: yes
Referee: [Abstract] Abstract: MindDR Bench is constructed from 500 internal product user interactions at Li Auto; the manuscript supplies no evidence that these queries were held out from training data, no details on the multi-dimensional rubric scoring process, and no comparison to external benchmarks under identical conditions, undermining claims of state-of-the-art performance on this new benchmark.

Authors: The 500 queries were collected from post-training user interactions at Li Auto (after the data cutoff date), with explicit temporal separation to ensure they are held out; we have added a clear statement confirming this in the revised manuscript. The multi-dimensional rubric (covering accuracy, completeness, relevance, and coherence) and scoring process (including annotator guidelines and agreement metrics) are described in Section 5.2 and Appendix B. To provide context, we now include side-by-side performance comparisons of MindDR on MindDR Bench versus external benchmarks like DeepResearch Bench using the same model and evaluation conditions. revision: yes
Referee: [Abstract] Abstract: No ablation studies, matched-data single-agent controls, or component-wise experiments are described to isolate the contribution of the Planning/DeepSearch/Report collaboration versus data synthesis advantages or training choices, leaving the central attribution claim unsupported.

Authors: The manuscript provides indirect evidence through comparisons to other open-source multi-agent and single-agent systems trained under similar regimes. To directly isolate contributions, we have added ablation studies in the revised Experiments section: single-agent baselines using the same SFT+RL pipeline on matched data, and component-wise ablations removing the Planning Agent, DeepSearch Agent, or Report Agent individually. These results demonstrate incremental gains from the collaborative architecture beyond data synthesis and training choices alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical technical report describing a three-agent architecture and four-stage training pipeline, with performance reported on external benchmarks (BrowseComp, WideSearch, xbench-DS) as well as a new internal benchmark. No mathematical derivations, equations, or first-principles predictions are present that reduce to inputs by construction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations that force the central claims appear in the text. The internal MindDR Bench is constructed from product interactions, but this is an evaluation choice rather than a definitional circularity, and external benchmarks provide independent points of comparison. The derivation chain is self-contained against the reported evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond naming the agent roles and training stages; no mathematical derivations or parameter counts are given.

pith-pipeline@v0.9.0 · 5513 in / 1320 out tokens · 56237 ms · 2026-05-10T11:43:53.089499+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 42 canonical work pages · 16 internal anchors

[1]

Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025

Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025. doi: 10.48550/arXiv.2510.18874. URLhttps://arxiv.org/abs/2510.18874

work page doi:10.48550/arxiv.2510.18874 2025
[2]

arXiv preprint arXiv:2506.13651 , year=

Kaiyuan Chen, Yixin Ren, Yang Liu, Xiaobo Hu, Haotong Tian, Tianbao Xie, Fangfu Liu, Haoye Zhang, Hongzhang Liu, Yuan Gong, et al. xbench: Tracking agents productivity scaling with profession-aligned real-world evaluations.arXiv preprint arXiv:2506.13651, 2025

work page arXiv 2025
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan- Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mis...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Gemini 3.1 pro

Google DeepMind. Gemini 3.1 pro. Google DeepMind Product Page, February 2026. URL https://deepmind.google/models/gemini/pro/

2026
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. URL https://arxiv.org/abs/2501. 12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Agentic reinforced policy optimization

Guanting Dong, Hangyu Mao, Jiajie Zhang, Xiaoxi Li, Kangzhi Zhao, Zhongyuan Wang, Guanting Dong, Licheng Bao, Fuzheng Zhang, and Ji-Rong Wen. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025

work page arXiv 2025
[7]

Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao. Deepresearch bench: A comprehensive benchmark for deep research agents.arXiv preprint arXiv:2506.11763, 2025

work page arXiv 2025
[8]

Openseeker: Democratizing frontier search agents by fully open-sourcing training data.arXiv preprint arXiv:2603.15594, 2026

Yuwen Du, Rui Ye, Shuo Tang, Xinyu Zhu, Yijun Lu, Yuzhu Cai, and Siheng Chen. Openseeker: Democratizing frontier search agents by fully open-sourcing training data, 2026. URL https: //arxiv.org/abs/2603.15594

work page arXiv 2026
[9]

Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

Tianqing Fang, Zhisong Zhang, Xiaoyang Wang, Rui Wang, Can Qin, Yuxuan Wan, Jun-Yu Ma, Ce Zhang, Jiaqi Chen, Xiyun Li, et al. Cognitive kernel-pro: A framework for deep research agents and agent foundation models training.arXiv preprint arXiv:2508.00414, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Google gemini deep research: Your personal ai research assistant, 2024

Google. Google gemini deep research: Your personal ai research assistant, 2024. URL https://blog.google/products/gemini/google-gemini-deep-research/

2024
[11]

DEER: A comprehensive and reliable benchmark for deep research agents on expert-level research tasks.arXiv preprint arXiv:2512.17776, 2025

Janghoon Han, Minseok Kim, Jihyung Yoon, Hyeonju Jo, Kyumin Lee, et al. Deer: A benchmark for evaluating deep research agents on expert report generation.arXiv preprint arXiv:2512.17776, 2025

work page arXiv 2025
[12]

TreeRL: LLM Reinforcement Learning with On-Policy Tree Search, 2025

Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, and Yuxiao Dong. Treerl: Llm reinforce- ment learning with on-policy tree search.arXiv preprint arXiv:2506.11902, 2025. 27

work page arXiv 2025
[13]

arXiv preprint arXiv:2512.20491 , year=

Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu L...

work page arXiv 2025
[14]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

work page Pith review arXiv 2025
[15]

Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning, 2025

Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Yida Zhao, Liwen Zhang, Litu Ou, Dingchu Zhang, Xixi Wu, Jialong Wu, et al. Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning.arXiv preprint arXiv:2509.13305, 2025

work page arXiv 2025
[16]

Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025. URLhttps://arxiv.org...

work page arXiv 2025
[17]

Evaluating deep research agents via academic survey generation.OpenReview,

Ming Li et al. Evaluating deep research agents via academic survey generation.OpenReview,
[18]

URLhttps://openreview.net/forum?id=zvL42fmtbG
[20]

A survey on reasoning agentic retrieval-augmented generation.ACL Findings,

Jiaqi Liang et al. A survey on reasoning agentic retrieval-augmented generation.ACL Findings,
[21]

URLhttps://aclanthology.org/2025.findings-ijcnlp.122.pdf

2025
[22]

Deepdive: Advancing deep search agents with knowledge graphs and multi-turn rl.arXiv preprint arXiv:2509.10446, 2025

Rui Lu, Zhenyu Hou, Zihan Wang, Hanchen Zhang, Xiao Liu, Yujiang Li, Shi Feng, Jie Tang, and Yuxiao Dong. Deepdive: Advancing deep search agents with knowledge graphs and multi-turn rl.arXiv preprint arXiv:2509.10446, 2025

work page arXiv 2025
[23]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Roberto Dessì, Maria Lomeli, et al. Gaia: a benchmark for general ai assistants. InInternational Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2311.12983

work page internal anchor Pith review arXiv 2023
[24]

Introducing deep research.OpenAI Blog, 2025

OpenAI. Introducing deep research.OpenAI Blog, 2025. URL https://openai.com/index/ introducing-deep-research/

2025
[25]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shengding Liang, Yining Ye, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023. URL https: //arxiv.org/abs/2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, pages 53428–53451, 2023. doi: 10.48550/arXiv.2305.18290. URL https://arxiv.org/abs/ 2305.18290

work page internal anchor Pith review doi:10.48550/arxiv.2305.18290 2023
[27]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

2023
[28]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 28

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

ResearchRubrics: Prompt-specific rubrics for deep research agent evaluation

Manasi Sharma, Chen Bo Calvin Zhang, Chaithanya Bandi, Clinton Wang, Ankit Aich, Huy Nghiem, Tahseen Rabbani, Ye Htet, Brian Jang, Sumana Basu, Aishwarya Balwani, Denis Peskoff, Marcos Ayestaran, Sean M. Hendryx, Brad Kenstler, and Bing Liu. Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents.arXiv preprint arXiv:2511.0...

work page arXiv 2025
[30]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

A Singh et al. Agentic retrieval-augmented generation: A survey on agentic rag.arXiv preprint arXiv:2501.09136, 2025. URLhttps://arxiv.org/abs/2501.09136

work page internal anchor Pith review arXiv 2025
[31]

Webshaper: Agentically data synthesizing via information-seeking formalization.arXiv preprint arXiv:2507.15061, 2025

Zhengwei Tao, Jialong Wu, Wenbiao Yin, Junkai Zhang, Baixuan Li, Haiyang Shen, Kuan Li, Liwen Zhang, Xinyu Wang, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou. Webshaper: Agentically data synthesizing via information-seeking formalization, 2025. URL https://arxiv.org/abs/2507.15061

work page arXiv 2025
[32]

arXiv:2509.13312 doi:10.48550/ARXIV.2509.13312

Alibaba NLP Team. Webweaver: Dual-agent framework for open-ended deep research.arXiv preprint arXiv:2509.13312, 2025. URLhttps://arxiv.org/abs/2509.13312

work page arXiv 2025
[33]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, Yanru Chen, Yuankun Chen, Yutian Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review arXiv 2025
[34]

MiroThinker-1.7 & H1: Towards heavy-duty research agents via ver- ification

MiroMind Team. Mirothinker-1.7 and h1: Towards heavy-duty reasoning with open-source research agents.arXiv preprint arXiv:2603.15726, 2026. URL https://arxiv.org/abs/ 2603.15726

work page arXiv 2026
[35]

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, et al. Mirothinker: Pushing the performance boundaries of open-source research agents via model, context, and interactive scaling.arXiv preprint arXiv:2511.11793, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025

work page internal anchor Pith review arXiv 2025
[37]

Reverse-engineered reasoning for open-ended generation.arXiv preprint arXiv:2509.06160, 2025

Haozhe Wang, Haoran Que, Qixin Xu, Minghao Liu, Wangchunshu Zhou, Jiazhan Zhang, Jian Lou, and Roy Ka-Wei Lee. Reverse-engineered reasoning for open-ended generation.arXiv preprint arXiv:2509.06160, 2025

work page arXiv 2025
[38]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self- evolution in llm agents via multi-turn reinforcement learning.arXiv preprint arXiv:2504.20073, 2025

work page internal anchor Pith review arXiv 2025
[39]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 24824–24837, 2022

2022
[40]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516, 2025

work page internal anchor Pith review arXiv 2025
[41]

Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, and Ke Wang. Widesearch: Benchmarking agentic broad info-seeking.arXiv preprint arXiv:2508.07999, 2025

work page arXiv 2025
[42]

Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025

Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, et al. Webdancer: Towards autonomous information seeking agency.arXiv preprint arXiv:2505.22648, 2025

work page arXiv 2025
[43]

Superwriter: Reflection- driven long-form generation with large language models.arXiv preprint arXiv:2506.04180, 2025

Yuhao Wu, Yushi Bai, Zhiqiang Hu, Juanzi Li, and Roy Ka-Wei Lee. Superwriter: Reflection- driven long-form generation with large language models.arXiv preprint arXiv:2506.04180, 2025. 29

work page arXiv 2025
[44]

Writingbench: A comprehensive benchmark for generative writing.CoRR, abs/2503.05244,

Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, and Fei Huang. Writingbench: A comprehensive benchmark for generative writing.arXiv preprint arXiv:2503.05244, 2025

work page arXiv 2025
[45]

Xue et al

Tian Xue et al. Online-mind2web: A benchmark for evaluating web agents in online envi- ronments.arXiv preprint arXiv:2504.01382, 2025. URL https://arxiv.org/abs/2504. 01382

work page arXiv 2025
[46]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Y ang, G

Chen Yang et al. Nanbeige4.1-3b: A small general model that reasons, aligns, and acts.arXiv preprint arXiv:2602.13367, 2026. URLhttps://arxiv.org/abs/2602.13367

work page arXiv 2026
[48]

arXiv preprint arXiv:2210.06774 , year =

Kevin Yang, Yuandong Tian, Nanyun Peng, and Dan Klein. Re3: Generating longer stories with recursive reprompting and revision.arXiv preprint arXiv:2210.06774, 2022

work page arXiv 2022
[49]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems (NeurIPS), volume 36, 2023

2023
[50]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id= WE_vluYUL-X

2023
[51]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Glm-4.6: Advanced agentic, reasoning and coding capabilities, 2025

Z.ai. Glm-4.6: Advanced agentic, reasoning and coding capabilities, 2025. URL https: //z.ai/blog/glm-4.6

2025
[53]

How far are we from genuinely useful deep research agents?arXiv preprint arXiv:2512.01948, 2025

Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou, Boyu Feng, Shudong Liu, Jiabin Luo, Weihao Xie, Zhaohui Wang, Tianrui Qin, King Zhu, Yuqing Wang, Qianben Chen, Yuchen Eleanor Jiang, Wei Wang, Jiaheng Liu, and Wangchunshu Zhou. How far are we from genuinely useful deep research agents?arXiv preprint arXiv:2512.01948, 2025

work page arXiv 2025
[54]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

work page internal anchor Pith review arXiv 2025
[55]

Stepsearch: Igniting llms search ability via step-wise proximal policy optimization

Xuhui Zheng, Kang An, Ziliang Wang, Yuhang Wang, and Yichao Wu. Stepsearch: Igniting llms search ability via step-wise proximal policy optimization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21816–21841, 2025

2025
[56]

Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese

Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314, 2025. 30

work page arXiv 2025