Tongyi DeepResearch Technical Report

Bo Zhang; Chenxi Wang; Dingchu Zhang; Donglei Yu; Fei Huang; Gang Fu; Guangyu Li; Guoxin Chen; Hailong Yin; Haiyang Shen

arxiv: 2510.24701 · v3 · pith:W6AUY5MMnew · submitted 2025-10-28 · 💻 cs.CL · cs.AI· cs.IR· cs.LG· cs.MA

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Team: Baixuan Li , Bo Zhang , Dingchu Zhang , Fei Huang , Guangyu Li , Guoxin Chen , Huifeng Yin , Jialong Wu

show 48 more authors

Jingren Zhou Kuan Li Liangcai Su Litu Ou Liwen Zhang Pengjun Xie Rui Ye Wenbiao Yin Xinmiao Yu Xinyu Wang Xixi Wu Xuanzhong Chen Yida Zhao Zhen Zhang Zhengwei Tao Zhongwang Zhang Zile Qiao Chenxi Wang Donglei Yu Gang Fu Haiyang Shen Jiayin Yang Jun Lin Junkai Zhang Kui Zeng Li Yang Hailong Yin Maojia Song Ming Yan Minpeng Liao Peng Xia Qian Xiao Rui Min Ruixue Ding Runnan Fang Shaowei Chen Shen Huang Shihang Wang Shihao Cai Weizhou Shen Xiaobin Wang Xin Guan Xinyu Geng Yingcheng Shi Yuning Wu Zhuo Chen Zijian Li Yong Jiang

This is my paper

Pith reviewed 2026-05-21 19:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LGcs.MA

keywords agentic large language modeldeep researchautomatic data synthesisinformation seekinglong-horizon tasksMoE modelbenchmark performanceend-to-end training

0 comments

The pith

A 30.5 billion parameter agentic model activates only 3.3 billion parameters per token to reach state-of-the-art results on long-horizon deep research benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tongyi DeepResearch as an agentic large language model built for long-horizon deep information-seeking research tasks. It develops the model through an end-to-end training framework that combines agentic mid-training and agentic post-training. A fully automatic data synthesis pipeline supplies all training data without human annotation, and customized environments maintain stable interactions at every stage. This approach yields superior performance on multiple agentic benchmarks including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES, and xbench-DeepSearch-2510. The model is released with its framework and solutions to support further community development.

Core claim

Tongyi DeepResearch is an agentic large language model specifically designed for long-horizon, deep information-seeking research tasks. It is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabled by a highly scalable fully automatic data synthesis pipeline that requires no human annotation. Customized environments for each training stage support stable and consistent interactions. Featuring 30.5 billion total parameters with only 3.3 billion activated per token, the model achieves state-of-the-art performance across agentic deep research benchmarks such as Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-

What carries the argument

The end-to-end training framework of agentic mid-training and agentic post-training, powered by a fully automatic scalable data synthesis pipeline and stage-specific customized environments.

If this is right

The framework enables scalable reasoning and information seeking across complex long-horizon tasks.
Stable interactions occur throughout all training stages due to the customized environments.
Superior results are obtained on a range of agentic deep research benchmarks without human-annotated data.
The open-sourced model, framework, and complete solutions allow direct community extension and application.
Sparse activation of 3.3 billion parameters per token supports efficient operation on demanding research tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automatic synthesis pipelines of this kind could speed up development of agentic systems in domains other than research.
The mid-training plus post-training sequence may generalize as a template for building other long-horizon agents.
Sparse models trained this way might make advanced research agents more practical to deploy at scale.
Applying the same pipeline to different base architectures could test how broadly the performance gains transfer.

Load-bearing premise

The fully automatic data synthesis pipeline generates training data of sufficient quality and diversity to produce stable agentic behavior and superior benchmark performance.

What would settle it

Independent evaluation showing that Tongyi DeepResearch scores below prior state-of-the-art systems on BrowseComp or Humanity's Last Exam when the custom environments or automatic synthesis pipeline are removed.

read the original abstract

We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tongyi DeepResearch is a technical report on an open-sourced 30.5B/3.3B MoE agentic model trained end-to-end with automatic data synthesis, claiming SOTA on several deep-research benchmarks.

read the letter

The main point is that this report describes Tongyi DeepResearch, a sparsely activated model with 30.5 billion total parameters and 3.3 billion active per token, built for long-horizon information-seeking tasks. It uses an end-to-end framework of agentic mid-training and post-training backed by a fully automatic data synthesis pipeline that avoids human annotation and sets up customized environments per stage. The authors claim state-of-the-art results on benchmarks including Humanity's Last Exam, BrowseComp, WebWalkerQA, and a couple of xbench variants, and they open-source the model, framework, and solutions.

Referee Report

2 major / 3 minor

Summary. The manuscript presents Tongyi DeepResearch, a 30.5B-parameter MoE agentic LLM (3.3B activated per token) for long-horizon deep research tasks. It is trained end-to-end via agentic mid-training and post-training using a fully automatic data synthesis pipeline that constructs customized environments per stage, without human annotation. The model is reported to achieve SOTA results on benchmarks including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES, and xbench-DeepSearch-2510. The model, framework, and solutions are open-sourced.

Significance. If the results hold, the work demonstrates a scalable path to high-performing agentic systems for complex information-seeking via automated data generation and staged training. The open-sourcing of the complete pipeline is a clear strength supporting reproducibility. The customized environments for stable interactions address a practical challenge in agent training. The stress-test concern regarding data quality in the automatic pipeline does not appear to undermine the claims, as the manuscript describes a self-consistent construction with no identified internal inconsistencies.

major comments (2)

[§3.2] §3.2 (Data Synthesis Pipeline): the claim that the fully automatic pipeline produces data of sufficient quality and diversity for superior agentic performance is load-bearing for the central result; additional specifics on quality control mechanisms, diversity metrics, or ablation studies showing their impact on downstream benchmark scores would strengthen this.
[Evaluation section] Evaluation section, main results table: the SOTA assertions on Humanity's Last Exam and BrowseComp would be more robust with explicit numerical scores for the proposed model, direct baseline comparisons, and any reported variance or statistical tests.

minor comments (3)

[Abstract] Abstract: the benchmark list is dense; separating or grouping them would improve readability.
[Model Architecture] Model description: the 30.5B/3.3B MoE activation ratio should be accompanied by an equation or diagram for clarity.
[Introduction] Ensure first mentions of all benchmarks (e.g., FRAMES) include citations to their source papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive assessment of our work and the recommendation for minor revision. We appreciate the constructive feedback on strengthening the presentation of the data synthesis pipeline and evaluation results. Below we address each major comment point by point, indicating the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Data Synthesis Pipeline): the claim that the fully automatic pipeline produces data of sufficient quality and diversity for superior agentic performance is load-bearing for the central result; additional specifics on quality control mechanisms, diversity metrics, or ablation studies showing their impact on downstream benchmark scores would strengthen this.

Authors: We agree that further elaboration on the data synthesis pipeline would strengthen the manuscript. Section 3.2 already describes the fully automatic, self-consistent construction process with customized environments per stage to ensure stable interactions. In the revised version, we will expand this section to include additional details on quality control mechanisms (such as automated consistency checks and filtering criteria), quantitative diversity metrics (e.g., distribution of task complexity and domain coverage), and a summary of internal ablation studies demonstrating the pipeline's contribution to downstream performance. These additions will be presented without introducing new experiments. revision: yes
Referee: [Evaluation section] Evaluation section, main results table: the SOTA assertions on Humanity's Last Exam and BrowseComp would be more robust with explicit numerical scores for the proposed model, direct baseline comparisons, and any reported variance or statistical tests.

Authors: The main results table in the Evaluation section already reports the explicit numerical scores for Tongyi DeepResearch alongside direct baseline comparisons across all listed benchmarks, including Humanity's Last Exam and BrowseComp. To address the referee's point, we will revise the table and surrounding text to more prominently highlight these scores and comparisons. Regarding variance and statistical tests, we will add any available multi-run variance estimates; however, formal statistical significance tests were not performed for all benchmarks due to the standardized nature of the evaluation suites and computational constraints, which we will explicitly note in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical technical report describing the architecture, end-to-end agentic training framework, and fully automatic data synthesis pipeline for Tongyi DeepResearch (30.5B/3.3B MoE). Central claims consist of benchmark results on external suites (Humanity's Last Exam, BrowseComp, WebWalkerQA, etc.) rather than any first-principles derivation, fitted parameter, or self-referential definition. No equations, uniqueness theorems, or ansatzes are presented that reduce outputs to inputs by construction; the pipeline and results are reported as self-consistent without load-bearing self-citations or renaming of known patterns. The derivation chain is therefore independent of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the premise that automatic data synthesis and stage-specific environments produce high-quality agentic training signals; no explicit free parameters, invented entities, or additional axioms are detailed in the abstract.

axioms (2)

domain assumption Agentic mid-training and post-training can incentivize autonomous deep research agency
Core premise stated for the development framework.
domain assumption A fully automatic data synthesis pipeline can generate sufficient high-quality data without human annotation
Enables all training stages as described.

pith-pipeline@v0.9.0 · 5928 in / 1471 out tokens · 95641 ms · 2026-05-21T19:18:52.869751+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
cs.AI 2026-04 accept novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
cs.CL 2026-05 unverdicted novelty 7.0

REFLECT benchmark shows current LLM judges achieve below 55% accuracy detecting failures in evidence-based research agents, especially on evidence verification.
Argus: Evidence Assembly for Scalable Deep Research Agents
cs.CL 2026-05 unverdicted novelty 7.0

Argus coordinates a Navigator and multiple Searchers via an evidence graph to assemble complete, source-traced answers, yielding benchmark gains up to 12.7 points with 8 parallel agents and 86.2 on BrowseComp with 64 agents.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 unverdicted novelty 7.0

ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and tha...
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
cs.AI 2026-05 unverdicted novelty 7.0

CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
cs.CL 2026-05 unverdicted novelty 7.0

A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent
cs.AI 2026-05 unverdicted novelty 7.0

AIDA is the first end-to-end autonomous agent that combines a domain-specific language with Pareto-guided reinforcement learning to discover insights from complex business data.
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
cs.LG 2026-05 conditional novelty 7.0

Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
Evaluating the Search Agent in a Parallel World
cs.AI 2026-03 unverdicted novelty 7.0

Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping ...
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
cs.AI 2026-05 unverdicted novelty 6.0

SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
Argus: Evidence Assembly for Scalable Deep Research Agents
cs.CL 2026-05 unverdicted novelty 6.0

Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmar...
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
cs.AI 2026-05 unverdicted novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
cs.LG 2026-05 unverdicted novelty 6.0

HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
cs.LG 2026-05 unverdicted novelty 6.0

An LLM entity-tagging pipeline plus multi-agent system extracts ~6.3M nuanced records from 22.5M PubMed papers across six tasks with lower measured error than existing curated databases.
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents
cs.AI 2026-05 unverdicted novelty 6.0

Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.
Towards Knowledgeable Deep Research: Framework and Benchmark
cs.AI 2026-04 unverdicted novelty 6.0

The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models
cs.CL 2026-04 unverdicted novelty 6.0

TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
Learning to Retrieve from Agent Trajectories
cs.IR 2026-03 conditional novelty 6.0

Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search
cs.AI 2026-03 conditional novelty 6.0

EASP adds a Probe-then-Plan step so LLMs ground their search plans in actual retrieval snapshots and inventory, yielding higher recall and business metrics in sub-second production search.
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
cs.CL 2026-02 conditional novelty 6.0

EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
cs.CL 2025-11 unverdicted novelty 6.0

MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
cs.AI 2026-05 unverdicted novelty 5.0

MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
cs.CV 2026-05 unverdicted novelty 5.0

ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
cs.AI 2026-05 unverdicted novelty 5.0

CuSearch reallocates fixed training budget toward deeper-search rollouts in RLVR for agentic RAG, treating search depth as an annotation-free proxy for supervision density and reporting up to 11.8 exact-match gains ov...
Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent
cs.AI 2026-05 unverdicted novelty 5.0

AIDA is a reinforcement learning agent that explores complex business databases using a proprietary DSL and Pareto-guided reasoning to discover actionable insights autonomously.
Mind DeepResearch Technical Report
cs.AI 2026-04 unverdicted novelty 5.0

MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
Valley3: Scaling Omni Foundation Models for E-commerce
cs.AI 2026-05 unverdicted novelty 4.0

Valley3 is an omni MLLM for e-commerce that uses a four-stage pre-training pipeline plus post-training for controllable reasoning and agentic search, outperforming baselines on e-commerce benchmarks while staying comp...
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
cs.CL 2026-02 unverdicted novelty 4.0

MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.