Tongyi DeepResearch Technical Report
Pith reviewed 2026-05-21 19:18 UTC · model grok-4.3
The pith
A 30.5 billion parameter agentic model activates only 3.3 billion parameters per token to reach state-of-the-art results on long-horizon deep research benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Tongyi DeepResearch is an agentic large language model specifically designed for long-horizon, deep information-seeking research tasks. It is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabled by a highly scalable fully automatic data synthesis pipeline that requires no human annotation. Customized environments for each training stage support stable and consistent interactions. Featuring 30.5 billion total parameters with only 3.3 billion activated per token, the model achieves state-of-the-art performance across agentic deep research benchmarks such as Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-
What carries the argument
The end-to-end training framework of agentic mid-training and agentic post-training, powered by a fully automatic scalable data synthesis pipeline and stage-specific customized environments.
If this is right
- The framework enables scalable reasoning and information seeking across complex long-horizon tasks.
- Stable interactions occur throughout all training stages due to the customized environments.
- Superior results are obtained on a range of agentic deep research benchmarks without human-annotated data.
- The open-sourced model, framework, and complete solutions allow direct community extension and application.
- Sparse activation of 3.3 billion parameters per token supports efficient operation on demanding research tasks.
Where Pith is reading between the lines
- Automatic synthesis pipelines of this kind could speed up development of agentic systems in domains other than research.
- The mid-training plus post-training sequence may generalize as a template for building other long-horizon agents.
- Sparse models trained this way might make advanced research agents more practical to deploy at scale.
- Applying the same pipeline to different base architectures could test how broadly the performance gains transfer.
Load-bearing premise
The fully automatic data synthesis pipeline generates training data of sufficient quality and diversity to produce stable agentic behavior and superior benchmark performance.
What would settle it
Independent evaluation showing that Tongyi DeepResearch scores below prior state-of-the-art systems on BrowseComp or Humanity's Last Exam when the custom environments or automatic synthesis pipeline are removed.
read the original abstract
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Tongyi DeepResearch, a 30.5B-parameter MoE agentic LLM (3.3B activated per token) for long-horizon deep research tasks. It is trained end-to-end via agentic mid-training and post-training using a fully automatic data synthesis pipeline that constructs customized environments per stage, without human annotation. The model is reported to achieve SOTA results on benchmarks including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES, and xbench-DeepSearch-2510. The model, framework, and solutions are open-sourced.
Significance. If the results hold, the work demonstrates a scalable path to high-performing agentic systems for complex information-seeking via automated data generation and staged training. The open-sourcing of the complete pipeline is a clear strength supporting reproducibility. The customized environments for stable interactions address a practical challenge in agent training. The stress-test concern regarding data quality in the automatic pipeline does not appear to undermine the claims, as the manuscript describes a self-consistent construction with no identified internal inconsistencies.
major comments (2)
- [§3.2] §3.2 (Data Synthesis Pipeline): the claim that the fully automatic pipeline produces data of sufficient quality and diversity for superior agentic performance is load-bearing for the central result; additional specifics on quality control mechanisms, diversity metrics, or ablation studies showing their impact on downstream benchmark scores would strengthen this.
- [Evaluation section] Evaluation section, main results table: the SOTA assertions on Humanity's Last Exam and BrowseComp would be more robust with explicit numerical scores for the proposed model, direct baseline comparisons, and any reported variance or statistical tests.
minor comments (3)
- [Abstract] Abstract: the benchmark list is dense; separating or grouping them would improve readability.
- [Model Architecture] Model description: the 30.5B/3.3B MoE activation ratio should be accompanied by an equation or diagram for clarity.
- [Introduction] Ensure first mentions of all benchmarks (e.g., FRAMES) include citations to their source papers.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of our work and the recommendation for minor revision. We appreciate the constructive feedback on strengthening the presentation of the data synthesis pipeline and evaluation results. Below we address each major comment point by point, indicating the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Data Synthesis Pipeline): the claim that the fully automatic pipeline produces data of sufficient quality and diversity for superior agentic performance is load-bearing for the central result; additional specifics on quality control mechanisms, diversity metrics, or ablation studies showing their impact on downstream benchmark scores would strengthen this.
Authors: We agree that further elaboration on the data synthesis pipeline would strengthen the manuscript. Section 3.2 already describes the fully automatic, self-consistent construction process with customized environments per stage to ensure stable interactions. In the revised version, we will expand this section to include additional details on quality control mechanisms (such as automated consistency checks and filtering criteria), quantitative diversity metrics (e.g., distribution of task complexity and domain coverage), and a summary of internal ablation studies demonstrating the pipeline's contribution to downstream performance. These additions will be presented without introducing new experiments. revision: yes
-
Referee: [Evaluation section] Evaluation section, main results table: the SOTA assertions on Humanity's Last Exam and BrowseComp would be more robust with explicit numerical scores for the proposed model, direct baseline comparisons, and any reported variance or statistical tests.
Authors: The main results table in the Evaluation section already reports the explicit numerical scores for Tongyi DeepResearch alongside direct baseline comparisons across all listed benchmarks, including Humanity's Last Exam and BrowseComp. To address the referee's point, we will revise the table and surrounding text to more prominently highlight these scores and comparisons. Regarding variance and statistical tests, we will add any available multi-run variance estimates; however, formal statistical significance tests were not performed for all benchmarks due to the standardized nature of the evaluation suites and computational constraints, which we will explicitly note in the revision. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper is an empirical technical report describing the architecture, end-to-end agentic training framework, and fully automatic data synthesis pipeline for Tongyi DeepResearch (30.5B/3.3B MoE). Central claims consist of benchmark results on external suites (Humanity's Last Exam, BrowseComp, WebWalkerQA, etc.) rather than any first-principles derivation, fitted parameter, or self-referential definition. No equations, uniqueness theorems, or ansatzes are presented that reduce outputs to inputs by construction; the pipeline and results are reported as self-consistent without load-bearing self-citations or renaming of known patterns. The derivation chain is therefore independent of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Agentic mid-training and post-training can incentivize autonomous deep research agency
- domain assumption A fully automatic data synthesis pipeline can generate sufficient high-quality data without human annotation
Forward citations
Cited by 32 Pith papers
-
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
-
Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
REFLECT benchmark shows current LLM judges achieve below 55% accuracy detecting failures in evidence-based research agents, especially on evidence verification.
-
Argus: Evidence Assembly for Scalable Deep Research Agents
Argus coordinates a Navigator and multiple Searchers via an evidence graph to assemble complete, source-traced answers, yielding benchmark gains up to 12.7 points with 8 parallel agents and 86.2 on BrowseComp with 64 agents.
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and tha...
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
-
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
-
Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent
AIDA is the first end-to-end autonomous agent that combines a domain-specific language with Pareto-guided reinforcement learning to discover insights from complex business data.
-
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
Evaluating the Search Agent in a Parallel World
Mind-ParaWorld creates parallel worlds with atomic facts to evaluate search agents on future scenarios, showing they synthesize evidence well but struggle with collection, coverage, sufficiency judgment, and stopping ...
-
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
-
Argus: Evidence Assembly for Scalable Deep Research Agents
Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmar...
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
-
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
An LLM entity-tagging pipeline plus multi-agent system extracts ~6.3M nuanced records from 22.5M PubMed papers across six tasks with lower measured error than existing curated databases.
-
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents
Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.
-
Towards Knowledgeable Deep Research: Framework and Benchmark
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
-
TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models
TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
-
Learning to Retrieve from Agent Trajectories
Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
-
Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search
EASP adds a Probe-then-Plan step so LLMs ground their search plans in actual retrieval snapshots and inventory, yielding higher recall and business metrics in sub-second production search.
-
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
-
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.
-
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
-
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.
-
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
CuSearch reallocates fixed training budget toward deeper-search rollouts in RLVR for agentic RAG, treating search depth as an annotation-free proxy for supervision density and reporting up to 11.8 exact-match gains ov...
-
Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent
AIDA is a reinforcement learning agent that explores complex business databases using a proprietary DSL and Pareto-guided reasoning to discover actionable insights autonomously.
-
Mind DeepResearch Technical Report
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
-
Valley3: Scaling Omni Foundation Models for E-commerce
Valley3 is an omni MLLM for e-commerce that uses a four-stage pre-training pipeline plus post-training for controllable reasoning and agentic search, outperforming baselines on e-commerce benchmarks while staying comp...
-
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.