arxiv: 2510.24701 · v2 · submitted 2025-10-28 · 💻 cs.CL · cs.AI· cs.IR· cs.LG· cs.MA

Recognition: 3 theorem links

· Lean Theorem

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Team: Baixuan Li , Bo Zhang , Dingchu Zhang , Fei Huang , Guangyu Li , Guoxin Chen , Huifeng Yin , Jialong Wu

show 48 more authors

Jingren Zhou Kuan Li Liangcai Su Litu Ou Liwen Zhang Pengjun Xie Rui Ye Wenbiao Yin Xinmiao Yu Xinyu Wang Xixi Wu Xuanzhong Chen Yida Zhao Zhen Zhang Zhengwei Tao Zhongwang Zhang Zile Qiao Chenxi Wang Donglei Yu Gang Fu Haiyang Shen Jiayin Yang Jun Lin Junkai Zhang Kui Zeng Li Yang Hailong Yin Maojia Song Ming Yan Minpeng Liao Peng Xia Qian Xiao Rui Min Ruixue Ding Runnan Fang Shaowei Chen Shen Huang Shihang Wang Shihao Cai Weizhou Shen Xiaobin Wang Xin Guan Xinyu Geng Yingcheng Shi Yuning Wu Zhuo Chen Zijian Li Yong Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LGcs.MA

keywords agentic modellarge language modeldeep researchinformation seekingautomatic data synthesislong-horizon tasksstate-of-the-artsparse activation

0 comments

The pith

Tongyi DeepResearch, a sparsely activated 30.5-billion-parameter agentic model, achieves state-of-the-art performance on long-horizon deep research benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tongyi DeepResearch as an agentic large language model built for complex, extended information-seeking tasks. It trains this model end-to-end using agentic mid-training and post-training stages supported by a fully automatic data synthesis pipeline that requires no human annotation. Customized environments ensure consistent interactions during training. This approach yields leading results on benchmarks including Humanity's Last Exam, BrowseComp, and others. A sympathetic reader would care because it points toward scalable methods for creating AI systems capable of independent research without heavy reliance on manual data labeling.

Core claim

Tongyi DeepResearch is an agentic LLM with 30.5 billion total parameters and 3.3 billion activated per token. It is developed via an end-to-end framework of agentic mid-training and post-training enabled by a highly scalable, fully automatic data synthesis pipeline. This pipeline constructs customized environments for each stage to support stable interactions. The resulting model achieves state-of-the-art performance across agentic deep research benchmarks such as Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES, and xbench-DeepSearch-2510. The model, framework, and solutions are open-sourced.

What carries the argument

The end-to-end training framework that integrates agentic mid-training and agentic post-training, powered by a fully automatic data synthesis pipeline without human annotation.

Load-bearing premise

The fully automatic data synthesis pipeline generates training data of sufficient quality and diversity to foster genuine long-horizon research capabilities in the model.

What would settle it

Demonstrating that the model requires human-annotated data to achieve comparable performance on new long-horizon benchmarks would falsify the claim that the automatic pipeline suffices.

read the original abstract

We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tongyi DeepResearch is an open-sourced agentic model using a fully automatic data pipeline for long-horizon research tasks, with claimed SOTA results but almost no supporting details available.

read the letter

The main point is that this technical report presents Tongyi DeepResearch as an agentic model for long-horizon research tasks, trained end-to-end with a fully automatic data synthesis pipeline and claiming state-of-the-art results on benchmarks like Humanity's Last Exam and various web search and browse tasks. The model uses a sparse activation setup with 30.5 billion total parameters but only 3.3 billion active per token, and the authors release the model, framework, and solutions. What stands out as new is the complete integration of agentic mid-training and post-training supported by a scalable, annotation-free data pipeline that builds customized environments for consistent interactions across stages. This addresses practical issues in training agents for complex information-seeking without heavy human involvement. The open-sourcing is a strong move that gives the community something concrete to work with and test. The soft spots are clear from the text. There are no ablation studies, no specific benchmark scores or methodology details, no error analysis, and no explanation of how the data pipeline ensures quality or handles diversity for genuine research agency. The performance claims are stated but not supported with evidence here, so it's impossible to judge if the approach delivers real gains or if other factors are at play. The assumption that automatic synthesis alone suffices for long-horizon tasks is central but untested in the provided description. This paper is for researchers in LLM agents and synthetic data generation who want to explore scalable training for autonomous systems. Readers interested in practical agent frameworks will find the high-level design and release useful for their own experiments. It deserves serious peer review. The claims are specific enough and the release substantial, so referees can request the missing experimental details and verify the results. I would recommend sending it out rather than desk rejecting, as the topic and the open-source aspect make it worth a closer look.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Tongyi DeepResearch, an agentic LLM with 30.5 billion total parameters (3.3 billion activated per token) for long-horizon deep research tasks. It is trained via an end-to-end framework combining agentic mid-training and post-training, powered by a fully automatic data synthesis pipeline without human annotation. The model is claimed to achieve state-of-the-art results on benchmarks including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES, and xbench-DeepSearch-2510. The model, framework, and solutions are open-sourced.

Significance. If substantiated, the work could meaningfully advance scalable training of autonomous agents for complex information-seeking tasks by demonstrating an annotation-free pipeline. Open-sourcing the model and framework would support reproducibility and community follow-up. The current text, however, supplies no quantitative results, ablations, or methodological details, so the significance remains provisional.

major comments (2)

[Abstract] Abstract: the central SOTA claim is stated without any benchmark scores, baseline comparisons, error bars, or evaluation protocol. This is load-bearing because the paper's primary contribution is empirical performance on the listed agentic research tasks.
[Training Framework] Training Framework section: the 'highly scalable data synthesis pipeline' and 'customized environments' are described only at a high level; no concrete mechanisms for task generation, quality control, or long-horizon stability are given, leaving the key assumption that fully automatic synthesis suffices for genuine research agency untestable.

minor comments (1)

[Abstract] Abstract: clarify whether the 30.5 B / 3.3 B activation pattern corresponds to a Mixture-of-Experts architecture and provide a brief architectural diagram or reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We have revised the manuscript to address the concerns about the abstract and the level of detail in the training framework description.

read point-by-point responses

Referee: [Abstract] Abstract: the central SOTA claim is stated without any benchmark scores, baseline comparisons, error bars, or evaluation protocol. This is load-bearing because the paper's primary contribution is empirical performance on the listed agentic research tasks.

Authors: We agree that the abstract should include concrete quantitative support for the SOTA claims. In the revised manuscript we have added the key benchmark scores (e.g., exact accuracies on Humanity's Last Exam and BrowseComp), direct comparisons to the strongest published baselines, and a brief statement of the evaluation protocol. Full tables with error bars and statistical details remain in the Experiments section and are now explicitly referenced from the abstract. revision: yes
Referee: [Training Framework] Training Framework section: the 'highly scalable data synthesis pipeline' and 'customized environments' are described only at a high level; no concrete mechanisms for task generation, quality control, or long-horizon stability are given, leaving the key assumption that fully automatic synthesis suffices for genuine research agency untestable.

Authors: We acknowledge that the original description was insufficiently concrete. We have expanded the Training Framework section with explicit mechanisms: recursive task decomposition for generation, multi-stage self-verification and filtering for quality control, and environment reinitialization combined with multi-turn reward shaping for long-horizon stability. Pseudocode and illustrative examples are now included so that the pipeline can be evaluated and reproduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided manuscript text consists entirely of high-level descriptive claims about model architecture, training stages, and benchmark results with no equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations. The central claims rest on empirical outcomes from an automatic data-synthesis pipeline and agentic training, which are presented as independent engineering results rather than any chain that reduces by construction to its own inputs. No self-definitional loops, renamed known results, or uniqueness theorems imported from prior author work appear. The work is therefore self-contained as a technical report.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions from LLM scaling and agent training literature plus the domain assumption that automatic synthesis suffices for high-quality agentic data; no new physical entities or ad-hoc constants are introduced beyond typical hyperparameter choices.

free parameters (1)

Total and activated parameter counts (30.5B / 3.3B)
Architecture scale chosen to balance capability and efficiency; specific values are presented as design decisions.

axioms (1)

domain assumption Fully automatic data synthesis without human annotation can produce training signals sufficient for stable long-horizon agentic reasoning.
Invoked as the foundation for all training stages in the abstract.

pith-pipeline@v0.9.0 · 5697 in / 1278 out tokens · 38682 ms · 2026-05-15T08:52:19.962797+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Tongyi DeepResearch... featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across... Humanity’s Last Exam, BrowseComp...
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation
IndisputableMonolith.Foundation.LedgerCanonicality reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
cs.AI 2026-04 accept novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
cs.AI 2026-05 conditional novelty 7.0

ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
Learning Agentic Policy from Action Guidance
cs.CL 2026-05 unverdicted novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
cs.AI 2026-05 unverdicted novelty 7.0

CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
cs.CL 2026-05 unverdicted novelty 7.0

A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent
cs.AI 2026-05 unverdicted novelty 7.0

AIDA is the first end-to-end autonomous agent that combines a domain-specific language with Pareto-guided reinforcement learning to discover insights from complex business data.
Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
cs.LG 2026-05 conditional novelty 7.0

Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
cs.AI 2026-05 unverdicted novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
cs.LG 2026-05 unverdicted novelty 6.0

HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents
cs.AI 2026-05 unverdicted novelty 6.0

Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.
Towards Knowledgeable Deep Research: Framework and Benchmark
cs.AI 2026-04 unverdicted novelty 6.0

The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models
cs.CL 2026-04 unverdicted novelty 6.0

TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
Learning to Retrieve from Agent Trajectories
cs.IR 2026-03 conditional novelty 6.0

Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search
cs.AI 2026-03 conditional novelty 6.0

EASP adds a Probe-then-Plan step so LLMs ground their search plans in actual retrieval snapshots and inventory, yielding higher recall and business metrics in sub-second production search.
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
cs.AI 2026-05 unverdicted novelty 5.0

MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
cs.CV 2026-05 unverdicted novelty 5.0

ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
cs.AI 2026-05 unverdicted novelty 5.0

CuSearch reallocates fixed training budget toward deeper-search rollouts in RLVR for agentic RAG, treating search depth as an annotation-free proxy for supervision density and reporting up to 11.8 exact-match gains ov...
Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent
cs.AI 2026-05 unverdicted novelty 5.0

AIDA is a reinforcement learning agent that explores complex business databases using a proprietary DSL and Pareto-guided reasoning to discover actionable insights autonomously.
Mind DeepResearch Technical Report
cs.AI 2026-04 unverdicted novelty 5.0

MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
cs.LG 2026-04 unverdicted novelty 5.0

The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
Valley3: Scaling Omni Foundation Models for E-commerce
cs.AI 2026-05 unverdicted novelty 4.0

Valley3 is an omni MLLM for e-commerce that uses a four-stage pre-training pipeline plus post-training for controllable reasoning and agentic search, outperforming baselines on e-commerce benchmarks while staying comp...