Recognition: 3 theorem links
· Lean TheoremTongyi DeepResearch Technical Report
Pith reviewed 2026-05-15 08:52 UTC · model grok-4.3
The pith
Tongyi DeepResearch, a sparsely activated 30.5-billion-parameter agentic model, achieves state-of-the-art performance on long-horizon deep research benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Tongyi DeepResearch is an agentic LLM with 30.5 billion total parameters and 3.3 billion activated per token. It is developed via an end-to-end framework of agentic mid-training and post-training enabled by a highly scalable, fully automatic data synthesis pipeline. This pipeline constructs customized environments for each stage to support stable interactions. The resulting model achieves state-of-the-art performance across agentic deep research benchmarks such as Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES, and xbench-DeepSearch-2510. The model, framework, and solutions are open-sourced.
What carries the argument
The end-to-end training framework that integrates agentic mid-training and agentic post-training, powered by a fully automatic data synthesis pipeline without human annotation.
Load-bearing premise
The fully automatic data synthesis pipeline generates training data of sufficient quality and diversity to foster genuine long-horizon research capabilities in the model.
What would settle it
Demonstrating that the model requires human-annotated data to achieve comparable performance on new long-horizon benchmarks would falsify the claim that the automatic pipeline suffices.
read the original abstract
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Tongyi DeepResearch, an agentic LLM with 30.5 billion total parameters (3.3 billion activated per token) for long-horizon deep research tasks. It is trained via an end-to-end framework combining agentic mid-training and post-training, powered by a fully automatic data synthesis pipeline without human annotation. The model is claimed to achieve state-of-the-art results on benchmarks including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES, and xbench-DeepSearch-2510. The model, framework, and solutions are open-sourced.
Significance. If substantiated, the work could meaningfully advance scalable training of autonomous agents for complex information-seeking tasks by demonstrating an annotation-free pipeline. Open-sourcing the model and framework would support reproducibility and community follow-up. The current text, however, supplies no quantitative results, ablations, or methodological details, so the significance remains provisional.
major comments (2)
- [Abstract] Abstract: the central SOTA claim is stated without any benchmark scores, baseline comparisons, error bars, or evaluation protocol. This is load-bearing because the paper's primary contribution is empirical performance on the listed agentic research tasks.
- [Training Framework] Training Framework section: the 'highly scalable data synthesis pipeline' and 'customized environments' are described only at a high level; no concrete mechanisms for task generation, quality control, or long-horizon stability are given, leaving the key assumption that fully automatic synthesis suffices for genuine research agency untestable.
minor comments (1)
- [Abstract] Abstract: clarify whether the 30.5 B / 3.3 B activation pattern corresponds to a Mixture-of-Experts architecture and provide a brief architectural diagram or reference.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We have revised the manuscript to address the concerns about the abstract and the level of detail in the training framework description.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central SOTA claim is stated without any benchmark scores, baseline comparisons, error bars, or evaluation protocol. This is load-bearing because the paper's primary contribution is empirical performance on the listed agentic research tasks.
Authors: We agree that the abstract should include concrete quantitative support for the SOTA claims. In the revised manuscript we have added the key benchmark scores (e.g., exact accuracies on Humanity's Last Exam and BrowseComp), direct comparisons to the strongest published baselines, and a brief statement of the evaluation protocol. Full tables with error bars and statistical details remain in the Experiments section and are now explicitly referenced from the abstract. revision: yes
-
Referee: [Training Framework] Training Framework section: the 'highly scalable data synthesis pipeline' and 'customized environments' are described only at a high level; no concrete mechanisms for task generation, quality control, or long-horizon stability are given, leaving the key assumption that fully automatic synthesis suffices for genuine research agency untestable.
Authors: We acknowledge that the original description was insufficiently concrete. We have expanded the Training Framework section with explicit mechanisms: recursive task decomposition for generation, multi-stage self-verification and filtering for quality control, and environment reinitialization combined with multi-turn reward shaping for long-horizon stability. Pseudocode and illustrative examples are now included so that the pipeline can be evaluated and reproduced. revision: yes
Circularity Check
No significant circularity
full rationale
The provided manuscript text consists entirely of high-level descriptive claims about model architecture, training stages, and benchmark results with no equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations. The central claims rest on empirical outcomes from an automatic data-synthesis pipeline and agentic training, which are presented as independent engineering results rather than any chain that reduces by construction to its own inputs. No self-definitional loops, renamed known results, or uniqueness theorems imported from prior author work appear. The work is therefore self-contained as a technical report.
Axiom & Free-Parameter Ledger
free parameters (1)
- Total and activated parameter counts (30.5B / 3.3B)
axioms (1)
- domain assumption Fully automatic data synthesis without human annotation can produce training signals sufficient for stable long-horizon agentic reasoning.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Tongyi DeepResearch... featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across... Humanity’s Last Exam, BrowseComp...
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation
-
IndisputableMonolith.Foundation.LedgerCanonicalityreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
Learning Agentic Policy from Action Guidance
ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.
-
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
-
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
-
Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent
AIDA is the first end-to-end autonomous agent that combines a domain-specific language with Pareto-guided reinforcement learning to discover insights from complex business data.
-
Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
-
Fine-Tuning Small Reasoning Models for Quantum Field Theory
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
-
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
-
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchm...
-
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents
Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.
-
Towards Knowledgeable Deep Research: Framework and Benchmark
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
-
TimelineReasoner: Advancing Timeline Summarization with Large Reasoning Models
TimelineReasoner applies large reasoning models in a Global Cognition plus Detail Exploration loop to produce more accurate, complete, and coherent timelines from news than prior LLM-based methods.
-
Learning to Retrieve from Agent Trajectories
Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
-
Probe-then-Plan: Environment-Aware Planning for Industrial E-commerce Search
EASP adds a Probe-then-Plan step so LLMs ground their search plans in actual retrieval snapshots and inventory, yielding higher recall and business metrics in sub-second production search.
-
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
-
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
ViDR treats source figures as retrievable and verifiable evidence objects in multimodal deep research reports and introduces MMR Bench+ to measure improvements in visual integration and verifiability.
-
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
CuSearch reallocates fixed training budget toward deeper-search rollouts in RLVR for agentic RAG, treating search depth as an annotation-free proxy for supervision density and reporting up to 11.8 exact-match gains ov...
-
Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent
AIDA is a reinforcement learning agent that explores complex business databases using a proprietary DSL and Pareto-guided reasoning to discover actionable insights autonomously.
-
Mind DeepResearch Technical Report
MindDR combines a Planning Agent, DeepSearch Agent, and Report Agent with SFT cold-start, Search-RL, Report-RL, and preference alignment to reach competitive scores on research benchmarks using 30B-scale models.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...
-
Valley3: Scaling Omni Foundation Models for E-commerce
Valley3 is an omni MLLM for e-commerce that uses a four-stage pre-training pipeline plus post-training for controllable reasoning and agentic search, outperforming baselines on e-commerce benchmarks while staying comp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.