Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

Jiabao Zhuang; Jiahui Lin; Jingyi Deng; Kexin Tan; Long Ma; Maxm Pan; Mingqi Wu; Ming Zhang; Ning Luo; Qiyuan Peng

REVIEW 2 major objections 2 minor 4 cited by

Deep research agents retrieve only 20.92 percent of expert-cited papers and produce taxonomies with high sibling overlap and structural imbalance.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-21 15:10 UTC pith:73VCVUBW

load-bearing objection TaxoBench gives concrete numbers on agent retrieval and taxonomy gaps but the reference-free metrics need human baselines to pin down what counts as failure. the 2 major comments →

arxiv 2601.12369 v4 pith:73VCVUBW submitted 2026-01-18 cs.CL

Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

Ming Zhang , Jiabao Zhuang , Wenqing Jing , Kexin Tan , Ziyu Kong , Jingyi Deng , Yujiong Shen , Yuhui Wang

show 13 more authors

Zhenghao Xiang Qiyuan Peng Yuhang Zhao Ning Luo Renzhe Zheng Jiahui Lin Mingqi Wu Long Ma Shihan Dou Maxm Pan Tao Gui Qi Zhang Xuanjing Huang

This is my paper

classification cs.CL

keywords deep research agentstaxonomy generationpaper retrievalhierarchical organizationLLM benchmarkssynthesis evaluationexpert taxonomiesreference-free metrics

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates whether deep research agents and frontier LLMs can match human experts at retrieving essential papers and organizing them into hierarchical taxonomies for survey generation. It introduces TaxoBench, built from 72 expert-authored LLM surveys with their taxonomy trees and 3,815 mapped papers, to test retrieval via standard precision metrics and organization via new hierarchical measures. Capability tests without any reference show agents miss most papers and create taxonomies with 75.9 percent sibling overlap, 51.2 percent MECE violations, and 83.4 percent imbalance. Alignment tests show all 12 LLMs reach only 28-29 percent Semantic Path Similarity, below the 47-58 percent range from independent human annotators on identical paper sets. The evaluation separates reference-free capability shortfalls from reference-dependent alignment differences to isolate genuine failures.

Core claim

Evaluating 7 deep research agents and 12 frontier LLMs on TaxoBench reveals a dual bottleneck in research synthesis: retrieval recall tops out at 20.92 percent of expert-cited papers, while 1,000 model taxonomies exhibit 75.9 percent sibling overlap, 51.2 percent MECE violations, and 83.4 percent structural imbalance detectable without references; on alignment, all LLMs converge to 28-29 percent Semantic Path Similarity against expert trees, below the 47-58 percent achieved by three human-annotator groups on the same paper sets.

What carries the argument

TaxoBench benchmark using expert taxonomy trees for 72 LLM surveys, with leaf-level paper-to-category assignment and hierarchy-level evaluation via Unordered Semantic Tree Edit Distance and Semantic Path Similarity metrics.

Load-bearing premise

Expert-authored taxonomy trees provide a stable reference standard for judging how well models organize the same papers into hierarchies.

What would settle it

A deep research agent or LLM that retrieves over 40 percent of expert-cited papers and reaches Semantic Path Similarity above 40 percent on the same paper sets while matching human annotator variability ranges would falsify the dual-bottleneck finding.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

The strongest agent still retrieves only 20.92 percent of papers cited by experts in the surveys.
Model taxonomies show 75.9 percent sibling overlap, 51.2 percent MECE violations, and 83.4 percent structural imbalance on reference-free checks.
All 12 tested LLMs converge to 28-29 percent Semantic Path Similarity with expert trees.
Three independent human-annotator groups achieve 47-58 percent Semantic Path Similarity on identical paper sets.
Partitioning results into capability-based and alignment-based groups separates genuine failure from valid disagreement with one expert's choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Separate advances in retrieval systems and in hierarchical reasoning may be needed, since the two shortfalls appear independent in the results.
Reference-free metrics could serve as diagnostic tools to improve model taxonomies before any expert comparison is introduced.
The consistent low alignment across all LLMs points to a shared limitation in organizing knowledge into balanced, non-overlapping hierarchies.
Extending the benchmark to non-LLM scientific domains would test whether the retrieval and organization gaps generalize beyond this field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

TaxoBench gives concrete numbers on agent retrieval and taxonomy gaps but the reference-free metrics need human baselines to pin down what counts as failure.

read the letter

The main thing to know is that this paper builds TaxoBench from 72 expert LLM surveys and shows agents top out at 20.92% recall on the key papers while model-built taxonomies hit 75.9% sibling overlap, 51.2% MECE violations, and 83.4% imbalance. On the alignment side the models sit at 28-29% Sem-Path against 47-58% for human annotators on the same sets. That dual-bottleneck picture is the useful output. What the work actually adds is the benchmark itself plus the two new tree metrics, US-TED/US-NTED and Sem-Path, which try to capture hierarchy instead of just flat clustering or citation counts. They also split results into reference-free capability checks and reference-dependent alignment checks, and they release the data. That setup is cleaner than most agent papers that just score final write-ups. The soft spot is exactly the one the stress-test flagged. The capability claims rest on those reference-free percentages as evidence of model shortcomings, yet the paper does not report the same three metrics on the human taxonomies it already collected for the Sem-Path comparison. Without that calibration it is hard to tell whether the model numbers are worse than ordinary expert variation or just different. The partition into capability and alignment is a good idea, but it leaves the capability half unanchored. Readers who work on literature-synthesis agents or hierarchical evaluation will find the numbers and the public repo worth looking at. The paper shows clear thinking about what to measure and why, so it deserves a serious referee. I would send it for review and ask the authors to add the missing human baselines on the reference-free metrics before final acceptance.

Referee Report

2 major / 2 minor

Summary. The paper introduces TaxoBench, a benchmark of 72 expert-authored taxonomy trees for highly cited LLM surveys with 3,815 mapped papers. It evaluates 7 deep research agents and 12 frontier LLMs on retrieval (Recall/Precision/F1) and organization using new metrics Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). Results are partitioned into reference-free capability metrics (e.g., 75.9% sibling overlap, 51.2% MECE violations, 83.4% structural imbalance in 1,000 model taxonomies; best retrieval 20.92%) and reference-dependent alignment metrics (LLMs at 28-29% Sem-Path vs. 47-58% for human annotators). Two modes are supported: end-to-end Deep Research and Bottom-Up organization-only.

Significance. If validated, the work is significant for exposing concrete bottlenecks in automated research synthesis, separating retrieval failures from organizational ones and capability from alignment issues. Strengths include the public benchmark release, the explicit partitioning of reference-free vs. reference-dependent results, and the use of expert taxonomies to ground evaluation beyond standard clustering metrics.

major comments (2)

[Results on reference-free metrics and human baselines] The capability-side claim of a synthesis bottleneck rests on the reference-free metrics (75.9% sibling overlap, 51.2% MECE violations, 83.4% structural imbalance) computed over 1,000 model taxonomies. These are presented as evidence of model shortcomings without any reference, yet the same three metrics are not reported for the three independent human-annotator groups whose taxonomies achieve 47-58% Sem-Path. This leaves the rates uncalibrated and weakens the interpretation that they indicate genuine failure rather than typical human variation in taxonomy construction. See the results section on reference-free capability metrics and the human comparison setup.
[Metric definitions (US-TED, Sem-Path)] The definitions of the new metrics US-TED/US-NTED and Sem-Path are central to both the hierarchy-level evaluation and the alignment claims. The manuscript should include explicit formulas or pseudocode for how unordered semantic tree edits are computed and how semantic path similarity aggregates over the expert trees, including any handling of partial matches or depth weighting, to allow reproduction and to confirm they do not inadvertently favor certain structures.

minor comments (2)

[Abstract and evaluation setup] Clarify in the abstract and methods whether the 1,000 model taxonomies are generated from the same 72 topics or a sampled subset, and report variance or confidence intervals for the reported percentages.
[Evaluation modes] The Bottom-Up mode is useful for isolating organization, but the manuscript could add a short discussion of how results differ between the two modes to strengthen the dual-bottleneck narrative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important opportunities to strengthen the calibration of our reference-free metrics and the reproducibility of our proposed evaluation measures. We address each major comment below and will revise the manuscript to incorporate the suggested changes.

read point-by-point responses

Referee: [Results on reference-free metrics and human baselines] The capability-side claim of a synthesis bottleneck rests on the reference-free metrics (75.9% sibling overlap, 51.2% MECE violations, 83.4% structural imbalance) computed over 1,000 model taxonomies. These are presented as evidence of model shortcomings without any reference, yet the same three metrics are not reported for the three independent human-annotator groups whose taxonomies achieve 47-58% Sem-Path. This leaves the rates uncalibrated and weakens the interpretation that they indicate genuine failure rather than typical human variation in taxonomy construction. See the results section on reference-free capability metrics and the human comparison setup.

Authors: The reference-free metrics are intentionally defined without reference to any expert taxonomy to surface intrinsic structural deficiencies (e.g., excessive sibling overlap or MECE violations) that can be observed directly in model outputs. We maintain that these quantities still provide useful evidence of capability limitations. Nevertheless, we agree that reporting the identical metrics on the human-annotator taxonomies would improve calibration and allow readers to assess whether the observed rates exceed typical human variation. In the revised manuscript we will compute and present sibling overlap, MECE violation rates, and structural imbalance for the three human groups alongside the model results. revision: yes
Referee: [Metric definitions (US-TED, Sem-Path)] The definitions of the new metrics US-TED/US-NTED and Sem-Path are central to both the hierarchy-level evaluation and the alignment claims. The manuscript should include explicit formulas or pseudocode for how unordered semantic tree edits are computed and how semantic path similarity aggregates over the expert trees, including any handling of partial matches or depth weighting, to allow reproduction and to confirm they do not inadvertently favor certain structures.

Authors: We concur that explicit, reproducible definitions are required. The current manuscript introduces US-TED/US-NTED and Sem-Path at a high level but does not supply the full algorithmic details. In the revision we will add a dedicated subsection containing the mathematical formulations, pseudocode for the unordered semantic tree-edit procedure, the aggregation rule for semantic path similarity, and explicit statements on the treatment of partial matches and depth weighting. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces TaxoBench with expert-authored taxonomy trees as external reference and defines new metrics (US-TED/US-NTED, Sem-Path) plus reference-free ones (sibling overlap, MECE violations, structural imbalance) computed directly from model outputs. Results are partitioned into capability (reference-free) and alignment (reference-dependent) groups, with explicit human-annotator comparisons on Sem-Path. No equations or steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the evaluation relies on independent external benchmarks and human baselines rather than tautological renaming or imported uniqueness from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that expert taxonomies provide a meaningful gold standard for organization quality and that the introduced metrics validly capture synthesis capability independent of any single reference.

axioms (1)

domain assumption Expert-authored taxonomy trees from LLM surveys represent high-quality hierarchical organization of the literature.
The benchmark uses these trees as the reference for alignment-based evaluation and for defining structural issues such as MECE violations.

invented entities (2)

Unordered Semantic Tree Edit Distance (US-TED / US-NTED) no independent evidence
purpose: Measure hierarchy-level differences between model-generated and expert taxonomies
New metric introduced to evaluate structural organization beyond leaf-level assignment.
Semantic Path Similarity (Sem-Path) no independent evidence
purpose: Quantify alignment between model and expert category paths for papers
New metric introduced to compare organization quality in a reference-dependent way.

pith-pipeline@v0.9.0 · 5920 in / 1502 out tokens · 57535 ms · 2026-05-21T15:10:22.966519+00:00 · methodology

0 comments

read the original abstract

Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics ignore hierarchical structure. We introduce TaxoBench, a benchmark of 72 highly cited LLM surveys with expert-authored taxonomy trees and 3,815 papers mapped to paper categories. TaxoBench evaluates (1) retrieval via Recall/Precision/F1, and (2) organization at a leaf level (paper-to-category assignment) and a hierarchy level via two new metrics: Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path). Two modes are supported: Deep Research (topic-only, end-to-end) and Bottom-Up (expert paper set provided, organization-only). To distinguish disagreement with a single expert reference from genuine model failure, we explicitly partition findings into capability-based (reference-free) and alignment-based (reference-dependent) groups. Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck. On the capability side, the best agent retrieves only 20.92% of expert-cited papers, and 1,000 model taxonomies show 75.9% sibling overlap, 51.2% MECE violations, and 83.4% structural imbalance, all detectable without any reference. On the alignment side, all 12 LLMs converge to Sem-Path 28-29%, well below 47-58% achieved by three independent human-annotator groups on the same paper sets. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench.

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI Can Learn Scientific Taste
cs.CL 2026-03 conditional novelty 6.0

Reinforcement learning on citation-preference pairs teaches a model to predict which papers will be cited more and to propose ideas that LLM judges rate as likely to be cited more—but "taste" here means citation impact.
LLM-Oriented Information Retrieval: A Denoising-First Perspective
cs.IR 2026-05 unverdicted novelty 5.0

Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
LLM-Oriented Information Retrieval: A Denoising-First Perspective
cs.IR 2026-05 unverdicted novelty 4.0

Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
WisPaper: Your AI Scholar Search Engine
cs.IR 2025-12 unverdicted novelty 3.0

WisPaper integrates semantic search with agent-based validation, library organization, and personalized AI feeds into a closed-loop system that improves academic paper discovery and long-term awareness.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 3 Pith papers

[7]

name": "AI Research

After assignment, verify that: - the number of assigned papers equals the number of input papers, - no paper appears in more than one group. ### Output Format Use a JSON structure like this (replace placeholders with actual paper titles) and the output you produce MUST be wrapped inside a fenced code block: json { "name": "AI Research", "subtopics": [ { "...

work page
[14]

‘json {

After assignment, verify that: - the number of assigned papers equals the number of input papers, - no paper appears in more than one group. ### Output Format Use a JSON structure like this (replace placeholders with actual paper titles) and the output you produce MUST be wrapped inside a fenced code block: “‘json { "name": "AI Research", "subtopics": [ {...

work page
[15]

Output must bestrictly valid JSON

work page
[16]

papers"; all internal nodes must contain

Only leaf nodes may contain"papers"; all internal nodes must contain"subtopics". 3.Every paper must appear exactly oncein the entire tree

work page
[17]

NO duplicate papers anywhere

work page
[18]

### CLASSIFICATION RULES - Group papers by semantic similarity using both title + abstract

The tree must eventually merge intoone single root node. ### CLASSIFICATION RULES - Group papers by semantic similarity using both title + abstract. - Create meaningful names for leaf-level themes. ### ANTI-DUPLICATION PROCEDURE (MANDATORY) Before constructing the tree:

work page
[19]

Produce an internal list of all given paper titles

work page
[20]

Assign each paper to exactly one leaf node

work page
[21]

‘json {

After assignment, verify that: - the number of assigned papers equals the number of input papers, - no paper appears in more than one group. ### Output Format Use a JSON structure like this (replace placeholders with actual paper titles) and the output you produce MUST be wrapped inside a fenced code block: “‘json { "name": "AI Research", "subtopics": [ {...

work page
[22]

- Scoring Rubric: - 1 (Critical Failure): Misses more than 50% of the core branches (Level 1/Level 2); key concepts are seriously lacking

Semantic Coverage & Recall - Definition: Measures whether the Model Tree contains the core concepts and main branches present in the Reference Tree. - Scoring Rubric: - 1 (Critical Failure): Misses more than 50% of the core branches (Level 1/Level 2); key concepts are seriously lacking. - 2 (Poor): Covers the main fields but misses a large number of impor...

work page
[23]

Parent Node→Child Node

Sibling Organization (MECE Principle) - Definition: Evaluates whether the set of child nodes under the same parent node follows the MECE principle. - Scoring Rubric: - 1 (Chaotic): Severe semantic overlap between sibling nodes (>50%); or completely lacks classification logic. - 2 (Poor): Inconsistent classification standards; or the division of a certain ...

work page
[24]

- Scoring Rubric: - 1 (Severe Deformation): Extreme structural difference (e.g., Reference Tree is deep, but Model Tree is a flat list)

Structural Topology - Definition: Evaluates whether the "shape" of the Model Tree is similar to the Reference Tree. - Scoring Rubric: - 1 (Severe Deformation): Extreme structural difference (e.g., Reference Tree is deep, but Model Tree is a flat list). - 2 (Imbalanced): Certain branches are overly expanded while others are not, causing center of gravity d...

work page
[25]

Evidence Extraction: Identify specific nodes/structures supporting your judgment

work page
[26]

Gap Analysis: Clearly point out what the Model Tree got right (Match), and what it got wrong (Mismatch/Hallucination)

work page
[27]

semantic_coverage

Final Scoring: Provide an objective score (1-5) based on your analysis. # Output Format Requirements Please strictly follow the<output_format>below. Do not include any irrelevant intro or summary. Ensure the output is valid JSON. <output_format> {{ "semantic_coverage": {{ "score": [Specific Score 1-5], "reasoning": "Detailed analysis of Semantic Coverage....

work page

[1] [7]

name": "AI Research

After assignment, verify that: - the number of assigned papers equals the number of input papers, - no paper appears in more than one group. ### Output Format Use a JSON structure like this (replace placeholders with actual paper titles) and the output you produce MUST be wrapped inside a fenced code block: json { "name": "AI Research", "subtopics": [ { "...

work page

[2] [14]

‘json {

After assignment, verify that: - the number of assigned papers equals the number of input papers, - no paper appears in more than one group. ### Output Format Use a JSON structure like this (replace placeholders with actual paper titles) and the output you produce MUST be wrapped inside a fenced code block: “‘json { "name": "AI Research", "subtopics": [ {...

work page

[3] [15]

Output must bestrictly valid JSON

work page

[4] [16]

papers"; all internal nodes must contain

Only leaf nodes may contain"papers"; all internal nodes must contain"subtopics". 3.Every paper must appear exactly oncein the entire tree

work page

[5] [17]

NO duplicate papers anywhere

work page

[6] [18]

### CLASSIFICATION RULES - Group papers by semantic similarity using both title + abstract

The tree must eventually merge intoone single root node. ### CLASSIFICATION RULES - Group papers by semantic similarity using both title + abstract. - Create meaningful names for leaf-level themes. ### ANTI-DUPLICATION PROCEDURE (MANDATORY) Before constructing the tree:

work page

[7] [19]

Produce an internal list of all given paper titles

work page

[8] [20]

Assign each paper to exactly one leaf node

work page

[9] [21]

‘json {

After assignment, verify that: - the number of assigned papers equals the number of input papers, - no paper appears in more than one group. ### Output Format Use a JSON structure like this (replace placeholders with actual paper titles) and the output you produce MUST be wrapped inside a fenced code block: “‘json { "name": "AI Research", "subtopics": [ {...

work page

[10] [22]

- Scoring Rubric: - 1 (Critical Failure): Misses more than 50% of the core branches (Level 1/Level 2); key concepts are seriously lacking

Semantic Coverage & Recall - Definition: Measures whether the Model Tree contains the core concepts and main branches present in the Reference Tree. - Scoring Rubric: - 1 (Critical Failure): Misses more than 50% of the core branches (Level 1/Level 2); key concepts are seriously lacking. - 2 (Poor): Covers the main fields but misses a large number of impor...

work page

[11] [23]

Parent Node→Child Node

Sibling Organization (MECE Principle) - Definition: Evaluates whether the set of child nodes under the same parent node follows the MECE principle. - Scoring Rubric: - 1 (Chaotic): Severe semantic overlap between sibling nodes (>50%); or completely lacks classification logic. - 2 (Poor): Inconsistent classification standards; or the division of a certain ...

work page

[12] [24]

- Scoring Rubric: - 1 (Severe Deformation): Extreme structural difference (e.g., Reference Tree is deep, but Model Tree is a flat list)

Structural Topology - Definition: Evaluates whether the "shape" of the Model Tree is similar to the Reference Tree. - Scoring Rubric: - 1 (Severe Deformation): Extreme structural difference (e.g., Reference Tree is deep, but Model Tree is a flat list). - 2 (Imbalanced): Certain branches are overly expanded while others are not, causing center of gravity d...

work page

[13] [25]

Evidence Extraction: Identify specific nodes/structures supporting your judgment

work page

[14] [26]

Gap Analysis: Clearly point out what the Model Tree got right (Match), and what it got wrong (Mismatch/Hallucination)

work page

[15] [27]

semantic_coverage

Final Scoring: Provide an objective score (1-5) based on your analysis. # Output Format Requirements Please strictly follow the<output_format>below. Do not include any irrelevant intro or summary. Ensure the output is valid JSON. <output_format> {{ "semantic_coverage": {{ "score": [Specific Score 1-5], "reasoning": "Detailed analysis of Semantic Coverage....

work page