arxiv: 2601.12369 · v3 · submitted 2026-01-18 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

Ming Zhang , Jiabao Zhuang , Wenqing Jing , Kexin Tan , Ziyu Kong , Jingyi Deng , Yujiong Shen , Yuhui Wang

show 13 more authors

Zhenghao Xiang Qiyuan Peng Yuhang Zhao Ning Luo Renzhe Zheng Jiahui Lin Mingqi Wu Long Ma Shihan Dou Maxm Pan Tao Gui Qi Zhang Xuanjing Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:28 UTC · model grok-4.3

classification 💻 cs.CL

keywords deep research agentsTaxoBenchLLM surveystaxonomy evaluationpaper retrievalhierarchical organizationMECE violationssemantic path similarity

0 comments

The pith

Deep research agents retrieve only 21 percent of expert-cited papers and organize taxonomies far below human alignment levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TaxoBench to test whether deep research agents can match human experts at retrieving essential papers on a topic and organizing them into coherent taxonomies. It evaluates seven agents and twelve frontier LLMs across seventy-two expert surveys containing thousands of papers, using retrieval scores plus new metrics that measure hierarchical structure without relying solely on one reference. The results identify two separate shortfalls: agents miss most of the papers experts cite, and the taxonomies they produce show high sibling overlap, violations of mutual exclusivity, and imbalance even when checked without any expert reference. Human annotators reach substantially higher alignment scores on identical paper sets. This matters because automated survey generation is expanding, yet these gaps indicate that current systems still require human correction to achieve expert-quality synthesis.

Core claim

Evaluating seven deep research agents and twelve frontier LLMs on TaxoBench reveals a dual bottleneck. On the capability side, the best agent retrieves only 20.92 percent of expert-cited papers, and one thousand model taxonomies exhibit 75.9 percent sibling overlap, 51.2 percent MECE violations, and 83.4 percent structural imbalance detectable without any reference taxonomy. On the alignment side, all twelve LLMs converge to Sem-Path scores of 28-29 percent, well below the 47-58 percent range achieved by three independent human-annotator groups on the same paper sets.

What carries the argument

TaxoBench benchmark pairing expert-authored taxonomy trees with 3,815 mapped papers, evaluated via retrieval metrics plus new Unordered Semantic Tree Edit Distance (US-TED/US-NTED) and Semantic Path Similarity (Sem-Path) for leaf assignments and hierarchical organization.

If this is right

Retrieval remains the primary limit even for the strongest agents when given only the topic.
Model taxonomies display consistent structural defects that can be measured without expert references.
All tested LLMs plateau at similar but low alignment levels with expert structures.
Both end-to-end deep research and bottom-up organization tasks expose the same shortfalls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Dedicated training on hierarchical construction may be needed beyond standard language modeling.
Reference-free checks could enable ongoing evaluation in fields that lack expert taxonomies.
Closing the gap may require combining improved retrieval with explicit synthesis objectives.
The benchmark could be applied to other scientific domains to test whether the synthesis shortfall is general.

Load-bearing premise

Expert-authored taxonomies provide a stable gold standard, and the new metrics validly quantify synthesis quality independent of any single reference.

What would settle it

An agent that retrieves more than 40 percent of expert-cited papers on TaxoBench while producing taxonomies with Sem-Path alignment above 45 percent and under 20 percent sibling overlap would falsify the reported dual bottleneck.

read the original abstract

Deep Research Agents increasingly automate survey generation, yet whether they match human experts at retrieving essential papers and organizing them into expert-like taxonomies remains unclear. Existing benchmarks emphasize writing quality or citation correctness, while standard clustering metrics ignore hierarchical structure. We introduce TaxoBench, a benchmark of 72 highly-cited LLM surveys with expert-authored taxonomy trees and 3,815 papers mapped to paper categories. TaxoBench evaluates (1) retrieval via Recall/Precision/F1, and (2) organization at a leaf level (paper-to-category assignment) and a hierarchy level via novel metrics, namely Unordered Semantic Tree Edit Distance US-TED/US-NTED and Semantic Path Similarity Sem-Path. Two modes are supported: Deep Research (topic-only, end-to-end) and Bottom-Up (expert paper set provided, organization-only). To distinguish disagreement with a single expert reference from genuine model failure, we explicitly partition findings into capability-based (reference-free) and alignment-based (reference-dependent). Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck: capability-side, the best agent retrieves only 20.92% of expert-cited papers, and 1,000 model taxonomies show 75.9% sibling overlap, 51.2% MECE violations, and 83.4% structural imbalance, all detectable without any reference; alignment-side, all 12 LLMs converge to Sem-Path 28--29%, well below 47--58% achieved by three independent human-annotator groups on the same paper sets. Our benchmark is publicly available at https://github.com/KongLongGeFDU/TaxoBench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TaxoBench, a benchmark of 72 highly-cited LLM surveys with expert-authored taxonomy trees and 3,815 mapped papers. It evaluates 7 deep research agents and 12 frontier LLMs on retrieval (Recall/Precision/F1) and organization using leaf-level assignment plus hierarchy-level metrics US-TED, US-NTED, and Sem-Path. Findings are partitioned into reference-free capability results (best retrieval recall 20.92%; 1,000 model taxonomies exhibit 75.9% sibling overlap, 51.2% MECE violations, 83.4% structural imbalance) and reference-dependent alignment results (LLMs converge to Sem-Path 28-29% versus 47-58% for three human-annotator groups), with two supported modes (end-to-end Deep Research and Bottom-Up).

Significance. If the quantitative outcomes hold, the work is significant for introducing a structured benchmark that moves beyond citation correctness or writing quality to assess synthesis via hierarchical organization, for providing reference-free diagnostics of taxonomy defects, and for releasing the full dataset and code. The explicit capability/alignment partition and use of independent human groups are constructive design choices.

major comments (2)

[Abstract and §4 (Alignment Evaluation)] Abstract and §4 (Alignment Evaluation): the alignment-side bottleneck (Sem-Path 28-29% for LLMs versus 47-58% for humans) treats each expert taxonomy as a stable gold standard, yet no inter-expert agreement, sensitivity to alternative hierarchies on the same 72 topics, or cross-taxonomy Sem-Path/US-TED scores are reported. If independent experts produce Sem-Path values in the 30-45% range, the reported human-model gap would shrink and the alignment claim would require re-interpretation.
[§3 (Metric Definition) and Methods] §3 (Metric Definition) and Methods: the novel metrics US-TED, US-NTED, and Sem-Path are central to both capability and alignment claims, but the manuscript provides insufficient detail on semantic embedding choices, tree-edit operations for unordered hierarchies, and validation against human judgments, preventing independent reproduction of the reported 20.92% recall, 75.9% sibling overlap, and 51.2% MECE figures.

minor comments (2)

[Results tables/figures] Table or figure captions should explicitly state the number of model taxonomies (1,000) and human groups (3) used for each statistic.
[§3.2] Clarify the exact definition and detection procedure for 'MECE violations' and 'structural imbalance' with a short illustrative example in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments, which help clarify the presentation of our alignment results and the reproducibility of the proposed metrics. We address each point below and will incorporate the suggested expansions in the revised manuscript.

read point-by-point responses

Referee: Abstract and §4 (Alignment Evaluation): the alignment-side bottleneck (Sem-Path 28-29% for LLMs versus 47-58% for humans) treats each expert taxonomy as a stable gold standard, yet no inter-expert agreement, sensitivity to alternative hierarchies on the same 72 topics, or cross-taxonomy Sem-Path/US-TED scores are reported. If independent experts produce Sem-Path values in the 30-45% range, the reported human-model gap would shrink and the alignment claim would require re-interpretation.

Authors: The three independent human-annotator groups were given identical paper sets and produced taxonomies whose pairwise Sem-Path scores fall in the 47-58% range; these scores already constitute a direct empirical measure of inter-expert agreement on the same inputs. Nevertheless, we agree that explicitly reporting (i) pairwise Sem-Path and US-TED between the original expert taxonomies and each human group and (ii) a sensitivity analysis across plausible alternative hierarchies would further substantiate the gap. We will add these cross-taxonomy statistics and a short sensitivity subsection in the revised §4. revision: yes
Referee: §3 (Metric Definition) and Methods: the novel metrics US-TED, US-NTED, and Sem-Path are central to both capability and alignment claims, but the manuscript provides insufficient detail on semantic embedding choices, tree-edit operations for unordered hierarchies, and validation against human judgments, preventing independent reproduction of the reported 20.92% recall, 75.9% sibling overlap, and 51.2% MECE figures.

Authors: We acknowledge that the current description of US-TED/US-NTED and Sem-Path is too terse for full reproducibility. In the revised manuscript we will expand §3 with: (1) the exact sentence embedding model and similarity threshold used for node matching, (2) the precise set of unordered tree-edit operations (insert, delete, relabel) together with the semantic matching procedure, and (3) a validation subsection reporting correlation of the new metrics with human judgments on a held-out subset of taxonomies. Note that the 20.92% recall figure is computed from standard set overlap and does not depend on these metrics; the sibling-overlap and MECE statistics are reference-free and will also be described with explicit formulas. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical measurements on external references

full rationale

The paper introduces TaxoBench with 72 expert-authored taxonomies and reports empirical retrieval/organization results for agents and LLMs. Capability metrics (20.92% recall, 75.9% sibling overlap, 51.2% MECE violations, 83.4% imbalance) are explicitly reference-free. Alignment metrics compare model Sem-Path (28-29%) to separate human-annotator groups (47-58%) on the same paper sets. No equations, fitted parameters, or self-citations reduce any reported value to a quantity defined inside the paper by construction. The derivation chain consists of benchmark construction followed by measurement, with no self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim depends on treating expert taxonomies as authoritative references and on the validity of the newly proposed tree-comparison metrics. No numerical free parameters are described. The two domain assumptions below are the primary unproven premises.

axioms (2)

domain assumption Expert-authored taxonomies represent a stable and appropriate gold standard for measuring AI synthesis quality.
Invoked when reporting alignment-based Sem-Path scores and when partitioning results into capability versus alignment.
domain assumption US-TED, US-NTED, and Sem-Path metrics correctly quantify hierarchical organization quality.
These metrics are introduced as novel without prior validation or comparison to existing tree-edit distances in the abstract.

invented entities (2)

US-TED / US-NTED no independent evidence
purpose: Unordered semantic tree edit distance for comparing taxonomy hierarchies
New metric family proposed for the benchmark; no independent evidence supplied in abstract.
Sem-Path no independent evidence
purpose: Semantic path similarity for measuring alignment between taxonomies
New metric introduced to capture path-level agreement; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5681 in / 1766 out tokens · 48441 ms · 2026-05-16T13:28:44.482992+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose hierarchy-aware metrics (US-TED, US-NTED, and Sem-Path) that capture structural quality beyond flat clustering scores... Unordered Semantic Tree Edit Distance (US-TED) ... minimum-cost bipartite matching (Hungarian algorithm)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Sem-Path ... order-preserving minimum-cost alignment ... 1/(1+J_d)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM-Oriented Information Retrieval: A Denoising-First Perspective
cs.IR 2026-05 unverdicted novelty 5.0

Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
WisPaper: Your AI Scholar Search Engine
cs.IR 2025-12 unverdicted novelty 3.0

WisPaper integrates semantic search with agent-based validation, library organization, and personalized AI feeds into a closed-loop system that improves academic paper discovery and long-term awareness.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 2 Pith papers

[7]

name": "AI Research

After assignment, verify that: - the number of assigned papers equals the number of input papers, - no paper appears in more than one group. ### Output Format Use a JSON structure like this (replace placeholders with actual paper titles) and the output you produce MUST be wrapped inside a fenced code block: json { "name": "AI Research", "subtopics": [ { "...

work page
[14]

‘json {

After assignment, verify that: - the number of assigned papers equals the number of input papers, - no paper appears in more than one group. ### Output Format Use a JSON structure like this (replace placeholders with actual paper titles) and the output you produce MUST be wrapped inside a fenced code block: “‘json { "name": "AI Research", "subtopics": [ {...

work page
[15]

Output must bestrictly valid JSON

work page
[16]

papers"; all internal nodes must contain

Only leaf nodes may contain"papers"; all internal nodes must contain"subtopics". 3.Every paper must appear exactly oncein the entire tree

work page
[17]

NO duplicate papers anywhere

work page
[18]

### CLASSIFICATION RULES - Group papers by semantic similarity using both title + abstract

The tree must eventually merge intoone single root node. ### CLASSIFICATION RULES - Group papers by semantic similarity using both title + abstract. - Create meaningful names for leaf-level themes. ### ANTI-DUPLICATION PROCEDURE (MANDATORY) Before constructing the tree:

work page
[19]

Produce an internal list of all given paper titles

work page
[20]

Assign each paper to exactly one leaf node

work page
[21]

‘json {

After assignment, verify that: - the number of assigned papers equals the number of input papers, - no paper appears in more than one group. ### Output Format Use a JSON structure like this (replace placeholders with actual paper titles) and the output you produce MUST be wrapped inside a fenced code block: “‘json { "name": "AI Research", "subtopics": [ {...

work page
[22]

- Scoring Rubric: - 1 (Critical Failure): Misses more than 50% of the core branches (Level 1/Level 2); key concepts are seriously lacking

Semantic Coverage & Recall - Definition: Measures whether the Model Tree contains the core concepts and main branches present in the Reference Tree. - Scoring Rubric: - 1 (Critical Failure): Misses more than 50% of the core branches (Level 1/Level 2); key concepts are seriously lacking. - 2 (Poor): Covers the main fields but misses a large number of impor...

work page
[23]

Parent Node→Child Node

Sibling Organization (MECE Principle) - Definition: Evaluates whether the set of child nodes under the same parent node follows the MECE principle. - Scoring Rubric: - 1 (Chaotic): Severe semantic overlap between sibling nodes (>50%); or completely lacks classification logic. - 2 (Poor): Inconsistent classification standards; or the division of a certain ...

work page
[24]

- Scoring Rubric: - 1 (Severe Deformation): Extreme structural difference (e.g., Reference Tree is deep, but Model Tree is a flat list)

Structural Topology - Definition: Evaluates whether the "shape" of the Model Tree is similar to the Reference Tree. - Scoring Rubric: - 1 (Severe Deformation): Extreme structural difference (e.g., Reference Tree is deep, but Model Tree is a flat list). - 2 (Imbalanced): Certain branches are overly expanded while others are not, causing center of gravity d...

work page
[25]

Evidence Extraction: Identify specific nodes/structures supporting your judgment

work page
[26]

Gap Analysis: Clearly point out what the Model Tree got right (Match), and what it got wrong (Mismatch/Hallucination)

work page
[27]

semantic_coverage

Final Scoring: Provide an objective score (1-5) based on your analysis. # Output Format Requirements Please strictly follow the<output_format>below. Do not include any irrelevant intro or summary. Ensure the output is valid JSON. <output_format> {{ "semantic_coverage": {{ "score": [Specific Score 1-5], "reasoning": "Detailed analysis of Semantic Coverage....

work page