Pith Number

pith:73VCVUBW

pith:2026:73VCVUBWDUMDDM2U3NAJBAPBZH

not attested not anchored not stored refs pending

Can Deep Research Agents Retrieve and Organize? Evaluating the Synthesis Gap with Expert Taxonomies

Jiabao Zhuang, Jiahui Lin, Jingyi Deng, Kexin Tan, Long Ma, Maxm Pan, Mingqi Wu, Ming Zhang, Ning Luo, Qiyuan Peng, Qi Zhang, Renzhe Zheng, Shihan Dou, Tao Gui, Wenqing Jing, Xuanjing Huang, Yuhang Zhao, Yuhui Wang, Yujiong Shen, Zhenghao Xiang, Ziyu Kong

Deep research agents retrieve only 21 percent of expert-cited papers and organize taxonomies far below human alignment levels.

arxiv:2601.12369 v4 · 2026-01-18 · cs.CL

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{73VCVUBWDUMDDM2U3NAJBAPBZH}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Evaluating 7 Deep Research Agents and 12 frontier LLMs reveals a dual bottleneck: capability-side, the best agent retrieves only 20.92% of expert-cited papers, and 1,000 model taxonomies show 75.9% sibling overlap, 51.2% MECE violations, and 83.4% structural imbalance, all detectable without any reference; alignment-side, all 12 LLMs converge to Sem-Path 28--29%, well below 47--58% achieved by three independent human-annotator groups on the same paper sets.

C2weakest assumption

Expert-authored taxonomies constitute an appropriate and stable gold standard against which model outputs can be meaningfully compared, and the newly introduced metrics (US-TED, US-NTED, Sem-Path) validly quantify synthesis quality independent of any single reference taxonomy.

C3one line summary

TaxoBench shows deep research agents retrieve 20.92% of expert-cited papers and produce taxonomies with 75.9% sibling overlap, 51.2% MECE violations, and 83.4% imbalance, while LLMs reach only 28-29% semantic path similarity versus 47-58% for human groups.

Formal links

2 machine-checked theorem links

Cited by

3 papers in Pith

LLM-Oriented Information Retrieval: A Denoising-First Perspective

WisPaper: Your AI Scholar Search Engine

LLM-Oriented Information Retrieval: A Denoising-First Perspective

Receipt and verification

First computed	2026-05-20T01:05:06.339939Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

feea2ad0361d1831b354db409081e1c9cd2b1e64caaca074ea42f148c04cf851

Aliases

arxiv: 2601.12369 · arxiv_version: 2601.12369v4 · doi: 10.48550/arxiv.2601.12369 · pith_short_12: 73VCVUBWDUMD · pith_short_16: 73VCVUBWDUMDDM2U · pith_short_8: 73VCVUBW

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/73VCVUBWDUMDDM2U3NAJBAPBZH \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: feea2ad0361d1831b354db409081e1c9cd2b1e64caaca074ea42f148c04cf851

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "e09a6c3573b14f6c0d91cfe480768a67a2bdca92801dd59e864ee5f6d25e3e99",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-01-18T11:57:09Z",
    "title_canon_sha256": "a2da680b0b94035cc84356880c0da6cb74af656d8733a4535c8129caebe0bf5b"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2601.12369",
    "kind": "arxiv",
    "version": 4
  }
}