Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory

Chao Huang; Linxuan Huang; Tianyu Fan; Yuhao Zhan; Zirui Guo

arxiv: 2601.22984 · v2 · pith:ZMIXFOIWnew · submitted 2026-01-30 · 💻 cs.AI

Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory

Yuhao Zhan , Tianyu Fan , Linxuan Huang , Zirui Guo , Chao Huang This is my paper

classification 💻 cs.AI

keywords evaluationresearchhallucinationstrajectorydeepdeephallubenchdrasframework

0 comments

read the original abstract

Diagnosing failure patterns in Deep Research Agents (DRAs) remains a critical challenge. Existing benchmarks predominantly rely on end-to-end evaluation, obscuring intermediate hallucinations that accumulate throughout the research trajectory. To bridge this gap, we propose a shift from outcome-based to processaware evaluation by auditing hallucinations in the full plan-search-summarize trajectory. We introduce the PING Taxonomy, which categorizes DRA hallucinations into four complementary types: Propagation, Intent, Noiseinduced, and Grounding. We further instantiate this taxonomy into a fine-grained evaluation framework that decomposes trajectories into atomic actions, claims, and sub-queries for rigorous verification. Leveraging this framework to isolate 100 distinctively hallucinationprone tasks including adversarial scenarios, we curate DeepHalluBench. Experiments on six representative DRAs show that, on our hallucination-prone stress-test set, all evaluated systems still exhibit non-negligible reliability gaps. Furthermore, our diagnostic analysis traces these failures to systemic deficits, especially hallucination propagation and cognitive biases, providing actionable insights for future architectural optimization. Code and data are available in https://github.com/yuhao-zhan/DeepHalluBench.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
cs.CL 2026-05 unverdicted novelty 7.0

REFLECT benchmark shows current LLM judges achieve below 55% accuracy detecting failures in evidence-based research agents, especially on evidence verification.