One Interaction Is Worth a Thousand Guesses: Benchmarking the Interactive Capabilities of Deep Research Agents

Anthony K. H. Tung; Jun Yu; Qiang Huang; Wei Chen; Xiaoya Xie; Yingchaojie Feng; Zhaorui Yang

arxiv: 2601.06676 · v2 · pith:UPQH2SVHnew · submitted 2026-01-10 · 💻 cs.CL · cs.AI· cs.HC

One Interaction Is Worth a Thousand Guesses: Benchmarking the Interactive Capabilities of Deep Research Agents

Yingchaojie Feng , Qiang Huang , Xiaoya Xie , Zhaorui Yang , Jun Yu , Wei Chen , Anthony K. H. Tung This is my paper

classification 💻 cs.CL cs.AIcs.HC

keywords researchdeepinteractiveagentsinteractionuseridrbenchbenchmark

0 comments

read the original abstract

Deep research agents powered by Large Language Models (LLMs) can perform multi-step reasoning, web exploration, and long-form report generation. However, existing systems remain largely autonomous, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, yet current benchmarks neither model dynamic user feedback nor measure interaction costs. To address this gap, we introduce IDRBench, the first Interactive Deep Research Benchmark for systematically evaluating the interactive capabilities of deep research agents. IDRBench formulates deep research as an interactive process where agents may solicit clarification to better align with user intent. It integrates a modular interactive framework, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite that jointly measures alignment gains and interaction overhead. Experiments on seven representative proprietary and open-weight LLMs show that interaction consistently improves research quality and robustness, while revealing substantial differences in interaction efficiency across models. These findings establish interactive capability as a distinct evaluation dimension and position IDRBench as a reusable benchmark for future user-aligned deep research agents.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Co-Evolving Skill Generation and Policy Optimization
cs.CL 2026-06 unverdicted novelty 7.0

Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.
AI for Auto-Research: Roadmap & User Guide
cs.AI 2026-05 unverdicted novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.