pith. sign in

arxiv: 2601.06676 · v2 · pith:UPQH2SVHnew · submitted 2026-01-10 · 💻 cs.CL · cs.AI· cs.HC

One Interaction Is Worth a Thousand Guesses: Benchmarking the Interactive Capabilities of Deep Research Agents

classification 💻 cs.CL cs.AIcs.HC
keywords researchdeepinteractiveagentsinteractionuseridrbenchbenchmark
0
0 comments X
read the original abstract

Deep research agents powered by Large Language Models (LLMs) can perform multi-step reasoning, web exploration, and long-form report generation. However, existing systems remain largely autonomous, assuming fully specified user intent and evaluating only final outputs. In practice, research goals are often underspecified and evolve during exploration, yet current benchmarks neither model dynamic user feedback nor measure interaction costs. To address this gap, we introduce IDRBench, the first Interactive Deep Research Benchmark for systematically evaluating the interactive capabilities of deep research agents. IDRBench formulates deep research as an interactive process where agents may solicit clarification to better align with user intent. It integrates a modular interactive framework, a scalable reference-grounded user simulator, and an interaction-aware evaluation suite that jointly measures alignment gains and interaction overhead. Experiments on seven representative proprietary and open-weight LLMs show that interaction consistently improves research quality and robustness, while revealing substantial differences in interaction efficiency across models. These findings establish interactive capability as a distinct evaluation dimension and position IDRBench as a reusable benchmark for future user-aligned deep research agents.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Co-Evolving Skill Generation and Policy Optimization

    cs.CL 2026-06 unverdicted novelty 7.0

    Framework estimates context-dependent marginal utility of candidate skills via reward gaps in matched base vs. skill-augmented rollouts to filter skills and co-train policy as generator.

  2. AI for Auto-Research: Roadmap & User Guide

    cs.AI 2026-05 unverdicted novelty 4.0

    The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.