EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents

· 2026 · cs.CL · arXiv 2604.05557

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Scientific research follows multi-turn, multi-step workflows that require proactively searching the literature, consulting figures and tables, and integrating evidence across papers to align experimental settings and support reproducible conclusions. This joint capability is not systematically assessed in existing benchmarks, which largely under-evaluate proactive search, multi-evidence integration and sustained evidence use over time. In this work, we introduce EpiBench, an episodic multi-turn multimodal benchmark that instantiates short research workflows. Given a research task, agents must navigate across papers over multiple turns, align evidence from figures and tables, and use the accumulated evidence in the memory to answer objective questions that require cross paper comparisons and multi-figure integration. EpiBench introduces a process-level evaluation framework for fine-grained testing and diagnosis of research agents. Our experiments show that even the leading model achieves an accuracy of only 29.23% on the hard split, indicating substantial room for improvement in multi-turn, multi-evidence research workflows, providing an evaluation platform for verifiable and reproducible research agents.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.

Claw AI Lab: An Autonomous Multi-Agent Research Team

cs.AI · 2026-05-21 · unverdicted · novelty 4.0

Claw AI Lab presents an interactive multi-agent platform for autonomous AI research that supports customizable teams, real-time control, and a code harness for experiment integration and result integrity.

citing papers explorer

Showing 2 of 2 citing papers.

TRACER: Verifiable Generative Provenance for Multimodal Tool-Using Agents cs.CL · 2026-05-11 · unverdicted · none · ref 18 · internal anchor
TRACER attaches verifiable sentence-level provenance records to multimodal agent outputs using tool-turn alignment and semantic relations, yielding 78.23% answer accuracy and fewer tool calls than baselines on TRACE-Bench.
Claw AI Lab: An Autonomous Multi-Agent Research Team cs.AI · 2026-05-21 · unverdicted · none · ref 20 · internal anchor
Claw AI Lab presents an interactive multi-agent platform for autonomous AI research that supports customizable teams, real-time control, and a code harness for experiment integration and result integrity.

EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer