Auto-Discovery-Bench: Diagnosing Structured State Tracking in Oracle-Guided Discovery

Anirudh Goyal; Beibei Lin; Dianbo Liu; Qiran Zou; Srinivas Anumasa; Tingting Chen; Vedant Shah; Zifeng Yuan

arxiv: 2502.15224 · v2 · pith:CA7G57KQnew · submitted 2025-02-21 · 💻 cs.LG · cs.AI

Auto-Discovery-Bench: Diagnosing Structured State Tracking in Oracle-Guided Discovery

Tingting Chen , Beibei Lin , Srinivas Anumasa , Vedant Shah , Zifeng Yuan , Qiran Zou , Anirudh Goyal , Dianbo Liu This is my paper

classification 💻 cs.LG cs.AI

keywords discoveryagentsauto-discovery-benchdiagnosticoracle-guidedstructuredbenchmarkcapability

0 comments

read the original abstract

Interactive discovery requires agents to maintain and update structured beliefs over many rounds of feedback. Before evaluating agents in noisy, open-ended scientific environments, it is useful to isolate this prerequisite capability under controlled conditions. We introduce Auto-Discovery-Bench, a deterministic oracle-guided diagnostic benchmark in which agents recover hidden structures through repeated hypothesis--intervention--feedback cycles. The benchmark instantiates three controlled discovery abstractions: directed graph discovery, undirected relational discovery, and symbolic equation discovery. Across models, performance degrades as the number of variables, trajectory length, and distractors increase. A separate trajectory-tracking diagnostic shows that many failures persist even when intervention selection and hypothesis generation are removed, suggesting that limitations in maintaining and integrating long-range structured information are an important bottleneck for oracle-guided discovery. Auto-Discovery-Bench is not intended to replace realistic discovery environments; rather, it provides a reproducible, low-confound diagnostic testbed for isolating a prerequisite capability for interactive scientific agents.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes
cs.CV 2026-05 unverdicted novelty 5.0

GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery
cs.AI 2026-05 unverdicted novelty 4.0

A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.