AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Gal Chechik; Talor Abramovich

arxiv: 2507.08038 · v3 · pith:C6MKQSQRnew · submitted 2025-07-09 · 💻 cs.CL · cs.AI

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Talor Abramovich , Gal Chechik This is my paper

classification 💻 cs.CL cs.AI

keywords tasksablationablationbenchablationsevaluatingexperimentsresearchagents

0 comments

read the original abstract

Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challenge. A key mechanism to obtain such insights is through ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 45% of the original ablations on average, below human-level performance. We observe an inverse performance trend between the author and reviewer tasks, which we attribute to differences in model grounding. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms an agent-based approach. Our data is available on https://huggingface.co/collections/ai-coscientist/ablationbench, and our code is available on https://github.com/ai-scientist-bench/ablation-bench .

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hidden Secrets in the arXiv: Discovering, Analyzing, and Preventing Unintentional Information Disclosure in Source Files of Scientific Preprints
cs.CR 2026-04 unverdicted novelty 7.0

Nearly every arXiv submission leaks hidden sensitive information through its source files, existing cleaners fail, and ALC-NG provides a more reliable fix.
AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories
cs.AI 2026-04 unverdicted novelty 5.0

AblateCell reproduces baselines in three single-cell perturbation repositories with 88.9% success and recovers ground-truth critical components with 93.3% accuracy via closed-loop ablation.