ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

Yujie Liu , Zonglin Yang , Tong Xie , Jinjie Ni , Ben Gao , Yuqiang Li , Shixiang Tang , Wanli Ouyang

show 2 more authors

Erik Cambria Dongzhan Zhou

Authors on Pith no claims yet

classification 💻 cs.CL cs.AIcs.CE

keywords discoveryllmsbenchmarkscientificabilityacrossautomatedautomatic

0 comments

read the original abstract

Large language models (LLMs) have shown potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs on a sufficient set of scientific discovery sub-tasks-inspiration retrieval, hypothesis composition, and hypothesis ranking-where sufficient means that perfectly solving these sub-tasks perfectly solves the overall discovery task. We develop an automated LLM-based framework that extracts critical components-research questions, background surveys, inspirations, and hypotheses-from papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on publications from 2024 onward, ensuring minimal overlap with LLM pretraining data; our automated framework further enables automatic extraction of even more recent papers as LLM pretraining cutoffs advance, supporting scalable and contamination-free automatic renewal of this discovery benchmark. Our evaluation shows that, across disciplines, LLMs excel at inspiration retrieval-an out-of-distribution task-suggesting their ability to surface novel knowledge associations.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification
astro-ph.IM 2026-05 unverdicted novelty 7.0

AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
AI scientists produce results without reasoning scientifically
cs.AI 2026-04 conditional novelty 7.0

LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
cs.LG 2026-05 unverdicted novelty 6.0

MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
cs.AI 2025-04 accept novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.