ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

· 2025 · cs.CL · arXiv 2503.21248

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

Large language models (LLMs) have shown potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs on a sufficient set of scientific discovery sub-tasks-inspiration retrieval, hypothesis composition, and hypothesis ranking-where sufficient means that perfectly solving these sub-tasks perfectly solves the overall discovery task. We develop an automated LLM-based framework that extracts critical components-research questions, background surveys, inspirations, and hypotheses-from papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on publications from 2024 onward, ensuring minimal overlap with LLM pretraining data; our automated framework further enables automatic extraction of even more recent papers as LLM pretraining cutoffs advance, supporting scalable and contamination-free automatic renewal of this discovery benchmark. Our evaluation shows that, across disciplines, LLMs excel at inspiration retrieval-an out-of-distribution task-suggesting their ability to surface novel knowledge associations.

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Forecasting Scientific Progress with Artificial Intelligence

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.

AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification

astro-ph.IM · 2026-05-07 · unverdicted · novelty 7.0

AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.

AI scientists produce results without reasoning scientifically

cs.AI · 2026-04-20 · conditional · novelty 7.0

LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.

LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.

AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

cs.AI · 2026-05-22 · unverdicted · novelty 4.0

A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

cs.AI · 2025-04-28 · accept · novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

cs.LG · 2026-05-09

IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research

cs.CL · 2025-07-21

citing papers explorer

Showing 8 of 8 citing papers.

Forecasting Scientific Progress with Artificial Intelligence cs.AI · 2026-05-21 · unverdicted · none · ref 25 · internal anchor
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification astro-ph.IM · 2026-05-07 · unverdicted · none · ref 6 · internal anchor
AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
AI scientists produce results without reasoning scientifically cs.AI · 2026-04-20 · conditional · none · ref 9 · internal anchor
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design cs.LG · 2026-05-14 · unverdicted · none · ref 8 · internal anchor
LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery cs.AI · 2026-05-22 · unverdicted · none · ref 56 · internal anchor
A survey organizing AI-powered research automation into five workflow stages, defining AutoResearch and Vibe Research, and proposing five evaluation dimensions while noting domain-conditioned limits on autonomy.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review cs.AI · 2025-04-28 · accept · none · ref 235 · internal anchor
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI cs.LG · 2026-05-09 · unreviewed · ref 56 · internal anchor
IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research cs.CL · 2025-07-21 · unreviewed · ref 23 · internal anchor

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer