hub Canonical reference

ScienceAgentBench: Toward rigorous assessment of language agents for data-driven scientific discovery

Chen, Z · 2024 · arXiv 2410.05080

Canonical reference. 100% of citing Pith papers cite this work as background.

19 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 19 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 dataset 2

citation-polarity summary

background 7

representative citing papers

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.

Sheaf-Theoretic Transport and Obstruction for Detecting Scientific Theory Shift in AI Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

A finite sheaf-theoretic framework ranks obstruction measures to identify when an AI agent's theory must deform within its language or extend to a new one, validated on a controlled transition benchmark.

Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

AI agents handle individual data-loading and reformatting steps on neuroscience datasets but rarely complete fully error-free end-to-end pipelines, and AI judges are unreliable without ground-truth references.

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

AI scientists produce results without reasoning scientifically

cs.AI · 2026-04-20 · conditional · novelty 7.0

LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.

How Far Are We From True Auto-Research?

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

cs.LG · 2026-05-17 · accept · novelty 6.0

FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.

MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility

cs.LG · 2026-05-15 · conditional · novelty 6.0

MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin

gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy

gr-qc · 2026-05-11 · unverdicted · novelty 6.0

LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.

TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

cs.AI · 2026-04-15 · unverdicted · novelty 6.0

TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

cs.LG · 2025-11-26 · unverdicted · novelty 6.0

Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.

CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

cs.CL · 2025-09-19 · unverdicted · novelty 6.0

CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.

RExBench: Can coding agents autonomously implement AI research extensions?

cs.CL · 2025-06-27 · unverdicted · novelty 6.0

RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

cs.CL · 2025-06-13 · conditional · novelty 6.0

DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.

Experiment-as-Code Labs: A Declarative Stack for AI-Driven Scientific Discovery

eess.SY · 2026-05-06 · unverdicted · novelty 5.0 · 2 refs

The paper introduces Experiment-as-Code Labs as a declarative stack synthesizing AI agents, systems orchestration, and physical lab control for AI-driven discovery.

EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale

cs.AI · 2026-04-19 · unverdicted · novelty 5.0

EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

cs.AI · 2025-07-28 · accept · novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

cs.AI · 2025-04-28 · accept · novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning

cs.CL · 2025-02-05 · unverdicted · novelty 2.0

Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

citing papers explorer

Showing 19 of 19 citing papers.

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science cs.AI · 2026-05-18 · unverdicted · none · ref 14
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
Sheaf-Theoretic Transport and Obstruction for Detecting Scientific Theory Shift in AI Agents cs.AI · 2026-05-13 · unverdicted · none · ref 7
A finite sheaf-theoretic framework ranks obstruction measures to identify when an AI agent's theory must deform within its language or extend to a new one, validated on a controlled transition benchmark.
Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse cs.LG · 2026-05-12 · unverdicted · none · ref 20 · 2 links
AI agents handle individual data-loading and reformatting steps on neuroscience datasets but rarely complete fully error-free end-to-end pipelines, and AI judges are unreliable without ground-truth references.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch cs.AI · 2026-05-05 · unverdicted · none · ref 3
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
AI scientists produce results without reasoning scientifically cs.AI · 2026-04-20 · conditional · none · ref 4
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
How Far Are We From True Auto-Research? cs.AI · 2026-05-18 · unverdicted · none · ref 22
ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics cs.LG · 2026-05-17 · accept · none · ref 12
FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.
MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility cs.LG · 2026-05-15 · conditional · none · ref 8
MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin
gwBenchmarks: Stress-Testing LLM Agents on High-Precision Gravitational Wave Astronomy gr-qc · 2026-05-11 · unverdicted · none · ref 16
LLM coding agents cannot reach the 10^{-4} relative accuracy required for gravitational wave modeling tasks and show systematic failures including metric misuse and result fabrication.
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration cs.AI · 2026-04-15 · unverdicted · none · ref 4
TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression cs.LG · 2025-11-26 · unverdicted · none · ref 6
Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.
CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics cs.CL · 2025-09-19 · unverdicted · none · ref 15
CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
RExBench: Can coding agents autonomously implement AI research extensions? cs.CL · 2025-06-27 · unverdicted · none · ref 5
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents cs.CL · 2025-06-13 · conditional · none · ref 3
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
Experiment-as-Code Labs: A Declarative Stack for AI-Driven Scientific Discovery eess.SY · 2026-05-06 · unverdicted · none · ref 8 · 2 links
The paper introduces Experiment-as-Code Labs as a declarative stack synthesizing AI agents, systems orchestration, and physical lab control for AI-driven discovery.
EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale cs.AI · 2026-04-19 · unverdicted · none · ref 3
EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence cs.AI · 2025-07-28 · accept · none · ref 74
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review cs.AI · 2025-04-28 · accept · none · ref 98
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
Position: Multimodal Large Language Models Can Significantly Advance Scientific Reasoning cs.CL · 2025-02-05 · unverdicted · none · ref 30
Position paper claims multimodal LLMs can significantly advance scientific reasoning and proposes a four-stage roadmap plus challenges and suggestions.

ScienceAgentBench: Toward rigorous assessment of language agents for data-driven scientific discovery

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer