NatureBench evaluates ten frontier AI coding agents on 90 tasks from Nature papers under web-search-disabled conditions and finds the strongest agent surpasses published SOTA on only 17.8% of tasks, succeeding mainly by translating problems into familiar supervised learning setups.
hub
Nature , year =
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
years
2026 10verdicts
UNVERDICTED 10representative citing papers
Closed-loop LM-agent auto research finds some transferable gains on molecular property prediction benchmarks via external data but shows non-transfer for model and feature edits selected on validation.
Presents MedSci Skills, an open-source toolkit with deterministic integrity gates for verifying LLM-assisted clinical manuscripts against reporting guidelines like STARD, PRISMA, and STROBE.
DN-Hypo-Pipeline operationalizes three philosophy-of-science accounts to direct LLMs toward principle-based hypothesis generation, claims superior performance over direct prompting, and derives two new transformer algorithms from the resulting hypotheses.
Speak-to-Objective is a modular agentic pipeline that translates spoken or written commands into fully differentiable objective functions for optofluidic microparticle assembly using LLMs, inverse solvers, and experimental platforms.
A multi-LLM council scores predictive processing papers on an expert ontology, maps results in 3D hypothesis space, and introduces a dispersion metric showing greater spread in global versus local oddball paradigms.
Human-AI collaboration expanded a meta-idea on rational approximation into sign-embedding quantum algorithms for matrix problems, with humans retaining final judgment on routes and refinements.
Coordinated AI agents improve scientific inference from partial evidence in cross-domain tasks when single sources are incomplete, as demonstrated by AUROC gains in vector-borne disease and exoplanet benchmarks but tied performance in others.
Develops a framework representing AI-assisted research via five operators and principles for evidence-licensed claims, distinguishing claim semantics and introducing epistemic debt.
The paper proposes the Cybersecurity AI Scientist as a modular multi-agent architecture for automating cybersecurity research, distinguished by its focus on non-stationary threats and anchored in a four-zeros risk-trust-incident-energy frame.
citing papers explorer
-
Closed-loop Auto Research for Molecular Property Prediction: Discovering and Certifying Generalizable Improvements
Closed-loop LM-agent auto research finds some transferable gains on molecular property prediction benchmarks via external data but shows non-transfer for model and feature edits selected on validation.
-
Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture
Presents MedSci Skills, an open-source toolkit with deterministic integrity gates for verifying LLM-assisted clinical manuscripts against reporting guidelines like STARD, PRISMA, and STROBE.
-
DN-Hypo-Pipeline: An AI-Driven Workflow for Generating Hypotheses using Large Language Models and Scientific Explanations
DN-Hypo-Pipeline operationalizes three philosophy-of-science accounts to direct LLMs toward principle-based hypothesis generation, claims superior performance over direct prompting, and derives two new transformer algorithms from the resulting hypotheses.
-
Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence
Coordinated AI agents improve scientific inference from partial evidence in cross-domain tasks when single sources are incomplete, as demonstrated by AUROC gains in vector-borne disease and exoplanet benchmarks but tied performance in others.