hub

Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks

Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, Tejal Patwardhan · 2026 · arXiv 2601.21165

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 2 background 1

citation-polarity summary

use dataset 2 background 1

representative citing papers

SFBench: The SciFy Scientific Feasibility Benchmark

cs.AI · 2026-06-28 · unverdicted · novelty 7.0

SFBench provides 197 expert-created materials science claims with feasibility scores and explanations to evaluate AI systems on scientific feasibility assessment.

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

physics.comp-ph · 2026-06-17 · unverdicted · novelty 7.0

PhySciBench benchmark shows current AI models achieve at most 33.5% accuracy on physical science tasks; DelveAgent framework improves accuracy by up to 7.5 points and cuts costs to one-third.

Forecasting Scientific Progress with Artificial Intelligence

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.

CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

stat.ML · 2026-05-07 · unverdicted · novelty 7.0

CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.

Agents-K1: Towards Agent-native Knowledge Orchestration

cs.AI · 2026-06-11 · unverdicted · novelty 6.0

Agents-K1 is an end-to-end pipeline with a multimodal parser, 4B GRPO-trained extractor, and agent CLI that builds scientific knowledge graphs from full papers and was run on 2.46 million documents to produce Scholar-KG.

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.

Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

An economy of agents using auctions and wealth accumulation produces emergent multi-step reasoning that outperforms monolithic baselines on five agentic tasks.

Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches

cs.AI · 2026-05-31 · unverdicted · novelty 6.0

A survey of RLM use in 28 disciplines reveals uneven adoption and introduces a maturity assessment framework showing larger gaps when limited to public resources.

Argus: Evidence Assembly for Scalable Deep Research Agents

cs.CL · 2026-05-15 · unverdicted · novelty 6.0 · 2 refs

Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.

LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

cs.AI · 2026-02-04 · unverdicted · novelty 6.0

LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.

Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

cs.AI · 2026-05-13 · unverdicted · novelty 5.0

A 30B model trained via reverse-perplexity SFT, two-stage RL, and test-time scaling achieves gold-medal-level results on IMO 2025 and IPhO 2024/2025.

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

cs.AI · 2026-05-02 · unverdicted · novelty 5.0

SciResearcher is a new agentic data-construction framework that trains an 8B model via supervised fine-tuning and reinforcement learning to reach 19.46% on HLE-Bio/Chem-Gold and 13-15% gains on related biology and literature benchmarks.

COMPOSITE-Stem

cs.AI · 2026-04-10 · conditional · novelty 5.0

COMPOSITE-STEM is a new benchmark of 70 expert-curated STEM tasks where frontier AI agents score at most 21% using flexible exact-match and rubric-based grading.

AI for Auto-Research: Roadmap & User Guide

cs.AI · 2026-05-18 · unverdicted · novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

AI and the Research-Education Environment of Physics

physics.ed-ph · 2026-05-04 · unverdicted · novelty 1.0

A summary of expert opinions on AI's impact on the research-education environment in physics from a KITP discussion session.

EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale

cs.AI · 2026-04-19

citing papers explorer

Showing 18 of 18 citing papers.

SFBench: The SciFy Scientific Feasibility Benchmark cs.AI · 2026-06-28 · unverdicted · none · ref 7
SFBench provides 197 expert-created materials science claims with feasibility scores and explanations to evaluate AI systems on scientific feasibility assessment.
Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark physics.comp-ph · 2026-06-17 · unverdicted · none · ref 30
PhySciBench benchmark shows current AI models achieve at most 33.5% accuracy on physical science tasks; DelveAgent framework improves accuracy by up to 7.5 points and cuts costs to one-third.
Forecasting Scientific Progress with Artificial Intelligence cs.AI · 2026-05-21 · unverdicted · none · ref 46
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction cs.LG · 2026-05-13 · unverdicted · none · ref 15
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency stat.ML · 2026-05-07 · unverdicted · none · ref 22
CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.
Agents-K1: Towards Agent-native Knowledge Orchestration cs.AI · 2026-06-11 · unverdicted · none · ref 13
Agents-K1 is an end-to-end pipeline with a multimodal parser, 4B GRPO-trained extractor, and agent CLI that builds scientific knowledge graphs from full papers and was run on 2.46 million documents to produce Scholar-KG.
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement cs.CL · 2026-06-10 · unverdicted · none · ref 157
Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.
Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions cs.CL · 2026-06-01 · unverdicted · none · ref 43
An economy of agents using auctions and wealth accumulation produces emergent multi-step reasoning that outperforms monolithic baselines on five agentic tasks.
Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches cs.AI · 2026-05-31 · unverdicted · none · ref 285
A survey of RLM use in 28 disciplines reveals uneven adoption and introduces a maturity assessment framework showing larger gaps when limited to public resources.
Argus: Evidence Assembly for Scalable Deep Research Agents cs.CL · 2026-05-15 · unverdicted · none · ref 41 · 2 links
Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning cs.CL · 2026-05-07 · unverdicted · none · ref 38
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research cs.AI · 2026-02-04 · unverdicted · none · ref 29
LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling cs.AI · 2026-05-13 · unverdicted · none · ref 1
A 30B model trained via reverse-perplexity SFT, two-stage RL, and test-time scaling achieves gold-medal-level results on IMO 2025 and IPhO 2024/2025.
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning cs.AI · 2026-05-02 · unverdicted · none · ref 40
SciResearcher is a new agentic data-construction framework that trains an 8B model via supervised fine-tuning and reinforcement learning to reach 19.46% on HLE-Bio/Chem-Gold and 13-15% gains on related biology and literature benchmarks.
COMPOSITE-Stem cs.AI · 2026-04-10 · conditional · none · ref 9
COMPOSITE-STEM is a new benchmark of 70 expert-curated STEM tasks where frontier AI agents score at most 21% using flexible exact-match and rubric-based grading.
AI for Auto-Research: Roadmap & User Guide cs.AI · 2026-05-18 · unverdicted · none · ref 209
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
AI and the Research-Education Environment of Physics physics.ed-ph · 2026-05-04 · unverdicted · none · ref 10
A summary of expert opinions on AI's impact on the research-education environment in physics from a KITP discussion session.
EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale cs.AI · 2026-04-19 · unreviewed · ref 14

Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer