hub Canonical reference

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster · 2025 · cs.AI · arXiv 2504.08066

Canonical reference. 75% of citing Pith papers cite this work as background.

60 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 60 citing papers arXiv PDF

abstract

AI is increasingly playing a pivotal role in transforming how scientific discoveries are made. We introduce The AI Scientist-v2, an end-to-end agentic system capable of producing the first entirely AI generated peer-review-accepted workshop paper. This system iteratively formulates scientific hypotheses, designs and executes experiments, analyzes and visualizes data, and autonomously authors scientific manuscripts. Compared to its predecessor (v1, Lu et al., 2024 arXiv:2408.06292), The AI Scientist-v2 eliminates the reliance on human-authored code templates, generalizes effectively across diverse machine learning domains, and leverages a novel progressive agentic tree-search methodology managed by a dedicated experiment manager agent. Additionally, we enhance the AI reviewer component by integrating a Vision-Language Model (VLM) feedback loop for iterative refinement of content and aesthetics of the figures. We evaluated The AI Scientist-v2 by submitting three fully autonomous manuscripts to a peer-reviewed ICLR workshop. Notably, one manuscript achieved high enough scores to exceed the average human acceptance threshold, marking the first instance of a fully AI-generated paper successfully navigating a peer review. This accomplishment highlights the growing capability of AI in conducting all aspects of scientific research. We anticipate that further advancements in autonomous scientific discovery technologies will profoundly impact human knowledge generation, enabling unprecedented scalability in research productivity and significantly accelerating scientific breakthroughs, greatly benefiting society at large. We have open-sourced the code at https://github.com/SakanaAI/AI-Scientist-v2 to foster the future development of this transformative technology. We also discuss the role of AI in science, including AI safety.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 14 dataset 1 other 1

citation-polarity summary

background 12 unclear 2 support 1 use dataset 1

claims ledger

abstract AI is increasingly playing a pivotal role in transforming how scientific discoveries are made. We introduce The AI Scientist-v2, an end-to-end agentic system capable of producing the first entirely AI generated peer-review-accepted workshop paper. This system iteratively formulates scientific hypotheses, designs and executes experiments, analyzes and visualizes data, and autonomously authors scientific manuscripts. Compared to its predecessor (v1, Lu et al., 2024 arXiv:2408.06292), The AI Scientist-v2 eliminates the reliance on human-authored code templates, generalizes effectively across dive

co-cited works

representative citing papers

SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

SciIntegrity-Bench shows state-of-the-art LLMs violate academic integrity in 34.2% of dilemmatic scenarios, primarily by fabricating data rather than refusing impossible tasks.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Forecasting Scientific Progress with Artificial Intelligence

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.

1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

Introduces the 1GC-7RC benchmark to evaluate AI coding agents on seven diverse ML tasks under single-GPU time and access constraints.

SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

SMCEvolve applies Sequential Monte Carlo sampling to LLM program search with adaptive resampling, mutation mixtures, and convergence control, delivering finite-sample complexity bounds and benchmark gains over prior systems.

PROMETHEUS: Automating Deep Causal Research Integrating Text, Data and Models

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

PROMETHEUS builds causal atlases from text and data using local predictive-state models and sheaf gluing to create navigable Topos World Models that expose evidence strength and coherence gaps.

AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

physics.flu-dyn · 2026-05-07 · conditional · novelty 7.0 · 3 refs

AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures missed by solver checks.

Fine-Tuning Small Reasoning Models for Quantum Field Theory

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

Camyla: Scaling Autonomous Research in Medical Image Segmentation

cs.AI · 2026-04-12 · unverdicted · novelty 7.0

Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.

AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.

El Agente Quntur: A research collaborator agent for quantum chemistry

physics.chem-ph · 2026-02-04 · unverdicted · novelty 7.0

El Agente Quntur is a new multi-agent system that uses reasoning over literature and software documentation to autonomously handle the full workflow of quantum chemistry experiments in ORCA.

AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

AutoResearchClaw presents a multi-agent autonomous research pipeline with debate, self-healing execution, verifiable reporting, human-in-the-loop modes, and cross-run evolution that outperforms AI Scientist v2 by 54.7% on the ARC-Bench benchmark.

How Far Are We From True Auto-Research?

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

cs.LG · 2026-05-17 · accept · novelty 6.0

FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.

MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility

cs.LG · 2026-05-15 · conditional · novelty 6.0

MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin

Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design

cs.AI · 2026-05-15 · unverdicted · novelty 6.0

Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.

Unlocking LLM Creativity in Science through Analogical Reasoning

cs.AI · 2026-05-11 · conditional · novelty 6.0

Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.

NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.

Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

Malicious actors could use AI agents to submit large numbers of fake papers, inflating the submission count and thereby raising the acceptance odds for a small set of chosen legitimate papers under stable conference acceptance rates.

Intervention-Based Time Series Causal Discovery via Simulator-Generated Interventional Distributions

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

SVAR-FM uses simulator clamping to produce interventional distributions and flow matching to identify time series causal structures, with an error bound that predicts sign reversal of causal effects below a simulator accuracy threshold.

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.

CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yielding improved benchmark performance with auditable traces.

FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.

Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

An LLM entity-tagging pipeline plus multi-agent system extracts ~6.3M nuanced records from 22.5M PubMed papers across six tasks with lower measured error than existing curated databases.

citing papers explorer

Showing 50 of 60 citing papers.

SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems cs.AI · 2026-05-11 · unverdicted · none · ref 5 · internal anchor
SciIntegrity-Bench shows state-of-the-art LLMs violate academic integrity in 34.2% of dilemmatic scenarios, primarily by fabricating data rather than refusing impossible tasks.
Evaluating Large Language Models in Scientific Discovery cs.AI · 2025-12-17 · unverdicted · none · ref 21 · internal anchor
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Forecasting Scientific Progress with Artificial Intelligence cs.AI · 2026-05-21 · unverdicted · none · ref 39 · internal anchor
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job? cs.LG · 2026-05-16 · unverdicted · none · ref 51 · internal anchor
Introduces the 1GC-7RC benchmark to evaluate AI coding agents on seven diverse ML tasks under single-GPU time and access constraints.
SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution cs.AI · 2026-05-14 · unverdicted · none · ref 11 · internal anchor
SMCEvolve applies Sequential Monte Carlo sampling to LLM program search with adaptive resampling, mutation mixtures, and convergence control, delivering finite-sample complexity bounds and benchmark gains over prior systems.
PROMETHEUS: Automating Deep Causal Research Integrating Text, Data and Models cs.AI · 2026-05-13 · unverdicted · none · ref 24 · internal anchor
PROMETHEUS builds causal atlases from text and data using local predictive-state models and sheaf gluing to create navigable Topos World Models that expose evidence strength and coherence gaps.
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents physics.flu-dyn · 2026-05-07 · conditional · none · ref 39 · 3 links · internal anchor
AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures missed by solver checks.
Fine-Tuning Small Reasoning Models for Quantum Field Theory cs.LG · 2026-04-21 · unverdicted · none · ref 26 · internal anchor
Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
Camyla: Scaling Autonomous Research in Medical Image Segmentation cs.AI · 2026-04-12 · unverdicted · none · ref 2 · internal anchor
Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery cs.CL · 2026-04-07 · unverdicted · none · ref 21 · internal anchor
AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.
El Agente Quntur: A research collaborator agent for quantum chemistry physics.chem-ph · 2026-02-04 · unverdicted · none · ref 36 · internal anchor
El Agente Quntur is a new multi-agent system that uses reasoning over literature and software documentation to autonomously handle the full workflow of quantum chemistry experiments in ORCA.
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration cs.AI · 2026-05-19 · unverdicted · none · ref 18 · internal anchor
AutoResearchClaw presents a multi-agent autonomous research pipeline with debate, self-healing execution, verifiable reporting, human-in-the-loop modes, and cross-run evolution that outperforms AI Scientist v2 by 54.7% on the ARC-Bench benchmark.
How Far Are We From True Auto-Research? cs.AI · 2026-05-18 · unverdicted · none · ref 17 · internal anchor
ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics cs.LG · 2026-05-17 · accept · none · ref 58 · internal anchor
FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.
MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility cs.LG · 2026-05-15 · conditional · none · ref 40 · internal anchor
MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design cs.AI · 2026-05-15 · unverdicted · none · ref 32 · internal anchor
Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.
Unlocking LLM Creativity in Science through Analogical Reasoning cs.AI · 2026-05-11 · conditional · none · ref 51 · internal anchor
Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation cs.AI · 2026-05-11 · unverdicted · none · ref 35 · internal anchor
NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.
Position: Academic Conferences are Potentially Facing Denominator Gaming Caused by Fully Automated Scientific Agents cs.CL · 2026-05-11 · unverdicted · none · ref 28 · internal anchor
Malicious actors could use AI agents to submit large numbers of fake papers, inflating the submission count and thereby raising the acceptance odds for a small set of chosen legitimate papers under stable conference acceptance rates.
Intervention-Based Time Series Causal Discovery via Simulator-Generated Interventional Distributions cs.LG · 2026-05-11 · unverdicted · none · ref 205 · internal anchor
SVAR-FM uses simulator clamping to produce interventional distributions and flow matching to identify time series causal structures, with an error bound that predicts sign reversal of causal effects below a simulator accuracy threshold.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI cs.LG · 2026-05-09 · unverdicted · none · ref 111 · internal anchor
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models cs.LG · 2026-05-08 · unverdicted · none · ref 26 · internal anchor
CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yielding improved benchmark performance with auditable traces.
FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution cs.LG · 2026-05-08 · unverdicted · none · ref 35 · internal anchor
FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale cs.LG · 2026-05-07 · unverdicted · none · ref 73 · 2 links · internal anchor
An LLM entity-tagging pipeline plus multi-agent system extracts ~6.3M nuanced records from 22.5M PubMed papers across six tasks with lower measured error than existing curated databases.
Hypothesis generation and updating in large language models cs.LG · 2026-05-07 · unverdicted · none · ref 7 · internal anchor
LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists cs.AI · 2026-04-30 · unverdicted · none · ref 30 · internal anchor
Intern-Atlas constructs a methodological evolution graph with 9.4 million edges from 1.03 million AI papers to capture how methods emerge, adapt, and transition, enabling better idea evaluation and generation for AI-driven research.
Rethinking Publication: A Certification Framework for AI-Enabled Research cs.AI · 2026-04-23 · unverdicted · none · ref 48 · 2 links · internal anchor
A two-layer certification framework decouples knowledge validity from human authorship to accommodate AI-enabled research in existing publication systems.
Evaluation-driven Scaling for Scientific Discovery cs.LG · 2026-04-21 · unverdicted · none · ref 160 · internal anchor
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration cs.AI · 2026-04-15 · unverdicted · none · ref 51 · internal anchor
TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
Toward Autonomous Long-Horizon Engineering for ML Research cs.CL · 2026-04-14 · unverdicted · none · ref 24 · internal anchor
AiScientist improves ML research benchmarks by 10.54 points on PaperBench and reaches 81.82% Any Medal on MLE-Bench Lite through hierarchical control plus durable file-based state instead of conversational handoffs.
Pioneer Agent: Continual Improvement of Small Language Models in Production cs.AI · 2026-04-10 · unverdicted · none · ref 97 · internal anchor
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation cs.AI · 2026-04-07 · unverdicted · none · ref 34 · internal anchor
ResearchEVO automates the discover-then-explain cycle by evolving algorithms via fitness-driven LLM co-evolution and generating grounded, anti-hallucination research papers through sentence-level RAG.
AIRA_2: Overcoming Bottlenecks in AI Research Agents cs.AI · 2026-03-27 · conditional · none · ref 21 · internal anchor
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.
LLMs learn scientific taste from institutional traces across the social sciences cs.AI · 2026-03-17 · conditional · none · ref 47 · internal anchor
Fine-tuned LLMs trained on social science publication records outperform experts and frontier models at judging which research pitches deserve attention.
DeepReviewer 2.0: A Traceable Agentic System for Auditable Scientific Peer Review cs.AI · 2026-03-03 · unverdicted · none · ref 1 · internal anchor
An agentic system produces traceable review packages and an un-finetuned 196B model using it covers more major issues than Gemini-3.1-Pro on 134 ICLR 2025 submissions while winning most blind comparisons to human committees.
The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems cs.CY · 2026-02-19 · accept · none · ref 136 · internal anchor
The 2025 AI Agent Index catalogs technical and safety details for 30 deployed AI agents and finds low developer transparency on safety, evaluations, and societal impacts.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution cs.CL · 2025-09-17 · unverdicted · none · ref 262 · internal anchor
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems cs.LG · 2025-06-11 · unverdicted · none · ref 72 · internal anchor
Introduces a Bayesian framework viewing LLM prompts as textual parameters and proposes MHLP, a novel MCMC algorithm using LLM proposals, to perform inference and improve accuracy plus uncertainty quantification on benchmarks.
XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration cs.CL · 2025-05-16 · conditional · none · ref 5 · internal anchor
XtraGPT is a suite of 1.5B-14B parameter open-source LLMs fine-tuned on 140,000 revision pairs from 7,000 top-tier papers to support controllable, context-aware academic paper editing.
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators cs.MA · 2026-05-21 · unverdicted · none · ref 42 · internal anchor
Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.
Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks cs.AI · 2026-05-20 · unverdicted · none · ref 30 · internal anchor
A multi-agent harness autonomously generates functional single-page VIS apps with linked views for scientific data tasks using coordinated skills for analysis, planning, implementation, and evaluation.
AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists cs.AI · 2026-05-20 · unverdicted · none · ref 3 · internal anchor
AiraXiv is a proposed AI-driven platform for open preprints that supports human and AI authors with interactive UI and MCP-based interactions, validated by serving as the submission system for ICAIS 2025.
Toward an Engineering of Science: Rebalancing Generation and Verification in the Age of AI cs.CY · 2026-05-11 · unverdicted · none · ref 41 · internal anchor
AI lowers the cost of generating plausible scientific artifacts without lowering verification costs, so the paper proposes blueprints as typed graph components that decompose claims, evidence, and assumptions to enable cheaper downstream verification.
GEAR: Genetic AutoResearch for Agentic Code Evolution cs.NE · 2026-05-08 · unverdicted · none · ref 24 · internal anchor
GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.
NORA: A Harness-Engineered Autonomous Research Agent for End-to-End Spatial Data Science cs.AI · 2026-05-03 · unverdicted · none · ref 2 · internal anchor
NORA is a harness-engineered multi-agent system that automates end-to-end spatial data science using domain skills for analysis and data acquisition, with case studies showing better output quality than general-purpose agents.
Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows cs.CL · 2026-04-22 · unverdicted · none · ref 48 · internal anchor
Cooperative profiles from behavioral economics games predict LLM team performance in AI-for-science workflows.
pAI/MSc: ML Theory Research with Humans on the Loop cs.AI · 2026-04-22 · unverdicted · none · ref 62 · internal anchor
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript draft in ML theory.
AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories cs.AI · 2026-04-21 · unverdicted · none · ref 16 · internal anchor
AblateCell reproduces baselines in three single-cell perturbation repositories with 88.9% success and recovers ground-truth critical components with 93.3% accuracy via closed-loop ablation.
EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale cs.AI · 2026-04-19 · unverdicted · none · ref 15 · internal anchor
EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.
Agentic Insight Generation in VSM Simulations cs.CL · 2026-04-14 · unverdicted · none · ref 22 · internal anchor
A two-step agentic system for extracting insights from VSM simulations achieves up to 86% accuracy with top LLMs by using progressive data discovery and slim context.

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer