SciAgentArena is a new interactive benchmark for AI agents on scientific tasks that finds agents handle clear data-analysis workflows but struggle with novel insights, self-directed exploration, and open-ended questions.
Accelerating scientific discovery with autonomous goal-evolving agents
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 11representative citing papers
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.
A parallel-tempering evolutionary framework for LLM hypothesis search improves both quality and diversity of candidates in molecular, equation, and algorithm discovery under fixed validation budgets.
DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.
Generate-Select-Refine is an open-ended Bayesian optimization method that generates tasks and concentrates evaluations on the best one with only logarithmic regret overhead relative to standard single-task optimization.
Hygieia is a new AI agent system that integrates phenotypes, genetics, and records to achieve superior rare disease diagnosis and gene prioritization with confidence scores.
Introduces consensus objective aggregation for meta-optimization of scientific discovery and reports improved scaling and speedup for 3-SAT algorithm discovery using digital MemComputing machines.
Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.
LLM agents produce outputs that meet basic functional criteria for creativity but lack the process-level, social, and personal elements required for ontological creativity.
citing papers explorer
-
Benchmarking AI Agents for Addressing Scientific Challenges Across Scales
SciAgentArena is a new interactive benchmark for AI agents on scientific tasks that finds agents handle clear data-analysis workflows but struggle with novel insights, self-directed exploration, and open-ended questions.
-
Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution
IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.
-
Towards Diverse Scientific Hypothesis Search with Large Language Models
A parallel-tempering evolutionary framework for LLM hypothesis search improves both quality and diversity of candidates in molecular, equation, and algorithm discovery under fixed validation budgets.
-
DrugSAGE:Self-evolving Agent Experience for Efficient State-of-the-Art Drug Discovery
DrugSAGE accumulates cross-task memory of skills, statistical evidence, and recurring errors to let LLM agents achieve top-ranked performance on molecular property prediction tasks with reduced or zero test-time search.
-
Open-Ended Task Discovery via Bayesian Optimization
Generate-Select-Refine is an open-ended Bayesian optimization method that generates tasks and concentrates evaluations on the best one with only logarithmic regret overhead relative to standard single-task optimization.
-
A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization
Hygieia is a new AI agent system that integrates phenotypes, genetics, and records to achieve superior rare disease diagnosis and gene prioritization with confidence scores.
-
Scientific discovery as meta-optimization: a combinatorial optimization case study
Introduces consensus objective aggregation for meta-optimization of scientific discovery and reports improved scaling and speedup for 3-SAT algorithm discovery using digital MemComputing machines.
-
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators
Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.
-
On the Creativity of AI Agents
LLM agents produce outputs that meet basic functional criteria for creativity but lack the process-level, social, and personal elements required for ontological creativity.