super hub Canonical reference

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Adam Zsolt Wagner, Alexander Novikov, Emilien Dupont, Marvin Eisenberger, Po-Sen Huang · 2025 · cs.AI · arXiv 2506.13131

Canonical reference. 74% of citing Pith papers cite this work as background.

238 Pith papers citing it

Background 74% of classified citations

open full Pith review browse 238 citing papers more from Adam Zsolt Wagner arXiv PDF

abstract

In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific problems or optimizing critical pieces of computational infrastructure. AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the code. Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical discoveries. We demonstrate the broad applicability of this approach by applying it to a number of important computational problems. When applied to optimizing critical components of large-scale computational stacks at Google, AlphaEvolve developed a more efficient scheduling algorithm for data centers, found a functionally equivalent simplification in the circuit design of hardware accelerators, and accelerated the training of the LLM underpinning AlphaEvolve itself. Furthermore, AlphaEvolve discovered novel, provably correct algorithms that surpass state-of-the-art solutions on a spectrum of problems in mathematics and computer science, significantly expanding the scope of prior automated discovery methods (Romera-Paredes et al., 2023). Notably, AlphaEvolve developed a search algorithm that found a procedure to multiply two $4 \times 4$ complex-valued matrices using $48$ scalar multiplications; offering the first improvement, after 56 years, over Strassen's algorithm in this setting. We believe AlphaEvolve and coding agents like it can have a significant impact in improving solutions of problems across many areas of science and computation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 33 baseline 3 method 3 dataset 2 other 1

citation-polarity summary

background 31 baseline 3 use method 3 unclear 2 use dataset 2 support 1

claims ledger

abstract In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific problems or optimizing critical pieces of computational infrastructure. AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the code. Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical d

authors

Adam Zsolt Wagner Alexander Novikov Emilien Dupont Marvin Eisenberger Ng\^an V\~u Po-Sen Huang

co-cited works

representative citing papers

Resolving the Schwartz Quadratic Meander Number Conjecture

math.CO · 2026-06-10 · unverdicted · novelty 8.0 · 2 refs

The maximum meander number for cyclic permutations on n letters is bounded above and below by quadratic functions of n.

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

cs.AI · 2026-06-03 · unverdicted · novelty 8.0

AutoLab benchmark shows frontier models mostly fail at sustained iterative optimization due to premature termination, with persistence as the key success factor.

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

cs.AI · 2026-06-03 · unverdicted · novelty 8.0

The Meta-Agent Challenge shows frontier AI models rarely match human-engineered agent baselines when tasked with autonomous development, with proprietary models succeeding most often and some exhibiting cheating under pressure.

LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning

cs.AI · 2026-05-28 · unverdicted · novelty 8.0

LLM-guided evolutionary search yields the first domain-independent C++ planning heuristics that exceed the strongest hand-engineered baselines on coverage and speed trade-offs across unseen domains.

FastKernels: Benchmarking GPU Kernel Generation in Production

cs.LG · 2026-05-22 · conditional · novelty 8.0

FastKernels is a production-aligned benchmark covering 96.2% of HuggingFace Transformers that reveals state-of-the-art kernel agents deliver at most 0.94x aggregate speedup.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

cs.CL · 2026-05-08 · conditional · novelty 8.0 · 2 refs

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

MappingEvolve: LLM-Driven Code Evolution for Technology Mapping

cs.CE · 2026-04-29 · unverdicted · novelty 8.0

MappingEvolve applies LLMs through Planner-Evolver-Evaluator agents to evolve technology mapping code, delivering 10.04% area reduction versus ABC and 7.93% versus mockturtle on EPFL benchmarks.

Prism: Symbolic Superoptimization of Tensor Programs

cs.PL · 2026-04-16 · unverdicted · novelty 8.0

Prism is the first symbolic superoptimizer for tensor programs that uses sGraph for compact representation of program families, two-level search, e-graph equivalence checking, and auto-tuning to achieve up to 2.2x speedup over prior superoptimizers on LLM workloads.

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

cs.CL · 2026-04-14 · unverdicted · novelty 8.0

InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

CHIA: An open-source framework for principled, agentic AI-driven hardware/software co-design research

cs.AR · 2026-06-25 · unverdicted · novelty 7.0

CHIA introduces a framework for building and deploying agentic AI co-design flows as CHIA loops with tool nodes, reliability mechanisms, and five case-study demonstrations.

A catalog of fast matrix multiplication algorithms with frontier-closure search

cs.SC · 2026-06-11 · unverdicted · novelty 7.0

A machine-checkable catalog of low-rank matrix multiplication algorithms up to 32x32x32 is built over multiple fields via frontier-closure search that recombines entries while preserving a non-overlap property with prior bilinear cores.

Mathematical perspective on genetic algorithms with optimization guided operators

cs.NE · 2026-06-10 · unverdicted · novelty 7.0

Presents a query-complexity framework for genetic algorithms with guided operators and shows necessity of multiple operators and tight bounds for diversity in solution pools.

AgentCanary: A Security Evaluation Framework for Autonomous AI Agents in Real Executable Environments

cs.CR · 2026-06-09 · unverdicted · novelty 7.0

AgentCanary introduces an Entry × Impact risk taxonomy, high-fidelity real tool environments with persistent state, and multi-dimensional trajectory evaluation to assess AI agent security across models and attacks.

Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

EinsteinArena is a platform for AI agents to collectively discover new mathematical results through open interaction, achieving 12 new state-of-the-art outcomes including raising the 11-dimensional kissing number lower bound from 593 to 604.

Self-Harness: Harnesses That Improve Themselves

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.

FunctionEvolve: Structure-Guided Symbolic Regression with LLMs

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

FunctionEvolve recovers 107 exact symbolic forms out of 129 synthetic tasks (82.9% SA@50) by using expression-tree structure for evolutionary search, parent selection, mutation, and coefficient scoring with LLMs.

MotionDisco: Motion Discovery for Extreme Humanoid Loco-Manipulation

cs.RO · 2026-06-04 · unverdicted · novelty 7.0

MotionDisco discovers long-horizon humanoid loco-manipulation motions from scratch via LLM-guided evolutionary search, trajectory optimization, and pruning, then transfers them to real robots with RL policies.

An automated proof that R(B_8,B_10)=37

math.CO · 2026-06-04 · accept · novelty 7.0

Proves R(B_8, B_10) = 37 via an AI-assisted short proof with a Lean formalization of the upper bound.

LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

LeanMarathon uses four contract-scoped agents on an evolving blueprint coordinated by a two-stage orchestrator to formalize seven theorems from Erdős problems in Lean, proving 258 lemmas with no sorry across three runs.

Classification of independent sets in signed Johnson graphs and applications to kissing arrangements

cs.IT · 2026-06-02 · unverdicted · novelty 7.0

Enumeration yields 1579 non-isomorphic maximum independent sets in J±(12,4) giving non-isometric kissing arrangements of size 840, with a proof that for n≡2 or 4 mod 6 all such sets arise from Steiner quadruple systems.

MobEvolve: An Agentic Self-Evolving Heuristic System for Interpretable Human Mobility Generation

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

MobEvolve is an agentic self-evolving heuristic framework that generates interpretable human mobility trajectories and outperforms deep generative and LLM-based methods on Singapore and Montreal benchmarks.

When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

cs.LG · 2026-05-29 · unverdicted · novelty 7.0 · 2 refs

PromptPO shows LLMs can act as black-box policy optimizers for sequential RL when leveraging prior knowledge, matching baselines in exploration and robotics but underperforming in MuJoCo.

citing papers explorer

Showing 50 of 89 citing papers after filters.

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks? cs.AI · 2026-06-03 · unverdicted · none · ref 53 · internal anchor
AutoLab benchmark shows frontier models mostly fail at sustained iterative optimization due to premature termination, with persistence as the key success factor.
The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development? cs.AI · 2026-06-03 · unverdicted · none · ref 18 · internal anchor
The Meta-Agent Challenge shows frontier AI models rarely match human-engineered agent baselines when tasked with autonomous development, with proprietary models succeeding most often and some exhibiting cheating under pressure.
LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning cs.AI · 2026-05-28 · unverdicted · none · ref 32 · internal anchor
LLM-guided evolutionary search yields the first domain-independent C++ planning heuristics that exceed the strongest hand-engineered baselines on coverage and speed trade-offs across unseen domains.
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems? cs.AI · 2026-05-07 · unverdicted · none · ref 54 · internal anchor
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization cs.AI · 2026-06-03 · unverdicted · none · ref 21 · internal anchor
LeanMarathon uses four contract-scoped agents on an evolving blueprint coordinated by a two-stage orchestrator to formalize seven theorems from Erdős problems in Lean, proving 258 lemmas with no sorry across three runs.
MobEvolve: An Agentic Self-Evolving Heuristic System for Interpretable Human Mobility Generation cs.AI · 2026-06-01 · unverdicted · none · ref 97 · internal anchor
MobEvolve is an agentic self-evolving heuristic framework that generates interpretable human mobility trajectories and outperforms deep generative and LLM-based methods on Singapore and Montreal benchmarks.
FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization cs.AI · 2026-05-24 · unverdicted · none · ref 16 · internal anchor
FrontierOR benchmark shows frontier LLMs outperform Gurobi on solution quality and efficiency in only 31% of one-shot cases and 50% with test-time evolution on hard large-scale optimization tasks.
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems cs.AI · 2026-05-22 · unverdicted · none · ref 35 · internal anchor
IDS is an agentic LLM system that incrementally synthesizes both implementation and proof for distributed key-value stores, succeeding on all 7 specs where prior agents succeeded on only 2.
Advancing Mathematics Research with AI-Driven Formal Proof Search cs.AI · 2026-05-21 · conditional · none · ref 46 · 2 links · internal anchor
An LLM-based agent with Lean verification autonomously solved multiple open Erdős problems and OEIS conjectures in the first large-scale test.
Forecasting Scientific Progress with Artificial Intelligence cs.AI · 2026-05-21 · unverdicted · none · ref 10 · internal anchor
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents cs.AI · 2026-05-21 · unverdicted · none · ref 39 · internal anchor
Life-Harness evolves reusable interventions from training trajectories to enhance frozen LLM agents on unseen tasks across seven deterministic environments, yielding 88.5% average relative improvement in 116 of 126 model-environment settings.
Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design cs.AI · 2026-05-16 · unverdicted · none · ref 16 · internal anchor
Latent Heuristic Search performs continuous optimization over learned embeddings of heuristics, using normalizing flows and LLM prompting to discover competitive solvers for TSP, CVRP, KSP, and OBP.
Property-Guided LLM Program Synthesis for Planning cs.AI · 2026-05-15 · unverdicted · none · ref 42 · internal anchor
Property-guided LLM program synthesis with counterexample feedback creates direct heuristics for PDDL planning domains that require far fewer generations and less evaluation cost than score-based baselines.
SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution cs.AI · 2026-05-14 · unverdicted · none · ref 4 · internal anchor
SMCEvolve applies Sequential Monte Carlo sampling to LLM program search with adaptive resampling, mutation mixtures, and convergence control, delivering finite-sample complexity bounds and benchmark gains over prior systems.
Harnessing Agentic Evolution cs.AI · 2026-05-13 · unverdicted · none · ref 19 · internal anchor
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
Budget-Efficient Automatic Algorithm Design via Code Graph cs.AI · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
A code-graph and correction-based LLM search framework outperforms full-algorithm generation at equal token budgets on three combinatorial optimization problems.
Agentic MIP Research: Accelerated Constraint Handler Generation cs.AI · 2026-05-09 · unverdicted · none · ref 14 · internal anchor
LLM agents in a solver-aware harness recover global constraints from MIP formulations, generate executable propagation-only handlers for SCIP, and solve five additional MIPLIB 2017 instances.
AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design cs.AI · 2026-05-09 · unverdicted · none · ref 8 · internal anchor
AHD Agent trains a 4B-parameter LLM via agentic RL to actively use tools for automatic heuristic design, matching or exceeding larger baselines across eight domains with fewer evaluations.
AI co-mathematician: Accelerating mathematicians with agentic AI cs.AI · 2026-05-07 · unverdicted · none · ref 19 · 2 links · internal anchor
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents cs.AI · 2026-05-07 · unverdicted · none · ref 23 · internal anchor
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs cs.AI · 2026-05-07 · unverdicted · none · ref 2 · internal anchor
A knowledge-first approach to LLM-driven automatic heuristic design in combinatorial optimization yields better discovery efficiency, transfer, and generalization than code-centric baselines by formalizing a distortion-compression trade-off.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch cs.AI · 2026-05-05 · unverdicted · none · ref 51 · internal anchor
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
AI scientists produce results without reasoning scientifically cs.AI · 2026-04-20 · conditional · none · ref 25 · internal anchor
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design cs.AI · 2026-04-14 · unverdicted · none · ref 9 · internal anchor
BEAM reformulates LLM-based heuristic design as bi-level optimization using GA for structures, MCTS for placeholders, and adaptive memory to outperform prior single-layer methods on CVRP and MIS tasks.
The AI Telco Engineer: Toward Autonomous Discovery of Wireless Communications Algorithms cs.AI · 2026-04-11 · unverdicted · none · ref 8 · internal anchor
An LLM-powered agentic framework autonomously designs competitive and sometimes superior explainable algorithms for wireless PHY and MAC layer tasks.
SignalClaw: LLM-Guided Evolutionary Synthesis of Interpretable Traffic Signal Control Skills cs.AI · 2026-04-07 · unverdicted · none · ref 6 · internal anchor
SignalClaw synthesizes interpretable, composable traffic signal control skills through LLM-guided evolution that matches top baselines on routine SUMO scenarios and outperforms them on emergency and transit events while remaining editable by engineers.
Meta-Harness: End-to-End Optimization of Model Harnesses cs.AI · 2026-03-30 · unverdicted · none · ref 37 · internal anchor
Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across held-out models.
FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment cs.AI · 2026-03-17 · unverdicted · none · ref 12 · internal anchor
FactorEngine mines alpha factors as Turing-complete code via LLM-guided directional search, parameter separation, and a multi-agent pipeline that converts financial reports into executable programs, delivering higher IC/ICIR and Sharpe ratios than baselines in backtests.
MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants cs.AI · 2026-03-10 · unverdicted · none · ref 1 · internal anchor
MiniAppBench is the first benchmark for LLMs to generate principle-driven interactive HTML MiniApps from 500 tasks across six domains, evaluated by the agentic MiniAppEval framework on intention, static, and dynamic dimensions.
IC3-Evolve: Proof-/Witness-Gated Offline LLM-Driven Heuristic Evolution for IC3 Hardware Model Checking cs.AI · 2026-01-18 · unverdicted · none · ref 13 · internal anchor
IC3-Evolve evolves IC3 heuristics via offline LLM patches that are admitted only after passing proof or witness validation, yielding standalone improved checkers evaluated on HWMCC and unseen benchmarks.
One Reflection Is Not Enough: Self-Correcting Autonomous Research via Multi-Hypothesis Failure Attribution cs.AI · 2026-06-30 · unverdicted · none · ref 106 · internal anchor
SAGE with MHFA improves failure recovery in autonomous research agents, raising metrics-bearing outputs from 42% to 92% on a 12-topic benchmark versus single-reflection baselines.
Learning the ARTS of Search for Automated Discovery cs.AI · 2026-06-20 · unverdicted · none · ref 40 · internal anchor
ARTS improves automated scientific discovery by using reasoning LMs with test-time training to separate hypothesis merit from execution quality in tree search, achieving 15.3% relative gains on 22 MLGym and MLEBench tasks.
Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness cs.AI · 2026-06-17 · unverdicted · none · ref 20 · internal anchor
Xcientist externalizes research synthesis and validation in AI scientists via contract-governed artifacts to maintain traceable trajectories and avoid claim drift across three domains.
AIChilles: Automatically Uncovering Hidden Weaknesses in AI-Evolved Systems cs.AI · 2026-06-14 · unverdicted · none · ref 30 · internal anchor
AIChilles finds 49 distinct hidden weaknesses across 30 AI-evolved programs in five applications by combining workload extraction, agent-based constraint inference, differential oracles, and coverage to expose regressions.
Structured Testbench Generation for LLM-Driven HDL Design and Verification-Oriented Data Curation cs.AI · 2026-06-11 · unverdicted · none · ref 12 · internal anchor
STG generates deterministic testbenches 720x faster than iterative LLM flows with higher coverage and fewer false passes, while serving as an 11x faster data curation engine with 127x less energy.
Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents cs.AI · 2026-06-10 · unverdicted · none · ref 10 · internal anchor
Evoflux applies evolutionary search at inference time to repair executable tool workflows for compact agents, outperforming SFT and SFT+DPO on held-out MCP-Bench tasks with live servers and 250 tools.
A-Evolve-Training: Autonomous Post-Training of a 30B Model cs.AI · 2026-06-09 · unverdicted · none · ref 3 · internal anchor
An autonomous post-training system for a 30B model achieves near-top human performance on a reasoning leaderboard and revises its search policy after detecting that its dev metric had become misleading.
Self-Evolving Scientific Agent Discovers Generalizable Physically-Reasoned Fluid Control cs.AI · 2026-06-07 · unverdicted · none · ref 16 · internal anchor
An LLM-based self-evolving agent discovers a traveling-wave controller with body-frame guidance and yaw feedback that generalizes to unseen targets for an underactuated fluid swimmer.
Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution cs.AI · 2026-06-03 · unverdicted · none · ref 31 · internal anchor
LLM-driven program mutation converges to restricted structural attractors, with 87% of chains showing over 93% structural revisits and most variation limited to terminal substitutions, unlike classical GP.
EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents cs.AI · 2026-06-02 · unverdicted · none · ref 17 · internal anchor
EvoDrive presents an LLM-based agentic evolution framework that generates diverse safety-critical autonomous driving scenarios by maintaining a Pareto archive of attack-realism trade-offs using simulator feedback.
EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning cs.AI · 2026-06-02 · unverdicted · none · ref 58 · internal anchor
EvoTrainer co-evolves LLM policies and training harnesses via empirical feedback to match or exceed human-engineered RL on math reasoning, code generation, and long-horizon software engineering.
AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation cs.AI · 2026-05-27 · unverdicted · none · ref 36 · internal anchor
Decentralized AI agent teams self-organize around hypotheses, critique proposals, and share knowledge to outperform single-agent baselines on biomedical ML, language-model optimization, and protein fitness tasks.
ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence cs.AI · 2026-05-25 · unverdicted · none · ref 23 · internal anchor
ScientistOne introduces Chain-of-Evidence and an audit system that achieves zero hallucinated references, perfect score verification, and top method-code alignment while matching or beating human experts on five frontier tasks and generalizing to six more.
DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations cs.AI · 2026-05-23 · unverdicted · none · ref 26 · internal anchor
DemoEvolve bootstraps harness evolution with demonstrations to achieve more stable and effective edits than self-rollout search in sparse-feedback environments like Balatro.
Towards Direct Evaluation of Harness Optimizers via Priority Ranking cs.AI · 2026-05-21 · unverdicted · none · ref 23 · internal anchor
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches cs.AI · 2026-05-18 · unverdicted · none · ref 14 · internal anchor
LLM agent translates user prompts into model patches and selects primal-aware re-optimization techniques for large-scale dynamic problems, shown on supply-chain and exam-scheduling cases.
Capturing LLM Capabilities via Evidence-Calibrated Query Clustering cs.AI · 2026-05-16 · unverdicted · none · ref 32 · 2 links · internal anchor
ECC calibrates semantic embeddings with model comparisons via Bradley-Terry profiles and mixture weights to cluster queries by latent LLM capabilities, claiming 17-18 point gains in ranking quality over baselines.
Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design cs.AI · 2026-05-15 · unverdicted · none · ref 34 · internal anchor
Multi-agent LLM systems discover new Transformer and hybrid architectures that outperform Llama 3.2 at 1B scale and approach human SOTA on long-range benchmarks.
OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation cs.AI · 2026-05-14 · conditional · none · ref 14 · 2 links · internal anchor
OpenDeepThink uses Bradley-Terry aggregation of LLM pairwise judgments to rank and evolve parallel reasoning traces, improving Gemini 3.1 Pro Codeforces Elo by 405 points over eight rounds.
Shepherd: Enabling Programmable Meta-Agents via Reversible Agentic Execution Traces cs.AI · 2026-05-11 · unverdicted · none · ref 26 · 2 links · internal anchor
Shepherd provides a reversible execution trace substrate for LLM agents that enables meta-agents to inspect and transform runs, yielding reported gains on coding and terminal benchmarks via supervision, counterfactual repair, and RL credit assignment.

AlphaEvolve: A coding agent for scientific and algorithmic discovery

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer