super hub Canonical reference

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Adam Zsolt Wagner, Alexander Novikov, Emilien Dupont, Marvin Eisenberger, Po-Sen Huang · 2025 · cs.AI · arXiv 2506.13131

Canonical reference. 74% of citing Pith papers cite this work as background.

147 Pith papers citing it

Background 74% of classified citations

open full Pith review browse 147 citing papers more from Adam Zsolt Wagner arXiv PDF

abstract

In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific problems or optimizing critical pieces of computational infrastructure. AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the code. Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical discoveries. We demonstrate the broad applicability of this approach by applying it to a number of important computational problems. When applied to optimizing critical components of large-scale computational stacks at Google, AlphaEvolve developed a more efficient scheduling algorithm for data centers, found a functionally equivalent simplification in the circuit design of hardware accelerators, and accelerated the training of the LLM underpinning AlphaEvolve itself. Furthermore, AlphaEvolve discovered novel, provably correct algorithms that surpass state-of-the-art solutions on a spectrum of problems in mathematics and computer science, significantly expanding the scope of prior automated discovery methods (Romera-Paredes et al., 2023). Notably, AlphaEvolve developed a search algorithm that found a procedure to multiply two $4 \times 4$ complex-valued matrices using $48$ scalar multiplications; offering the first improvement, after 56 years, over Strassen's algorithm in this setting. We believe AlphaEvolve and coding agents like it can have a significant impact in improving solutions of problems across many areas of science and computation.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 33 baseline 3 method 3 dataset 2 other 1

citation-polarity summary

background 31 baseline 3 use method 3 unclear 2 use dataset 2 support 1

claims ledger

abstract In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific problems or optimizing critical pieces of computational infrastructure. AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the code. Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical d

authors

Adam Zsolt Wagner Alexander Novikov Emilien Dupont Marvin Eisenberger Ng\^an V\~u Po-Sen Huang

co-cited works

representative citing papers

FastKernels: Benchmarking GPU Kernel Generation in Production

cs.LG · 2026-05-22 · conditional · novelty 8.0

FastKernels is a production-aligned benchmark covering 96.2% of HuggingFace Transformers that reveals state-of-the-art kernel agents deliver at most 0.94x aggregate speedup.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

cs.CL · 2026-05-08 · conditional · novelty 8.0 · 2 refs

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

MappingEvolve: LLM-Driven Code Evolution for Technology Mapping

cs.CE · 2026-04-29 · unverdicted · novelty 8.0

MappingEvolve applies LLMs through Planner-Evolver-Evaluator agents to evolve technology mapping code, delivering 10.04% area reduction versus ABC and 7.93% versus mockturtle on EPFL benchmarks.

Prism: Symbolic Superoptimization of Tensor Programs

cs.PL · 2026-04-16 · unverdicted · novelty 8.0

Prism is the first symbolic superoptimizer for tensor programs that uses sGraph for compact representation of program families, two-level search, e-graph equivalence checking, and auto-tuning to achieve up to 2.2x speedup over prior superoptimizers on LLM workloads.

InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

cs.CL · 2026-04-14 · unverdicted · novelty 8.0

InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

IDS is an agentic LLM system that incrementally synthesizes both implementation and proof for distributed key-value stores, succeeding on all 7 specs where prior agents succeeded on only 2.

Advancing Mathematics Research with AI-Driven Formal Proof Search

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

LLM-based agents in Lean solved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures at a few hundred dollars each.

Forecasting Scientific Progress with Artificial Intelligence

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.

What Do Evolutionary Coding Agents Evolve?

cs.NE · 2026-05-19 · unverdicted · novelty 7.0

Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.

Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

An LLM-based evolutionary search discovers novel kernels for high-dimensional Bayesian optimization, achieving an average rank of 1.2 out of 17 on five benchmarks via two-stage proposal and LOO-CRPS selection.

Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design

cs.AI · 2026-05-16 · unverdicted · novelty 7.0

Latent Heuristic Search performs continuous optimization over learned embeddings of heuristics, using normalizing flows and LLM prompting to discover competitive solvers for TSP, CVRP, KSP, and OBP.

Probabilistic Seasonal Streamflow Forecasting Across California's Sierra Nevada Watersheds with Agentic AI

physics.ao-ph · 2026-05-15 · unverdicted · novelty 7.0

An agentic AI workflow evolves an adaptive XGBoost quantile regression ensemble that reduces watershed-averaged forecast error by up to 29% versus California's operational forecasts for April-July runoff at 1-6 month leads across 23 Sierra Nevada sites.

Property-Guided LLM Program Synthesis for Planning

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

Property-guided LLM program synthesis with counterexample feedback creates direct heuristics for PDDL planning domains that require far fewer generations and less evaluation cost than score-based baselines.

From I/O to Code with Discovery Agent

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

DIO-Agent frames IO2Code as LLM-driven evolutionary search over programs with a Transformation Priority Premise to favor simple hypotheses, outperforming baselines on a new IO2CodeBench.

SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

SMCEvolve applies Sequential Monte Carlo sampling to LLM program search with adaptive resampling, mutation mixtures, and convergence control, delivering finite-sample complexity bounds and benchmark gains over prior systems.

SemaTune: Semantic-Aware Online OS Tuning with Large Language Models

cs.OS · 2026-05-14 · unverdicted · novelty 7.0

SemaTune uses LLM guidance with semantic context to tune up to 41 Linux OS parameters, delivering 72.5% performance gains over defaults and 153.3% over non-LLM baselines on 13 workloads while avoiding degraded states.

Test-Time Learning with an Evolving Library

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without parameter updates or supervision.

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

cs.LG · 2026-05-14 · conditional · novelty 7.0

FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.

Harnessing Agentic Evolution

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

Learning POMDP World Models from Observations with Language-Model Priors

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Pinductor leverages language-model priors to learn POMDP world models from limited trajectories, matching privileged-access methods in performance and exceeding tabular baselines in sample efficiency.

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.

Budget-Efficient Automatic Algorithm Design via Code Graph

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

A code-graph and correction-based LLM search framework outperforms full-algorithm generation at equal token budgets on three combinatorial optimization problems.

citing papers explorer

Showing 50 of 147 citing papers.

FastKernels: Benchmarking GPU Kernel Generation in Production cs.LG · 2026-05-22 · conditional · none · ref 17 · internal anchor
FastKernels is a production-aligned benchmark covering 96.2% of HuggingFace Transformers that reveals state-of-the-art kernel agents deliver at most 0.94x aggregate speedup.
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty cs.CL · 2026-05-12 · unverdicted · none · ref 11 · internal anchor
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling cs.CL · 2026-05-08 · conditional · none · ref 42 · 2 links · internal anchor
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning tasks at low cost.
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems? cs.AI · 2026-05-07 · unverdicted · none · ref 54 · internal anchor
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
MappingEvolve: LLM-Driven Code Evolution for Technology Mapping cs.CE · 2026-04-29 · unverdicted · none · ref 21 · internal anchor
MappingEvolve applies LLMs through Planner-Evolver-Evaluator agents to evolve technology mapping code, delivering 10.04% area reduction versus ABC and 7.93% versus mockturtle on EPFL benchmarks.
Prism: Symbolic Superoptimization of Tensor Programs cs.PL · 2026-04-16 · unverdicted · none · ref 24 · internal anchor
Prism is the first symbolic superoptimizer for tensor programs that uses sGraph for compact representation of program families, two-level search, e-graph equivalence checking, and auto-tuning to achieve up to 2.2x speedup over prior superoptimizers on LLM workloads.
InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis cs.CL · 2026-04-14 · unverdicted · none · ref 20 · internal anchor
InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems cs.AI · 2026-05-22 · unverdicted · none · ref 35 · internal anchor
IDS is an agentic LLM system that incrementally synthesizes both implementation and proof for distributed key-value stores, succeeding on all 7 specs where prior agents succeeded on only 2.
Advancing Mathematics Research with AI-Driven Formal Proof Search cs.AI · 2026-05-21 · unverdicted · none · ref 46 · internal anchor
LLM-based agents in Lean solved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures at a few hundred dollars each.
Forecasting Scientific Progress with Artificial Intelligence cs.AI · 2026-05-21 · unverdicted · none · ref 10 · internal anchor
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
What Do Evolutionary Coding Agents Evolve? cs.NE · 2026-05-19 · unverdicted · none · ref 2 · internal anchor
Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.
Automated Kernel Discovery Towards Understanding High-dimensional Bayesian Optimization cs.LG · 2026-05-18 · unverdicted · none · ref 31 · internal anchor
An LLM-based evolutionary search discovers novel kernels for high-dimensional Bayesian optimization, achieving an average rank of 1.2 out of 17 on five benchmarks via two-stage proposal and LOO-CRPS selection.
Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design cs.AI · 2026-05-16 · unverdicted · none · ref 16 · internal anchor
Latent Heuristic Search performs continuous optimization over learned embeddings of heuristics, using normalizing flows and LLM prompting to discover competitive solvers for TSP, CVRP, KSP, and OBP.
Probabilistic Seasonal Streamflow Forecasting Across California's Sierra Nevada Watersheds with Agentic AI physics.ao-ph · 2026-05-15 · unverdicted · none · ref 63 · internal anchor
An agentic AI workflow evolves an adaptive XGBoost quantile regression ensemble that reduces watershed-averaged forecast error by up to 29% versus California's operational forecasts for April-July runoff at 1-6 month leads across 23 Sierra Nevada sites.
Property-Guided LLM Program Synthesis for Planning cs.AI · 2026-05-15 · unverdicted · none · ref 42 · internal anchor
Property-guided LLM program synthesis with counterexample feedback creates direct heuristics for PDDL planning domains that require far fewer generations and less evaluation cost than score-based baselines.
From I/O to Code with Discovery Agent cs.LG · 2026-05-14 · unverdicted · none · ref 17 · internal anchor
DIO-Agent frames IO2Code as LLM-driven evolutionary search over programs with a Transformation Priority Premise to favor simple hypotheses, outperforming baselines on a new IO2CodeBench.
SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution cs.AI · 2026-05-14 · unverdicted · none · ref 4 · internal anchor
SMCEvolve applies Sequential Monte Carlo sampling to LLM program search with adaptive resampling, mutation mixtures, and convergence control, delivering finite-sample complexity bounds and benchmark gains over prior systems.
SemaTune: Semantic-Aware Online OS Tuning with Large Language Models cs.OS · 2026-05-14 · unverdicted · none · ref 67 · internal anchor
SemaTune uses LLM guidance with semantic context to tune up to 41 Linux OS parameters, delivering 72.5% performance gains over defaults and 153.3% over non-LLM baselines on 13 workloads while avoiding degraded states.
Test-Time Learning with an Evolving Library cs.LG · 2026-05-14 · unverdicted · none · ref 31 · internal anchor
EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without parameter updates or supervision.
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale cs.LG · 2026-05-14 · conditional · none · ref 26 · internal anchor
FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
Harnessing Agentic Evolution cs.AI · 2026-05-13 · unverdicted · none · ref 19 · internal anchor
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
Learning POMDP World Models from Observations with Language-Model Priors cs.LG · 2026-05-13 · unverdicted · none · ref 23 · internal anchor
Pinductor leverages language-model priors to learn POMDP world models from limited trajectories, matching privileged-access methods in performance and exceeding tabular baselines in sample efficiency.
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents cs.LG · 2026-05-11 · unverdicted · none · ref 91 · internal anchor
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
Budget-Efficient Automatic Algorithm Design via Code Graph cs.AI · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
A code-graph and correction-based LLM search framework outperforms full-algorithm generation at equal token budgets on three combinatorial optimization problems.
Agentic MIP Research: Accelerated Constraint Handler Generation cs.AI · 2026-05-09 · unverdicted · none · ref 14 · internal anchor
LLM agents in a solver-aware harness recover global constraints from MIP formulations, generate executable propagation-only handlers for SCIP, and solve five additional MIPLIB 2017 instances.
AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design cs.AI · 2026-05-09 · unverdicted · none · ref 8 · internal anchor
AHD Agent trains a 4B-parameter LLM via agentic RL to actively use tools for automatic heuristic design, matching or exceeding larger baselines across eight domains with fewer evaluations.
AI co-mathematician: Accelerating mathematicians with agentic AI cs.AI · 2026-05-07 · unverdicted · none · ref 19 · 2 links · internal anchor
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents cs.AI · 2026-05-07 · unverdicted · none · ref 23 · internal anchor
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs cs.AI · 2026-05-07 · unverdicted · none · ref 2 · internal anchor
A knowledge-first approach to LLM-driven automatic heuristic design in combinatorial optimization yields better discovery efficiency, transfer, and generalization than code-centric baselines by formalizing a distortion-compression trade-off.
Out-of-the-Box Global Optimization for Packing Problems: New Models and Improved Solutions math.OC · 2026-05-06 · unverdicted · none · ref 7 · internal anchor
New nonlinear formulations for geometric packing problems, solved with FICO Xpress and SCIP, produce improved and first-known solutions for several variants.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch cs.AI · 2026-05-05 · unverdicted · none · ref 51 · internal anchor
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
A revision of Litvak's conjecture on Gaussian minima and a volumetric zone conjecture math.PR · 2026-05-03 · unverdicted · none · ref 10 · internal anchor
Litvak's conjecture on minimizing moments of Gaussian minima is disproved by a cosine-based correlation matrix for small n and p, with a new conjecture proposed that this matrix is the general minimizer, supported conditionally via a volumetric extension of Fejes Tóth's zone conjecture.
BIM Information Extraction Through LLM-based Adaptive Exploration cs.CL · 2026-05-03 · unverdicted · none · ref 52 · internal anchor
LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses cs.CL · 2026-04-28 · unverdicted · none · ref 25 · 2 links · internal anchor
AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with cross-benchmark and cross-model transfer.
AI scientists produce results without reasoning scientifically cs.AI · 2026-04-20 · conditional · none · ref 25 · internal anchor
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
Neuro-Symbolic ODE Discovery with Latent Grammar Flow cs.LG · 2026-04-17 · unverdicted · none · ref 26 · internal anchor
Latent Grammar Flow discovers ODEs by placing grammar-based equation representations in a discrete latent space, using a behavioral loss to cluster similar equations, and sampling via a discrete flow model guided by data fit and constraints.
Optimal and Near-Optimal Constructions for Bootstrap Percolation in Hypercubes math.CO · 2026-04-16 · unverdicted · none · ref 23 · internal anchor
m(Q_d;4) equals d(d² + 3d + 14)/24 + 1 for infinitely many d, with an O(d)-additive upper bound for all d.
BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design cs.AI · 2026-04-14 · unverdicted · none · ref 9 · internal anchor
BEAM reformulates LLM-based heuristic design as bi-level optimization using GA for structures, MCTS for placeholders, and adaptive memory to outperform prior single-layer methods on CVRP and MIS tasks.
The AI Telco Engineer: Toward Autonomous Discovery of Wireless Communications Algorithms cs.AI · 2026-04-11 · unverdicted · none · ref 8 · internal anchor
An LLM-powered agentic framework autonomously designs competitive and sometimes superior explainable algorithms for wireless PHY and MAC layer tasks.
A Minimal Model of Representation Collapse: Frustration, Stop-Gradient, and Dynamics cond-mat.dis-nn · 2026-04-11 · unverdicted · none · ref 12 · internal anchor
A minimal embedding model shows representation collapse arises from frustrated samples through slow dynamics and is prevented by stop-gradient.
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture cs.MS · 2026-04-08 · accept · none · ref 34 · internal anchor
k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics cs.DC · 2026-04-08 · unverdicted · none · ref 21 · internal anchor
Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
SignalClaw: LLM-Guided Evolutionary Synthesis of Interpretable Traffic Signal Control Skills cs.AI · 2026-04-07 · unverdicted · none · ref 6 · internal anchor
SignalClaw synthesizes interpretable, composable traffic signal control skills through LLM-guided evolution that matches top baselines on routine SUMO scenarios and outperforms them on emergency and transit events while remaining editable by engineers.
Meta-Harness: End-to-End Optimization of Model Harnesses cs.AI · 2026-03-30 · unverdicted · none · ref 37 · internal anchor
Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across held-out models.
FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment cs.AI · 2026-03-17 · unverdicted · none · ref 12 · internal anchor
FactorEngine mines alpha factors as Turing-complete code via LLM-guided directional search, parameter separation, and a multi-agent pipeline that converts financial reports into executable programs, delivering higher IC/ICIR and Sharpe ratios than baselines in backtests.
MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants cs.AI · 2026-03-10 · unverdicted · none · ref 1 · internal anchor
MiniAppBench is the first benchmark for LLMs to generate principle-driven interactive HTML MiniApps from 500 tasks across six domains, evaluated by the agentic MiniAppEval framework on intention, static, and dynamic dimensions.
Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution cs.SE · 2026-02-27 · unverdicted · none · ref 20 · internal anchor
IFCodeEvolve synthesizes coding data via actor-schema co-evolution with MCTS, boosting a 32B model's performance to match proprietary SOTA on instruction following.
VeRO: An Evaluation Harness for Agents to Optimize Agents cs.AI · 2026-02-25 · unverdicted · none · ref 19 · internal anchor
VeRO supplies a versioned harness, benchmark suite, and empirical comparison of optimizer configurations for coding agents that improve other agents.
EvoDiagram: Agentic Editable Diagram Creation via Design Expertise Evolution cs.HC · 2026-02-20 · unverdicted · none · ref 8 · internal anchor
EvoDiagram uses a coordinated multi-agent system and design knowledge evolution to generate editable diagrams via canvas schema, with a new CanvasBench benchmark showing strong performance over baselines.
Learning to Discover at Test Time cs.LG · 2026-01-22 · unverdicted · none · ref 50 · internal anchor
TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.

AlphaEvolve: A coding agent for scientific and algorithmic discovery

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer