arxiv: 2506.13131 · v1 · submitted 2025-06-16 · 💻 cs.AI · cs.LG· cs.NE

Recognition: 2 theorem links

· Lean Theorem

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov , Ng\^an V\~u , Marvin Eisenberger , Emilien Dupont , Po-Sen Huang , Adam Zsolt Wagner , Sergey Shirobokov , Borislav Kozlovskii

show 10 more authors

Francisco J. R. Ruiz Abbas Mehrabian M. Pawan Kumar Abigail See Swarat Chaudhuri George Holland Alex Davies Sebastian Nowozin Pushmeet Kohli Matej Balog

Authors on Pith no claims yet

Pith reviewed 2026-05-10 21:21 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.NE

keywords evolutionary algorithmslarge language modelsmatrix multiplicationalgorithm discoverycode optimizationautomated improvement

0 comments

The pith

An evolutionary coding agent discovered a new algorithm for multiplying 4x4 complex matrices using 48 multiplications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AlphaEvolve as a system that directs large language models to edit code in repeated cycles, using evaluator feedback to keep only changes that are both correct and faster. This process is applied to real infrastructure tasks and to longstanding mathematical questions. One outcome is an algorithm for 4x4 complex matrix multiplication that needs only 48 scalar multiplications. The authors argue this is the first such improvement since Strassen's 1969 method, which required 49. The work claims the same loop can be pointed at many other algorithmic bottlenecks.

Core claim

AlphaEvolve runs an evolutionary loop in which language models propose direct edits to source code; each candidate is tested by one or more evaluator functions that return signals about correctness and runtime cost. When applied to the matrix-multiplication problem, the system located a procedure that multiplies two 4-by-4 complex matrices with 48 scalar multiplications, a reduction from the 49 required by Strassen's algorithm.

What carries the argument

The evolutionary pipeline in which language models generate code edits that are retained only if evaluator functions confirm both functional correctness and measurable performance gains.

If this is right

Scheduling algorithms for data centers can be made more efficient by the same code-evolution process.
Hardware-accelerator circuit designs can be simplified while preserving function.
Training runs of the underlying language model itself can be accelerated through discovered code changes.
Other open problems in mathematics and computer science can yield provably better algorithms when the same loop is applied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If reliable evaluators can be written for proof verification or physical simulation, the approach could be directed at open conjectures or model discovery.
Scaling the number of parallel evolutionary runs or pairing the system with stronger base models would likely increase the rate of new findings.
The method offers a route for domain experts to supply only an evaluator and a starting code skeleton while the agent handles the search.

Load-bearing premise

The evaluator functions reliably detect both correctness and genuine performance improvements without overlooking subtle errors or accepting superficial code changes.

What would settle it

An independent check that either shows the reported 4-by-4 complex matrix algorithm is incorrect or that it requires more than 48 scalar multiplications.

read the original abstract

In this white paper, we present AlphaEvolve, an evolutionary coding agent that substantially enhances capabilities of state-of-the-art LLMs on highly challenging tasks such as tackling open scientific problems or optimizing critical pieces of computational infrastructure. AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the code. Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical discoveries. We demonstrate the broad applicability of this approach by applying it to a number of important computational problems. When applied to optimizing critical components of large-scale computational stacks at Google, AlphaEvolve developed a more efficient scheduling algorithm for data centers, found a functionally equivalent simplification in the circuit design of hardware accelerators, and accelerated the training of the LLM underpinning AlphaEvolve itself. Furthermore, AlphaEvolve discovered novel, provably correct algorithms that surpass state-of-the-art solutions on a spectrum of problems in mathematics and computer science, significantly expanding the scope of prior automated discovery methods (Romera-Paredes et al., 2023). Notably, AlphaEvolve developed a search algorithm that found a procedure to multiply two $4 \times 4$ complex-valued matrices using $48$ scalar multiplications; offering the first improvement, after 56 years, over Strassen's algorithm in this setting. We believe AlphaEvolve and coding agents like it can have a significant impact in improving solutions of problems across many areas of science and computation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AlphaEvolve shows an LLM evolutionary loop can edit full code to beat Strassen on 4x4 complex matrices and tweak Google infrastructure, but the matrix claim hinges on unshown evaluator details.

read the letter

The paper's core contribution is an evolutionary agent that lets LLMs make direct edits to code, then scores the results with evaluators and iterates. It reports three practical wins at Google scale plus one mathematical discovery: a 48-multiplication algorithm for 4x4 complex matrix multiplication that improves on Strassen's 49 after 56 years. That matrix result is the clearest new claim relative to the cited FunSearch work, which was limited to expression search rather than full program editing. The infrastructure examples, such as data-center scheduling and accelerator circuit simplification, plus the self-acceleration of the underlying LLM training, show the loop can be applied to production code without obvious circularity in the search process. Those applications are the part that feels most grounded if the measurements are reproducible. The matrix result is presented as empirically discovered rather than algebraically derived, which fits the method but still requires the evaluator to have caught every possible functional mismatch over the complex field. The abstract gives no test-suite description, no symbolic verification steps, and no sample of the discovered code, so the stress-test concern about evaluator blind spots is fair until the methods section shows how equivalence and multiplication count were confirmed. Minor gaps include missing error bars on the infrastructure speedups and no discussion of how the search avoided solutions that only work on narrow input distributions. The paper is aimed at researchers building LLM agents for algorithmic discovery and at teams that optimize large computational stacks. It is worth a serious referee because the claims are concrete enough to verify or falsify and the extension beyond FunSearch is straightforward to evaluate once the evaluator logic is laid out. I would send it to review rather than desk-reject.

Referee Report

2 major / 0 minor

Summary. The paper introduces AlphaEvolve, an evolutionary coding agent that orchestrates LLMs to iteratively modify code for algorithm improvement under evaluator feedback. It reports applications to Google-scale infrastructure tasks (data-center scheduling, hardware-accelerator circuit simplification, and acceleration of the underlying LLM training) as well as to open mathematical and algorithmic problems, with the headline result being a search-derived procedure for multiplying two 4×4 complex matrices using only 48 scalar multiplications—the first improvement over Strassen’s 49-multiplication construction in 56 years.

Significance. If the reported infrastructure wins and the new matrix-multiplication algorithm are independently verified, the work would demonstrate that LLM-orchestrated evolutionary search can produce both practical computational gains and novel, non-obvious algorithmic results, thereby extending the scope of automated discovery methods beyond the symbolic and theorem-proving settings explored in prior work.

major comments (2)

[Abstract] Abstract: the central claim that AlphaEvolve produced a 'provably correct' 4×4 complex matrix multiplication algorithm using 48 scalar multiplications is presented without any description of the evaluator’s test suite, symbolic verification method, input coverage over the complex field, or the discovered code itself. Because the evaluator is the sole source of both the correctness label and the performance delta, this omission is load-bearing for the discovery claim.
[Abstract] Abstract: the statements that AlphaEvolve 'developed a more efficient scheduling algorithm,' 'found a functionally equivalent simplification in the circuit design,' and 'accelerated the training of the LLM' are given without quantitative metrics, baselines, statistical significance, or error bars, preventing assessment of whether these constitute genuine, reproducible improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the abstract. We have revised the abstract to include additional details on verification methods and quantitative metrics while preserving its concise nature. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that AlphaEvolve produced a 'provably correct' 4×4 complex matrix multiplication algorithm using 48 scalar multiplications is presented without any description of the evaluator’s test suite, symbolic verification method, input coverage over the complex field, or the discovered code itself. Because the evaluator is the sole source of both the correctness label and the performance delta, this omission is load-bearing for the discovery claim.

Authors: We agree that the abstract would benefit from more context on verification. The full manuscript describes the evaluator in the section on mathematical discovery tasks, which combines symbolic algebraic verification (using exact equality checks via computer algebra) with dense random sampling over the complex field to confirm correctness for arbitrary inputs. The discovered procedure is presented explicitly in the results. We have revised the abstract to note that the 48-multiplication algorithm was obtained and verified through this symbolic and numerical process, with pointers to the full description. revision: yes
Referee: [Abstract] Abstract: the statements that AlphaEvolve 'developed a more efficient scheduling algorithm,' 'found a functionally equivalent simplification in the circuit design,' and 'accelerated the training of the LLM' are given without quantitative metrics, baselines, statistical significance, or error bars, preventing assessment of whether these constitute genuine, reproducible improvements.

Authors: We acknowledge that quantitative support strengthens the claims. The manuscript body reports the specific metrics, baselines, and statistical details (including multiple runs and significance tests) for each infrastructure application. We have revised the abstract to include concise summaries of these performance deltas and verification approaches. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical search results independent of inputs

full rationale

The paper presents AlphaEvolve as an evolutionary coding agent that applies LLMs to modify code and uses separate evaluators to score correctness and performance. The headline discovery—a 4×4 complex matrix multiplication algorithm using 48 scalar multiplications—is reported as an output of this search process rather than any algebraic derivation, fitted parameter, or self-referential definition. No equations, uniqueness theorems, or ansatzes are introduced that reduce the claimed result to the method's own inputs by construction. Prior citations (e.g., Romera-Paredes et al. 2023) supply background on automated discovery but do not carry the load of the new empirical finding, which rests on external verification of the discovered code. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on the assumption that LLMs can generate useful code edits and that evaluators can be trusted to guide evolution toward correct solutions; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption LLMs can produce syntactically valid and semantically useful code modifications when prompted with performance feedback.
Invoked implicitly by the evolutionary pipeline description.
domain assumption Evaluator programs provide accurate and complete signals of correctness and performance.
Required for the evolutionary selection step to converge on improved algorithms.

pith-pipeline@v0.9.0 · 5658 in / 1365 out tokens · 44442 ms · 2026-05-10T21:21:46.086365+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
cs.CL 2026-05 unverdicted novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 conditional novelty 8.0

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
MappingEvolve: LLM-Driven Code Evolution for Technology Mapping
cs.CE 2026-04 unverdicted novelty 8.0

MappingEvolve applies LLMs through Planner-Evolver-Evaluator agents to evolve technology mapping code, delivering 10.04% area reduction versus ABC and 7.93% versus mockturtle on EPFL benchmarks.
Prism: Symbolic Superoptimization of Tensor Programs
cs.PL 2026-04 unverdicted novelty 8.0

Prism is the first symbolic superoptimizer for tensor programs that uses sGraph for compact representation of program families, two-level search, e-graph equivalence checking, and auto-tuning to achieve up to 2.2x spe...
InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis
cs.CL 2026-04 unverdicted novelty 8.0

InfiniteScienceGym procedurally generates unbounded scientific repositories with exact ground-truth QA pairs to benchmark LLMs on data reasoning, abstention, and tool use without static datasets.
SemaTune: Semantic-Aware Online OS Tuning with Large Language Models
cs.OS 2026-05 unverdicted novelty 7.0

SemaTune uses LLM guidance with semantic context to tune up to 41 Linux OS parameters, delivering 72.5% performance gains over defaults and 153.3% over non-LLM baselines on 13 workloads while avoiding degraded states.
Test-Time Learning with an Evolving Library
cs.LG 2026-05 unverdicted novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
cs.LG 2026-05 conditional novelty 7.0

FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
Harnessing Agentic Evolution
cs.AI 2026-05 unverdicted novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
Learning POMDP World Models from Observations with Language-Model Priors
cs.LG 2026-05 unverdicted novelty 7.0

Pinductor leverages language-model priors to learn POMDP world models from limited trajectories, matching privileged-access methods in performance and exceeding tabular baselines in sample efficiency.
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
cs.LG 2026-05 unverdicted novelty 7.0

A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
cs.LG 2026-05 unverdicted novelty 7.0

Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
cs.LG 2026-05 unverdicted novelty 7.0

AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
Budget-Efficient Automatic Algorithm Design via Code Graph
cs.AI 2026-05 unverdicted novelty 7.0

A code-graph and correction-based LLM search framework outperforms full-algorithm generation at equal token budgets on three combinatorial optimization problems.
Agentic MIP Research: Accelerated Constraint Handler Generation
cs.AI 2026-05 unverdicted novelty 7.0

LLM agents in a solver-aware harness recover global constraints from MIP formulations, generate executable propagation-only handlers for SCIP, and solve five additional MIPLIB 2017 instances.
AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design
cs.AI 2026-05 unverdicted novelty 7.0

AHD Agent trains a 4B-parameter LLM via agentic RL to actively use tools for automatic heuristic design, matching or exceeding larger baselines across eight domains with fewer evaluations.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 unverdicted novelty 7.0

AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
AI co-mathematician: Accelerating mathematicians with agentic AI
cs.AI 2026-05 unverdicted novelty 7.0

An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
cs.AI 2026-05 unverdicted novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar op...
Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs
cs.AI 2026-05 unverdicted novelty 7.0

A knowledge-first approach to LLM-driven automatic heuristic design in combinatorial optimization yields better discovery efficiency, transfer, and generalization than code-centric baselines by formalizing a distortio...
Out-of-the-Box Global Optimization for Packing Problems: New Models and Improved Solutions
math.OC 2026-05 unverdicted novelty 7.0

New nonlinear formulations for geometric packing problems, solved with FICO Xpress and SCIP, produce improved and first-known solutions for several variants.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
A revision of Litvak's conjecture on Gaussian minima and a volumetric zone conjecture
math.PR 2026-05 unverdicted novelty 7.0

Litvak's conjecture on minimizing moments of Gaussian minima is disproved by a cosine-based correlation matrix for small n and p, with a new conjecture proposed that this matrix is the general minimizer, supported con...
BIM Information Extraction Through LLM-based Adaptive Exploration
cs.CL 2026-05 unverdicted novelty 7.0

LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
cs.CL 2026-04 unverdicted novelty 7.0

AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with transfer gains across models and benchmarks.
AI scientists produce results without reasoning scientifically
cs.AI 2026-04 conditional novelty 7.0

LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
Neuro-Symbolic ODE Discovery with Latent Grammar Flow
cs.LG 2026-04 unverdicted novelty 7.0

Latent Grammar Flow discovers ODEs by placing grammar-based equation representations in a discrete latent space, using a behavioral loss to cluster similar equations, and sampling via a discrete flow model guided by d...
Optimal and Near-Optimal Constructions for Bootstrap Percolation in Hypercubes
math.CO 2026-04 unverdicted novelty 7.0

m(Q_d;4) equals d(d² + 3d + 14)/24 + 1 for infinitely many d, with an O(d)-additive upper bound for all d.
BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design
cs.AI 2026-04 unverdicted novelty 7.0

BEAM reformulates LLM-based heuristic design as bi-level optimization using GA for structures, MCTS for placeholders, and adaptive memory to outperform prior single-layer methods on CVRP and MIS tasks.
The AI Telco Engineer: Toward Autonomous Discovery of Wireless Communications Algorithms
cs.AI 2026-04 unverdicted novelty 7.0

An LLM-powered agentic framework autonomously designs competitive and sometimes superior explainable algorithms for wireless PHY and MAC layer tasks.
A Minimal Model of Representation Collapse: Frustration, Stop-Gradient, and Dynamics
cond-mat.dis-nn 2026-04 unverdicted novelty 7.0

A minimal embedding model shows representation collapse arises from frustrated samples through slow dynamics and is prevented by stop-gradient.
M$^\star$: Every Task Deserves Its Own Memory Harness
cs.PL 2026-04 unverdicted novelty 7.0

M* evolves distinct Python memory programs per task via population-based reflective search, outperforming fixed-memory baselines on conversation, planning, and reasoning benchmarks.
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture
cs.MS 2026-04 accept novelty 7.0

k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
Autopoiesis: A Self-Evolving System Paradigm for LLM Serving Under Runtime Dynamics
cs.DC 2026-04 unverdicted novelty 7.0

Autopoiesis uses LLM-driven program synthesis to evolve serving policies online during deployment, delivering up to 53% and average 34% gains over prior LLM serving systems under runtime dynamics.
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery
cs.CL 2026-04 unverdicted novelty 7.0

AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.
SignalClaw: LLM-Guided Evolutionary Synthesis of Interpretable Traffic Signal Control Skills
cs.AI 2026-04 unverdicted novelty 7.0

SignalClaw synthesizes interpretable, composable traffic signal control skills through LLM-guided evolution that matches top baselines on routine SUMO scenarios and outperforms them on emergency and transit events whi...
Meta-Harness: End-to-End Optimization of Model Harnesses
cs.AI 2026-03 unverdicted novelty 7.0

Meta-Harness discovers improved harness code for LLMs via agentic search over prior execution traces, yielding 7.7-point gains on text classification with 4x fewer tokens and 4.7-point gains on math reasoning across h...
FactorEngine: A Program-level Knowledge-Infused Factor Mining Framework for Quantitative Investment
cs.AI 2026-03 unverdicted novelty 7.0

FactorEngine mines alpha factors as Turing-complete code via LLM-guided directional search, parameter separation, and a multi-agent pipeline that converts financial reports into executable programs, delivering higher ...
MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants
cs.AI 2026-03 unverdicted novelty 7.0

MiniAppBench is the first benchmark for LLMs to generate principle-driven interactive HTML MiniApps from 500 tasks across six domains, evaluated by the agentic MiniAppEval framework on intention, static, and dynamic d...
OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation
cs.AI 2026-05 unverdicted novelty 6.0

OpenDeepThink improves LLM reasoning by ranking parallel candidate traces via Bradley-Terry aggregation of LLM pairwise judgments, achieving a +405 Codeforces Elo gain on Gemini 3.1 Pro after eight rounds.
Epistemic Uncertainty for Test-Time Discovery
cs.LG 2026-05 unverdicted novelty 6.0

UG-TTT adds epistemic uncertainty measured by adapter disagreement as an exploration bonus in RL for LLMs, raising maximum reward and diversity on scientific discovery benchmarks.
What should post-training optimize? A test-time scaling law perspective
cs.LG 2026-05 unverdicted novelty 6.0

Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
LLM4Branch: Large Language Model for Discovering Efficient Branching Policies of Integer Programs
cs.AI 2026-05 unverdicted novelty 6.0

LLM4Branch discovers branching policies for MILP solvers as LLM-generated executable programs whose parameters are tuned via zeroth-order optimization on solver performance.
SCALAR: A Neurosymbolic Framework for Automated Conjecture and Reasoning in Quantum Circuit Analysis
quant-ph 2026-05 unverdicted novelty 6.0

SCALAR generates conjectures linking optimal QAOA parameters to graph invariants, recovers known periodicity and parameter-transfer properties, and identifies correlations with optimization landscapes across thousands...
Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon
cs.LG 2026-05 unverdicted novelty 6.0

Metal-Sci is a benchmark and harness for LLM evolutionary optimization of Apple Silicon Metal kernels that uses held-out sizes to detect silent regressions missed by in-distribution scores.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
cs.LG 2026-05 unverdicted novelty 6.0

MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
cs.LG 2026-05 unverdicted novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
Open-Ended Task Discovery via Bayesian Optimization
cs.AI 2026-05 unverdicted novelty 6.0

Generate-Select-Refine is an open-ended Bayesian optimization method that generates tasks and concentrates evaluations on the best one with only logarithmic regret overhead relative to standard single-task optimization.
AI co-mathematician: Accelerating mathematicians with agentic AI
cs.AI 2026-05 unverdicted novelty 6.0

An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.
Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery
cs.AI 2026-05 unverdicted novelty 6.0

Expert mathematicians using an AI coding agent for discovery engage in repeated cycles of intentmaking to define goals and sensemaking to interpret outputs.
CodeEvolve: LLM-Driven Evolutionary Optimization with Runtime-Enriched Target Selection for Multi-Language Code Enhancement
cs.SE 2026-05 unverdicted novelty 6.0

CodeEvolve uses runtime-guided target selection and MCTS-augmented LLM evolution to optimize real Java and Apex code, reporting 15.22x average speedup on seven hotspots while preserving correctness.
CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment
cs.AI 2026-05 unverdicted novelty 6.0

CASCADE enables LLMs to continually adapt at deployment via case-based episodic memory and contextual bandits, improving macro-averaged success by 20.9% over zero-shot on 16 tasks spanning medicine, law, code, and robotics.
Programmatic Context Augmentation for LLM-based Symbolic Regression
cs.AI 2026-05 unverdicted novelty 6.0

Programmatic context augmentation lets LLM-based symbolic regression perform code-driven data analysis during search, yielding superior efficiency and accuracy over baselines on LLM-SRBench.
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
cs.AI 2026-05 unverdicted novelty 6.0

FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.
Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving
cs.NI 2026-04 unverdicted novelty 6.0

Switchless topologies such as 3D full-mesh are 20.6-56.2% more cost-effective than scale-up networks for MoE LLM serving, with current link bandwidths over-provisioned by up to 27%.
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
q-bio.NC 2026-04 unverdicted novelty 6.0

CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization
cs.AI 2026-04 accept novelty 6.0

An LLM-driven agentic system evolves microarchitectural policies for cache replacement, data prefetching, and branch prediction, producing designs that match or exceed prior state-of-the-art in IPC on standard benchmarks.
GR-Evolve: Design-Adaptive Global Routing via LLM-Driven Algorithm Evolution
cs.AR 2026-04 unverdicted novelty 6.0

GR-Evolve applies LLM-driven code evolution to global routing, reporting up to 8.72% post-detailed-routing wirelength reduction on seven benchmarks across three technology nodes.
PrismaDV: Automated Task-Aware Data Unit Test Generation
cs.LG 2026-04 unverdicted novelty 6.0

PrismaDV generates task-aware data unit tests by jointly analyzing downstream code and dataset profiles, outperforming task-agnostic baselines on new benchmarks spanning 60 tasks, with SIFTA enabling automatic prompt ...

Reference graph

Works this paper leans on

123 extracted references · 123 canonical work pages · cited by 88 Pith papers · 4 internal anchors

[1]

Abramson, J

J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Will- more, A. J. Ballard, J. Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3.Nature, 630(8016):493–500, 2024

work page 2024
[2]

Aglietti, I

V. Aglietti, I. Ktena, J. Schrouff, E. Sgouritsa, F. J. R. Ruiz, A. Malek, A. Bellot, and S. Chiappa. FunBO: Discovering acquisition functions for Bayesian optimization with FunSearch. InInternational Conference on Machine Learning, 2025

work page 2025
[3]

AI achieves silver-medal standard solving International Mathematical Olympiad problems, 2024

AlphaProof and AlphaGeometry teams. AI achieves silver-medal standard solving International Mathematical Olympiad problems, 2024. URLhttps://deepmind.g oogle/discover/blog/ai-solves-imo-problems-at-silver-medal-lev el

work page 2024
[4]

Neural Machine Translation by Jointly Learning to Align and Translate

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014

work page internal anchor Pith review arXiv 2014
[5]

Banzhaf, P

W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francone.Genetic Programming: An Introduction on the Automatic Evolution of computer programs and its Applications. The Morgan Kaufmann Series in Artificial Intelligence, 1998

work page 1998
[6]

Bausch, A

J. Bausch, A. W. Senior, F. J. H. Heras, T. Edlich, A. Davies, M. Newman, C. Jones, K. Satzinger, M. Y. Niu, S. Blackwell, G. Holland, D. Kafri, J. Atalaya, C. Gidney, D. Hassabis, S. Boixo, H. Neven, and P. Kohli. Learning high-accuracy error decoding for quantum processors.Nature, 635(8040):834–840, 2024. doi: 10.1038/s41586-0 24-08148-8

work page doi:10.1038/s41586-0 2024
[7]

D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language models.Nature, 624(7992):570–578, 2023. doi: 10.1038/s415 86-023-06792-0

work page doi:10.1038/s415 2023
[8]

Boyvalenkov, S

P. Boyvalenkov, S. Dodunekov, and O. Musin. A survey on the kissing numbers.Serdica Math. J., 38(4):507–522, 2012. ISSN 1310-6600

work page 2012
[9]

Bradbury, R

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URLhttp://github.com/j ax-ml/jax. 23 AlphaEvolve: A coding agent for scientific and algorithmic discovery

work page 2018
[10]

Bran, A., Cox, S., Schilter, O., Baldassari, C., White, A.D., Schwaller, P.: Aug- menting large language models with chemistry tools

A.M.Bran,S.Cox,O.Schilter,C.Baldassari,A.D.White,andP.Schwaller. Augmenting large language models with chemistry tools.Nature Machine Intelligence, 6(5):525– 535, 2024. doi: 10.1038/s42256-024-00832-8

work page doi:10.1038/s42256-024-00832-8 2024
[11]

A. M. Bran, T. A. Neukomm, D. P. Armstrong, Z. Jončev, and P. Schwaller. Chemical reasoning in LLMs unlocks steerable synthesis planning and reaction mechanism elucidation. InarXiv preprint arXiv:2503.08537, 2025

work page arXiv 2025
[12]

Caldas Ramos, C

M. Caldas Ramos, C. J. Collison, and A. D. White. A review of large language models and autonomous agents in chemistry.Chemical Science, 16:2514–2572, 2025. doi: 10.1039/D4SC03921A

work page doi:10.1039/d4sc03921a 2025
[13]

P. S. Castro, N. Tomasev, A. Anand, N. Sharma, R. Mohanta, A. Dev, K. Perlin, S. Jain, K. Levin, N. Éltető, W. Dabney, A. Novikov, G. C. Turner, M. K. Eckstein, N. D. Daw, K. J. Miller, and K. L. Stachenfeld. Discovering symbolic cognitive models from human and animal behavior. InInternational Conference on Machine Learning, 2025

work page 2025
[14]

A. Chen, D. M. Dohan, and D. R. So. EvoPrompting: Language models for code-level neural architecture search. In Advances in Neural Information Processing Systems, 2023

work page 2023
[15]

T. Chen, B. Xu, and K. Devleker. Automating GPU kernel generation with DeepSeek-R1 and inference time scaling, 2025. URLhttps://developer.nvidia.com/blog/ automating-gpu-kernel-generation-with-deepseek-r1-and-inference -time-scaling

work page 2025
[16]

X. Chen, C. Liang, D. Huang, E. Real, K. Wang, H. Pham, X. Dong, T. Luong, C.-J. Hsieh, Y. Lu, and Q. V. Le. Symbolic discovery of optimization algorithms.Advances in Neural Information Processing Systems, 2023

work page 2023
[17]

Onsupremaofautoconvolutionswithanapplication to Sidon sets.Proceedings of the American Mathematical Society, 145(8):3191–3200, 2017

A.CloningerandS.Steinerberger. Onsupremaofautoconvolutionswithanapplication to Sidon sets.Proceedings of the American Mathematical Society, 145(8):3191–3200, 2017

work page 2017
[18]

Cohn and F

H. Cohn and F. Gonçalves. An optimal uncertainty principle in twelve dimensions via modular forms.Inventiones mathematicae, 217:799–831, 2019

work page 2019
[19]

K. M. Collins, A. Q. Jiang, S. Frieder, L. Wong, M. Zilka, U. Bhatt, T. Lukasiewicz, Y. Wu, J. B. Tenenbaum, W. Hart, et al. Evaluating language models for mathematics through interactions. Proceedings of the National Academy of Sciences, 121(24):e2318124121, 2024

work page 2024
[20]

K. D. Cooper, D. Subramanian, and L. Torczon. Adaptive optimizing compilers for the 21st century.The Journal of Supercomputing, 23:7–22, 2002

work page 2002
[21]

M. Cranmer. Interpretable machine learning for science with pysr and symbolicre- gression. jl.arXiv preprint arXiv:2305.01582, 2023

work page internal anchor Pith review arXiv 2023
[22]

T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022. 24 AlphaEvolve: A coding agent for scientific and algorithmic discovery

work page 2022
[23]

Nature600(7887), 70–74 (2021) https://doi.org/10.1038/s41586-021-04086-x

A. Davies, P. Veličković, L. Buesing, S. Blackwell, D. Zheng, N. Tomašev, R. Tanburn, P. Battaglia, C. Blundell, A. Juhász, M. Lackenby, G. Williamson, D. Hassabis, and P. Kohli. Advancing mathematics by guiding human intuition with AI.Nature, 600 (7887):70–74, 2021. doi: 10.1038/s41586-021-04086-x

work page doi:10.1038/s41586-021-04086-x 2021
[24]

J. S. Ellenberg, C. S. Fraser-Taliente, T. R. Harvey, K. Srivastava, and A. V. Sutherland. Generative modelling for mathematical discovery.arXiv preprint arXiv:2503.11061, 2025

work page arXiv 2025
[25]

P. Erdős. Some remarks on number theory.Riveon Lematematika, 9:45–48, 1955

work page 1955
[26]

A.Fawzi,M.Balog,A.Huang,T.Hubert,B.Romera-Paredes,M.Barekatain,A.Novikov, F. J. R. Ruiz, J. Schrittwieser, G. Swirszcz, D. Silver, D. Hassabis, and P. Kohli. Discov- ering faster matrix multiplication algorithms with reinforcement learning.Nature, 610(7930):47–53, 2022. doi: 10.1038/s41586-022-05172-4

work page doi:10.1038/s41586-022-05172-4 2022
[27]

Promptbreeder: Self-referential self-improvement via prompt evolution,

C. Fernando, D. Banarse, H. Michalewski, S. Osindero, and T. Rocktäschel. Prompt- breeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023

work page arXiv 2023
[28]

Ferruz and B

N. Ferruz and B. Höcker. Controllable protein design with language models.Nature Machine Intelligence, 4(6):521–532, 2022

work page 2022
[29]

Friedman

E. Friedman. Erich’s Packing Center.https://erich-friedman.github.io/pa cking/, 2025. Accessed: 2025-04-22

work page 2025
[30]

Frohnert, X

F. Frohnert, X. Gu, M. Krenn, and E. van Nieuwenburg. Discovering emergent connec- tions in quantum physics research via dynamic word embeddings.Machine Learning: Science and Technology, 6(1):015029, 2025. doi: 10.1088/2632-2153/adb00a

work page doi:10.1088/2632-2153/adb00a 2025
[31]

Ganzhinov

M. Ganzhinov. Highly symmetric lines. InarXiv preprint arXiv:2207.08266v1, 2022

work page arXiv 2022
[32]

Gemini 2.5: Our most intelligent AI model, 2025

Gemini team. Gemini 2.5: Our most intelligent AI model, 2025. URL https: //blog.google/technology/google-deepmind/gemini-model-thinking-u pdates-march-2025

work page 2025
[33]

Gonçalves, D

F. Gonçalves, D. O. e Silva, and S. Steinerberger. Hermite polynomials, linear flows on the torus, and an uncertainty principle for roots.Journal of Mathematical Analysis and Applications, 451(2):678–711, 2017

work page 2017
[34]

Towards an AI co-scientist

J. Gottweis, W.-H. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno, K. Saab, D. Popovici, J. Blum, F. Zhang, K. Chou, A. Hassidim, B. Gokturk, A. Vahdat, P. Kohli, Y. Matias, A. Carroll, K. Kulkarni, N. Tomasev, Y. Guan, V. Dhillon, E. D. Vaishnav, B. Lee, T. R. D. Costa, J. R. Penadés, G. Peltz, Y. Xu, A...

work page internal anchor Pith review arXiv 2025
[35]

Grayeli, A

A. Grayeli, A. Sehgal, O. Costilla Reyes, M. Cranmer, and S. Chaudhuri. Symbolic regression with a learned concept library.Advances in Neural Information Processing Systems, 37:44678–44709, 2024. 25 AlphaEvolve: A coding agent for scientific and algorithmic discovery

work page 2024
[36]

Gridach, J

M. Gridach, J. Nanavati, C. Mack, K. Z. E. Abidine, and L. Mendes. Agentic AI for scientific discovery: A survey of progress, challenges, and future directions. InICLR Workshop: Towards Agentic AI for Science: Hypothesis Generation, Comprehension, Quantification, and Validation, 2025

work page 2025
[37]

Interesting scientific idea generation using knowledge graphs and llms: Evaluations with 100 research group leaders.arXiv:2405.17044, 2024

X. Gu and M. Krenn. Interesting scientific idea generation using knowledge graphs and LLMs: Evaluations with 100 research group leaders. InarXiv preprint arXiv:2405.17044, 2024

work page arXiv 2024
[38]

S. Guo, A. H. Shariatmadari, G. Xiong, and A. Zhang. Embracing foundation models for advancing scientific discovery. InProceedings of the IEEE International Conference on Big Data, pages 1746–1755, 2024. doi: 10.1109/bigdata62323.2024.10825618

work page doi:10.1109/bigdata62323.2024.10825618 2024
[39]

Gyarmati, F

K. Gyarmati, F. Hennecart, and I. Z. Ruzsa. Sums and differences of finite sets. Functiones et Approximatio Commentarii Mathematici, 37(1):175–186, 2007

work page 2007
[40]

J. K. Haugland. The minimum overlap problem revisited. arXiv preprint arXiv:1609.08000, 2016

work page arXiv 2016
[41]

Hemberg, S

E. Hemberg, S. Moskal, and U.-M. O’Reilly. Evolving code with a large language model. Genetic Programming and Evolvable Machines, 25(2):21, 2024. doi: 10.1007/ s10710-024-09494-2

work page 2024
[42]

J. E. Hopcroft and L. R. Kerr. On minimizing the number of multiplications necessary for matrix multiplication.SIAM J. Appl. Math., 20(1):30–36, Jan. 1971. ISSN 0036-

work page 1971
[43]

doi: 10.1137/0120004

work page doi:10.1137/0120004
[44]

Crispr-gpt: An llm agent for automated design of gene-editing experiments

K. Huang, Y. Qu, H. Cousins, W. A. Johnson, D. Yin, M. Shah, D. Zhou, R. Altman, M. Wang, and L. Cong. CRISPR-GPT: An LLM agent for automated design of gene- editing experiments. InarXiv preprint arXiv:2404.18021, 2024

work page arXiv 2024
[45]

Huang, W

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2): 1–55, 2025

work page 2025
[46]

S. Jia, C. Zhang, and V. Fung. LLMatDesign: Autonomous materials discovery with large language models. InarXiv preprint arXiv:2406.13163, 2024

work page arXiv 2024
[47]

doi: 10.1038/s41586-021-03819-2

J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvu- nakool, R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bo- denste...

work page doi:10.1038/s41586-021-03819-2 2021
[48]

Kauers and J

M. Kauers and J. Moosbauer. Flip graphs for matrix multiplication. InProceedings of the 2023 International Symposium on Symbolic and Algebraic Computation, pages 381–388, 2023. 26 AlphaEvolve: A coding agent for scientific and algorithmic discovery

work page 2023
[49]

Kauers and J

M. Kauers and J. Moosbauer. Some new non-commutative matrix multiplication algorithms of size(𝑛, 𝑚, 6). ACM Commun. Comput. Algebra, 58(1):1–11, Jan. 2025. ISSN 1932-2232. doi: 10.1145/3712020.3712021

work page doi:10.1145/3712020.3712021 2025
[50]

Kauers and I

M. Kauers and I. Wood. Consequences of the Moosbauer-Poole algorithms.arXiv preprint arXiv:2505.05896, 2025

work page arXiv 2025
[51]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InInternational Conference on Learning Representations (ICLR), 2015

work page 2015
[52]

J. R. Koza. Genetic programming as a means for programming computers by natural selection. Statistics and Computing, 4(2):87–112, 1994. doi: 10.1007/BF00175355

work page doi:10.1007/bf00175355 1994
[53]

J. D. Laderman. A noncommutative algorithm for multiplying3 × 3 matrices using 23 multiplications.Bulletin of the American Mathematical Society, 82(1):126 – 128, 1976

work page 1976
[54]

R. Lam, A. Sanchez-Gonzalez, M. Willson, P. Wirnsberger, M. Fortunato, F. Alet, S. Ravuri, T. Ewalds, Z. Eaton-Rosen, W. Hu, A. Merose, S. Hoyer, G. Holland, O. Vinyals, J. Stott, A. Pritzel, S. Mohamed, and P. Battaglia. Learning skillful medium-range global weather forecasting.Science, 382(6677):1416–1421, 2023. doi: 10.1126/science.adi2336

work page doi:10.1126/science.adi2336 2023
[55]

W. B. Langdon and R. Poli.Foundations of genetic programming. Springer Science & Business Media, 2013

work page 2013
[56]

Lange, Y

R. Lange, Y. Tian, and Y. Tang. Large language models as evolution strategies. In ProceedingsoftheGeneticandEvolutionaryComputationConferenceCompanion ,GECCO ’24 Companion, pages 579–582. Association for Computing Machinery, 2024. doi: 10.1145/3638530.3654238

work page doi:10.1145/3638530.3654238 2024
[57]

R. T. Lange, A. Prasad, Q. Sun, M. Faldor, Y. Tang, and D. Ha. The AI CUDA engineer: Agentic CUDA kernel discovery, optimization and composition. Technical report, Sakana AI, 02 2025

work page 2025
[58]

Lehman, J

J. Lehman, J. Gordon, S. Jain, K. Ndousse, C. Yeh, and K. O. Stanley. Evolution through large models. InHandbook of evolutionary machine learning, pages 331–366. Springer, 2023

work page 2023
[59]

Lehman, J

J. Lehman, J. Gordon, S. Jain, K. Ndousse, C. Yeh, and K. O. Stanley.Evolution Through Large Models, pages 331–366. Springer Nature Singapore, 2024. doi: 10.1007/978-981-99-3814-8\_11

work page doi:10.1007/978-981-99-3814-8 2024
[60]

Li, Y.-Y

P.-H. Li, Y.-Y. Sun, H.-F. Juan, C.-Y. Chen, H.-K. Tsai, and J.-H. Huang. A large language model framework for literature-based disease–gene association prediction.Briefings in Bioinformatics, 26(1):bbaf070, 02 2025. ISSN 1477-4054. doi: 10.1093/bib/bbaf070

work page doi:10.1093/bib/bbaf070 2025
[61]

Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P.-S. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. Mankowitz, 27 AlphaEvolve: A coding agent for scientific and algorithmic discovery E. S. Robson, P. Kohli, N. d...

work page doi:10.1126/science.abq1158 2022
[62]

H. Lin, M. Maas, M. Roquemore, A. Hasanzadeh, F. Lewis, Y. Simonson, T.-W. Yang, A. Yazdanbakhsh, D. Altinbüken, F. Papa, et al. ECO: An LLM-driven efficient code optimizer for warehouse scale computers.arXiv preprint arXiv:2503.15669, 2025

work page arXiv 2025
[63]

Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, N. Smetanin, R. Verkuil, O. Kabeli, Y. Shmueli, A. dos Santos Costa, M. Fazel-Zarandi, T. Sercu, S. Candido, and A. Rives. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023. doi: 10.1126/science.ade2574

work page doi:10.1126/science.ade2574 2023
[64]

F. Liu, X. Tong, M. Yuan, X. Lin, F. Luo, Z. Wang, Z. Lu, and Q. Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language model. arXiv preprint arXiv:2401.02051, 2024

work page arXiv 2024
[65]

F. Luo, J. Zhang, Q. Wang, and C. Yang. Leveraging prompt engineering in large language models for accelerating chemical research.ACS Central Science, 2025. doi: 10.1021/acscentsci.4c01935

work page doi:10.1021/acscentsci.4c01935 2025
[66]

Z. Luo, Z. Yang, Z. Xu, W. Yang, and X. Du. LLM4SR: A survey on large language models for scientific research. InarXiv preprint arXiv:2501.04306, 2025

work page arXiv 2025
[67]

H. Ma, A. Narayanaswamy, P. Riley, and L. Li. Evolving symbolic density functionals. Science Advances, 8(36):eabq0279, 2022. doi: 10.1126/sciadv.abq0279

work page doi:10.1126/sciadv.abq0279 2022
[68]

Madani, B

A. Madani, B. Krause, E. R. Greene, S. Subramanian, B. P. Mohr, J. M. Holton, J. L. Olmos, C. Xiong, Z. Z. Sun, R. Socher, J. S. Fraser, and N. Naik. Large language models generatefunctionalproteinsequencesacrossdiversefamilies. NatureBiotechnology, 41 (8):1099–1106, August 2023. ISSN 1087-0156. doi: 10.1038/s41587-022-01618-2

work page doi:10.1038/s41587-022-01618-2 2023
[69]

D. J. Mankowitz, A. Michi, A. Zhernov, M. Gelmi, M. Selvi, C. Paduraru, E. Leurent, S. Iqbal, J.-B. Lespiau, A. Ahern, T. Köppe, K. Millikin, S. Gaffney, S. Elster, J. Broshear, C.Gamble, K.Milan, R.Tung, M.Hwang, T.Cemgil, M.Barekatain, Y.Li, A.Mandhane, T.Hubert, J.Schrittwieser, D.Hassabis, P.Kohli, M.Riedmiller, O.Vinyals, andD.Silver. Faster sorting ...

work page doi:10.1038/s41586-023-06004-9 2023
[70]

Massalin

H. Massalin. Superoptimizer - A look at the smallest program. In R. H. Katz and M.Freeman, editors,ProceedingsoftheSecondInternationalConferenceonArchitectural Support for Programming Languages and Operating Systems (ASPLOS II), Palo Alto, California, USA, October 5-8, 1987, pages 122–126. ACM Press, 1987. doi: 10.1145/ 36206.36194

work page arXiv 1987
[71]

Matolcsi and C

M. Matolcsi and C. Vinuesa. Improved bounds on the supremum of autoconvolutions. Journal of mathematical analysis and applications, 372(2):439–447, 2010

work page 2010
[72]

Are llms ready for real-world materials discovery? arXiv preprint arXiv:2402.05200, 2024

S. Miret and N. M. A. Krishnan. Are LLMs ready for real-world materials discovery? In arXiv preprint arXiv:2402.05200, 2024. 28 AlphaEvolve: A coding agent for scientific and algorithmic discovery

work page arXiv 2024
[73]

Moosbauer and M

J. Moosbauer and M. Poole. Flip graphs with symmetry and new matrix multiplication schemes. arXiv preprint arXiv:2502.04514, 2025

work page arXiv 2025
[74]

Morris, M

C. Morris, M. Jurado, and J. Zutty. Llm guided evolution-the automation of mod- els advancing models. InProceedings of the Genetic and Evolutionary Computation Conference, pages 377–384, 2024

work page 2024
[75]

Illuminating search spaces by mapping elites

J.-B. Mouret and J. Clune. Illuminating search spaces by mapping elites.arXiv preprint arXiv:1504.04909, 2015

work page Pith review arXiv 2015
[76]

Naumov, D

V. Naumov, D. Zagirova, S. Lin, Y. Xie, W. Gou, A. Urban, N. Tikhonova, K. Alawi, M. Durymanov, F. Galkin, S. Chen, D. Sidorenko, M. Korzinkin, M. Scheibye-Knudsen, A. Aspuru-Guzik, E. Izumchenko, D. Gennert, F. W. Pun, M. Zhang, P. Kamya, A. Aliper, F. Ren, and A. Zhavoronkov. DORA AI scientist: Multi-agent virtual research team for scientific exploratio...

work page doi:10.1101/2025.03.06.641840 2025
[77]

Introducing OpenAI o3 and o4-mini, 2025

OpenAI. Introducing OpenAI o3 and o4-mini, 2025. URLhttps://openai.com/i ndex/introducing-o3-and-o4-mini/

work page 2025
[78]

XLA: composable transformations of Python+NumPy programs

OpenXLA. XLA: composable transformations of Python+NumPy programs. URL https://github.com/openxla/xla

work page
[79]

H. Pan, N. Mudur, W. Taranto, M. Tikhanovskaya, S. Venugopalan, Y. Bahri, M. P. Brenner, and E.-A. Kim. Quantum many-body physics calculations with large language models. Communications Physics, 8(1):49, 2025. doi: 10.1038/s42005-025-01956-y

work page doi:10.1038/s42005-025-01956-y 2025
[80]

Accelerating earth science discovery via multi- agent llm systems,

D. Pantiukhin, B. Shapkin, I. Kuznetsov, A. A. Jost, and N. Koldunov. Accelerating Earth science discovery via multi-agent LLM systems. InarXiv preprint arXiv:2503.05854, 2025

work page arXiv 2025

Showing first 80 references.