arxiv: 2509.19349 · v1 · submitted 2025-09-17 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

Robert Tjarko Lange , Yuki Imajuku , Edoardo Cetin

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:54 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords program evolutionlarge language modelssample efficiencyevolutionary algorithmsopen-ended discoveryagentic systemscircle packingmixture of experts

0 comments

The pith

ShinkaEvolve evolves programs with far fewer samples by balancing exploration, rejecting non-novel code, and dynamically choosing which LLM to use for mutations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ShinkaEvolve, an open-source framework that uses large language models as mutation operators to evolve code for scientific and computational tasks. It claims that three coordinated techniques—parent sampling that trades exploration against exploitation, rejection sampling that filters out non-novel code, and bandit-driven selection of the best LLM for each step—cut the number of required samples from thousands to roughly one hundred while still reaching or beating prior results. Demonstrations include a new record circle-packing configuration, stronger agentic harnesses for AIME math problems, refinements to competitive-programming solutions, and fresh mixture-of-experts loss functions. A sympathetic reader would care because these changes make open-ended evolutionary search practical under realistic compute limits and make the method available for anyone to extend.

Core claim

ShinkaEvolve shows that parent sampling balancing exploration and exploitation, code novelty rejection-sampling, and bandit-based LLM ensemble selection together enable sample-efficient program evolution. These mechanisms let the system discover a new state-of-the-art circle-packing solution in only 150 evaluations, produce high-performing agentic systems for AIME reasoning, improve ALE-Bench competitive-programming entries, and identify novel mixture-of-expert load-balancing losses.

What carries the argument

Three coordinated mechanisms: parent sampling that balances exploration against exploitation, rejection sampling based on code novelty, and a multi-armed bandit that selects which LLM acts as the mutation operator at each generation.

If this is right

New state-of-the-art circle-packing solutions become reachable with under 200 program evaluations.
Agentic harnesses for AIME-level mathematical reasoning can be improved without requiring thousands of LLM calls.
Competitive-programming solutions on benchmarks such as ALE-Bench can be refined through targeted evolutionary search.
Novel loss functions for mixture-of-experts load balancing can be discovered automatically.
Open-source release lowers the cost barrier for applying evolutionary discovery to other computational problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sampling and selection principles could transfer to non-code domains such as molecule generation or neural-architecture search if the underlying LLM mutation step remains effective.
Dynamic LLM selection may reduce overall inference cost in other agentic pipelines even when evolution is not the goal.
Success with modest sample budgets suggests evolutionary search can complement large-scale training rather than compete with it.
Future tests could check whether the efficiency gains persist when the base models are replaced or when task complexity increases.

Load-bearing premise

The reported gains in sample efficiency and solution quality are driven primarily by the three listed innovations rather than by the choice of base LLMs or task-specific tuning.

What would settle it

An ablation experiment that disables any one of the three innovations and shows that performance on the circle-packing task reverts to the level of prior closed-source methods.

read the original abstract

We introduce ShinkaEvolve: a new open-source framework leveraging large language models (LLMs) to advance scientific discovery with state-of-the-art performance and unprecedented efficiency. Recent advances in scaling inference time compute of LLMs have enabled significant progress in generalized scientific discovery. These approaches rely on evolutionary agentic harnesses that leverage LLMs as mutation operators to generate candidate solutions. However, current code evolution methods suffer from critical limitations: they are sample inefficient, requiring thousands of samples to identify effective solutions, and remain closed-source, hindering broad adoption and extension. ShinkaEvolve addresses these limitations, introducing three key innovations: a parent sampling technique balancing exploration and exploitation, code novelty rejection-sampling for efficient search space exploration, and a bandit-based LLM ensemble selection strategy. We evaluate ShinkaEvolve across diverse tasks, demonstrating consistent improvements in sample efficiency and solution quality. ShinkaEvolve discovers a new state-of-the-art circle packing solution using only 150 samples, designs high-performing agentic harnesses for AIME mathematical reasoning tasks, identifies improvements to ALE-Bench competitive programming solutions, and discovers novel mixture-of-expert load balancing loss functions that illuminate the space of optimization strategies. Our results demonstrate that ShinkaEvolve achieves broad applicability with exceptional sample efficiency. By providing open-source accessibility and cost-efficiency, this work democratizes open-ended discovery across diverse computational problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ShinkaEvolve adds three practical tweaks to LLM code evolution and claims big efficiency wins like a new circle-packing SOTA in 150 samples, but the results rest on unablated claims.

read the letter

The core of this paper is an open-source framework that evolves programs with LLMs using three named changes: parent sampling that tries to balance exploration and exploitation, novelty rejection sampling on the code outputs, and a bandit method for picking which LLM to use at each step. They report it finding a new circle-packing record with 150 samples, plus gains on AIME agent harnesses, ALE-Bench solutions, and some new MoE load-balancing losses. Being open source and focused on sample count is a clear step forward from the closed systems referenced in the abstract. The task mix is broad enough to suggest the approach is not narrowly tuned to one domain. The soft spot is exactly what the stress test flags: no ablations or controlled comparisons appear that isolate whether those three pieces are responsible for the efficiency or whether the gains trace to base model choice, prompt details, or other tuning. The abstract gives headline numbers but no error bars, baseline tables, or implementation specifics on how novelty is measured or how the bandit is updated. Without those, it is hard to judge if the innovations are load-bearing or incidental. This work is aimed at people already running LLM evolutionary loops for optimization or agent design who need lower sample budgets. A reader in that group could extract the framework and test the ideas themselves. It deserves a serious referee because the claims are concrete and falsifiable once the methods and data are fully laid out; the ideas are specific enough that reviewers can check whether the efficiency holds under proper controls.

Referee Report

3 major / 2 minor

Summary. The paper introduces ShinkaEvolve, an open-source LLM-based framework for program evolution. It proposes three innovations: a parent sampling technique to balance exploration and exploitation, code novelty rejection-sampling for efficient search, and a bandit-based strategy for LLM ensemble selection. The framework is evaluated on tasks including circle packing, AIME mathematical reasoning, ALE-Bench competitive programming, and mixture-of-experts load balancing, claiming superior sample efficiency and solution quality, such as a new SOTA circle packing solution with only 150 samples.

Significance. If the empirical results hold under rigorous validation, this work could significantly impact the field by providing an accessible, efficient tool for open-ended discovery and optimization problems. The open-source release and focus on sample efficiency address key limitations in current LLM-driven evolution methods, potentially accelerating progress in automated scientific discovery and code optimization.

major comments (3)

The abstract claims 'consistent improvements in sample efficiency and solution quality' and specific achievements like the 150-sample SOTA on circle packing, but supplies no experimental details, baselines, error bars, or ablation evidence. This prevents assessment of whether the claims are supported.
The three innovations are presented as the primary drivers of the reported gains, yet no controlled ablation studies (e.g., full system vs. ablated versions or vs. strong single-LLM baselines) are described to isolate their effects from base model choice or tuning.
Specific results such as improvements to ALE-Bench solutions and novel MoE loss functions are stated without accompanying quantitative comparisons, statistical significance, or details on how novelty and performance were measured.

minor comments (2)

Ensure all acronyms (e.g., AIME, ALE-Bench, MoE) are defined at first use.
The term 'agentic harnesses' could be clarified for readers unfamiliar with the terminology.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We have revised the manuscript to provide additional experimental details, explicit ablation studies, quantitative comparisons, and statistical information as requested. Our point-by-point responses follow.

read point-by-point responses

Referee: The abstract claims 'consistent improvements in sample efficiency and solution quality' and specific achievements like the 150-sample SOTA on circle packing, but supplies no experimental details, baselines, error bars, or ablation evidence. This prevents assessment of whether the claims are supported.

Authors: We agree the abstract is concise by design. The full manuscript (Section 4) details the experimental protocol, including baselines (standard LLM evolution, random search, and single-model variants), the exact circle-packing configuration discovered, and results averaged over five independent runs with standard deviations reported. We have added a summary table of key metrics with error bars to the revised manuscript and expanded the abstract with a brief reference to these controls. revision: yes
Referee: The three innovations are presented as the primary drivers of the reported gains, yet no controlled ablation studies (e.g., full system vs. ablated versions or vs. strong single-LLM baselines) are described to isolate their effects from base model choice or tuning.

Authors: The manuscript already contains component-wise comparisons in Section 4.2. We have now explicitly labeled these as ablation studies, adding tables that isolate the contribution of parent sampling, novelty rejection-sampling, and the bandit ensemble versus strong single-LLM baselines (GPT-4o and Claude-3.5) under identical budgets. These results confirm each component's role in sample efficiency; the revised version highlights them more prominently. revision: yes
Referee: Specific results such as improvements to ALE-Bench solutions and novel MoE loss functions are stated without accompanying quantitative comparisons, statistical significance, or details on how novelty and performance were measured.

Authors: Quantitative comparisons for ALE-Bench (solution scores versus prior submissions) and MoE load-balancing (throughput and stability metrics) appear in Section 4.3–4.4. Novelty is quantified via normalized AST edit distance and semantic embedding similarity; performance uses task-specific metrics. We have added p-values from paired t-tests across runs and clarified the measurement procedures in the revised text and appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with direct task evaluations

full rationale

The paper presents ShinkaEvolve as an open-source empirical framework for LLM-driven program evolution, with performance claims (new circle-packing SOTA in 150 samples, AIME harnesses, ALE-Bench improvements, novel MoE losses) resting on direct experimental evaluations across tasks. No mathematical derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction. The three innovations are described algorithmically and validated empirically rather than via self-definition, fitted-parameter renaming, or load-bearing self-citations. The result is self-contained against external benchmarks with no enumerated circularity patterns applicable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are described. The three innovations are high-level techniques whose internal parameters and assumptions are not detailed.

pith-pipeline@v0.9.0 · 5551 in / 1128 out tokens · 47183 ms · 2026-05-16T13:54:37.423622+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ShinkaEvolve addresses these limitations, introducing three key innovations: a parent sampling technique balancing exploration and exploitation, code novelty rejection-sampling for efficient search space exploration, and a bandit-based LLM ensemble selection strategy.
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ShinkaEvolve discovers a new state-of-the-art circle packing solution using only 150 samples

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evolutionary Ensemble of Agents
cs.NE 2026-05 unverdicted novelty 7.0

EvE uses co-evolving populations of solvers and guidance states with Elo-based evaluation to autonomously discover a rescale-then-interpolate mechanism for better generalization in In-Context Operator Networks.
CoupleEvo: Evolving Heuristics for Coupled Optimization Problems Using Large Language Models
cs.NE 2026-05 unverdicted novelty 7.0

CoupleEvo finds that sequential and iterative strategies for evolving LLM-based heuristics yield more stable and higher-quality solutions than an integrated strategy on coupled optimization problems.
The AI Telco Engineer: Toward Autonomous Discovery of Wireless Communications Algorithms
cs.AI 2026-04 unverdicted novelty 7.0

An LLM-powered agentic framework autonomously designs competitive and sometimes superior explainable algorithms for wireless PHY and MAC layer tasks.
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture
cs.MS 2026-04 accept novelty 7.0

k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
Learning to Discover at Test Time
cs.LG 2026-01 unverdicted novelty 7.0

TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.
ToolMol: Evolutionary Agentic Framework for Multi-objective Drug Discovery
cs.LG 2026-05 unverdicted novelty 6.0

ToolMol integrates evolutionary algorithms with agentic LLMs and precise RDKit tools to optimize multi-objective drug properties, yielding ligands with over 10% better predicted binding affinity and 35% gains in absol...
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
cs.LG 2026-05 unverdicted novelty 6.0

MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
cs.LG 2026-05 unverdicted novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
Open-Ended Task Discovery via Bayesian Optimization
cs.AI 2026-05 unverdicted novelty 6.0

Generate-Select-Refine is an open-ended Bayesian optimization method that generates tasks and concentrates evaluations on the best one with only logarithmic regret overhead relative to standard single-task optimization.
Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization
cs.AI 2026-04 accept novelty 6.0

An LLM-driven agentic system evolves microarchitectural policies for cache replacement, data prefetching, and branch prediction, producing designs that match or exceed prior state-of-the-art in IPC on standard benchmarks.
Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization
cs.AI 2026-04 unverdicted novelty 6.0

EvoOR-Agent co-evolves agent architectures as AOE-style networks with graph-mediated recombination and knowledge-base-assisted mutation to outperform fixed LLM pipelines on OR benchmarks.
TurboEvolve: Towards Fast and Robust LLM-Driven Program Evolution
cs.NE 2026-04 unverdicted novelty 6.0

TurboEvolve improves LLM program evolution by running parallel islands with LLM-generated diverse candidates that carry self-assigned weights, an adaptive scheduler, and clustered seed injection to reach stronger solu...
AI-Driven Research for Databases
cs.DB 2026-04 unverdicted novelty 6.0

Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.
DeepReviewer 2.0: A Traceable Agentic System for Auditable Scientific Peer Review
cs.AI 2026-03 unverdicted novelty 6.0

An agentic system produces traceable review packages and an un-finetuned 196B model using it covers more major issues than Gemini-3.1-Pro on 134 ICLR 2025 submissions while winning most blind comparisons to human committees.
ToolMol: Evolutionary Agentic Framework for Multi-objective Drug Discovery
cs.LG 2026-05 unverdicted novelty 5.0

ToolMol is an evolutionary agentic framework that pairs multi-objective genetic algorithms with LLM tool-calling to generate drug-like ligands with over 10% better predicted binding affinity and 35% better ABFE scores...
Evolutionary Ensemble of Agents
cs.NE 2026-05 unverdicted novelty 5.0

EvE co-evolves code solvers and guidance states via synchronous races and Elo updates, discovering a rescale-then-interpolate mechanism that enables example-count generalization in ICON.
GEAR: Genetic AutoResearch for Agentic Code Evolution
cs.NE 2026-05 unverdicted novelty 5.0

GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
cs.LG 2026-05 unverdicted novelty 5.0

PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...
FunFuzz: An LLM-Powered Evolutionary Fuzzing Framework
cs.CR 2026-05 unverdicted novelty 5.0

FunFuzz uses parallel LLM islands with candidate migration and adaptive prompting to achieve higher compiler coverage and more unique internal failures than prior LLM fuzzers on GCC and Clang over 24-hour runs.
AI for Mathematics: Progress, Challenges, and Prospects
math.HO 2026-01 unverdicted novelty 4.0

AI for math combines task-specific architectures and general foundation models to support research and advance AI reasoning capabilities.

Reference graph

Works this paper leans on

234 extracted references · 234 canonical work pages · cited by 18 Pith papers · 50 internal anchors

[1]

American Invitational Mathematics Examination, 2023 , year =

work page 2023
[2]

American Invitational Mathematics Examination, 2024 , year =

work page 2024
[3]

American Invitational Mathematics Examination, 2025 , year =

work page 2025
[8]

2025 , publisher =

OpenEvolve: an open-source evolutionary coding agent , author =. 2025 , publisher =

work page 2025
[10]

2025 , institution=

The AI CUDA engineer: Agentic CUDA kernel discovery, optimization and composition , author=. 2025 , institution=

work page 2025
[14]

2025 , eprint=

KernelBench: Can LLMs Write Efficient GPU Kernels? , author=. 2025 , eprint=

work page 2025
[16]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Discovering Preference Optimization Algorithms with and for Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

work page
[17]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[18]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[19]

Proceedings of the Companion Conference on Genetic and Evolutionary Computation , pages=

Discovering evolution strategies via meta-black-box optimization , author=. Proceedings of the Companion Conference on Genetic and Evolutionary Computation , pages=

work page
[20]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[21]

Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages=

A systematic evaluation of large language models of code , author=. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages=

work page
[22]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Nature , volume=

Mathematical discoveries from program search with large language models , author=. Nature , volume=. 2024 , publisher=

work page 2024
[24]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[25]

arXiv preprint arXiv:2005.04305 , year=

Measuring the algorithmic efficiency of neural networks , author=. arXiv preprint arXiv:2005.04305 , year=

work page arXiv 2005
[26]

Science , volume=

Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

work page 2022
[27]

Layer Normalization

Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Proceedings of the IEEE international conference on computer vision , pages=

Arbitrary style transfer in real-time with adaptive instance normalization , author=. Proceedings of the IEEE international conference on computer vision , pages=

work page
[29]

Queue , volume=

Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for? , author=. Queue , volume=. 2008 , publisher=

work page 2008
[30]

2016 , publisher=

Programming massively parallel processors: a hands-on approach , author=. 2016 , publisher=

work page 2016
[31]

IEEE micro , volume=

Parallel computing experiences with CUDA , author=. IEEE micro , volume=. 2008 , publisher=

work page 2008
[32]

cuDNN: Efficient Primitives for Deep Learning

cudnn: Efficient primitives for deep learning , author=. arXiv preprint arXiv:1410.0759 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Tensorflow: Large-scale machine learning on heterogeneous distributed systems , author=. arXiv preprint arXiv:1603.04467 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

JAX: composable transformations of Python+ NumPy programs , author=

work page
[35]

Advances in neural information processing systems , volume=

Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=

work page
[36]

Advances in Neural Information Processing Systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=

work page
[37]

Retrieval-Augmented Generation for Large Language Models: A Survey

Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Least-to-most prompting enables complex reasoning in large language models , author=. arXiv preprint arXiv:2205.10625 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[39]

2022 International Joint Conference on Neural Networks (IJCNN) , pages=

Compute trends across three eras of machine learning , author=. 2022 International Joint Conference on Neural Networks (IJCNN) , pages=. 2022 , organization=

work page 2022
[40]

Zhang et al

The effect of sampling temperature on problem solving in large language models , author=. arXiv preprint arXiv:2402.05201 , year=

work page arXiv
[41]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

On the Opportunities and Risks of Foundation Models

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

2008 5th IEEE international symposium on biomedical imaging: from nano to macro , pages=

CUDA: Scalable parallel programming for high-performance scientific computing , author=. 2008 5th IEEE international symposium on biomedical imaging: from nano to macro , pages=. 2008 , organization=

work page 2008
[44]

arXiv preprint arXiv:2309.02726 , year=

Large language models for automated open-domain scientific hypotheses discovery , author=. arXiv preprint arXiv:2309.02726 , year=

work page arXiv
[45]

Proceedings of the 44th annual international symposium on computer architecture , pages=

In-datacenter performance analysis of a tensor processing unit , author=. Proceedings of the 44th annual international symposium on computer architecture , pages=

work page
[46]

Advances in Neural Information Processing Systems , volume=

Evoprompting: Language models for code-level neural architecture search , author=. Advances in Neural Information Processing Systems , volume=

work page
[47]

Handbook of Evolutionary Machine Learning , pages=

Evolution through large models , author=. Handbook of Evolutionary Machine Learning , pages=. 2023 , publisher=

work page 2023
[50]

arXiv preprint arXiv:2404.15794 , year=

Large Language Models as In-context AI Generators for Quality-Diversity , author=. arXiv preprint arXiv:2404.15794 , year=

work page arXiv
[51]

arXiv preprint arXiv:2306.08647 , year=

Language to rewards for robotic skill synthesis , author=. arXiv preprint arXiv:2306.08647 , year=

work page arXiv
[53]

arXiv preprint arXiv:2405.03547 , year=

Position Paper: Leveraging Foundational Models for Black-Box Optimization: Benefits, Challenges, and Future Directions , author=. arXiv preprint arXiv:2405.03547 , year=

work page arXiv
[54]

2023 , eprint=

GPT-4 Technical Report , author=. 2023 , eprint=

work page 2023
[55]

2023 , eprint=

Gemini: A Family of Highly Capable Multimodal Models , author=. 2023 , eprint=

work page 2023
[56]

2023 , url=

Model Card and Evaluations for Claude Models , author=. 2023 , url=

work page 2023
[57]

2024 , url=

The Claude 3 Model Family: Opus, Sonnet, Haiku , author=. 2024 , url=

work page 2024
[58]

Advances in Neural Information Processing Systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=

work page
[59]

2022 , eprint=

Evolution through Large Models , author=. 2022 , eprint=

work page 2022
[60]

1910 , publisher=

How We Think , author=. 1910 , publisher=

work page 1910
[61]

Knowledge-based systems , volume=

AutoML: A survey of the state-of-the-art , author=. Knowledge-based systems , volume=. 2021 , publisher=

work page 2021
[62]

2019 , publisher=

Automated machine learning: methods, systems, challenges , author=. 2019 , publisher=

work page 2019
[63]

2024 , url=

Jenny Zhang and Joel Lehman and Kenneth Stanley and Jeff Clune , booktitle=. 2024 , url=

work page 2024
[64]

2024 , eprint=

OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code , author=. 2024 , eprint=

work page 2024
[65]

Frontiers of Computer Science , volume=

A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , volume=. 2024 , publisher=

work page 2024
[66]

GitHub repository , url =

Gauthier, Paul , title =. GitHub repository , url =. 2024 , publisher =

work page 2024
[67]

Proceedings of the First International Conference on Automated Machine Learning , pages =

Bayesian Generational Population-Based Training , author =. Proceedings of the First International Conference on Automated Machine Learning , pages =. 2022 , editor =

work page 2022
[68]

International Conference on Learning Representations , year=

Revisiting Design Choices in Offline Model Based Reinforcement Learning , author=. International Conference on Learning Representations , year=

work page
[69]

International Conference on Machine Learning , pages=

Think Global and Act Local: Bayesian Optimisation over High-Dimensional Categorical and Mixed Search Spaces , author=. International Conference on Machine Learning , pages=. 2021 , organization=

work page 2021
[70]

1877 , publisher=

The principles of science: A treatise on logic and scientific method , author=. 1877 , publisher=

work page
[71]

Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=

Understanding the difficulty of training deep feedforward neural networks , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=

work page 2010
[72]

2013 , publisher=

What is this thing called science? , author=. 2013 , publisher=

work page 2013
[73]

tiny-diffusion , year =

P\". tiny-diffusion , year =. GitHub repository , url =

work page
[74]

GitHub repository , url =

Berto, Federico , title =. GitHub repository , url =. 2024 , publisher =

work page 2024
[75]

In-context Learning and Induction Heads

In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[76]

2024 , eprint=

Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models , author=. 2024 , eprint=

work page 2024
[77]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[78]

GitHub repository , url =

Andrej Karpathy , title =. GitHub repository , url =. 2022 , publisher =

work page 2022
[79]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[80]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Grokking: Generalization beyond overfitting on small algorithmic datasets , author=. arXiv preprint arXiv:2201.02177 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[81]

Forty-first International Conference on Machine Learning , year=

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation , author=. Forty-first International Conference on Machine Learning , year=

work page
[82]

Available at SSRN 4526071 , year=

Ideas are dimes a dozen: Large language models for idea generation in innovation , author=. Available at SSRN 4526071 , year=

work page
[83]

arXiv preprint arXiv:2403.09733 , year=

OverleafCopilot: Empowering Academic Writing in Overleaf with Large Language Models , author=. arXiv preprint arXiv:2403.09733 , year=

work page arXiv
[84]

The Twelfth International Conference on Learning Representations , year=

Quality-Diversity through AI Feedback , author=. The Twelfth International Conference on Learning Representations , year=

work page
[85]

2024 , eprint=

Mixtral of Experts , author=. 2024 , eprint=

work page 2024
[86]

2023 , eprint=

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision , author=. 2023 , eprint=

work page 2023
[87]

arXiv preprint arXiv:2405.15143 , year=

Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models , author=. arXiv preprint arXiv:2405.15143 , year=

work page arXiv
[88]

npj Computational Materials , volume=

Accelerating materials discovery using artificial intelligence, high performance computing and robotics , author=. npj Computational Materials , volume=. 2022 , publisher=

work page 2022
[89]

2024 , eprint=

DiffiT: Diffusion Vision Transformers for Image Generation , author=. 2024 , eprint=

work page 2024
[90]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[91]

nature , volume=

Highly accurate protein structure prediction with AlphaFold , author=. nature , volume=. 2021 , publisher=

work page 2021
[92]

Nature , volume=

Discovering faster matrix multiplication algorithms with reinforcement learning , author=. Nature , volume=. 2022 , publisher=

work page 2022

Showing first 80 references.