pith. machine review for the scientific record. sign in

arxiv: 2509.19349 · v1 · submitted 2025-09-17 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:54 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords program evolutionlarge language modelssample efficiencyevolutionary algorithmsopen-ended discoveryagentic systemscircle packingmixture of experts
0
0 comments X

The pith

ShinkaEvolve evolves programs with far fewer samples by balancing exploration, rejecting non-novel code, and dynamically choosing which LLM to use for mutations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ShinkaEvolve, an open-source framework that uses large language models as mutation operators to evolve code for scientific and computational tasks. It claims that three coordinated techniques—parent sampling that trades exploration against exploitation, rejection sampling that filters out non-novel code, and bandit-driven selection of the best LLM for each step—cut the number of required samples from thousands to roughly one hundred while still reaching or beating prior results. Demonstrations include a new record circle-packing configuration, stronger agentic harnesses for AIME math problems, refinements to competitive-programming solutions, and fresh mixture-of-experts loss functions. A sympathetic reader would care because these changes make open-ended evolutionary search practical under realistic compute limits and make the method available for anyone to extend.

Core claim

ShinkaEvolve shows that parent sampling balancing exploration and exploitation, code novelty rejection-sampling, and bandit-based LLM ensemble selection together enable sample-efficient program evolution. These mechanisms let the system discover a new state-of-the-art circle-packing solution in only 150 evaluations, produce high-performing agentic systems for AIME reasoning, improve ALE-Bench competitive-programming entries, and identify novel mixture-of-expert load-balancing losses.

What carries the argument

Three coordinated mechanisms: parent sampling that balances exploration against exploitation, rejection sampling based on code novelty, and a multi-armed bandit that selects which LLM acts as the mutation operator at each generation.

If this is right

  • New state-of-the-art circle-packing solutions become reachable with under 200 program evaluations.
  • Agentic harnesses for AIME-level mathematical reasoning can be improved without requiring thousands of LLM calls.
  • Competitive-programming solutions on benchmarks such as ALE-Bench can be refined through targeted evolutionary search.
  • Novel loss functions for mixture-of-experts load balancing can be discovered automatically.
  • Open-source release lowers the cost barrier for applying evolutionary discovery to other computational problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sampling and selection principles could transfer to non-code domains such as molecule generation or neural-architecture search if the underlying LLM mutation step remains effective.
  • Dynamic LLM selection may reduce overall inference cost in other agentic pipelines even when evolution is not the goal.
  • Success with modest sample budgets suggests evolutionary search can complement large-scale training rather than compete with it.
  • Future tests could check whether the efficiency gains persist when the base models are replaced or when task complexity increases.

Load-bearing premise

The reported gains in sample efficiency and solution quality are driven primarily by the three listed innovations rather than by the choice of base LLMs or task-specific tuning.

What would settle it

An ablation experiment that disables any one of the three innovations and shows that performance on the circle-packing task reverts to the level of prior closed-source methods.

read the original abstract

We introduce ShinkaEvolve: a new open-source framework leveraging large language models (LLMs) to advance scientific discovery with state-of-the-art performance and unprecedented efficiency. Recent advances in scaling inference time compute of LLMs have enabled significant progress in generalized scientific discovery. These approaches rely on evolutionary agentic harnesses that leverage LLMs as mutation operators to generate candidate solutions. However, current code evolution methods suffer from critical limitations: they are sample inefficient, requiring thousands of samples to identify effective solutions, and remain closed-source, hindering broad adoption and extension. ShinkaEvolve addresses these limitations, introducing three key innovations: a parent sampling technique balancing exploration and exploitation, code novelty rejection-sampling for efficient search space exploration, and a bandit-based LLM ensemble selection strategy. We evaluate ShinkaEvolve across diverse tasks, demonstrating consistent improvements in sample efficiency and solution quality. ShinkaEvolve discovers a new state-of-the-art circle packing solution using only 150 samples, designs high-performing agentic harnesses for AIME mathematical reasoning tasks, identifies improvements to ALE-Bench competitive programming solutions, and discovers novel mixture-of-expert load balancing loss functions that illuminate the space of optimization strategies. Our results demonstrate that ShinkaEvolve achieves broad applicability with exceptional sample efficiency. By providing open-source accessibility and cost-efficiency, this work democratizes open-ended discovery across diverse computational problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ShinkaEvolve, an open-source LLM-based framework for program evolution. It proposes three innovations: a parent sampling technique to balance exploration and exploitation, code novelty rejection-sampling for efficient search, and a bandit-based strategy for LLM ensemble selection. The framework is evaluated on tasks including circle packing, AIME mathematical reasoning, ALE-Bench competitive programming, and mixture-of-experts load balancing, claiming superior sample efficiency and solution quality, such as a new SOTA circle packing solution with only 150 samples.

Significance. If the empirical results hold under rigorous validation, this work could significantly impact the field by providing an accessible, efficient tool for open-ended discovery and optimization problems. The open-source release and focus on sample efficiency address key limitations in current LLM-driven evolution methods, potentially accelerating progress in automated scientific discovery and code optimization.

major comments (3)
  1. The abstract claims 'consistent improvements in sample efficiency and solution quality' and specific achievements like the 150-sample SOTA on circle packing, but supplies no experimental details, baselines, error bars, or ablation evidence. This prevents assessment of whether the claims are supported.
  2. The three innovations are presented as the primary drivers of the reported gains, yet no controlled ablation studies (e.g., full system vs. ablated versions or vs. strong single-LLM baselines) are described to isolate their effects from base model choice or tuning.
  3. Specific results such as improvements to ALE-Bench solutions and novel MoE loss functions are stated without accompanying quantitative comparisons, statistical significance, or details on how novelty and performance were measured.
minor comments (2)
  1. Ensure all acronyms (e.g., AIME, ALE-Bench, MoE) are defined at first use.
  2. The term 'agentic harnesses' could be clarified for readers unfamiliar with the terminology.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We have revised the manuscript to provide additional experimental details, explicit ablation studies, quantitative comparisons, and statistical information as requested. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: The abstract claims 'consistent improvements in sample efficiency and solution quality' and specific achievements like the 150-sample SOTA on circle packing, but supplies no experimental details, baselines, error bars, or ablation evidence. This prevents assessment of whether the claims are supported.

    Authors: We agree the abstract is concise by design. The full manuscript (Section 4) details the experimental protocol, including baselines (standard LLM evolution, random search, and single-model variants), the exact circle-packing configuration discovered, and results averaged over five independent runs with standard deviations reported. We have added a summary table of key metrics with error bars to the revised manuscript and expanded the abstract with a brief reference to these controls. revision: yes

  2. Referee: The three innovations are presented as the primary drivers of the reported gains, yet no controlled ablation studies (e.g., full system vs. ablated versions or vs. strong single-LLM baselines) are described to isolate their effects from base model choice or tuning.

    Authors: The manuscript already contains component-wise comparisons in Section 4.2. We have now explicitly labeled these as ablation studies, adding tables that isolate the contribution of parent sampling, novelty rejection-sampling, and the bandit ensemble versus strong single-LLM baselines (GPT-4o and Claude-3.5) under identical budgets. These results confirm each component's role in sample efficiency; the revised version highlights them more prominently. revision: yes

  3. Referee: Specific results such as improvements to ALE-Bench solutions and novel MoE loss functions are stated without accompanying quantitative comparisons, statistical significance, or details on how novelty and performance were measured.

    Authors: Quantitative comparisons for ALE-Bench (solution scores versus prior submissions) and MoE load-balancing (throughput and stability metrics) appear in Section 4.3–4.4. Novelty is quantified via normalized AST edit distance and semantic embedding similarity; performance uses task-specific metrics. We have added p-values from paired t-tests across runs and clarified the measurement procedures in the revised text and appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with direct task evaluations

full rationale

The paper presents ShinkaEvolve as an open-source empirical framework for LLM-driven program evolution, with performance claims (new circle-packing SOTA in 150 samples, AIME harnesses, ALE-Bench improvements, novel MoE losses) resting on direct experimental evaluations across tasks. No mathematical derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction. The three innovations are described algorithmically and validated empirically rather than via self-definition, fitted-parameter renaming, or load-bearing self-citations. The result is self-contained against external benchmarks with no enumerated circularity patterns applicable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are described. The three innovations are high-level techniques whose internal parameters and assumptions are not detailed.

pith-pipeline@v0.9.0 · 5551 in / 1128 out tokens · 47183 ms · 2026-05-16T13:54:37.423622+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Evolutionary Ensemble of Agents

    cs.NE 2026-05 unverdicted novelty 7.0

    EvE uses co-evolving populations of solvers and guidance states with Elo-based evaluation to autonomously discover a rescale-then-interpolate mechanism for better generalization in In-Context Operator Networks.

  2. CoupleEvo: Evolving Heuristics for Coupled Optimization Problems Using Large Language Models

    cs.NE 2026-05 unverdicted novelty 7.0

    CoupleEvo finds that sequential and iterative strategies for evolving LLM-based heuristics yield more stable and higher-quality solutions than an integrated strategy on coupled optimization problems.

  3. The AI Telco Engineer: Toward Autonomous Discovery of Wireless Communications Algorithms

    cs.AI 2026-04 unverdicted novelty 7.0

    An LLM-powered agentic framework autonomously designs competitive and sometimes superior explainable algorithms for wireless PHY and MAC layer tasks.

  4. $k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture

    cs.MS 2026-04 accept novelty 7.0

    k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.

  5. Learning to Discover at Test Time

    cs.LG 2026-01 unverdicted novelty 7.0

    TTT-Discover applies test-time RL to set new state-of-the-art results on math inequalities, GPU kernels, algorithm contests, and single-cell denoising using an open model and public code.

  6. ToolMol: Evolutionary Agentic Framework for Multi-objective Drug Discovery

    cs.LG 2026-05 unverdicted novelty 6.0

    ToolMol integrates evolutionary algorithms with agentic LLMs and precise RDKit tools to optimize multi-objective drug properties, yielding ligands with over 10% better predicted binding affinity and 35% gains in absol...

  7. MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

    cs.LG 2026-05 unverdicted novelty 6.0

    MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.

  8. FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

    cs.LG 2026-05 unverdicted novelty 6.0

    FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...

  9. Open-Ended Task Discovery via Bayesian Optimization

    cs.AI 2026-05 unverdicted novelty 6.0

    Generate-Select-Refine is an open-ended Bayesian optimization method that generates tasks and concentrates evaluations on the best one with only logarithmic regret overhead relative to standard single-task optimization.

  10. Agentic Architect: An Agentic AI Framework for Architecture Design Exploration and Optimization

    cs.AI 2026-04 accept novelty 6.0

    An LLM-driven agentic system evolves microarchitectural policies for cache replacement, data prefetching, and branch prediction, producing designs that match or exceed prior state-of-the-art in IPC on standard benchmarks.

  11. Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    EvoOR-Agent co-evolves agent architectures as AOE-style networks with graph-mediated recombination and knowledge-base-assisted mutation to outperform fixed LLM pipelines on OR benchmarks.

  12. TurboEvolve: Towards Fast and Robust LLM-Driven Program Evolution

    cs.NE 2026-04 unverdicted novelty 6.0

    TurboEvolve improves LLM program evolution by running parallel islands with LLM-generated diverse candidates that carry self-assigned weights, an adaptive scheduler, and clustered seed injection to reach stronger solu...

  13. AI-Driven Research for Databases

    cs.DB 2026-04 unverdicted novelty 6.0

    Co-evolving LLM-generated solutions with their evaluators enables discovery of novel database algorithms that outperform state-of-the-art baselines, including a query rewrite policy with up to 6.8x lower latency.

  14. DeepReviewer 2.0: A Traceable Agentic System for Auditable Scientific Peer Review

    cs.AI 2026-03 unverdicted novelty 6.0

    An agentic system produces traceable review packages and an un-finetuned 196B model using it covers more major issues than Gemini-3.1-Pro on 134 ICLR 2025 submissions while winning most blind comparisons to human committees.

  15. ToolMol: Evolutionary Agentic Framework for Multi-objective Drug Discovery

    cs.LG 2026-05 unverdicted novelty 5.0

    ToolMol is an evolutionary agentic framework that pairs multi-objective genetic algorithms with LLM tool-calling to generate drug-like ligands with over 10% better predicted binding affinity and 35% better ABFE scores...

  16. Evolutionary Ensemble of Agents

    cs.NE 2026-05 unverdicted novelty 5.0

    EvE co-evolves code solvers and guidance states via synchronous races and Elo updates, discovering a rescale-then-interpolate mechanism that enables example-count generalization in ICON.

  17. GEAR: Genetic AutoResearch for Agentic Code Evolution

    cs.NE 2026-05 unverdicted novelty 5.0

    GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.

  18. PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

    cs.LG 2026-05 unverdicted novelty 5.0

    PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...

  19. FunFuzz: An LLM-Powered Evolutionary Fuzzing Framework

    cs.CR 2026-05 unverdicted novelty 5.0

    FunFuzz uses parallel LLM islands with candidate migration and adaptive prompting to achieve higher compiler coverage and more unique internal failures than prior LLM fuzzers on GCC and Clang over 24-hour runs.

  20. AI for Mathematics: Progress, Challenges, and Prospects

    math.HO 2026-01 unverdicted novelty 4.0

    AI for math combines task-specific architectures and general foundation models to support research and advance AI reasoning capabilities.

Reference graph

Works this paper leans on

234 extracted references · 234 canonical work pages · cited by 18 Pith papers · 50 internal anchors

  1. [1]

    American Invitational Mathematics Examination, 2023 , year =

  2. [2]

    American Invitational Mathematics Examination, 2024 , year =

  3. [3]

    American Invitational Mathematics Examination, 2025 , year =

  4. [8]

    2025 , publisher =

    OpenEvolve: an open-source evolutionary coding agent , author =. 2025 , publisher =

  5. [10]

    2025 , institution=

    The AI CUDA engineer: Agentic CUDA kernel discovery, optimization and composition , author=. 2025 , institution=

  6. [14]

    2025 , eprint=

    KernelBench: Can LLMs Write Efficient GPU Kernels? , author=. 2025 , eprint=

  7. [16]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Discovering Preference Optimization Algorithms with and for Large Language Models , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  8. [17]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  9. [18]

    2020 , eprint=

    Language Models are Few-Shot Learners , author=. 2020 , eprint=

  10. [19]

    Proceedings of the Companion Conference on Genetic and Evolutionary Computation , pages=

    Discovering evolution strategies via meta-black-box optimization , author=. Proceedings of the Companion Conference on Genetic and Evolutionary Computation , pages=

  11. [20]

    Advances in Neural Information Processing Systems , volume=

    Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  12. [21]

    Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages=

    A systematic evaluation of large language models of code , author=. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , pages=

  13. [22]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  14. [23]

    Nature , volume=

    Mathematical discoveries from program search with large language models , author=. Nature , volume=. 2024 , publisher=

  15. [24]

    Scaling Laws for Neural Language Models

    Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

  16. [25]

    arXiv preprint arXiv:2005.04305 , year=

    Measuring the algorithmic efficiency of neural networks , author=. arXiv preprint arXiv:2005.04305 , year=

  17. [26]

    Science , volume=

    Competition-level code generation with alphacode , author=. Science , volume=. 2022 , publisher=

  18. [27]

    Layer Normalization

    Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=

  19. [28]

    Proceedings of the IEEE international conference on computer vision , pages=

    Arbitrary style transfer in real-time with adaptive instance normalization , author=. Proceedings of the IEEE international conference on computer vision , pages=

  20. [29]

    Queue , volume=

    Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for? , author=. Queue , volume=. 2008 , publisher=

  21. [30]

    2016 , publisher=

    Programming massively parallel processors: a hands-on approach , author=. 2016 , publisher=

  22. [31]

    IEEE micro , volume=

    Parallel computing experiences with CUDA , author=. IEEE micro , volume=. 2008 , publisher=

  23. [32]

    cuDNN: Efficient Primitives for Deep Learning

    cudnn: Efficient primitives for deep learning , author=. arXiv preprint arXiv:1410.0759 , year=

  24. [33]

    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

    Tensorflow: Large-scale machine learning on heterogeneous distributed systems , author=. arXiv preprint arXiv:1603.04467 , year=

  25. [34]

    JAX: composable transformations of Python+ NumPy programs , author=

  26. [35]

    Advances in neural information processing systems , volume=

    Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=

  27. [36]

    Advances in Neural Information Processing Systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=

  28. [37]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , year=

  29. [38]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Least-to-most prompting enables complex reasoning in large language models , author=. arXiv preprint arXiv:2205.10625 , year=

  30. [39]

    2022 International Joint Conference on Neural Networks (IJCNN) , pages=

    Compute trends across three eras of machine learning , author=. 2022 International Joint Conference on Neural Networks (IJCNN) , pages=. 2022 , organization=

  31. [40]

    Zhang et al

    The effect of sampling temperature on problem solving in large language models , author=. arXiv preprint arXiv:2402.05201 , year=

  32. [41]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Large language monkeys: Scaling inference compute with repeated sampling , author=. arXiv preprint arXiv:2407.21787 , year=

  33. [42]

    On the Opportunities and Risks of Foundation Models

    On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

  34. [43]

    2008 5th IEEE international symposium on biomedical imaging: from nano to macro , pages=

    CUDA: Scalable parallel programming for high-performance scientific computing , author=. 2008 5th IEEE international symposium on biomedical imaging: from nano to macro , pages=. 2008 , organization=

  35. [44]

    arXiv preprint arXiv:2309.02726 , year=

    Large language models for automated open-domain scientific hypotheses discovery , author=. arXiv preprint arXiv:2309.02726 , year=

  36. [45]

    Proceedings of the 44th annual international symposium on computer architecture , pages=

    In-datacenter performance analysis of a tensor processing unit , author=. Proceedings of the 44th annual international symposium on computer architecture , pages=

  37. [46]

    Advances in Neural Information Processing Systems , volume=

    Evoprompting: Language models for code-level neural architecture search , author=. Advances in Neural Information Processing Systems , volume=

  38. [47]

    Handbook of Evolutionary Machine Learning , pages=

    Evolution through large models , author=. Handbook of Evolutionary Machine Learning , pages=. 2023 , publisher=

  39. [50]

    arXiv preprint arXiv:2404.15794 , year=

    Large Language Models as In-context AI Generators for Quality-Diversity , author=. arXiv preprint arXiv:2404.15794 , year=

  40. [51]

    arXiv preprint arXiv:2306.08647 , year=

    Language to rewards for robotic skill synthesis , author=. arXiv preprint arXiv:2306.08647 , year=

  41. [53]

    arXiv preprint arXiv:2405.03547 , year=

    Position Paper: Leveraging Foundational Models for Black-Box Optimization: Benefits, Challenges, and Future Directions , author=. arXiv preprint arXiv:2405.03547 , year=

  42. [54]

    2023 , eprint=

    GPT-4 Technical Report , author=. 2023 , eprint=

  43. [55]

    2023 , eprint=

    Gemini: A Family of Highly Capable Multimodal Models , author=. 2023 , eprint=

  44. [56]

    2023 , url=

    Model Card and Evaluations for Claude Models , author=. 2023 , url=

  45. [57]

    2024 , url=

    The Claude 3 Model Family: Opus, Sonnet, Haiku , author=. 2024 , url=

  46. [58]

    Advances in Neural Information Processing Systems , volume=

    Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=

  47. [59]

    2022 , eprint=

    Evolution through Large Models , author=. 2022 , eprint=

  48. [60]

    1910 , publisher=

    How We Think , author=. 1910 , publisher=

  49. [61]

    Knowledge-based systems , volume=

    AutoML: A survey of the state-of-the-art , author=. Knowledge-based systems , volume=. 2021 , publisher=

  50. [62]

    2019 , publisher=

    Automated machine learning: methods, systems, challenges , author=. 2019 , publisher=

  51. [63]

    2024 , url=

    Jenny Zhang and Joel Lehman and Kenneth Stanley and Jeff Clune , booktitle=. 2024 , url=

  52. [64]

    2024 , eprint=

    OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code , author=. 2024 , eprint=

  53. [65]

    Frontiers of Computer Science , volume=

    A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , volume=. 2024 , publisher=

  54. [66]

    GitHub repository , url =

    Gauthier, Paul , title =. GitHub repository , url =. 2024 , publisher =

  55. [67]

    Proceedings of the First International Conference on Automated Machine Learning , pages =

    Bayesian Generational Population-Based Training , author =. Proceedings of the First International Conference on Automated Machine Learning , pages =. 2022 , editor =

  56. [68]

    International Conference on Learning Representations , year=

    Revisiting Design Choices in Offline Model Based Reinforcement Learning , author=. International Conference on Learning Representations , year=

  57. [69]

    International Conference on Machine Learning , pages=

    Think Global and Act Local: Bayesian Optimisation over High-Dimensional Categorical and Mixed Search Spaces , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  58. [70]

    1877 , publisher=

    The principles of science: A treatise on logic and scientific method , author=. 1877 , publisher=

  59. [71]

    Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=

    Understanding the difficulty of training deep feedforward neural networks , author=. Proceedings of the thirteenth international conference on artificial intelligence and statistics , pages=. 2010 , organization=

  60. [72]

    2013 , publisher=

    What is this thing called science? , author=. 2013 , publisher=

  61. [73]

    tiny-diffusion , year =

    P\". tiny-diffusion , year =. GitHub repository , url =

  62. [74]

    GitHub repository , url =

    Berto, Federico , title =. GitHub repository , url =. 2024 , publisher =

  63. [75]

    In-context Learning and Induction Heads

    In-context learning and induction heads , author=. arXiv preprint arXiv:2209.11895 , year=

  64. [76]

    2024 , eprint=

    Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models , author=. 2024 , eprint=

  65. [77]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  66. [78]

    GitHub repository , url =

    Andrej Karpathy , title =. GitHub repository , url =. 2022 , publisher =

  67. [79]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  68. [80]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Grokking: Generalization beyond overfitting on small algorithmic datasets , author=. arXiv preprint arXiv:2201.02177 , year=

  69. [81]

    Forty-first International Conference on Machine Learning , year=

    MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation , author=. Forty-first International Conference on Machine Learning , year=

  70. [82]

    Available at SSRN 4526071 , year=

    Ideas are dimes a dozen: Large language models for idea generation in innovation , author=. Available at SSRN 4526071 , year=

  71. [83]

    arXiv preprint arXiv:2403.09733 , year=

    OverleafCopilot: Empowering Academic Writing in Overleaf with Large Language Models , author=. arXiv preprint arXiv:2403.09733 , year=

  72. [84]

    The Twelfth International Conference on Learning Representations , year=

    Quality-Diversity through AI Feedback , author=. The Twelfth International Conference on Learning Representations , year=

  73. [85]

    2024 , eprint=

    Mixtral of Experts , author=. 2024 , eprint=

  74. [86]

    2023 , eprint=

    Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision , author=. 2023 , eprint=

  75. [87]

    arXiv preprint arXiv:2405.15143 , year=

    Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models , author=. arXiv preprint arXiv:2405.15143 , year=

  76. [88]

    npj Computational Materials , volume=

    Accelerating materials discovery using artificial intelligence, high performance computing and robotics , author=. npj Computational Materials , volume=. 2022 , publisher=

  77. [89]

    2024 , eprint=

    DiffiT: Diffusion Vision Transformers for Image Generation , author=. 2024 , eprint=

  78. [90]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

  79. [91]

    nature , volume=

    Highly accurate protein structure prediction with AlphaFold , author=. nature , volume=. 2021 , publisher=

  80. [92]

    Nature , volume=

    Discovering faster matrix multiplication algorithms with reinforcement learning , author=. Nature , volume=. 2022 , publisher=

Showing first 80 references.