pith. machine review for the scientific record. sign in

arxiv: 2605.09018 · v2 · submitted 2026-05-09 · 💻 cs.NE · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Evolutionary Ensemble of Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:10 UTC · model grok-4.3

classification 💻 cs.NE cs.AIcs.LG
keywords evolutionary ensemblecoding agentsElo ratingsIn-Context Operator Networksagent adaptationco-evolving populationsalgorithmic discoveryrescale-then-interpolate
0
0 comments X

The pith

A self-revising ensemble of coding agents overcomes static performance ceilings by co-evolving solvers and guidance states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Evolutionary Ensemble (EvE) as a framework that organizes existing coding agents into two co-evolving populations: functional code solvers and agent guidance states. These populations compete in synchronous races where agents receive Elo ratings based on the marginal improvements they deliver to the current solver. Applied to refining In-Context Operator Networks, the system autonomously identifies a rescale-then-interpolate mechanism that supports reliable generalization across example counts. Controlled comparisons show that only the live, stage-dependent adaptation of agents succeeds in navigating the shifting search spaces of complex code, while fixed or frozen agents encounter phase mismatches and stall. The central argument is that this self-revising ensemble structure, rather than any single agent improvement, supplies the mechanism for sustained progress beyond static limits.

Core claim

By fixing the base agent substrate and instead evolving cumulative guidance and skills through populations of solvers and states that update via Elo ratings on marginal gains, EvE produces autonomous discoveries such as the rescale-then-interpolate operator for ICON; ablations confirm that stage-dependent adaptation is required to avoid phase mismatch, establishing that the self-revising ensemble itself drives escape from static performance ceilings.

What carries the argument

The Evolutionary Ensemble (EvE) mechanism of two co-evolving populations (functional code solvers and agent guidance states) evaluated through synchronous races and updated by Elo ratings on marginal contributions.

If this is right

  • The rescale-then-interpolate mechanism discovered by EvE enables reliable example-count generalization in In-Context Operator Networks.
  • Stage-dependent adaptation prevents the phase mismatch that halts progress under fixed or frozen agent conditions.
  • Organizing agents into a self-revising ensemble supplies the driver for sustained gains once static ceilings are reached.
  • The decentralized Elo-based evaluation allows the system to identify which guidance states contribute most at each stage of code evolution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same co-evolution pattern could be tested on non-coding agent tasks such as scientific simulation or theorem proving to check whether ensemble revision generalizes beyond code.
  • If performance ceilings in other LLM-driven domains arise from static guidance rather than model capacity, inserting live adaptation stages might produce comparable lifts without retraining.
  • Extending the two-population structure to include explicit memory of past stage transitions could reduce the computational cost of re-evaluating marginal gains at every step.

Load-bearing premise

Stage-dependent adaptation of agents is required to track the changing search landscapes inside complex codebases.

What would settle it

A controlled run on the ICON task in which a non-adapting fixed initial agent or a frozen best-evolved agent reaches or exceeds the rescale-then-interpolate performance would falsify the necessity of live stage-dependent revision.

Figures

Figures reproduced from arXiv: 2605.09018 by Liu Yang, Zongmin Yu.

Figure 1
Figure 1. Figure 1: Three paradigms of LLM-driven algorithmic discovery. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The EvE framework. EvE maintains two co-evolving populations: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Search trajectories for all three variants (two independent runs each). [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-example-count error curves (k = 1 through k = 10) at two training budgets. Each variant contributes the best PE method from each of its two independent runs; the Seed (gray, ICON vanilla PE) is the reference baseline. See [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Run 1: from agent guidance to solver code. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Run 2: from agent guidance to solver code, with the same layout [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 4
Figure 4. Figure 4: error jumps from ∼0.05 (in-distribution) to ∼0.9 (out-of-distribution), an 18× degradation. The Seed demonstrates that positional-encoding design, not model capacity, is the bottleneck for example-count generalization. EvE Run 1: InterpolatedDemoPE Agent: evolved. The same PE family first appeared at iteration 2, but the best solver using this PE design was produced at iteration 15. Best at: iteration 15 (… view at source ↗
Figure 4
Figure 4. Figure 4: error jumps from ∼0.05 (in-distribution) to ∼0.9 (out-of-distribution), an 18× degradation. The Seed demonstrates that positional-encoding design, not model capacity, is the bottleneck for example-count generalization. EvE Run 1: InterpolatedDemoPE Agent: evolved. The same PE family first appeared at iteration 2, but the best solver using this PE design was produced at iteration 15. Best at: iteration 15 (… view at source ↗
read the original abstract

We introduce Evolutionary Ensemble (EvE), a decentralized framework that organizes existing, highly capable coding agents into a live, co-evolving system for algorithmic discovery. Rather than reinventing the wheel within the "LLMs as optimizers" paradigm, EvE fixes the base agent substrate and focuses entirely on evolving the cumulative guidance and skills that dictate agent behaviors. By maintaining two co-evolving populations, namely functional code solvers and agent guidance states, the system evaluates agents through a synchronous race, updating their empirical Elo ratings based on the marginal gains they contribute to the current solver state. When applied to a research bottleneck in In-Context Operator Networks (ICON), EvE autonomously discovered a robust rescale-then-interpolate mechanism that enables reliable example-count generalization. Crucially, controlled ablations reveal the absolute necessity of stage-dependent agent adaptation to navigate the shifting search landscapes of complex codebases. Compared to variants driven by a fixed initial agent or even a frozen "best-evolved" agent, EvE uniquely avoids phase mismatch, demonstrating that organizing agents into a self-revising ensemble is the fundamental driver for breaking through static performance ceilings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Evolutionary Ensemble (EvE), a decentralized framework that maintains two co-evolving populations—functional code solvers and agent guidance states—evaluated through synchronous races with Elo-rating updates based on marginal solver gains. Applied to a research bottleneck in In-Context Operator Networks (ICON), EvE autonomously discovers a rescale-then-interpolate mechanism for reliable example-count generalization. Controlled ablations are used to argue that stage-dependent agent adaptation is absolutely necessary to avoid phase mismatch and break static performance ceilings, outperforming fixed-initial and frozen-best agent variants.

Significance. If the empirical results hold after controlling for compute, the work would advance evolutionary multi-agent systems for algorithmic discovery by showing how self-revising ensembles can navigate shifting code-search landscapes. The concrete ICON discovery and emphasis on evolving guidance states rather than base agents are strengths; reproducible code or parameter-free derivations are not mentioned.

major comments (2)
  1. [Ablation studies] Ablation studies: the central claim that stage-dependent adaptation is 'absolutely necessary' rests on comparisons to fixed-initial and frozen-best baselines. It is unclear whether total agent evaluations, solver iterations, and cumulative compute are explicitly matched across conditions; the synchronous race and Elo updates may grant the adaptive condition higher effective search budget, leaving open the possibility that gains arise from resource differences rather than the ensemble mechanism.
  2. [ICON results] ICON results section: the rescale-then-interpolate discovery is presented as evidence of autonomous discovery, but the manuscript must report per-condition resource accounting, validation metrics for generalization, and direct comparisons to non-EvE baselines to substantiate that the ensemble (rather than search effort) is the driver.
minor comments (2)
  1. [Introduction] A brief definition or reference for In-Context Operator Networks (ICON) early in the introduction would aid readers outside the immediate subfield.
  2. [Framework description] Notation for Elo updates and marginal-gain calculation should be formalized with an equation to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We address the major concerns point by point below, agreeing that further clarification on computational resources is needed to strengthen the claims.

read point-by-point responses
  1. Referee: [Ablation studies] Ablation studies: the central claim that stage-dependent adaptation is 'absolutely necessary' rests on comparisons to fixed-initial and frozen-best baselines. It is unclear whether total agent evaluations, solver iterations, and cumulative compute are explicitly matched across conditions; the synchronous race and Elo updates may grant the adaptive condition higher effective search budget, leaving open the possibility that gains arise from resource differences rather than the ensemble mechanism.

    Authors: We thank the referee for highlighting this important point. The experimental protocol runs all conditions for an identical number of synchronous races (N=50) and solver iterations per race, with each agent participating in the same number of evaluations per race. The Elo update mechanism does not alter the number of evaluations; it only affects selection probabilities in subsequent races. Nevertheless, to address the concern directly, we will include a dedicated resource accounting table in the revised manuscript that reports total agent evaluations, solver iterations, and estimated compute (in FLOPs) for each ablation condition, confirming they are matched. revision: yes

  2. Referee: [ICON results] ICON results section: the rescale-then-interpolate discovery is presented as evidence of autonomous discovery, but the manuscript must report per-condition resource accounting, validation metrics for generalization, and direct comparisons to non-EvE baselines to substantiate that the ensemble (rather than search effort) is the driver.

    Authors: We agree that additional details are required here. The manuscript already compares EvE to fixed-initial and frozen-best agent variants as controls for the ensemble mechanism. For the ICON application, we report generalization performance on validation sets with varying example counts. In the revision, we will add per-condition resource accounting (as noted above), explicit validation metrics (e.g., mean squared error across different numbers of in-context examples), and a direct comparison to a standard ICON baseline without evolutionary ensemble to better demonstrate that the discovered mechanism and performance gains stem from the co-evolution rather than raw search effort. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical framework with standard Elo-based evaluation and ablations

full rationale

The paper presents EvE as an empirical framework organizing agents via co-evolving populations, synchronous races, and standard Elo rating updates. No mathematical derivations, predictions, or first-principles results are claimed that reduce to fitted inputs or self-definitions by construction. Ablations compare stage-dependent adaptation against fixed-initial and frozen-best baselines using empirical performance metrics; these are not statistically forced by parameter fitting within the same data. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing. The central claim rests on observable differences in search behavior across conditions rather than renaming or re-deriving its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters, axioms, or invented entities; the Elo rating system is treated as a standard external method.

pith-pipeline@v0.9.0 · 5484 in / 1022 out tokens · 50916 ms · 2026-05-15T06:10:39.826366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 4 internal anchors

  1. [1]

    Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization

    Liu, Ziyang and Guo, Xinyan and Wei, Xuchen and Hao, Han and Yang, Liu , title =. arXiv preprint arXiv:2604.23472 , year =

  2. [2]

    1992 , publisher =

    Genetic Programming: On the Programming of Computers by Means of Natural Selection , author =. 1992 , publisher =

  3. [3]

    and Le, Quoc V

    Real, Esteban and Liang, Chen and So, David R. and Le, Quoc V. , booktitle =. 2020 , publisher =

  4. [4]

    Handbook of Evolutionary Machine Learning , series =

    Evolution Through Large Models , author =. Handbook of Evolutionary Machine Learning , series =. 2023 , publisher =

  5. [5]

    Nature , volume =

    Mathematical Discoveries From Program Search With Large Language Models , author =. Nature , volume =. 2024 , doi =

  6. [6]

    Proceedings of the 41st International Conference on Machine Learning , year =

    Evolution of Heuristics: Towards Efficient Automatic Algorithm Design Using Large Language Model , author =. Proceedings of the 41st International Conference on Machine Learning , year =

  7. [7]

    Novikov, Alexander and Vu, Ngan and Eisenberger, Marvin and Dupont, Emilien and Huang, Po-Sen and Wagner, Adam Zsolt and Shirobokov, Sergey and Kozlovskii, Borislav and Ruiz, Francisco J. R. and Mehrabian, Abbas and Kumar, M. Pawan and See, Abigail and Chaudhuri, Swarat and Holland, George and Davies, Alex and Nowozin, Sebastian and Kohli, Pushmeet and Ba...

  8. [8]

    2026 , howpublished =

  9. [9]

    arXiv preprint arXiv:2510.14150 , year =

    Assump. arXiv preprint arXiv:2510.14150 , year =

  10. [10]

    2509.19349 , archivePrefix =

    Lange, Robert Tjarko and Imajuku, Yuki and Cetin, Edoardo , year =. 2509.19349 , archivePrefix =

  11. [11]

    and Du, Alexander and Keutzer, Kurt and Cheung, Alvin and Dimakis, Alexandros G

    Liu, Shu and Agarwal, Shubham and Maheswaran, Monishwaran and Cemri, Mert and Li, Zhifei and Mang, Qiuyang and Naren, Ashwin and Boneh, Ethan and Cheng, Audrey and Pan, Melissa Z. and Du, Alexander and Keutzer, Kurt and Cheung, Alvin and Dimakis, Alexandros G. and Sen, Koushik and Zaharia, Matei and Stoica, Ion , year =. 2602.23413 , archivePrefix =

  12. [12]

    2602.20133 , archivePrefix =

    Cemri, Mert and Agrawal, Shubham and Gupta, Akshat and Liu, Shu and Cheng, Audrey and Mang, Qiuyang and Naren, Ashwin and Erdogan, Lutfi Eren and Sen, Koushik and Zaharia, Matei and Dimakis, Alex and Stoica, Ion , year =. 2602.20133 , archivePrefix =

  13. [13]

    2511.23473 , archivePrefix =

    Wang, Yiping and Su, Shao-Rong and Zeng, Zhiyuan and Xu, Eva and Ren, Liliang and Yang, Xinyu and Huang, Zeyi and He, Xuehai and Ma, Luyao and Peng, Baolin and Cheng, Hao and He, Pengcheng and Chen, Weizhu and Wang, Shuohang and Du, Simon Shaolei and Shen, Yelong , year =. 2511.23473 , archivePrefix =

  14. [14]

    2026 , eprint =

    Learning to Discover at Test Time , author =. 2026 , eprint =

  15. [15]

    Self-Taught Optimizer (

    Zelikman, Eric and Lorch, Eliana and Mackey, Lester and Kalai, Adam Tauman , booktitle =. Self-Taught Optimizer (. 2024 , url =

  16. [16]

    Yin, Xunjian and Wang, Xinyi and Pan, Liangming and Lin, Li and Wan, Xiaojun and Wang, William Yang , booktitle =. G

  17. [17]

    ACM Computing Surveys , volume =

    A Systematic Survey on Large Language Models for Algorithm Design , author =. ACM Computing Surveys , volume =. 2026 , doi =

  18. [18]

    2603.16910 , archivePrefix =

    Paolo, Giuseppe and Warner, Jamieson and Shahrzad, Hormoz and Hodjat, Babak and Miikkulainen, Risto and Meyerson, Elliot , year =. 2603.16910 , archivePrefix =

  19. [19]

    Darwin G

    Zhang, Jenny and Hu, Shengran and Lu, Cong and Lange, Robert and Clune, Jeff , year =. Darwin G. 2505.22954 , archivePrefix =

  20. [20]

    Huxley-G

    Wang, Wenyi and Pi. Huxley-G. 2025 , eprint =

  21. [21]

    2026 , eprint =

    Hyperagents , author =. 2026 , eprint =

  22. [22]

    2026 , eprint =

    Group-Evolving Agents: Open-Ended Self-Improvement via Experience Sharing , author =. 2026 , eprint =

  23. [23]

    2604.01658 , archivePrefix =

    Qu, Ao and Zheng, Han and Zhou, Zijian and Yan, Yihao and Tang, Yihong and Ong, Shao Yong and Hong, Fenglu and Zhou, Kaichen and Jiang, Chonghe and Kong, Minwei and Zhu, Jiacheng and Jiang, Xuan and Li, Sirui and Wu, Cathy and Low, Bryan Kian Hsiang and Zhao, Jinhua and Liang, Paul Pu , year =. 2604.01658 , archivePrefix =

  24. [24]

    Proceedings of the 41st International Conference on Machine Learning , series =

    Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution , author =. Proceedings of the 41st International Conference on Machine Learning , series =. 2024 , eprint =

  25. [25]

    Large Language Models as Optimizers

    Large Language Models as Optimizers , author =. International Conference on Learning Representations (ICLR) , year =. 2309.03409 , archivePrefix =

  26. [26]

    2025 , eprint =

    A Self-Improving Coding Agent , author =. 2025 , eprint =

  27. [27]

    Physica D: Nonlinear Phenomena , volume =

    Co-Evolving Parasites Improve Simulated Evolution as an Optimization Procedure , author =. Physica D: Nonlinear Phenomena , volume =. 1990 , doi =

  28. [28]

    Evolutionary Computation , volume =

    New Methods for Competitive Coevolution , author =. Evolutionary Computation , volume =. 1997 , doi =

  29. [29]

    Artificial Life , volume =

    Coevolutionary Computation , author =. Artificial Life , volume =. 1995 , doi =

  30. [30]

    Proceedings of the National Academy of Sciences , volume =

    In-Context Operator Learning With Data Prompts for Differential Equation Problems , author =. Proceedings of the National Academy of Sciences , volume =. 2023 , doi =

  31. [31]

    , journal =

    Yang, Liu and Osher, Stanley J. , journal =. 2024 , doi =

  32. [32]

    , journal =

    Cao, Yadi and Liu, Yuxuan and Yang, Liu and Yu, Rose and Schaeffer, Hayden and Osher, Stanley J. , journal =

  33. [33]

    Graph In-Context Operator Networks for Generalizable Spatiotemporal Prediction

    Graph In-Context Operator Networks for Generalizable Spatiotemporal Prediction , author =. arXiv preprint arXiv:2603.12725 , year =

  34. [34]

    Neural Networks , year =

    Fine-Tune Language Models as Multi-Modal Differential Equation Solvers , author =. Neural Networks , year =

  35. [35]

    2025 , eprint =

    Probabilistic Operator Learning: Generative Modeling and Uncertainty Quantification for Foundation Models of Differential Equations , author =. 2025 , eprint =

  36. [36]

    NeurIPS 2023 Workshop on the Symbiosis of Deep Learning and Differential Equations (DLDE III) , year =

    Does In-Context Operator Learning Generalize to Domain-Shifted Settings? , author =. NeurIPS 2023 Workshop on the Symbiosis of Deep Learning and Differential Equations (DLDE III) , year =

  37. [37]

    2024 , eprint =

    In-Context Learning of Linear Systems: Generalization Theory and Applications to Operator Learning , author =. 2024 , eprint =

  38. [38]

    2025 , eprint =

    Continuum Transformers Perform In-Context Learning by Operator Gradient Descent , author =. 2025 , eprint =

  39. [39]

    2025 , eprint =

    Solving Optimal Execution Problems via In-Context Operator Networks , author =. 2025 , eprint =

  40. [40]

    2026 , eprint =

    In-Context Operator Learning on the Space of Probability Measures , author =. 2026 , eprint =

  41. [41]

    CoEvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification

    Zhang, Hanrong and Fan, Shichen and Zou, Henry Peng and Chen, Yankai and Wang, Zhenting and Zhou, Jiayuan and Li, Chengze and Huang, Wei-Chieh and Yao, Yifei and Zheng, Kening and Liu, Xue and Li, Xiaoxiao and Yu, Philip S. , title =. arXiv preprint arXiv:2604.01687 , year =