pith. machine review for the scientific record. sign in

arxiv: 2605.10598 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Budget-Efficient Automatic Algorithm Design via Code Graph

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:41 UTC · model grok-4.3

classification 💻 cs.AI
keywords automatic algorithm designLLM correctionscode graphdirected acyclic graphcombinatorial optimizationbudget efficiencycredit assignment
0
0 comments X

The pith

Representing algorithms as graphs of LLM corrections allows higher fitness at the same token budget than generating full algorithms each time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard LLM pipelines for automatic algorithm design waste budget by repeatedly rewriting the same code structures and throwing away partial candidates that still hold useful pieces. It introduces a directed acyclic graph where each LLM output is treated as a small correction that adds, swaps, or deletes code blocks, so new algorithms are built by composing prior corrections rather than starting over. This structure supports credit assignment down to the level of individual corrections, guiding the next queries more precisely. The authors supply both a theoretical account of how to trade search depth against breadth under a fixed budget and empirical results showing the graph approach beats full-algorithm search on three combinatorial optimization tasks.

Core claim

By modeling algorithms as a directed acyclic graph of compact LLM corrections rather than as independent full programs, the search reuses valuable substructures across candidates and performs correction-level credit assignment, which together produce algorithms of higher realized fitness for any given token budget.

What carries the argument

The directed acyclic graph whose nodes are LLM-generated corrections (compact operators that add, replace, or remove code blocks), allowing algorithms to be decomposed into reusable corrections and recomposed without redundant rewriting.

If this is right

  • At any fixed token budget the graph method yields higher average fitness than full-algorithm regeneration.
  • Credit assignment at the correction level improves query selection over random or fitness-only selection.
  • Theoretical guidance on depth-versus-breadth trade-offs can be used to allocate the remaining budget after early corrections are evaluated.
  • Rich context helps performance only when the LLM starts with shallow prior knowledge of the problem domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same correction-graph structure could be combined with non-LLM search operators such as local mutation or evolutionary recombination.
  • If correction credit assignment proves stable, the approach might scale to algorithm design tasks whose full code exceeds current context windows.
  • The depth-breadth analysis could be tested on non-combinatorial problems where substructure reuse is even more valuable.

Load-bearing premise

LLM-generated corrections can be reliably composed inside the graph without creating semantic inconsistencies that break the resulting algorithms.

What would settle it

On the same three combinatorial optimization problems and with identical token budgets, the graph-based method fails to produce consistently higher-fitness algorithms than the full-algorithm baseline.

Figures

Figures reproduced from arXiv: 2605.10598 by Manxi Wu, Maxime Bouscary, Saurabh Amin.

Figure 1
Figure 1. Figure 1: Single iteration of the graph-based search. Each iteration generates multiple corrections. A [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Fitness against cumulative tokens on TSP and LRP. The solid, dashed, and dotted lines [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Proportion of iterations that strictly improve [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Augmentation of a DAG with two corrections. The green correction, “Move the refinement [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Contour plots of the Traveling Salesman Problem (top), Location Routing Problem (middle), [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Proportion of iterations that strictly improve the incumbent’s best fitness for each problem. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Large language models (LLMs) have emerged as powerful tools for automatic algorithm design (AAD). However, existing pipelines remain inefficient. They operate at the granularity of full algorithms, redundantly rewriting recurring substructures and discarding low-fitness candidates that may contain valuable algorithmic features. We formalize budget-efficient automatic algorithm design, wherein the search policy maximizes realized fitness subject to limited computational cost. We propose a directed acyclic graph representation of algorithms and build a search framework that fully exploits the LLM's output. Instead of querying the LLM for full algorithms, we use it to obtain corrections: compact operators that add, replace, or remove code blocks. Each correction augments the graph, yielding new algorithms that compose with prior corrections. This graph structure decomposes algorithms into sets of corrections, enabling correction-level credit assignment that informs subsequent queries. We complement this framework with theoretical insights into the ideal balance between search depth and breadth at different budget levels. We validate our method empirically on three combinatorial optimization problems, demonstrating consistent superiority of our graph-based search over full-algorithm search at equal token budget. Finally, our experiments suggest that rich contexts help only when the LLM's prior knowledge is shallow, and can hinder performance otherwise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper formalizes budget-efficient automatic algorithm design (AAD) with LLMs and proposes representing algorithms as directed acyclic graphs (DAGs) whose nodes are compact LLM-generated corrections (add/replace/remove code blocks). This structure permits composition of new algorithms from prior corrections, correction-level credit assignment to guide future LLM queries, and avoidance of redundant full-algorithm rewrites. The manuscript supplies theoretical analysis of the depth-breadth tradeoff under token budgets and reports empirical results on three combinatorial optimization problems showing that the graph-based search consistently outperforms full-algorithm search at equal token budget.

Significance. If the empirical superiority holds under rigorous controls, the DAG-plus-correction framework offers a concrete mechanism for more token-efficient AAD by salvaging useful substructures and directing search via credit assignment. The explicit treatment of budget constraints and the depth-breadth analysis constitute a useful theoretical contribution that could inform future LLM-based search policies. The work is grounded in a clear structural proposal rather than ad-hoc prompting.

major comments (2)
  1. [Abstract / Empirical Evaluation] Abstract and Empirical Evaluation section: the central claim of 'consistent superiority' over full-algorithm search at equal token budget is load-bearing, yet the manuscript supplies no description of the three combinatorial problems, the precise definition of token budget (input vs. output tokens, prompt overhead), the full-algorithm baseline implementation, number of independent runs, or any statistical tests. Without these elements the empirical result cannot be verified or reproduced.
  2. [Method / DAG Representation] Framework description (DAG construction and correction composition): the assumption that LLM-generated corrections can be reliably composed into valid algorithms without semantic inconsistencies is central to the credit-assignment mechanism, yet no verification procedure, rejection sampling strategy, or failure-rate statistics are reported. This directly affects whether the reported efficiency gains are attributable to the graph structure or to unstated filtering.
minor comments (2)
  1. [Method] Notation for the DAG and correction operators should be introduced with a small illustrative example (e.g., a 3-node graph) early in the method section to improve readability.
  2. [Experiments] The abstract states that 'rich contexts help only when the LLM's prior knowledge is shallow'; this observation is interesting but would benefit from a short table or plot in the experiments section showing the interaction effect.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below, indicating the revisions we will incorporate to enhance reproducibility and methodological transparency.

read point-by-point responses
  1. Referee: [Abstract / Empirical Evaluation] Abstract and Empirical Evaluation section: the central claim of 'consistent superiority' over full-algorithm search at equal token budget is load-bearing, yet the manuscript supplies no description of the three combinatorial problems, the precise definition of token budget (input vs. output tokens, prompt overhead), the full-algorithm baseline implementation, number of independent runs, or any statistical tests. Without these elements the empirical result cannot be verified or reproduced.

    Authors: We agree that these details are essential for verifying and reproducing the central empirical claim. The submitted manuscript provides only high-level references to the experimental setup. In the revised version, we will expand the Experimental Setup and Results sections to include: explicit descriptions of the three combinatorial optimization problems, a precise definition of token budget (total tokens consumed across all LLM API calls, counting both input prompts and output completions including overhead), the full-algorithm baseline (identical LLM and prompting but generating complete algorithms without graph composition), the number of independent runs, and the statistical tests used for comparison. revision: yes

  2. Referee: [Method / DAG Representation] Framework description (DAG construction and correction composition): the assumption that LLM-generated corrections can be reliably composed into valid algorithms without semantic inconsistencies is central to the credit-assignment mechanism, yet no verification procedure, rejection sampling strategy, or failure-rate statistics are reported. This directly affects whether the reported efficiency gains are attributable to the graph structure or to unstated filtering.

    Authors: We acknowledge that the manuscript does not report verification procedures or failure statistics for correction composition. In the revision, we will augment the Method section with a description of the composition process (AST-based merging to ensure syntactic validity) and add a new subsection on verification, including any rejection sampling applied and the observed failure rates from our experiments. This will allow readers to evaluate whether the reported gains stem from the DAG structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core contribution is a structural framework for budget-efficient AAD that decomposes algorithms into composable LLM-generated corrections within a DAG, with correction-level credit assignment. The central claim is an empirical result showing graph-based search outperforms full-algorithm search at fixed token budget across three combinatorial problems. No equations, fitted parameters, or self-citations appear in the derivation chain that would reduce any prediction or theoretical insight to its own inputs by construction. The framework is presented as an independent proposal whose validity rests on external experimental validation rather than internal tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of the new graph representation and correction mechanism whose utility is asserted via empirical comparison; limited information is available from the abstract alone.

axioms (1)
  • domain assumption Large language models can generate compact, composable code corrections that improve algorithmic fitness
    Core premise enabling the correction-based search and credit assignment.
invented entities (2)
  • Directed acyclic graph representation of algorithms no independent evidence
    purpose: To decompose algorithms into composable corrections and enable credit assignment
    New structural abstraction introduced to replace full-algorithm rewriting.
  • Correction operators (add/replace/remove code blocks) no independent evidence
    purpose: Compact LLM outputs that augment the algorithm graph
    New search primitives proposed instead of full algorithm generation.

pith-pipeline@v0.9.0 · 5507 in / 1239 out tokens · 53284 ms · 2026-05-12T03:41:30.207420+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 3 internal anchors

  1. [1]

    Mathematical discoveries from program search with large language models.Nature, 2024

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models.Nature, 2024

  2. [2]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

  3. [3]

    Evolution of heuristics: Towards efficient automatic algorithm design using large language model,

    Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language model.arXiv preprint arXiv:2401.02051, 2024

  4. [4]

    Reevo: Large language models as hyper-heuristics with reflective evolution.Advances in neural information processing systems, 37:43571–43608, 2024

    Haoran Ye, Jiarui Wang, Zhiguang Cao, Federico Berto, Chuanbo Hua, Haeyeon Kim, Jinkyoo Park, and Guojie Song. Reevo: Large language models as hyper-heuristics with reflective evolution.Advances in neural information processing systems, 37:43571–43608, 2024

  5. [5]

    Monte carlo tree search for comprehensive explo- ration in llm-based automatic heuristic design,

    Zhi Zheng, Zhuoliang Xie, Zhenkun Wang, and Bryan Hooi. Monte carlo tree search for comprehensive exploration in llm-based automatic heuristic design.arXiv preprint arXiv:2501.08603, 2025

  6. [6]

    Evocut: Strengthening integer programs via evolution-guided language models.arXiv preprint arXiv:2508.11850, 2025

    Milad Yazdani, Mahdi Mostajabdaveh, Samin Aref, and Zirui Zhou. Evocut: Strengthening integer programs via evolution-guided language models.arXiv preprint arXiv:2508.11850, 2025

  7. [7]

    Algorithm discovery with LLMs: Evolutionary search meets reinforcement learning

    Anja Surina, Amin Mansouri, Lars Quaedvlieg, Amal Seddas, Maryna Viazovska, Emmanuel Abbe, and Caglar Gulcehre. Algorithm discovery with llms: Evolutionary search meets reinforcement learning.arXiv preprint arXiv:2504.05108, 2025

  8. [8]

    Where do large language models fail when generating code?

    Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, and Tianyi Zhang. Where do large language models fail when generating code.arXiv preprint arXiv:2406.08731, 2024

  9. [9]

    What is wrong with your code generated by large language models? an extensive study.Science China Information Sciences, 69(1):112107, 2026

    Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Muling Wu, Yunbo Tao, Ming Zhang, Mingxu Chai, Jessica Fan, Zhiheng Xi, et al. What is wrong with your code generated by large language models? an extensive study.Science China Information Sciences, 69(1):112107, 2026

  10. [10]

    To Diff or Not to Diff? Structure-Aware and Adaptive Output Formats for Efficient LLM-based Code Editing

    Wei Cheng, Yongchang Cao, Chen Shen, Binhua Li, Jue Chen, Yongbin Li, and Wei Hu. To diff or not to diff? structure-aware and adaptive output formats for efficient llm-based code editing.arXiv preprint arXiv:2604.27296, 2026

  11. [11]

    Multi-objective evolution of heuristic using large language model

    Shunyu Yao, Fei Liu, Xi Lin, Zhichao Lu, Zhenkun Wang, and Qingfu Zhang. Multi-objective evolution of heuristic using large language model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27144–27152, 2025

  12. [12]

    Tide: Tuning-integrated dynamic evolution for llm-based automated heuristic design.arXiv preprint arXiv:2601.21239, 2026

    Chentong Chen, Mengyuan Zhong, Ye Fan, Jialong Shi, and Jianyong Sun. Tide: Tuning-integrated dynamic evolution for llm-based automated heuristic design.arXiv preprint arXiv:2601.21239, 2026

  13. [13]

    Openevolve: an open-source evolutionary coding agent, 2025

    Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025

  14. [14]

    Planning of heuristics: Strategic planning on large lan- guage models with monte carlo tree search for automating heuristic optimization.arXiv preprint arXiv:2502.11422, 2025

    Hui Wang, Xufeng Zhang, and Chaoxu Mu. Planning of heuristics: Strategic planning on large lan- guage models with monte carlo tree search for automating heuristic optimization.arXiv preprint arXiv:2502.11422, 2025

  15. [15]

    Codetree: Agent- guided tree search for code generation with large language models

    Jierui Li, Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Codetree: Agent- guided tree search for code generation with large language models. 2025

  16. [16]

    Autoformulation of mathematical optimization models using llms.arXiv preprint arXiv:2411.01679, 2024

    Nicolás Astorga, Tennison Liu, Yuanzhang Xiao, and Mihaela van der Schaar. Autoformulation of mathematical optimization models using llms.arXiv preprint arXiv:2411.01679, 2024

  17. [17]

    Orlm: A customizable framework in training large models for automated optimization modeling.Operations Research, 2025

    Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. Orlm: A customizable framework in training large models for automated optimization modeling.Operations Research, 2025

  18. [18]

    Optimind: Teaching llms to think like optimization experts.arXiv preprint arXiv:2509.22979, 2025

    Zeyi Chen, Xinzhi Zhang, Humishka Zope, Hugo Barbalho, Konstantina Mellou, Marco Molinaro, Janardhan Kulkarni, Ishai Menache, and Sirui Li. Optimind: Teaching llms to think like optimization experts.arXiv preprint arXiv:2509.22979, 2025. 10

  19. [19]

    Towards foundation models for mixed integer linear programming.arXiv preprint arXiv:2410.08288, 2024

    Sirui Li, Janardhan Kulkarni, Ishai Menache, Cathy Wu, and Beibin Li. Towards foundation models for mixed integer linear programming.arXiv preprint arXiv:2410.08288, 2024

  20. [20]

    Toward a trustworthy optimization modeling agent via verifiable synthetic data generation.arXiv preprint arXiv:2508.03117, 2025

    Vinicius Lima, Dzung T Phan, Jayant Kalagnanam, Dhaval Patel, and Nianjun Zhou. Toward a trustworthy optimization modeling agent via verifiable synthetic data generation.arXiv preprint arXiv:2508.03117, 2025

  21. [21]

    Integrated facility location and vehicle routing models: Recent work and future prospects.American Journal of Mathematical and Management Sciences, 7(1-2):35–61, 1987

    Anantaram Balakrishnan, James E Ward, and Richard T Wong. Integrated facility location and vehicle routing models: Recent work and future prospects.American Journal of Mathematical and Management Sciences, 7(1-2):35–61, 1987

  22. [22]

    Optimizing schools’ start time and bus routes

    Dimitris Bertsimas, Arthur Delarue, and Sebastien Martin. Optimizing schools’ start time and bus routes. Proceedings of the National Academy of Sciences, 116(13):5943–5948, 2019

  23. [23]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  24. [24]

    Exploring combinatorial problem solving with large language models: A case study on the travelling salesman problem using gpt-3.5 turbo

    Mahmoud Masoud, Ahmed Abdelhay, and Mohammed Elhenawy. Exploring combinatorial problem solving with large language models: A case study on the travelling salesman problem using gpt-3.5 turbo. arXiv preprint arXiv:2405.01997, 2024

  25. [25]

    Eyeballing combinatorial problems: A case study of using multimodal large language models to solve traveling salesman problems

    Mohammed Elhenawy, Ahmed Abdelhay, Taqwa I Alhadidi, Huthaifa I Ashqar, Shadi Jaradat, Ahmed Jaber, Sebastien Glaser, and Andry Rakotonirainy. Eyeballing combinatorial problems: A case study of using multimodal large language models to solve traveling salesman problems. InInternational Conference on Intelligent Systems, Blockchain, and Communication Techn...

  26. [26]

    Consistent Individualized Feature Attribution for Tree Ensembles

    Scott M Lundberg, Gabriel G Erion, and Su-In Lee. Consistent individualized feature attribution for tree ensembles.arXiv preprint arXiv:1802.03888, 2018

  27. [27]

    Fast treeshap: Accelerating shap value computation for trees.arXiv preprint arXiv:2109.09847, 2021

    Jilei Yang. Fast treeshap: Accelerating shap value computation for trees.arXiv preprint arXiv:2109.09847, 2021

  28. [28]

    Move the refinement procedure before the main optimization loop

    Gerhard Reinelt. Tsplib - a traveling salesman problem library.ORSA journal on computing, 3(4):376–384, 1991. 11 A Proofs LetN=B/ω. DifferentiatingZ B once with respect toω: ∂ZB ∂ω (π, ω) =µ ′ π(ω)− B ω2 Q′ π B ω (3) =µ ′ π(ω)− N ω Q′ π (N) and twice, withN=B/ω, ∂2ZB ∂ω2 (π, ω) =µ ′′ π(ω) + 2B ω3 Q′ π B ω + B2 ω4 Q′′ π B ω (4) =µ ′′ π(ω) + 2N ω2 Q′ π (N) ...

  29. [29]

    Frontier selection.Among the exponentially many unevaluated paths, the surrogate provides a cheap ranking that lets us avoid spending compute on low-potential candidates and only consider the top-scoring ones for evaluation

  30. [30]

    route": None, # list[str] e.g. [

    Fast and robust identification of corrections’ influence.Shapley values distribute credit fairly among interacting corrections, but computing them exactly is exponential in the number of corrections. Leveraging the specific structure of trees, the methods of [26] and [27] make this identification tractable. B.2 Additional Numerical Results In Figure 5, we...

  31. [31]

    Do not include general ML theory or textbook advice

    Strict grounding: only extract insights explicitly present in the provided correction history. Do not include general ML theory or textbook advice

  32. [32]

    Conciseness: the summary should only include the most informative and actionable points

  33. [33]

    No ambiguity: avoid vague terms

  34. [34]

    Keep the whole summary under 500 words

    Advisory tone: give strong guidelines rather than absolute rules, allowing the generator to revisit concepts if it can fix the underlying bug. Keep the whole summary under 500 words. Strict formatting of the summary (if the provided history contains no supporting data for a category, output ’* None’ under that header): ### Strongly Discouraged * [Concept]...