arxiv: 2605.10598 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Budget-Efficient Automatic Algorithm Design via Code Graph

Maxime Bouscary , Manxi Wu , Saurabh Amin

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:41 UTC · model grok-4.3

classification 💻 cs.AI

keywords automatic algorithm designLLM correctionscode graphdirected acyclic graphcombinatorial optimizationbudget efficiencycredit assignment

0 comments

The pith

Representing algorithms as graphs of LLM corrections allows higher fitness at the same token budget than generating full algorithms each time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard LLM pipelines for automatic algorithm design waste budget by repeatedly rewriting the same code structures and throwing away partial candidates that still hold useful pieces. It introduces a directed acyclic graph where each LLM output is treated as a small correction that adds, swaps, or deletes code blocks, so new algorithms are built by composing prior corrections rather than starting over. This structure supports credit assignment down to the level of individual corrections, guiding the next queries more precisely. The authors supply both a theoretical account of how to trade search depth against breadth under a fixed budget and empirical results showing the graph approach beats full-algorithm search on three combinatorial optimization tasks.

Core claim

By modeling algorithms as a directed acyclic graph of compact LLM corrections rather than as independent full programs, the search reuses valuable substructures across candidates and performs correction-level credit assignment, which together produce algorithms of higher realized fitness for any given token budget.

What carries the argument

The directed acyclic graph whose nodes are LLM-generated corrections (compact operators that add, replace, or remove code blocks), allowing algorithms to be decomposed into reusable corrections and recomposed without redundant rewriting.

If this is right

At any fixed token budget the graph method yields higher average fitness than full-algorithm regeneration.
Credit assignment at the correction level improves query selection over random or fitness-only selection.
Theoretical guidance on depth-versus-breadth trade-offs can be used to allocate the remaining budget after early corrections are evaluated.
Rich context helps performance only when the LLM starts with shallow prior knowledge of the problem domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same correction-graph structure could be combined with non-LLM search operators such as local mutation or evolutionary recombination.
If correction credit assignment proves stable, the approach might scale to algorithm design tasks whose full code exceeds current context windows.
The depth-breadth analysis could be tested on non-combinatorial problems where substructure reuse is even more valuable.

Load-bearing premise

LLM-generated corrections can be reliably composed inside the graph without creating semantic inconsistencies that break the resulting algorithms.

What would settle it

On the same three combinatorial optimization problems and with identical token budgets, the graph-based method fails to produce consistently higher-fitness algorithms than the full-algorithm baseline.

Figures

Figures reproduced from arXiv: 2605.10598 by Manxi Wu, Maxime Bouscary, Saurabh Amin.

**Figure 2.** Figure 2: Fitness against cumulative tokens on TSP and LRP. The solid, dashed, and dotted lines [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Proportion of iterations that strictly improve [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Augmentation of a DAG with two corrections. The green correction, “Move the refinement [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Contour plots of the Traveling Salesman Problem (top), Location Routing Problem (middle), [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Proportion of iterations that strictly improve the incumbent’s best fitness for each problem. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Large language models (LLMs) have emerged as powerful tools for automatic algorithm design (AAD). However, existing pipelines remain inefficient. They operate at the granularity of full algorithms, redundantly rewriting recurring substructures and discarding low-fitness candidates that may contain valuable algorithmic features. We formalize budget-efficient automatic algorithm design, wherein the search policy maximizes realized fitness subject to limited computational cost. We propose a directed acyclic graph representation of algorithms and build a search framework that fully exploits the LLM's output. Instead of querying the LLM for full algorithms, we use it to obtain corrections: compact operators that add, replace, or remove code blocks. Each correction augments the graph, yielding new algorithms that compose with prior corrections. This graph structure decomposes algorithms into sets of corrections, enabling correction-level credit assignment that informs subsequent queries. We complement this framework with theoretical insights into the ideal balance between search depth and breadth at different budget levels. We validate our method empirically on three combinatorial optimization problems, demonstrating consistent superiority of our graph-based search over full-algorithm search at equal token budget. Finally, our experiments suggest that rich contexts help only when the LLM's prior knowledge is shallow, and can hinder performance otherwise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's DAG of LLM corrections reuses substructures to beat full-algorithm prompting at fixed token budget on three problems, but the gains rest on unexamined composition reliability.

read the letter

The main takeaway is that modeling algorithms as a DAG built from compact corrections (add/replace/remove code blocks) lets the search keep useful pieces across iterations and assign credit at the correction level instead of discarding whole candidates. This directly targets the waste in standard full-algorithm prompting under a token budget, and the abstract reports consistent wins on three combinatorial optimization tasks at equal cost. The depth-breadth balance analysis is a useful theoretical touch even if it stays qualitative for now. The credit-assignment step and the observation that extra context can hurt when the LLM already knows the domain are the parts that feel freshest. The experiments are the soft spot. The superiority claim is stated clearly, but without the actual baselines, run counts, statistical tests, or failure modes of correction composition shown in the text, it is hard to judge how much the DAG structure itself drives the result versus simply better prompting. The assumption that corrections compose cleanly without introducing hard-to-fix semantic errors could eat budget in practice, and the paper would be stronger with an ablation on that. This work is aimed at researchers building LLM pipelines for optimization and auto-design. It is concrete enough that a referee could check the implementation details and see whether the efficiency claim generalizes beyond the three problems. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper formalizes budget-efficient automatic algorithm design (AAD) with LLMs and proposes representing algorithms as directed acyclic graphs (DAGs) whose nodes are compact LLM-generated corrections (add/replace/remove code blocks). This structure permits composition of new algorithms from prior corrections, correction-level credit assignment to guide future LLM queries, and avoidance of redundant full-algorithm rewrites. The manuscript supplies theoretical analysis of the depth-breadth tradeoff under token budgets and reports empirical results on three combinatorial optimization problems showing that the graph-based search consistently outperforms full-algorithm search at equal token budget.

Significance. If the empirical superiority holds under rigorous controls, the DAG-plus-correction framework offers a concrete mechanism for more token-efficient AAD by salvaging useful substructures and directing search via credit assignment. The explicit treatment of budget constraints and the depth-breadth analysis constitute a useful theoretical contribution that could inform future LLM-based search policies. The work is grounded in a clear structural proposal rather than ad-hoc prompting.

major comments (2)

[Abstract / Empirical Evaluation] Abstract and Empirical Evaluation section: the central claim of 'consistent superiority' over full-algorithm search at equal token budget is load-bearing, yet the manuscript supplies no description of the three combinatorial problems, the precise definition of token budget (input vs. output tokens, prompt overhead), the full-algorithm baseline implementation, number of independent runs, or any statistical tests. Without these elements the empirical result cannot be verified or reproduced.
[Method / DAG Representation] Framework description (DAG construction and correction composition): the assumption that LLM-generated corrections can be reliably composed into valid algorithms without semantic inconsistencies is central to the credit-assignment mechanism, yet no verification procedure, rejection sampling strategy, or failure-rate statistics are reported. This directly affects whether the reported efficiency gains are attributable to the graph structure or to unstated filtering.

minor comments (2)

[Method] Notation for the DAG and correction operators should be introduced with a small illustrative example (e.g., a 3-node graph) early in the method section to improve readability.
[Experiments] The abstract states that 'rich contexts help only when the LLM's prior knowledge is shallow'; this observation is interesting but would benefit from a short table or plot in the experiments section showing the interaction effect.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below, indicating the revisions we will incorporate to enhance reproducibility and methodological transparency.

read point-by-point responses

Referee: [Abstract / Empirical Evaluation] Abstract and Empirical Evaluation section: the central claim of 'consistent superiority' over full-algorithm search at equal token budget is load-bearing, yet the manuscript supplies no description of the three combinatorial problems, the precise definition of token budget (input vs. output tokens, prompt overhead), the full-algorithm baseline implementation, number of independent runs, or any statistical tests. Without these elements the empirical result cannot be verified or reproduced.

Authors: We agree that these details are essential for verifying and reproducing the central empirical claim. The submitted manuscript provides only high-level references to the experimental setup. In the revised version, we will expand the Experimental Setup and Results sections to include: explicit descriptions of the three combinatorial optimization problems, a precise definition of token budget (total tokens consumed across all LLM API calls, counting both input prompts and output completions including overhead), the full-algorithm baseline (identical LLM and prompting but generating complete algorithms without graph composition), the number of independent runs, and the statistical tests used for comparison. revision: yes
Referee: [Method / DAG Representation] Framework description (DAG construction and correction composition): the assumption that LLM-generated corrections can be reliably composed into valid algorithms without semantic inconsistencies is central to the credit-assignment mechanism, yet no verification procedure, rejection sampling strategy, or failure-rate statistics are reported. This directly affects whether the reported efficiency gains are attributable to the graph structure or to unstated filtering.

Authors: We acknowledge that the manuscript does not report verification procedures or failure statistics for correction composition. In the revision, we will augment the Method section with a description of the composition process (AST-based merging to ensure syntactic validity) and add a new subsection on verification, including any rejection sampling applied and the observed failure rates from our experiments. This will allow readers to evaluate whether the reported gains stem from the DAG structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core contribution is a structural framework for budget-efficient AAD that decomposes algorithms into composable LLM-generated corrections within a DAG, with correction-level credit assignment. The central claim is an empirical result showing graph-based search outperforms full-algorithm search at fixed token budget across three combinatorial problems. No equations, fitted parameters, or self-citations appear in the derivation chain that would reduce any prediction or theoretical insight to its own inputs by construction. The framework is presented as an independent proposal whose validity rests on external experimental validation rather than internal tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the effectiveness of the new graph representation and correction mechanism whose utility is asserted via empirical comparison; limited information is available from the abstract alone.

axioms (1)

domain assumption Large language models can generate compact, composable code corrections that improve algorithmic fitness
Core premise enabling the correction-based search and credit assignment.

invented entities (2)

Directed acyclic graph representation of algorithms no independent evidence
purpose: To decompose algorithms into composable corrections and enable credit assignment
New structural abstraction introduced to replace full-algorithm rewriting.
Correction operators (add/replace/remove code blocks) no independent evidence
purpose: Compact LLM outputs that augment the algorithm graph
New search primitives proposed instead of full algorithm generation.

pith-pipeline@v0.9.0 · 5507 in / 1239 out tokens · 53284 ms · 2026-05-12T03:41:30.207420+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 3 internal anchors

[1]

Mathematical discoveries from program search with large language models.Nature, 2024

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models.Nature, 2024

work page 2024
[2]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Evolution of heuristics: Towards efficient automatic algorithm design using large language model,

Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language model.arXiv preprint arXiv:2401.02051, 2024

work page arXiv 2024
[4]

Reevo: Large language models as hyper-heuristics with reflective evolution.Advances in neural information processing systems, 37:43571–43608, 2024

Haoran Ye, Jiarui Wang, Zhiguang Cao, Federico Berto, Chuanbo Hua, Haeyeon Kim, Jinkyoo Park, and Guojie Song. Reevo: Large language models as hyper-heuristics with reflective evolution.Advances in neural information processing systems, 37:43571–43608, 2024

work page 2024
[5]

Monte carlo tree search for comprehensive explo- ration in llm-based automatic heuristic design,

Zhi Zheng, Zhuoliang Xie, Zhenkun Wang, and Bryan Hooi. Monte carlo tree search for comprehensive exploration in llm-based automatic heuristic design.arXiv preprint arXiv:2501.08603, 2025

work page arXiv 2025
[6]

Evocut: Strengthening integer programs via evolution-guided language models.arXiv preprint arXiv:2508.11850, 2025

Milad Yazdani, Mahdi Mostajabdaveh, Samin Aref, and Zirui Zhou. Evocut: Strengthening integer programs via evolution-guided language models.arXiv preprint arXiv:2508.11850, 2025

work page arXiv 2025
[7]

Algorithm discovery with LLMs: Evolutionary search meets reinforcement learning

Anja Surina, Amin Mansouri, Lars Quaedvlieg, Amal Seddas, Maryna Viazovska, Emmanuel Abbe, and Caglar Gulcehre. Algorithm discovery with llms: Evolutionary search meets reinforcement learning.arXiv preprint arXiv:2504.05108, 2025

work page arXiv 2025
[8]

Where do large language models fail when generating code?

Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, and Tianyi Zhang. Where do large language models fail when generating code.arXiv preprint arXiv:2406.08731, 2024

work page arXiv 2024
[9]

What is wrong with your code generated by large language models? an extensive study.Science China Information Sciences, 69(1):112107, 2026

Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Muling Wu, Yunbo Tao, Ming Zhang, Mingxu Chai, Jessica Fan, Zhiheng Xi, et al. What is wrong with your code generated by large language models? an extensive study.Science China Information Sciences, 69(1):112107, 2026

work page 2026
[10]

To Diff or Not to Diff? Structure-Aware and Adaptive Output Formats for Efficient LLM-based Code Editing

Wei Cheng, Yongchang Cao, Chen Shen, Binhua Li, Jue Chen, Yongbin Li, and Wei Hu. To diff or not to diff? structure-aware and adaptive output formats for efficient llm-based code editing.arXiv preprint arXiv:2604.27296, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Multi-objective evolution of heuristic using large language model

Shunyu Yao, Fei Liu, Xi Lin, Zhichao Lu, Zhenkun Wang, and Qingfu Zhang. Multi-objective evolution of heuristic using large language model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27144–27152, 2025

work page 2025
[12]

Tide: Tuning-integrated dynamic evolution for llm-based automated heuristic design.arXiv preprint arXiv:2601.21239, 2026

Chentong Chen, Mengyuan Zhong, Ye Fan, Jialong Shi, and Jianyong Sun. Tide: Tuning-integrated dynamic evolution for llm-based automated heuristic design.arXiv preprint arXiv:2601.21239, 2026

work page arXiv 2026
[13]

Openevolve: an open-source evolutionary coding agent, 2025

Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025

work page 2025
[14]

Planning of heuristics: Strategic planning on large lan- guage models with monte carlo tree search for automating heuristic optimization.arXiv preprint arXiv:2502.11422, 2025

Hui Wang, Xufeng Zhang, and Chaoxu Mu. Planning of heuristics: Strategic planning on large lan- guage models with monte carlo tree search for automating heuristic optimization.arXiv preprint arXiv:2502.11422, 2025

work page arXiv 2025
[15]

Codetree: Agent- guided tree search for code generation with large language models

Jierui Li, Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Codetree: Agent- guided tree search for code generation with large language models. 2025

work page 2025
[16]

Autoformulation of mathematical optimization models using llms.arXiv preprint arXiv:2411.01679, 2024

Nicolás Astorga, Tennison Liu, Yuanzhang Xiao, and Mihaela van der Schaar. Autoformulation of mathematical optimization models using llms.arXiv preprint arXiv:2411.01679, 2024

work page arXiv 2024
[17]

Orlm: A customizable framework in training large models for automated optimization modeling.Operations Research, 2025

Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. Orlm: A customizable framework in training large models for automated optimization modeling.Operations Research, 2025

work page 2025
[18]

Optimind: Teaching llms to think like optimization experts.arXiv preprint arXiv:2509.22979, 2025

Zeyi Chen, Xinzhi Zhang, Humishka Zope, Hugo Barbalho, Konstantina Mellou, Marco Molinaro, Janardhan Kulkarni, Ishai Menache, and Sirui Li. Optimind: Teaching llms to think like optimization experts.arXiv preprint arXiv:2509.22979, 2025. 10

work page arXiv 2025
[19]

Towards foundation models for mixed integer linear programming.arXiv preprint arXiv:2410.08288, 2024

Sirui Li, Janardhan Kulkarni, Ishai Menache, Cathy Wu, and Beibin Li. Towards foundation models for mixed integer linear programming.arXiv preprint arXiv:2410.08288, 2024

work page arXiv 2024
[20]

Toward a trustworthy optimization modeling agent via verifiable synthetic data generation.arXiv preprint arXiv:2508.03117, 2025

Vinicius Lima, Dzung T Phan, Jayant Kalagnanam, Dhaval Patel, and Nianjun Zhou. Toward a trustworthy optimization modeling agent via verifiable synthetic data generation.arXiv preprint arXiv:2508.03117, 2025

work page arXiv 2025
[21]

Integrated facility location and vehicle routing models: Recent work and future prospects.American Journal of Mathematical and Management Sciences, 7(1-2):35–61, 1987

Anantaram Balakrishnan, James E Ward, and Richard T Wong. Integrated facility location and vehicle routing models: Recent work and future prospects.American Journal of Mathematical and Management Sciences, 7(1-2):35–61, 1987

work page 1987
[22]

Optimizing schools’ start time and bus routes

Dimitris Bertsimas, Arthur Delarue, and Sebastien Martin. Optimizing schools’ start time and bus routes. Proceedings of the National Academy of Sciences, 116(13):5943–5948, 2019

work page 2019
[23]

DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Exploring combinatorial problem solving with large language models: A case study on the travelling salesman problem using gpt-3.5 turbo

Mahmoud Masoud, Ahmed Abdelhay, and Mohammed Elhenawy. Exploring combinatorial problem solving with large language models: A case study on the travelling salesman problem using gpt-3.5 turbo. arXiv preprint arXiv:2405.01997, 2024

work page arXiv 2024
[25]

Eyeballing combinatorial problems: A case study of using multimodal large language models to solve traveling salesman problems

Mohammed Elhenawy, Ahmed Abdelhay, Taqwa I Alhadidi, Huthaifa I Ashqar, Shadi Jaradat, Ahmed Jaber, Sebastien Glaser, and Andry Rakotonirainy. Eyeballing combinatorial problems: A case study of using multimodal large language models to solve traveling salesman problems. InInternational Conference on Intelligent Systems, Blockchain, and Communication Techn...

work page 2024
[26]

Consistent Individualized Feature Attribution for Tree Ensembles

Scott M Lundberg, Gabriel G Erion, and Su-In Lee. Consistent individualized feature attribution for tree ensembles.arXiv preprint arXiv:1802.03888, 2018

work page Pith review arXiv 2018
[27]

Fast treeshap: Accelerating shap value computation for trees.arXiv preprint arXiv:2109.09847, 2021

Jilei Yang. Fast treeshap: Accelerating shap value computation for trees.arXiv preprint arXiv:2109.09847, 2021

work page arXiv 2021
[28]

Move the refinement procedure before the main optimization loop

Gerhard Reinelt. Tsplib - a traveling salesman problem library.ORSA journal on computing, 3(4):376–384, 1991. 11 A Proofs LetN=B/ω. DifferentiatingZ B once with respect toω: ∂ZB ∂ω (π, ω) =µ ′ π(ω)− B ω2 Q′ π B ω (3) =µ ′ π(ω)− N ω Q′ π (N) and twice, withN=B/ω, ∂2ZB ∂ω2 (π, ω) =µ ′′ π(ω) + 2B ω3 Q′ π B ω + B2 ω4 Q′′ π B ω (4) =µ ′′ π(ω) + 2N ω2 Q′ π (N) ...

work page 1991
[29]

Frontier selection.Among the exponentially many unevaluated paths, the surrogate provides a cheap ranking that lets us avoid spending compute on low-potential candidates and only consider the top-scoring ones for evaluation

work page
[30]

route": None, # list[str] e.g. [

Fast and robust identification of corrections’ influence.Shapley values distribute credit fairly among interacting corrections, but computing them exactly is exponential in the number of corrections. Leveraging the specific structure of trees, the methods of [26] and [27] make this identification tractable. B.2 Additional Numerical Results In Figure 5, we...

work page
[31]

Do not include general ML theory or textbook advice

Strict grounding: only extract insights explicitly present in the provided correction history. Do not include general ML theory or textbook advice

work page
[32]

Conciseness: the summary should only include the most informative and actionable points

work page
[33]

No ambiguity: avoid vague terms

work page
[34]

Keep the whole summary under 500 words

Advisory tone: give strong guidelines rather than absolute rules, allowing the generator to revisit concepts if it can fix the underlying bug. Keep the whole summary under 500 words. Strict formatting of the summary (if the provided history contains no supporting data for a category, output ’* None’ under that header): ### Strongly Discouraged * [Concept]...

work page