Recognition: 2 theorem links
· Lean TheoremBudget-Efficient Automatic Algorithm Design via Code Graph
Pith reviewed 2026-05-12 03:41 UTC · model grok-4.3
The pith
Representing algorithms as graphs of LLM corrections allows higher fitness at the same token budget than generating full algorithms each time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By modeling algorithms as a directed acyclic graph of compact LLM corrections rather than as independent full programs, the search reuses valuable substructures across candidates and performs correction-level credit assignment, which together produce algorithms of higher realized fitness for any given token budget.
What carries the argument
The directed acyclic graph whose nodes are LLM-generated corrections (compact operators that add, replace, or remove code blocks), allowing algorithms to be decomposed into reusable corrections and recomposed without redundant rewriting.
If this is right
- At any fixed token budget the graph method yields higher average fitness than full-algorithm regeneration.
- Credit assignment at the correction level improves query selection over random or fitness-only selection.
- Theoretical guidance on depth-versus-breadth trade-offs can be used to allocate the remaining budget after early corrections are evaluated.
- Rich context helps performance only when the LLM starts with shallow prior knowledge of the problem domain.
Where Pith is reading between the lines
- The same correction-graph structure could be combined with non-LLM search operators such as local mutation or evolutionary recombination.
- If correction credit assignment proves stable, the approach might scale to algorithm design tasks whose full code exceeds current context windows.
- The depth-breadth analysis could be tested on non-combinatorial problems where substructure reuse is even more valuable.
Load-bearing premise
LLM-generated corrections can be reliably composed inside the graph without creating semantic inconsistencies that break the resulting algorithms.
What would settle it
On the same three combinatorial optimization problems and with identical token budgets, the graph-based method fails to produce consistently higher-fitness algorithms than the full-algorithm baseline.
Figures
read the original abstract
Large language models (LLMs) have emerged as powerful tools for automatic algorithm design (AAD). However, existing pipelines remain inefficient. They operate at the granularity of full algorithms, redundantly rewriting recurring substructures and discarding low-fitness candidates that may contain valuable algorithmic features. We formalize budget-efficient automatic algorithm design, wherein the search policy maximizes realized fitness subject to limited computational cost. We propose a directed acyclic graph representation of algorithms and build a search framework that fully exploits the LLM's output. Instead of querying the LLM for full algorithms, we use it to obtain corrections: compact operators that add, replace, or remove code blocks. Each correction augments the graph, yielding new algorithms that compose with prior corrections. This graph structure decomposes algorithms into sets of corrections, enabling correction-level credit assignment that informs subsequent queries. We complement this framework with theoretical insights into the ideal balance between search depth and breadth at different budget levels. We validate our method empirically on three combinatorial optimization problems, demonstrating consistent superiority of our graph-based search over full-algorithm search at equal token budget. Finally, our experiments suggest that rich contexts help only when the LLM's prior knowledge is shallow, and can hinder performance otherwise.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes budget-efficient automatic algorithm design (AAD) with LLMs and proposes representing algorithms as directed acyclic graphs (DAGs) whose nodes are compact LLM-generated corrections (add/replace/remove code blocks). This structure permits composition of new algorithms from prior corrections, correction-level credit assignment to guide future LLM queries, and avoidance of redundant full-algorithm rewrites. The manuscript supplies theoretical analysis of the depth-breadth tradeoff under token budgets and reports empirical results on three combinatorial optimization problems showing that the graph-based search consistently outperforms full-algorithm search at equal token budget.
Significance. If the empirical superiority holds under rigorous controls, the DAG-plus-correction framework offers a concrete mechanism for more token-efficient AAD by salvaging useful substructures and directing search via credit assignment. The explicit treatment of budget constraints and the depth-breadth analysis constitute a useful theoretical contribution that could inform future LLM-based search policies. The work is grounded in a clear structural proposal rather than ad-hoc prompting.
major comments (2)
- [Abstract / Empirical Evaluation] Abstract and Empirical Evaluation section: the central claim of 'consistent superiority' over full-algorithm search at equal token budget is load-bearing, yet the manuscript supplies no description of the three combinatorial problems, the precise definition of token budget (input vs. output tokens, prompt overhead), the full-algorithm baseline implementation, number of independent runs, or any statistical tests. Without these elements the empirical result cannot be verified or reproduced.
- [Method / DAG Representation] Framework description (DAG construction and correction composition): the assumption that LLM-generated corrections can be reliably composed into valid algorithms without semantic inconsistencies is central to the credit-assignment mechanism, yet no verification procedure, rejection sampling strategy, or failure-rate statistics are reported. This directly affects whether the reported efficiency gains are attributable to the graph structure or to unstated filtering.
minor comments (2)
- [Method] Notation for the DAG and correction operators should be introduced with a small illustrative example (e.g., a 3-node graph) early in the method section to improve readability.
- [Experiments] The abstract states that 'rich contexts help only when the LLM's prior knowledge is shallow'; this observation is interesting but would benefit from a short table or plot in the experiments section showing the interaction effect.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major comment below, indicating the revisions we will incorporate to enhance reproducibility and methodological transparency.
read point-by-point responses
-
Referee: [Abstract / Empirical Evaluation] Abstract and Empirical Evaluation section: the central claim of 'consistent superiority' over full-algorithm search at equal token budget is load-bearing, yet the manuscript supplies no description of the three combinatorial problems, the precise definition of token budget (input vs. output tokens, prompt overhead), the full-algorithm baseline implementation, number of independent runs, or any statistical tests. Without these elements the empirical result cannot be verified or reproduced.
Authors: We agree that these details are essential for verifying and reproducing the central empirical claim. The submitted manuscript provides only high-level references to the experimental setup. In the revised version, we will expand the Experimental Setup and Results sections to include: explicit descriptions of the three combinatorial optimization problems, a precise definition of token budget (total tokens consumed across all LLM API calls, counting both input prompts and output completions including overhead), the full-algorithm baseline (identical LLM and prompting but generating complete algorithms without graph composition), the number of independent runs, and the statistical tests used for comparison. revision: yes
-
Referee: [Method / DAG Representation] Framework description (DAG construction and correction composition): the assumption that LLM-generated corrections can be reliably composed into valid algorithms without semantic inconsistencies is central to the credit-assignment mechanism, yet no verification procedure, rejection sampling strategy, or failure-rate statistics are reported. This directly affects whether the reported efficiency gains are attributable to the graph structure or to unstated filtering.
Authors: We acknowledge that the manuscript does not report verification procedures or failure statistics for correction composition. In the revision, we will augment the Method section with a description of the composition process (AST-based merging to ensure syntactic validity) and add a new subsection on verification, including any rejection sampling applied and the observed failure rates from our experiments. This will allow readers to evaluate whether the reported gains stem from the DAG structure. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper's core contribution is a structural framework for budget-efficient AAD that decomposes algorithms into composable LLM-generated corrections within a DAG, with correction-level credit assignment. The central claim is an empirical result showing graph-based search outperforms full-algorithm search at fixed token budget across three combinatorial problems. No equations, fitted parameters, or self-citations appear in the derivation chain that would reduce any prediction or theoretical insight to its own inputs by construction. The framework is presented as an independent proposal whose validity rests on external experimental validation rather than internal tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can generate compact, composable code corrections that improve algorithmic fitness
invented entities (2)
-
Directed acyclic graph representation of algorithms
no independent evidence
-
Correction operators (add/replace/remove code blocks)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Mathematical discoveries from program search with large language models.Nature, 2024
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models.Nature, 2024
work page 2024
-
[2]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco JR Ruiz, Abbas Mehrabian, et al. Alphaevolve: A coding agent for scientific and algorithmic discovery.arXiv preprint arXiv:2506.13131, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Evolution of heuristics: Towards efficient automatic algorithm design using large language model,
Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language model.arXiv preprint arXiv:2401.02051, 2024
-
[4]
Haoran Ye, Jiarui Wang, Zhiguang Cao, Federico Berto, Chuanbo Hua, Haeyeon Kim, Jinkyoo Park, and Guojie Song. Reevo: Large language models as hyper-heuristics with reflective evolution.Advances in neural information processing systems, 37:43571–43608, 2024
work page 2024
-
[5]
Monte carlo tree search for comprehensive explo- ration in llm-based automatic heuristic design,
Zhi Zheng, Zhuoliang Xie, Zhenkun Wang, and Bryan Hooi. Monte carlo tree search for comprehensive exploration in llm-based automatic heuristic design.arXiv preprint arXiv:2501.08603, 2025
-
[6]
Milad Yazdani, Mahdi Mostajabdaveh, Samin Aref, and Zirui Zhou. Evocut: Strengthening integer programs via evolution-guided language models.arXiv preprint arXiv:2508.11850, 2025
-
[7]
Algorithm discovery with LLMs: Evolutionary search meets reinforcement learning
Anja Surina, Amin Mansouri, Lars Quaedvlieg, Amal Seddas, Maryna Viazovska, Emmanuel Abbe, and Caglar Gulcehre. Algorithm discovery with llms: Evolutionary search meets reinforcement learning.arXiv preprint arXiv:2504.05108, 2025
-
[8]
Where do large language models fail when generating code?
Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, and Tianyi Zhang. Where do large language models fail when generating code.arXiv preprint arXiv:2406.08731, 2024
-
[9]
Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Muling Wu, Yunbo Tao, Ming Zhang, Mingxu Chai, Jessica Fan, Zhiheng Xi, et al. What is wrong with your code generated by large language models? an extensive study.Science China Information Sciences, 69(1):112107, 2026
work page 2026
-
[10]
Wei Cheng, Yongchang Cao, Chen Shen, Binhua Li, Jue Chen, Yongbin Li, and Wei Hu. To diff or not to diff? structure-aware and adaptive output formats for efficient llm-based code editing.arXiv preprint arXiv:2604.27296, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[11]
Multi-objective evolution of heuristic using large language model
Shunyu Yao, Fei Liu, Xi Lin, Zhichao Lu, Zhenkun Wang, and Qingfu Zhang. Multi-objective evolution of heuristic using large language model. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 27144–27152, 2025
work page 2025
-
[12]
Chentong Chen, Mengyuan Zhong, Ye Fan, Jialong Shi, and Jianyong Sun. Tide: Tuning-integrated dynamic evolution for llm-based automated heuristic design.arXiv preprint arXiv:2601.21239, 2026
-
[13]
Openevolve: an open-source evolutionary coding agent, 2025
Asankhaya Sharma. Openevolve: an open-source evolutionary coding agent, 2025
work page 2025
-
[14]
Hui Wang, Xufeng Zhang, and Chaoxu Mu. Planning of heuristics: Strategic planning on large lan- guage models with monte carlo tree search for automating heuristic optimization.arXiv preprint arXiv:2502.11422, 2025
-
[15]
Codetree: Agent- guided tree search for code generation with large language models
Jierui Li, Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. Codetree: Agent- guided tree search for code generation with large language models. 2025
work page 2025
-
[16]
Autoformulation of mathematical optimization models using llms.arXiv preprint arXiv:2411.01679, 2024
Nicolás Astorga, Tennison Liu, Yuanzhang Xiao, and Mihaela van der Schaar. Autoformulation of mathematical optimization models using llms.arXiv preprint arXiv:2411.01679, 2024
-
[17]
Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, and Zizhuo Wang. Orlm: A customizable framework in training large models for automated optimization modeling.Operations Research, 2025
work page 2025
-
[18]
Optimind: Teaching llms to think like optimization experts.arXiv preprint arXiv:2509.22979, 2025
Zeyi Chen, Xinzhi Zhang, Humishka Zope, Hugo Barbalho, Konstantina Mellou, Marco Molinaro, Janardhan Kulkarni, Ishai Menache, and Sirui Li. Optimind: Teaching llms to think like optimization experts.arXiv preprint arXiv:2509.22979, 2025. 10
-
[19]
Towards foundation models for mixed integer linear programming.arXiv preprint arXiv:2410.08288, 2024
Sirui Li, Janardhan Kulkarni, Ishai Menache, Cathy Wu, and Beibin Li. Towards foundation models for mixed integer linear programming.arXiv preprint arXiv:2410.08288, 2024
-
[20]
Vinicius Lima, Dzung T Phan, Jayant Kalagnanam, Dhaval Patel, and Nianjun Zhou. Toward a trustworthy optimization modeling agent via verifiable synthetic data generation.arXiv preprint arXiv:2508.03117, 2025
-
[21]
Anantaram Balakrishnan, James E Ward, and Richard T Wong. Integrated facility location and vehicle routing models: Recent work and future prospects.American Journal of Mathematical and Management Sciences, 7(1-2):35–61, 1987
work page 1987
-
[22]
Optimizing schools’ start time and bus routes
Dimitris Bertsimas, Arthur Delarue, and Sebastien Martin. Optimizing schools’ start time and bus routes. Proceedings of the National Academy of Sciences, 116(13):5943–5948, 2019
work page 2019
-
[23]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Mahmoud Masoud, Ahmed Abdelhay, and Mohammed Elhenawy. Exploring combinatorial problem solving with large language models: A case study on the travelling salesman problem using gpt-3.5 turbo. arXiv preprint arXiv:2405.01997, 2024
-
[25]
Mohammed Elhenawy, Ahmed Abdelhay, Taqwa I Alhadidi, Huthaifa I Ashqar, Shadi Jaradat, Ahmed Jaber, Sebastien Glaser, and Andry Rakotonirainy. Eyeballing combinatorial problems: A case study of using multimodal large language models to solve traveling salesman problems. InInternational Conference on Intelligent Systems, Blockchain, and Communication Techn...
work page 2024
-
[26]
Consistent Individualized Feature Attribution for Tree Ensembles
Scott M Lundberg, Gabriel G Erion, and Su-In Lee. Consistent individualized feature attribution for tree ensembles.arXiv preprint arXiv:1802.03888, 2018
work page Pith review arXiv 2018
-
[27]
Fast treeshap: Accelerating shap value computation for trees.arXiv preprint arXiv:2109.09847, 2021
Jilei Yang. Fast treeshap: Accelerating shap value computation for trees.arXiv preprint arXiv:2109.09847, 2021
-
[28]
Move the refinement procedure before the main optimization loop
Gerhard Reinelt. Tsplib - a traveling salesman problem library.ORSA journal on computing, 3(4):376–384, 1991. 11 A Proofs LetN=B/ω. DifferentiatingZ B once with respect toω: ∂ZB ∂ω (π, ω) =µ ′ π(ω)− B ω2 Q′ π B ω (3) =µ ′ π(ω)− N ω Q′ π (N) and twice, withN=B/ω, ∂2ZB ∂ω2 (π, ω) =µ ′′ π(ω) + 2B ω3 Q′ π B ω + B2 ω4 Q′′ π B ω (4) =µ ′′ π(ω) + 2N ω2 Q′ π (N) ...
work page 1991
-
[29]
Frontier selection.Among the exponentially many unevaluated paths, the surrogate provides a cheap ranking that lets us avoid spending compute on low-potential candidates and only consider the top-scoring ones for evaluation
-
[30]
route": None, # list[str] e.g. [
Fast and robust identification of corrections’ influence.Shapley values distribute credit fairly among interacting corrections, but computing them exactly is exponential in the number of corrections. Leveraging the specific structure of trees, the methods of [26] and [27] make this identification tractable. B.2 Additional Numerical Results In Figure 5, we...
-
[31]
Do not include general ML theory or textbook advice
Strict grounding: only extract insights explicitly present in the provided correction history. Do not include general ML theory or textbook advice
-
[32]
Conciseness: the summary should only include the most informative and actionable points
-
[33]
No ambiguity: avoid vague terms
-
[34]
Keep the whole summary under 500 words
Advisory tone: give strong guidelines rather than absolute rules, allowing the generator to revisit concepts if it can fix the underlying bug. Keep the whole summary under 500 words. Strict formatting of the summary (if the provided history contains no supporting data for a category, output ’* None’ under that header): ### Strongly Discouraged * [Concept]...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.