Teacher-Aware Evolution of Heuristic Programs from Learned Optimization Policies
Pith reviewed 2026-05-12 05:29 UTC · model grok-4.3
The pith
Learned optimization policies can guide the evolution of better static heuristics by supplying local behavioral feedback instead of relying only on final performance scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that independently trained learned optimization policies can be repurposed as behavioral teachers: by querying them for action preferences on states visited by evolving candidate programs, the search discovers static executable heuristics that outperform those evolved using only task performance, as validated on scheduling, routing, and graph optimization benchmarks without any neural inference required at deployment.
What carries the argument
Teacher-aware evolutionary framework that extracts local action preferences from learned policies to supplement endpoint performance in LLM-driven heuristic search.
If this is right
- Heuristics discovered this way achieve higher performance on standard combinatorial optimization benchmarks than performance-driven LLM evolution baselines.
- The final heuristics are static executable programs that require no neural network inference or model deployment at runtime.
- Learned optimization policies can be reused as sources of behavioral guidance rather than being directly imitated or deployed.
- The method applies to multiple problem classes including scheduling, routing, and graph optimization.
Where Pith is reading between the lines
- The separation of policy training from heuristic deployment allows one trained policy to potentially inform many different heuristic design tasks without repeated neural costs.
- This feedback style could be tested in domains outside combinatorial optimization where learned models exist but fast rule-based implementations are preferred.
- Multiple policies trained under different objectives might be queried together to produce richer behavioral signals for evolution.
Load-bearing premise
Querying learned optimization policies on the intermediate states visited by candidate heuristics produces reliable local behavioral feedback that meaningfully improves evolution beyond endpoint performance signals alone.
What would settle it
An experiment in which heuristics evolved with the teacher-derived action preferences show no consistent performance advantage over those evolved with performance feedback only, across the same scheduling, routing, and graph benchmarks.
Figures
read the original abstract
LLM-based automatic heuristic design has shown promise for generating executable heuristics for combinatorial optimization, but existing methods mainly rely on delayed endpoint performance. We propose a \emph{teacher-aware evolutionary framework} that uses independently trained learned optimization policies as behavioral teachers. Instead of deploying or imitating the teacher, our method queries it on states visited by candidate heuristic programs and uses its action preferences as local feedback for evolution. The resulting search discovers static executable heuristics guided by both task performance and teacher-derived behavioral signals. Experiments on scheduling, routing, and graph optimization benchmarks show that our method improves over performance-driven LLM heuristic evolution baselines while requiring no neural inference at deployment. These results suggest that learned optimization policies can be repurposed as behavioral feedback sources for automatic heuristic discovery.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a teacher-aware evolutionary framework for LLM-based automatic design of executable heuristic programs for combinatorial optimization tasks. Independently trained learned optimization policies serve as behavioral teachers: candidate heuristics are evaluated not only on endpoint performance but also by querying the teacher on visited states to obtain local action-preference feedback that guides the evolutionary search. The resulting static heuristics require no neural inference at deployment. Experiments on scheduling, routing, and graph optimization benchmarks report improvements over performance-driven LLM heuristic evolution baselines.
Significance. If the central claim holds, the work offers a way to repurpose learned optimization policies as sources of dense behavioral signals for heuristic evolution without imitation or deployment costs. This addresses the delayed-feedback limitation of prior performance-only methods and produces directly executable programs. The approach is notable for its parameter-free integration of teacher signals and its focus on falsifiable benchmark improvements.
major comments (1)
- [§4 and §5] §4 (Method) and §5 (Experiments): The fitness function combines endpoint performance with teacher-derived action preferences, yet no ablation is reported that removes only the teacher-action-preference term while preserving identical performance evaluation, search budget, and LLM prompt structure. Without this control, the reported gains on scheduling/routing/graph benchmarks cannot be attributed to the teacher component rather than the evolutionary loop or baseline LLM capabilities.
minor comments (2)
- [Abstract] Abstract: states that the method 'improves over performance-driven LLM heuristic evolution baselines' but supplies no quantitative deltas, error bars, number of runs, or statistical tests.
- [§4] Notation for the combined fitness function is introduced without an explicit equation number; cross-referencing the exact weighting between performance and teacher signals is difficult.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and for identifying a point that merits clarification. We respond to the major comment below.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Method) and §5 (Experiments): The fitness function combines endpoint performance with teacher-derived action preferences, yet no ablation is reported that removes only the teacher-action-preference term while preserving identical performance evaluation, search budget, and LLM prompt structure. Without this control, the reported gains on scheduling/routing/graph benchmarks cannot be attributed to the teacher component rather than the evolutionary loop or baseline LLM capabilities.
Authors: We thank the referee for this observation. The performance-driven LLM heuristic evolution baseline reported in §5 is constructed precisely as the requested control: it employs the same evolutionary loop, identical search budget, and the same LLM prompt structure for generating and refining candidate heuristics. The only difference is the fitness function, which uses solely endpoint performance and omits the teacher-action-preference term. The reported improvements are therefore measured against this exact ablation. To remove any potential ambiguity, we will revise the manuscript to state this equivalence explicitly in §4 and §5 and to add a short paragraph confirming that all other experimental factors (prompt wording, population size, number of generations, etc.) are held constant. revision: partial
Circularity Check
No significant circularity; derivation relies on independent teacher training and external benchmarks
full rationale
The paper's core method trains optimization policies independently, then queries them on states from candidate heuristics to supply behavioral feedback during LLM-driven evolution. This feedback is combined with endpoint performance but does not reduce to a self-definition, fitted parameter renamed as prediction, or self-citation chain. The central claim rests on experimental gains versus performance-only baselines on scheduling/routing/graph tasks, with no equations or steps that equate the output to the input by construction. The absence of an ablation isolating the teacher term is an evidence gap, not a circularity in the derivation itself.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Verifiability via iterative policy extraction
Osbert Bastani, Yewen Pu, and Armando Solar-Lezama. Verifiability via iterative policy extraction. InAdvances in Neural Information Processing Systems 31 (NeurIPS 2018),
work page 2018
-
[2]
Le, Mohammad Norouzi, and Samy Bengio
Irwan Bello, Hieu Pham, Quoc V . Le, Mohammad Norouzi, and Samy Bengio. Neural combinatorial optimization with reinforcement learning. InProceedings of the International Conference on Learning Representations (ICLR 2017),
work page 2017
-
[3]
Wouter Kool, Herke van Hoof, and Max Welling. Attention, learn to solve routing problems! In Proceedings of the International Conference on Learning Representations (ICLR 2019),
work page 2019
-
[4]
Pomo: Policy optimization with multiple optima for reinforcement learning
Yeong-Dae Kwon, Jinho Choo, Byoungjip Kim, Iljoo Yoon, Youngjune Gwon, and Seungjai Min. Pomo: Policy optimization with multiple optima for reinforcement learning. InAdvances in Neural Information Processing Systems 33 (NeurIPS 2020), pages 21188–21198,
work page 2020
-
[5]
Evolution of heuristics: Towards efficient automatic algorithm design using large language models
Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language models. InProceedings of the 41st International Conference on Machine Learning (ICML 2024), pages 32201–32223, 2024a. Fei Liu, Yilu Liu, Qingfu Zhang, Xialiang Tong, a...
work page 2024
-
[6]
Large language models as evolutionary optimizers
10 Shengcai Liu, Caishun Chen, Xinghua Qu, Ke Tang, and Yew-Soon Ong. Large language models as evolutionary optimizers. InIEEE Congress on Evolutionary Computation, pages 1–8, 2024b. Fu Luo, Xi Lin, Fei Liu, Qingfu Zhang, and Zhenkun Wang. Neural combinatorial optimization with heavy decoder: Toward large scale generalization. InAdvances in Neural Informa...
work page 2023
-
[7]
Adjustable robust reinforcement learning for online 3d bin packing
Yuxin Pan, Yize Chen, and Fangzhen Lin. Adjustable robust reinforcement learning for online 3d bin packing. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), pages 51926–51954,
work page 2023
-
[8]
Neuro-symbolic program synthesis
Emilio Parisotto, Abdel-Rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Push- meet Kohli. Neuro-symbolic program synthesis. InProceedings of the International Conference on Learning Representations (ICLR 2017),
work page 2017
-
[9]
Andrei A. Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirk- patrick, Razvan Pascanu, V olodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation. InProceedings of the International Conference on Learning Representations (ICLR 2016),
work page 2016
-
[10]
Programmatically interpretable reinforcement learning
Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Programmatically interpretable reinforcement learning. InProceedings of the 35th International Conference on Machine Learning (ICML 2018), pages 5045–5054,
work page 2018
-
[11]
Xuan Wu, Di Wang, Chunguo Wu, Lijie Wen, Chunyan Miao, Yubin Xiao, and You Zhou. Efficient heuristics generation for solving combinatorial optimization problems using large language models. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2025), pages 3228–3239,
work page 2025
-
[12]
Reevo: Large language models as hyper-heuristics with reflective evolution
Haoran Ye, Jiarui Wang, Zhiguang Cao, Federico Berto, Chuanbo Hua, Haeyeon Kim, Jinkyoo Park, and Guojie Song. Reevo: Large language models as hyper-heuristics with reflective evolution. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), pages 43571–43608,
work page 2024
-
[13]
Learning to dispatch for job shop scheduling via deep reinforcement learning
11 Cong Zhang, Wen Song, Zhiguang Cao, Jie Zhang, Puay Siew Tan, and Xu Chi. Learning to dispatch for job shop scheduling via deep reinforcement learning. InAdvances in Neural Information Processing Systems 33 (NeurIPS 2020), pages 1621–1632,
work page 2020
-
[14]
Monte carlo tree search for comprehen- sive exploration in llm-based automatic heuristic design
Zhi Zheng, Zhuoliang Xie, Zhenkun Wang, and Bryan Hooi. Monte carlo tree search for comprehen- sive exploration in llm-based automatic heuristic design. InProceedings of the 42nd International Conference on Machine Learning (ICML 2025),
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.