pith. sign in

arxiv: 2605.29649 · v2 · pith:SAWUXQ3Lnew · submitted 2026-05-28 · 💻 cs.AI

LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning

Pith reviewed 2026-06-29 07:25 UTC · model grok-4.3

classification 💻 cs.AI
keywords heuristic searchAI planninglarge language modelsevolutionary algorithmsdomain-independent heuristicsMAP-Elitessymbolic planninginformedness-speed tradeoff
0
0 comments X

The pith

LLM-driven evolution on training domains produces the first domain-independent heuristics that solve more tasks than the strongest hand-engineered baselines on entirely new domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evolves C++ heuristic functions for symbolic planning by letting an LLM mutate existing programs, archiving variants in a grid sorted by how informative versus fast each one is, and scoring them on a mix of how many tasks they finish and how long they take. Training happens only on a fixed collection of domains; the resulting programs are then tested on held-out domains never seen during evolution. A sympathetic reader would care because heuristic quality has been the main limit on what symbolic planners can solve, and an automated route that beats decades of expert design would change how such systems are built. The programs remain ordinary C++ so they drop straight into existing planners and keep their formal guarantees.

Core claim

By seeding an evolutionary loop with an LLM that proposes mutations to parent heuristics written in C++, maintaining a MAP-Elites archive indexed by informedness and speed, and assigning fitness from a blend of coverage and runtime on training domains, the method yields heuristics whose best member solves more tasks than the strongest baseline on unseen test domains while the full collection of evolved programs spans the Pareto frontier of the informedness-speed tradeoff.

What carries the argument

MAP-Elites archive keyed on informedness and speed, with LLM mutations generating new C++ heuristic candidates whose fitness is computed from coverage blended with solving time on training domains.

If this is right

  • The evolved heuristics slot into existing planners as ordinary C++ functions and inherit soundness and completeness.
  • Seeding the evolutionary process from the blind heuristic can produce stronger final programs than seeding from the FF heuristic.
  • LLM reasoning effort during mutation affects how often candidates compile far more than the quality of the candidates that do compile.
  • A broad benchmark of hand-engineered heuristics on the informedness-speed tradeoff is now available for comparison.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mutation-plus-archive loop could be applied to other search problems that rely on hand-crafted heuristics, such as game playing or verification.
  • Planners could dynamically select among the Pareto-optimal evolved heuristics according to current time or memory limits.
  • Varying the diversity or number of training domains would test how robust the generalization is to changes in the training distribution.
  • The plain C++ form opens the possibility of further manual or automated refinement after evolution.

Load-bearing premise

Heuristics evolved only on a finite set of training domains will generalize their performance advantage to completely new planning domains without overfitting to the training distribution or the particular coverage-runtime fitness blend.

What would settle it

A collection of new planning domains in which none of the evolved heuristics solves more tasks than the strongest hand-engineered baseline would falsify the generalization result.

Figures

Figures reproduced from arXiv: 2605.29649 by Elliot Gestrin, Jendrik Seipp.

Figure 2
Figure 2. Figure 2: Cactus plot on the held-out 2023 IPC Learning [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-run trajectories of the best-so-far training score (left axis) alongside a kernel-density estimate of improvement [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of scores grouped by reasoning effort. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cactus plot on the 2023 IPC Learning Track test instances with the four notable evolved heuristics from Section 5 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Unfiltered informedness–speed scatter across all heuristics in our benchmark, including the weak baselines that [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Stacked task outcomes per heuristic on the held-out test set: solved, out of time, out of memory, translator out of [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: For each ordered pair of heuristics, the share of test tasks that the row solves and the column does not. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Jaccard similarity between the per-heuristic sets of solved tasks from Figure 7. High values mark heuristics that are [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of training scores at each generation across all evolved runs. Later generations trend toward higher scores, [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
read the original abstract

Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are the result of decades of work by planning researchers. Recent work has shown that large language models (LLMs) can design heuristics for individual planning domains, but no LLM-generated heuristic has so far worked on arbitrary planning tasks. In this paper, we use evolutionary search to produce the first LLM-generated domain-independent heuristics that exceed the hand-engineered state of the art. We let an LLM mutate parent heuristics written in C++, store candidates in a MAP-Elites archive keyed on informedness and speed and calculate fitness scores by blending coverage with solving time. To place the evolved programs in context, we additionally benchmark a broad set of hand-engineered heuristics on their informedness-speed tradeoff, which to our knowledge has not been done before. On unseen testing domains, our best evolved heuristic solves more tasks than even the strongest baseline, with our full heuristic suite spanning the Pareto frontier of said tradeoff. We also find that seeding evolution from the trivial blind heuristic outperforms seeding from the strong FF heuristic, even when the resulting program is itself an FF variant, and that LLM reasoning effort affects how often candidates compile much more than the quality of those that do. Because the evolved programs are plain C++, they slot into existing planners as drop-in replacements and inherit the soundness and completeness guarantees of the underlying search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce the first LLM-generated domain-independent heuristics for symbolic AI planning that outperform hand-engineered state-of-the-art baselines. It uses evolutionary search where an LLM mutates C++ heuristic code, stores candidates in a MAP-Elites archive keyed on informedness and speed, and computes fitness by blending coverage with solving time. On unseen test domains, the best evolved heuristic reportedly solves more tasks than the strongest baseline, with the full suite spanning the informedness-speed Pareto frontier. Additional results include that seeding from the blind heuristic outperforms seeding from FF, and that LLM reasoning effort primarily affects compilation rates rather than quality of successful candidates. The evolved C++ programs are presented as drop-in replacements inheriting planner guarantees.

Significance. If the generalization results hold under rigorous evaluation, this would represent a notable advance by showing that LLM-driven evolution can produce domain-independent planning heuristics exceeding decades of hand-engineered work, with practical integration via C++ code. The broad benchmarking of hand-engineered heuristics on the informedness-speed tradeoff is a useful and previously unreported contribution that provides context for the evolved results.

major comments (3)
  1. [Experimental Evaluation] Experimental Evaluation (likely §4 or equivalent): The central claim that the best evolved heuristic outperforms the strongest baseline on unseen testing domains, and that the suite spans the Pareto frontier, is presented without any details on the number of training vs. testing domains, the procedure for creating the train/test split, domain diversity metrics, selection criteria for domains, or any statistical tests (e.g., significance of coverage differences). These omissions directly undermine evaluation of the generalization step highlighted as the weakest assumption.
  2. [Method / Fitness] Fitness computation description (likely §3): The fitness score is described as a blend of coverage and solving time, but the specific weights, how they were chosen, and any sensitivity analysis are not reported. This is load-bearing because the MAP-Elites archive and Pareto claims depend on this blending, and different weights could alter which candidates are retained or which evolved heuristic is deemed best.
  3. [Results] Results on seeding and LLM effort (likely §5): Claims that seeding from the blind heuristic outperforms FF seeding (even for FF-variant outputs) and that reasoning effort affects compilation more than quality lack reported run counts, variance across runs, or controls for other factors, making it impossible to assess whether these are robust findings supporting the overall approach.
minor comments (2)
  1. [Abstract / Introduction] The abstract and introduction could more explicitly reference the specific tables or figures showing the Pareto frontier comparison to allow readers to locate the supporting data immediately.
  2. [Method] Notation for informedness and speed metrics in the MAP-Elites archive could be defined more formally with equations or pseudocode to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which identify areas where additional clarity and rigor will strengthen the presentation of our results. We respond to each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Experimental Evaluation] Experimental Evaluation (likely §4 or equivalent): The central claim that the best evolved heuristic outperforms the strongest baseline on unseen testing domains, and that the suite spans the Pareto frontier, is presented without any details on the number of training vs. testing domains, the procedure for creating the train/test split, domain diversity metrics, selection criteria for domains, or any statistical tests (e.g., significance of coverage differences). These omissions directly undermine evaluation of the generalization step highlighted as the weakest assumption.

    Authors: We agree that the experimental section requires expanded reporting to allow readers to assess the generalization claims. In the revision we will add: the precise counts of training and testing domains, the exact procedure used to form the train/test split (including any stratification or diversity considerations), quantitative domain diversity metrics, the domain selection criteria, and statistical tests (such as Wilcoxon signed-rank tests) with p-values for coverage differences between the best evolved heuristic and the strongest baseline. These details will be placed in a dedicated subsection of the experimental evaluation. revision: yes

  2. Referee: [Method / Fitness] Fitness computation description (likely §3): The fitness score is described as a blend of coverage and solving time, but the specific weights, how they were chosen, and any sensitivity analysis are not reported. This is load-bearing because the MAP-Elites archive and Pareto claims depend on this blending, and different weights could alter which candidates are retained or which evolved heuristic is deemed best.

    Authors: We acknowledge that the fitness function description is incomplete. The revised manuscript will explicitly state the weights applied to coverage and solving time, explain the rationale and any preliminary experiments used to select them, and include a sensitivity analysis that varies the weights and reports the resulting changes to the retained archive members and the identity of the best heuristic. This information will be added to Section 3. revision: yes

  3. Referee: [Results] Results on seeding and LLM effort (likely §5): Claims that seeding from the blind heuristic outperforms FF seeding (even for FF-variant outputs) and that reasoning effort affects compilation more than quality lack reported run counts, variance across runs, or controls for other factors, making it impossible to assess whether these are robust findings supporting the overall approach.

    Authors: We agree that the robustness of the seeding and LLM-effort results needs better documentation. The revision will report the number of independent runs performed for each condition, means and standard deviations (or other variance measures) across runs, and any controls applied for confounding factors. These statistics will be added to the relevant results subsection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking against external baselines

full rationale

The paper reports an experimental protocol that evolves C++ heuristics via LLM mutation on a finite training-domain set, then measures coverage and runtime on a disjoint unseen test set, comparing directly to a broad suite of hand-engineered baselines. No equations, fitted parameters, or self-referential definitions appear; the reported advantage is a measured outcome, not a quantity forced by construction from the training fitness function or archive. Self-citations (if any) are not load-bearing for the central empirical claim, which remains falsifiable by external replication on the same domain split.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on the abstract alone, the central claim rests on the standard assumption that heuristic search remains sound when the heuristic is admissible or consistent, plus the empirical assumption that the chosen training domains and fitness blend produce generalizable programs. No free parameters or invented entities are explicitly named.

axioms (1)
  • standard math Heuristic search algorithms remain sound and complete when supplied with admissible or consistent heuristics.
    Implicit background assumption of all symbolic planning work; invoked when the abstract states that the evolved C++ programs inherit the guarantees of the underlying search.

pith-pipeline@v0.9.1-grok · 5773 in / 1313 out tokens · 28652 ms · 2026-06-29T07:25:17.910803+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Aler, R.; Borrajo, D.; and Isasi, P. 2001. Learning to Solve Planning Problems Efficiently by Means of Genetic Programming. Evolutionary Computation, 9(4): 387--420

  4. [4]

    a ckstr \

    B \"a ckstr \"o m, C.; and Nebel, B. 1995. Complexity Results for SAS ^ + Planning. Computational Intelligence, 11(4): 625--655

  5. [5]

    Bonet, B. 2013. An Admissible Heuristic for SAS ^ + Planning Obtained from the State Equation. In Rossi, F., ed., Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI 2013), 2268--2274. AAAI Press

  6. [6]

    Bonet, B.; and Geffner, H. 2001. Planning as Heuristic Search. Artificial Intelligence, 129(1): 5--33

  7. [7]

    Bylander, T. 1994. The Computational Complexity of Propositional STRIPS Planning. Artificial Intelligence, 69(1--2): 165--204

  8. [8]

    Z.; Thi \'e baux, S.; and Trevizan, F

    Chen, D. Z.; Thi \'e baux, S.; and Trevizan, F. 2024. Learning Domain-Independent Heuristics for Grounded and Lifted Planning. In Dy, J.; and Natarajan, S., eds., Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence ( AAAI 2024) , 20078--20086. AAAI Press

  9. [9]

    Z.; Zenn, J.; Cinquin, T.; and McIlraith, S

    Chen, D. Z.; Zenn, J.; Cinquin, T.; and McIlraith, S. A. 2025. Language Models for Generalised PDDL Planning: Synthesising Sound and Programmatic Policies. In ICAPS 2025 Workshop on Planning in the Era of LLMs (LM4Plan)

  10. [10]

    Frontier Large Language Models Rival State-of-the-Art Planners

    Corr \^e a, A. B.; Pereira, A. G.; and Seipp, J. 2025. The 2025 Planning Performance of Frontier Large Language Models. arXiv:2511.09378

  11. [11]

    B.; Pereira, A

    Corr\^ e a, A. B.; Pereira, A. G.; and Seipp, J. 2025. Classical Planning with LLM -Generated Heuristics: Challenging the State of the Art with Python Code. In Proceedings of the Thirty-Ninth Annual Conference on Neural Information Processing Systems ( NeurIPS 2025)

  12. [12]

    Edelkamp, S. 2002. Symbolic Pattern Databases in Heuristic Search Planning. In Ghallab, M.; Hertzberg, J.; and Traverso, P., eds., Proceedings of the Sixth International Conference on Artificial Intelligence Planning and Scheduling ( AIPS 2002) , 274--283. AAAI Press

  13. [13]

    E.; and Smith, J

    Eiben, A. E.; and Smith, J. E. 2015. Introduction to Evolutionary Computing. Natural Computing Series. Springer, 2nd edition

  14. [14]

    Fox, M.; and Long, D. 2003. PDDL2.1 : An Extension to PDDL for Expressing Temporal Planning Domains. Journal of Artificial Intelligence Research, 20: 61--124

  15. [15]

    Fritzsche, M.; Gestrin, E.; and Seipp, J. 2026. Symmetry-Aware Transformer Training for Automated Planning. In Jenkins, C.; and Taylor, M., eds., Proceedings of the Fortieth AAAI Conference on Artificial Intelligence ( AAAI 2026) . AAAI Press

  16. [16]

    Fukunaga, A. S. 2008. Automated Discovery of Local Search Heuristics for Satisfiability Testing. Evolutionary Computation, 16(1): 31--61

  17. [17]

    Gestrin, E.; Kuhlmann, M.; and Seipp, J. 2024. NL2Plan : Robust LLM -Driven Planning from Minimal Text Descriptions. In ICAPS 2024 Workshop on Human-Aware and Explainable Planning (HAXP)

  18. [18]

    Haslum, P.; Botea, A.; Helmert, M.; Bonet, B.; and Koenig, S. 2007. Domain-Independent Construction of Pattern Database Heuristics for Cost-Optimal Planning. In Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence ( AAAI 2007) , 1007--1012. AAAI Press

  19. [19]

    Haslum, P.; and Geffner, H. 2000. Admissible Heuristics for Optimal Planning. In Chien, S.; Kambhampati, S.; and Knoblock, C. A., eds., Proceedings of the Fifth International Conference on Artificial Intelligence Planning and Scheduling ( AIPS 2000) , 140--149. AAAI Press

  20. [20]

    Helmert, M. 2004. A Planning Heuristic Based on Causal Graph Analysis. In Zilberstein, S.; Koehler, J.; and Koenig, S., eds., Proceedings of the Fourteenth International Conference on Automated Planning and Scheduling ( ICAPS 2004) , 161--170. AAAI Press

  21. [21]

    Helmert, M. 2006. The Fast Downward Planning System. Journal of Artificial Intelligence Research, 26: 191--246

  22. [22]

    Helmert, M.; and Domshlak, C. 2009. Landmarks, Critical Paths and Abstractions: What's the Difference Anyway? In Gerevini, A.; Howe, A.; Cesta, A.; and Refanidis, I., eds., Proceedings of the Nineteenth International Conference on Automated Planning and Scheduling (ICAPS 2009), 162--169. AAAI Press

  23. [23]

    Helmert, M.; and Geffner, H. 2008. Unifying the Causal Graph and Additive Heuristics. In Rintanen, J.; Nebel, B.; Beck, J. C.; and Hansen, E., eds., Proceedings of the Eighteenth International Conference on Automated Planning and Scheduling (ICAPS 2008), 140--147. AAAI Press

  24. [24]

    Helmert, M.; Haslum, P.; Hoffmann, J.; and Nissim, R. 2014. Merge-and-Shrink Abstraction: A Method for Generating Lower Bounds in Factored State Spaces. Journal of the ACM, 61(3): 16:1--63

  25. [25]

    Hoffmann, J.; and Nebel, B. 2001. The FF Planning System: Fast Plan Generation Through Heuristic Search. Journal of Artificial Intelligence Research, 14: 253--302

  26. [26]

    Karpas, E.; and Domshlak, C. 2009. Cost-Optimal Planning with Landmarks. In Boutilier, C., ed., Proceedings of the 21st International Joint Conference on Artificial Intelligence ( IJCAI 2009) , 1728--1733. AAAI Press

  27. [27]

    Liu, B.; Jiang, Y.; Zhang, X.; Liu, Q.; Zhang, S.; Biswas, J.; and Stone, P. 2023. LLM+P : Empowering Large Language Models with Optimal Planning Proficiency. arXiv preprint arXiv:2304.11477

  28. [28]

    Liu, F.; Tong, X.; Yuan, M.; Lin, X.; Luo, F.; Wang, Z.; Lu, Z.; and Zhang, Q. 2024. Evolution of heuristics: towards efficient automatic algorithm design using large language model. In Proceedings of the 41st International Conference on Machine Learning, ICML'24. JMLR.org

  29. [29]

    McDermott, D.; Ghallab, M.; Howe, A.; Knoblock, C.; Ram, A.; Veloso, M.; Weld, D.; and Wilkins, D. 1998. PDDL -- The Planning Domain Definition Language -- Version 1.2. Technical Report CVC TR-98-003/DCS TR-1165, Yale Center for Computational Vision and Control, Yale University

  30. [30]

    Mouret, J.-B.; and Clune, J. 2015. Illuminating search spaces by mapping elites. arXiv:1504.04909

  31. [31]

    Murray, A.; Dervovic, D.; Pozanco, A.; and Cashmore, M. 2026. GenePlan : Evolving Better Generalized PDDL Plans using Large Language Models. In Proceedings of the Thirty-Sixth International Conference on Automated Planning and Scheduling (ICAPS 2026)

  32. [32]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Novikov, A.; V \ u , N.; Eisenberger, M.; Dupont, E.; Huang, P.-S.; Wagner, A. Z.; Shirobokov, S.; Kozlovskii, B.; Ruiz, F. J. R.; Mehrabian, A.; Kumar, M. P.; See, A.; Chaudhuri, S.; Holland, G.; Davies, A.; Nowozin, S.; Kohli, P.; and Balog, M. 2025. AlphaEvolve : A Coding Agent for Scientific and Algorithmic Discovery. arXiv:2506.13131

  33. [33]

    Pommerening, F.; Helmert, M.; R\"oger , G.; and Seipp, J. 2015. From Non-Negative to General Operator Cost Partitioning. In Bonet, B.; and Koenig, S., eds., Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence ( AAAI 2015) , 3335--3341. AAAI Press

  34. [34]

    Pommerening, F.; R\"oger , G.; and Helmert, M. 2013. Getting the Most Out of Pattern Databases for Classical Planning. In Rossi, F., ed., Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI 2013), 2357--2364. AAAI Press

  35. [35]

    Pommerening, F.; R\"oger , G.; Helmert, M.; and Bonet, B. 2014. LP -based Heuristics for Cost-optimal Planning. In Chien, S.; Fern, A.; Ruml, W.; and Do, M., eds., Proceedings of the Twenty-Fourth International Conference on Automated Planning and Scheduling (ICAPS 2014), 226--234. AAAI Press

  36. [36]

    F.; and Rintanen, J

    Rankooh, M. F.; and Rintanen, J. 2022. Efficient Computation and Informative Estimation of h^+ by Integer and Linear Programming. In Thi \'e baux, S.; and Yeoh, W., eds., Proceedings of the Thirty-Second International Conference on Automated Planning and Scheduling (ICAPS 2022), 71--79. AAAI Press

  37. [37]

    Richter, S.; and Westphal, M. 2010. The LAMA Planner: Guiding Cost-Based Anytime Planning with Landmarks. Journal of Artificial Intelligence Research, 39: 127--177

  38. [38]

    P.; Dupont, E.; Ruiz, F

    Romera-Paredes, B.; Barekatain, M.; Novikov, A.; Balog, M.; Kumar, M. P.; Dupont, E.; Ruiz, F. J. R.; Ellenberg, J. S.; Wang, P.; Fawzi, O.; Kohli, P.; and Fawzi, A. 2024. Mathematical Discoveries from Program Search with Large Language Models. Nature, 625: 468--475

  39. [39]

    Seipp, J.; and Helmert, M. 2018. Counterexample-Guided Cartesian Abstraction Refinement for Classical Planning. Journal of Artificial Intelligence Research, 62: 535--577

  40. [40]

    Seipp, J.; Keller, T.; and Helmert, M. 2020. Saturated Cost Partitioning for Optimal Classical Planning. Journal of Artificial Intelligence Research, 67: 129--167

  41. [41]

    Seipp, J.; Pommerening, F.; and Helmert, M. 2015. New Optimization Functions for Potential Heuristics. In Brafman, R.; Domshlak, C.; Haslum, P.; and Zilberstein, S., eds., Proceedings of the Twenty-Fifth International Conference on Automated Planning and Scheduling (ICAPS 2015), 193--201. AAAI Press

  42. [42]

    Sharma, A. 2025. OpenEvolve: an open-source evolutionary coding agent

  43. [43]

    Sievers, S.; and Helmert, M. 2021. Merge-and-Shrink: A Compositional Theory of Transformations of Factored Transition Systems. Journal of Artificial Intelligence Research, 71: 781--883

  44. [44]

    St hlberg, S.; Bonet, B.; and Geffner, H. 2022. Learning General Optimal Policies with Graph Neural Networks: Expressive Power, Transparency, and Limits. In Thi \'e baux, S.; and Yeoh, W., eds., Proceedings of the Thirty-Second International Conference on Automated Planning and Scheduling (ICAPS 2022), 629--637. AAAI Press

  45. [45]

    Stein, K.; Hodel, N.; Fi s er, D.; Hoffmann, J.; Katz, M.; and Koller, A. 2026. Improved Generalized Planning with LLMs through Strategy Refinement and Reflection. In Proceedings of the Thirty-Sixth International Conference on Automated Planning and Scheduling (ICAPS 2026)

  46. [46]

    Taitler, A.; Alford, R.; Espasa, J.; Behnke, G.; Fi s er, D.; Gimelfarb, M.; Pommerening, F.; Sanner, S.; Scala, E.; Schreiber, D.; Segovia-Aguas, J.; and Seipp, J. 2024. The 2023 International Planning Competition . AI Magazine, 45(2): 280--296

  47. [47]

    Tantakoun, M.; Muise, C.; and Zhu, X. 2025. LLM s as Planning Formalizers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models. In Findings of the Association for Computational Linguistics: ACL 2025, 25167--25188. Vienna, Austria: Association for Computational Linguistics. ISBN 979-8-89176-256-5

  48. [48]

    Torralba, \'A .; Seipp, J.; and Sievers, S. 2021. Automatic Instance Generation for Classical Planning. In Goldman, R. P.; Biundo, S.; and Katz, M., eds., Proceedings of the Thirty-First International Conference on Automated Planning and Scheduling (ICAPS 2021), 376--384. AAAI Press

  49. [49]

    Tuisov, A.; Vernik, Y.; and Shleyfman, A. 2025. Successor-Generator Planning with LLM -generated Heuristics. arXiv:2501.18784