pith. sign in

arxiv: 2605.16142 · v2 · pith:NCLCVWQ4new · submitted 2026-05-15 · 💻 cs.AI · cs.LG

Property-Guided LLM Program Synthesis for Planning

Pith reviewed 2026-05-20 17:43 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords property-guided synthesisLLM program synthesisplanning heuristicsdirect heuristicscounterexample feedbackPDDL planninghill-climbing
0
0 comments X

The pith

Property-guided LLM synthesis with counterexample feedback creates direct planning heuristics that solve tasks without search using seven times fewer programs than score-based methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that replacing numeric scores with checks against a formal property, plus early feedback of concrete counterexamples on violations, lets LLMs synthesize effective heuristic functions for PDDL planning domains. The property requires that every state reached by a strictly improving transition has a strictly improving successor, so hill-climbing reaches the goal directly. A reader would care because the loop generates one candidate at a time, stops evaluation on the first violation, and produces heuristics that work on virtually all out-of-distribution test tasks while cutting generations by a factor of seven, solving more problems without search, and lowering evaluation cost by orders of magnitude. The approach applies whenever a domain admits a verifiable property that can be checked on a training set.

Core claim

The authors show that a counterexample-guided repair loop, which checks whether every improving state has an improving successor over a training set and returns the first violating case, guides an LLM to synthesize direct heuristics. On ten planning domains the resulting programs are effectively direct on virtually all test tasks, generate seven times fewer programs per domain on average than the best prior method, solve more tasks without search, and require several orders of magnitude less computation to evaluate candidates.

What carries the argument

The counterexample-guided repair loop that enforces the direct-heuristic property by halting evaluation at the first violation and feeding the concrete failure case back to the LLM.

If this is right

  • Hill-climbing with the heuristics reaches goal states directly on virtually all test tasks.
  • The method works across ten planning domains using out-of-distribution test sets.
  • Average program generations per domain drop to one-seventh the number needed by prior score-based generation.
  • Candidate evaluation cost falls by several orders of magnitude due to early stopping on violations.
  • More tasks are solved without invoking any search procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same property-check loop could be reused in other synthesis tasks where a domain admits an efficiently checkable correctness or progress property.
  • Hybrid systems that pair LLM generation with symbolic property checkers may reduce the need for expensive search in broader automated planning.
  • Testing the approach on properties stricter than direct heuristics or on larger state spaces would show where the cost savings break down.
  • Efficient property verifiers become a new bottleneck once LLM generation itself is cheap.

Load-bearing premise

A single formally defined property plus counterexample feedback is enough to steer the LLM to high-quality direct heuristics across diverse planning domains without extra mechanisms.

What would settle it

Measure whether hill-climbing with the synthesized heuristics reaches the goal on the out-of-distribution test tasks in virtually every case without any search steps.

Figures

Figures reproduced from arXiv: 2605.16142 by Andr\'e G. Pereira, Augusto B. Corr\^ea, Jendrik Seipp.

Figure 1
Figure 1. Figure 1: Counterexample-driven repair loop for property-guided heuristic function synthesis. Given [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example descending and direct heuristics for the same state space. Nodes are placed on [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Size of training and test tasks measured by the main domain parameter. Each line spans the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: State expansions of direct + HC versus each baseline, on test tasks solved by both methods. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes property-guided LLM program synthesis to generate direct heuristic functions for PDDL planning domains. The formal property requires that every state reachable by strictly improving transitions has a strictly improving successor, allowing hill-climbing search to reach the goal without backtracking. A counterexample-guided repair loop checks candidate programs against this property on a training set and feeds violations back to the LLM. The approach is evaluated on ten planning domains using an out-of-distribution test set, with claims of generating seven times fewer programs per domain on average, solving more tasks without search, and requiring orders of magnitude less computation than the best prior generation method.

Significance. If the empirical claims hold after verification details are added, the work shows that formally checkable properties plus counterexample feedback can substantially reduce LLM inference cost and improve program quality compared to numeric-score-based synthesis. This is a concrete strength for domains where such properties are definable, with potential extension beyond planning.

major comments (3)
  1. [Abstract] Abstract: the headline claim that 'the synthesized heuristics are effectively direct on virtually all test tasks' lacks any description of post-synthesis verification that the property holds on OOD test states (as opposed to only the training set used in the repair loop). This is load-bearing for both the 'solves more tasks without using search' and 'seven times fewer programs' comparisons.
  2. [Evaluation] Evaluation section: no details are given on statistical significance of the reported gains, the precise baselines compared against, or the exact number of tasks solved per domain. These omissions leave the central quantitative results only partially supported.
  3. [Method] Method: the assumption that a program satisfying the property on all checked training states will preserve the direct property on unseen test states is not justified or tested; a local maximum reachable only from OOD initial states would still require search or cause failure, directly affecting the no-search success claims.
minor comments (2)
  1. [Abstract] Abstract: define 'effectively direct' quantitatively and state how it was measured on the test set.
  2. [Introduction] Add a short related-work paragraph contrasting the counterexample mechanism with prior LLM synthesis approaches that rely solely on numeric scores.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below, indicating where revisions will be made to improve clarity and support for the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that 'the synthesized heuristics are effectively direct on virtually all test tasks' lacks any description of post-synthesis verification that the property holds on OOD test states (as opposed to only the training set used in the repair loop). This is load-bearing for both the 'solves more tasks without using search' and 'seven times fewer programs' comparisons.

    Authors: We agree that the abstract would benefit from greater precision on this point. The claim that the heuristics are 'effectively direct' on test tasks is grounded in the empirical results, where the synthesized programs enable solving the large majority of OOD test tasks without any search; this outcome would be impossible if the direct property failed to hold on the states actually encountered during test execution. To make the verification explicit, we will revise the abstract and add a short paragraph in the evaluation section describing the post-synthesis checks we performed on sampled OOD states from the test distribution. These checks were run during our experiments but were not reported in the original submission. revision: yes

  2. Referee: [Evaluation] Evaluation section: no details are given on statistical significance of the reported gains, the precise baselines compared against, or the exact number of tasks solved per domain. These omissions leave the central quantitative results only partially supported.

    Authors: We accept that these omissions reduce the strength of the quantitative claims. In the revised manuscript we will expand the evaluation section to include a per-domain table listing the exact number of tasks solved by each method, a clear enumeration of all baselines with citations and implementation notes, and statistical significance results (paired t-tests with p-values and 95% confidence intervals) for the key metrics of program generations and no-search success rate. The underlying data already exist and will be used to populate these additions. revision: yes

  3. Referee: [Method] Method: the assumption that a program satisfying the property on all checked training states will preserve the direct property on unseen test states is not justified or tested; a local maximum reachable only from OOD initial states would still require search or cause failure, directly affecting the no-search success claims.

    Authors: This is a substantive concern about generalization. The counterexample loop guarantees the property only on the training distribution; we did not conduct an exhaustive search for OOD-specific local maxima beyond the observed no-search success rates. Nevertheless, the fact that our method solves substantially more test tasks without search than the baselines supplies indirect evidence that such failures are infrequent in the domains studied. We will add a dedicated paragraph in the method section that states the generalization assumption explicitly, discusses its limitations, and reports an auxiliary verification experiment in which the property was checked on additional OOD states drawn from the test distribution. This is a partial revision because we strengthen the justification and add supporting evidence without introducing entirely new experimental campaigns. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmarks

full rationale

The paper defines a formal property for direct heuristics and uses counterexample-guided LLM synthesis on a training set, then reports empirical success rates on out-of-distribution test tasks from standard PDDL benchmarks. No equations, fitted parameters, or self-citations reduce the reported improvements (fewer programs, more tasks solved without search) to quantities defined by the authors' own inputs by construction. The evaluation compares against prior generation methods on independent planning domains, making the derivation self-contained against external benchmarks. This is the expected honest non-finding for an empirical synthesis paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the chosen property guarantees direct heuristics and that counterexample feedback reliably improves LLM outputs. No free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption A heuristic satisfying the property that every state reachable by strictly improving transitions has a strictly improving successor leads hill-climbing directly to a goal state.
    This property is the load-bearing definition used to guide synthesis and to claim that the resulting heuristics are direct.

pith-pipeline@v0.9.0 · 5831 in / 1243 out tokens · 44668 ms · 2026-05-20T17:43:24.791869+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 4 internal anchors

  1. [1]

    Counterexample guided inductive synthesis modulo theories

    Alessandro Abate, Cristina David, Pascal Kesseli, Daniel Kroening, and Elizabeth Polgreen. Counterexample guided inductive synthesis modulo theories. InInternational Conference Computer Aided Verification, 2018

  2. [2]

    Mohammad Abdulaziz, Florian Pommerening, and Augusto B. Corrêa. Mechanically proving guarantees of generalized heuristics: First results and ongoing work. InICAPS Workshop on Heuristics and Search for Domain-independent Planning, 2022

  3. [3]

    Pyperplan.https://doi.org/10.5281/zenodo.3700819, 2020

    Yusra Alkhazraji, Matthias Frorath, Markus Grützner, Malte Helmert, Thomas Liebetraut, Robert Mattmüller, Manuela Ortlieb, Jendrik Seipp, Tobias Springenberg, Philip Stahl, and Jan Wülfing. Pyperplan.https://doi.org/10.5281/zenodo.3700819, 2020

  4. [4]

    Arfaee, Sandra Zilles, and Robert C

    Shahab J. Arfaee, Sandra Zilles, and Robert C. Holte. Learning heuristic functions for large state spaces.Artificial Intelligence, 175:2075–2098, 2011

  5. [5]

    Cambridge University Press, 2017

    Franz Baader, Ian Horrocks, Carsten Lutz, and Uli Sattler, editors.An Introduction to Descrip- tion Logic. Cambridge University Press, 2017

  6. [6]

    Understanding sample generation strategies for learning heuristic functions in classical planning.Journal of Artificial Intelligence Research, 80:243–271, 2024

    Rafael V Bettker, Pedro P Minini, André G Pereira, and Marcus Ritt. Understanding sample generation strategies for learning heuristic functions in classical planning.Journal of Artificial Intelligence Research, 80:243–271, 2024

  7. [7]

    Exploring and benchmarking the planning capabilities of large language models

    Bernd Bohnet, Azade Nova, Aaron T Parisi, Kevin Swersky, Katayoon Goshvadi, Hanjun Dai, Dale Schuurmans, Noah Fiedel, and Hanie Sedghi. Exploring and benchmarking the planning capabilities of large language models. arXiv:2406.13094 [cs.CL], 2024

  8. [8]

    Planning as heuristic search.Artificial Intelligence, 129(1): 5–33, 2001

    Blai Bonet and Héctor Geffner. Planning as heuristic search.Artificial Intelligence, 129(1): 5–33, 2001

  9. [9]

    Learning features and abstract actions for computing generalized plans

    Blai Bonet, Guillem Francès, and Héctor Geffner. Learning features and abstract actions for computing generalized plans. InProc. AAAI 2019, pages 2703–2710, 2019

  10. [10]

    Corrêa, Salomé Eriksson, Patrick Ferber, Jendrik Seipp, and Silvan Sievers

    Clemens Büchner, Remo Christen, Augusto B. Corrêa, Salomé Eriksson, Patrick Ferber, Jendrik Seipp, and Silvan Sievers. Fast Downward Stone Soup 2023. InIPC-10 Planner Abstracts, 2023

  11. [11]

    Chen, Sylvie Thiébaux, and Felipe Trevizan

    Dillon Z. Chen, Sylvie Thiébaux, and Felipe Trevizan. Learning domain-independent heuristics for grounded and lifted planning. InProc. AAAI 2024, pages 20078–20086, 2024

  12. [12]

    Chen, Johannes Zenn, Tristan Cinquin, and Sheila A

    Dillon Z. Chen, Johannes Zenn, Tristan Cinquin, and Sheila A. McIlraith. Language models for generalised pddl planning: Synthesising sound and programmatic policies. arXiv:2508.18507 [cs.CL], 2025

  13. [13]

    A unified theory of heuristic evaluation functions and its application to learning

    Jens Christensen and Richard E Korf. A unified theory of heuristic evaluation functions and its application to learning. In Tom Kehler and Stanley J. Rosenschein, editors,Proc. AAAI 1986, pages 148–152, 1986

  14. [14]

    Corrêa and Giuseppe De Giacomo

    Augusto B. Corrêa and Giuseppe De Giacomo. Lifted planning: Recent advances in planning using first-order representations. InProc. IJCAI 2024, pages 8010–8019, 2024

  15. [15]

    Corrêa, Guillem Francès, Markus Hecher, Davide Mario Longo, and Jendrik Seipp

    Augusto B. Corrêa, Guillem Francès, Markus Hecher, Davide Mario Longo, and Jendrik Seipp. Scorpion Maidu: Width search in the Scorpion planning system. InIPC-10 Planner Abstracts, 2023. 10

  16. [16]

    Corrêa, André Grahl Pereira, and Jendrik Seipp

    Augusto B. Corrêa, André Grahl Pereira, and Jendrik Seipp. Classical planning with LLM- generated heuristics: Challenging the state of the art with Python code. InProc. NeurIPS 2025, 2025

  17. [17]

    Frontier Large Language Models Rival State-of-the-Art Planners

    Augusto B. Corrêa, André G. Pereira, and Jendrik Seipp. The 2025 planning performance of frontier large language models. arXiv:2511.09378 [cs.CL], 2025

  18. [18]

    Culberson

    Joseph C. Culberson. Sokoban is PSPACE-complete. Technical Report TR 97-02, Department of Computing Science, The University of Alberta, Edmonton, Alberta, Canada, 1997

  19. [19]

    Novelty vs

    Simon Dold and Malte Helmert. Novelty vs. potential heuristics: A comparison of hardness measures for satisficing planning. InProc. AAAI 2024, pages 20692–20699, 2024

  20. [20]

    Learning sketches for decomposing planning problems into subproblems of bounded width

    Dominik Drexler, Jendrik Seipp, and Hector Geffner. Learning sketches for decomposing planning problems into subproblems of bounded width. InProc. ICAPS 2022, pages 62–70, 2022

  21. [21]

    Expressing and exploiting subgoal structure in classical planning using sketches.Journal of Artificial Intelligence Research, 80: 171–208, 2024

    Dominik Drexler, Jendrik Seipp, and Hector Geffner. Expressing and exploiting subgoal structure in classical planning using sketches.Journal of Artificial Intelligence Research, 80: 171–208, 2024

  22. [22]

    Nau, and V

    Kutluhan Erol, Dana S. Nau, and V . S. Subrahmanian. Complexity, decidability and unde- cidability results for domain-independent planning.Artificial Intelligence, 76(1–2):75–88, 1995

  23. [23]

    Neural network heuristic functions for classical planning: Bootstrapping and comparison to other methods

    Patrick Ferber, Florian Geißer, Felipe Trevizan, Malte Helmert, and Jörg Hoffmann. Neural network heuristic functions for classical planning: Bootstrapping and comparison to other methods. InProc. ICAPS 2022, pages 583–587, 2022

  24. [24]

    Corrêa, Cedric Geissmann, and Florian Pommerening

    Guillem Francès, Augusto B. Corrêa, Cedric Geissmann, and Florian Pommerening. Generalized potential heuristics for classical planning. InProc. IJCAI 2019, pages 5554–5561, 2019

  25. [25]

    Gemini 3.1 Pro model card

    Gemini Team Google. Gemini 3.1 Pro model card. Google DeepMind, 2026

  26. [26]

    Morgan Kaufmann, 2004

    Malik Ghallab, Dana Nau, and Paolo Traverso.Automated Planning: Theory and Practice. Morgan Kaufmann, 2004

  27. [27]

    DecStar-2023

    Daniel Gnad, Álvaro Torralba, and Alexander Shleyfman. DecStar-2023. InIPC-10 Planner Abstracts, 2023

  28. [28]

    Leveraging pre-trained large language models to construct and utilize world models for model-based task planning

    Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. InProc. NeurIPS 2023, pages 79081–79094, 2023

  29. [29]

    Guiding GBFS through learned pairwise rankings

    Mingyu Hao, Felipe Trevizan, Sylvie Thiébaux, Patrick Ferber, and Jörg Hoffmann. Guiding GBFS through learned pairwise rankings. InProc. IJCAI 2024, pages 6724–6732, 2024

  30. [30]

    Morgan & Claypool, 2019

    Patrik Haslum, Nir Lipovetzky, Daniele Magazzeni, and Christian Muise.An Introduction to the Planning Domain Definition Language, volume 13 ofSynthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2019

  31. [31]

    The Fast Downward planning system.Journal of Artificial Intelligence Research, 26:191–246, 2006

    Malte Helmert. The Fast Downward planning system.Journal of Artificial Intelligence Research, 26:191–246, 2006

  32. [32]

    Malte Helmert, Silvan Sievers, Alexander Rovner, and Augusto B. Corrêa. On the complexity of heuristic synthesis for satisficing classical planning: Potential heuristics and beyond. InProc. ICAPS 2022, pages 124–133, 2022

  33. [33]

    The FF planning system: Fast plan generation through heuristic search.Journal of Artificial Intelligence Research, 14:253–302, 2001

    Jörg Hoffmann and Bernhard Nebel. The FF planning system: Fast plan generation through heuristic search.Journal of Artificial Intelligence Research, 14:253–302, 2001

  34. [34]

    Chasing progress, not perfection: Revisiting strategies for end-to-end LLM plan generation

    Sukai Huang, Trevor Cohn, and Nir Lipovetzky. Chasing progress, not perfection: Revisiting strategies for end-to-end LLM plan generation. arXiv:2412.10675 [cs.CL], 2024. 11

  35. [35]

    Seshia, and Ashish Tiwari

    Susmit Jha, Sumit Gulwani, Sanjit A. Seshia, and Ashish Tiwari. Oracle-guided component- based program synthesis. InACM/IEEE International Conference on Software, 2010

  36. [36]

    Competition-Level Code Generation with AlphaCode

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

  37. [37]

    Best-first width search: Exploration and exploitation in classical planning

    Nir Lipovetzky and Hector Geffner. Best-first width search: Exploration and exploitation in classical planning. InProc. AAAI 2017, pages 3590–3596, 2017

  38. [38]

    LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Pe- ter Stone. LLM+P: Empowering large language models with optimal planning proficiency. arXiv:2304.11477 [cs.CL], 2023

  39. [39]

    Evolution of heuristics: Towards efficient automatic algorithm design using large language model

    Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language model. InProc. ICML 2024, 2024

  40. [40]

    The 1998 AI Planning Systems competition.AI Magazine, 21(2):35–55, 2000

    Drew McDermott. The 1998 AI Planning Systems competition.AI Magazine, 21(2):35–55, 2000

  41. [41]

    GenePlan: Evolv- ing better generalized PDDL plans using large language models

    Andrew Murray, Danial Dervovic, Alberto Pozanco, and Michael Cashmore. GenePlan: Evolv- ing better generalized PDDL plans using large language models. InProc. ICAPS 2026, 2026

  42. [42]

    Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

  43. [43]

    Manquinho

    Pedro Orvalho, Mikolás Janota, and Vasco M. Manquinho. Counterexample guided program repair using zero-shot learning and maxsat-based fault localization. InAAAI Conference on Artificial Intelligence, 2025

  44. [44]

    Large language models as planning domain generators

    James Oswald, Kavitha Srinivas, Harsha Kokel, Junkyu Lee, Michael Katz, and Shirin Sohrabi. Large language models as planning domain generators. InProc. ICAPS 2024, pages 423–431, 2024

  45. [45]

    Stefan O’Toole, Miquel Ramirez, Nir Lipovetzky, and Adrian R. Pearce. Sampling from pre- images to learn heuristic functions for classical planning (extended abstract). InProc. SoCS 2022, pages 308–310, 2022

  46. [46]

    Addison- Wesley, 1984

    Judea Pearl.Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison- Wesley, 1984

  47. [47]

    First results on the effect of error in heuristic search

    Ira Pohl. First results on the effect of error in heuristic search. In Bernard Meltzer and Donald Michie, editors,Machine Intelligence 5, pages 219–236. Edinburgh University Press, 1969

  48. [48]

    From non-negative to general operator cost partitioning

    Florian Pommerening, Malte Helmert, Gabriele Röger, and Jendrik Seipp. From non-negative to general operator cost partitioning. InProc. AAAI 2015, pages 3335–3341, 2015

  49. [49]

    The LAMA planner: Guiding cost-based anytime planning with landmarks.Journal of Artificial Intelligence Research, 39:127–177, 2010

    Silvia Richter and Matthias Westphal. The LAMA planner: Guiding cost-based anytime planning with landmarks.Journal of Artificial Intelligence Research, 39:127–177, 2010

  50. [50]

    Pawan Kumar, Emilien Dupont, Francisco J

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024

  51. [51]

    Learning general policies for planning through GPT models

    Nicholas Rossetti, Massimiliano Tummolo, Alfonso Emilio Gerevini, Luca Putelli, Ivan Serina, Mattia Chiari, and Matteo Olivato. Learning general policies for planning through GPT models. InProc. ICAPS 2024, pages 500–508, 2024. 12

  52. [52]

    Learning from multiple heuristics

    Mehdi Samadi, Ariel Felner, and Jonathan Schaeffer. Learning from multiple heuristics. In Proc. AAAI 2008, pages 357–362, 2008

  53. [53]

    Some studies in machine learning using the game of checkers.IBM Journal of research and development, 1959

    Arthur L Samuel. Some studies in machine learning using the game of checkers.IBM Journal of research and development, 1959

  54. [54]

    Correlation complex- ity of classical planning domains

    Jendrik Seipp, Florian Pommerening, Gabriele Röger, and Malte Helmert. Correlation complex- ity of classical planning domains. InProc. IJCAI 2016, pages 3242–3250, 2016

  55. [55]

    Downward Lab

    Jendrik Seipp, Florian Pommerening, Silvan Sievers, and Malte Helmert. Downward Lab. https://doi.org/10.5281/zenodo.790461, 2017

  56. [56]

    Learning domain-independent planning heuristics with hypergraph networks

    William Shen, Felipe Trevizan, and Sylvie Thiébaux. Learning domain-independent planning heuristics with hypergraph networks. InProc. ICAPS 2020, pages 574–584, 2020

  57. [57]

    Rabbah, Rastislav Bodík, and Kemal Ebcioglu

    Armando Solar-Lezama, Rodric M. Rabbah, Rastislav Bodík, and Kemal Ebcioglu. Program- ming by sketching for bit-streaming programs. In Vivek Sarkar and Mary W. Hall, editors,ACM SIGPLAN 2005 Conference on Programming anguage Design and Implementation, 2005

  58. [58]

    Learning general optimal policies with graph neural networks: Expressive power, transparency, and limits

    Simon Ståhlberg, Blai Bonet, and Hector Geffner. Learning general optimal policies with graph neural networks: Expressive power, transparency, and limits. InProc. ICAPS 2022, pages 629–637, 2022

  59. [59]

    Chain of thoughtlessness? an analysis of CoT in planning

    Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. Chain of thoughtlessness? an analysis of CoT in planning. InProc. NeurIPS 2024, pages 29106–29141, 2024

  60. [60]

    The 2023 International Planning Competition.AI Magazine, 45(2):280–296,

    Ayal Taitler, Ron Alford, Joan Espasa, Gregor Behnke, Daniel Fišer, Michael Gimelfarb, Florian Pommerening, Scott Sanner, Enrico Scala, Dominik Schreiber, Javier Segovia-Aguas, and Jendrik Seipp. The 2023 International Planning Competition.AI Magazine, 45(2):280–296,

  61. [61]

    doi: 10.1002/aaai.12169

  62. [62]

    LLMs as planning formalizers: A survey for leveraging large language models to construct automated planning models

    Marcus Tantakoun, Christian Muise, and Xiaodan Zhu. LLMs as planning formalizers: A survey for leveraging large language models to construct automated planning models. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

  63. [63]

    Successor-generator planning with LLM-generated heuristics

    Alexander Tuisov, Yonatan Vernik, and Alexander Shleyfman. Successor-generator planning with LLM-generated heuristics. InProc. ICAPS 2026, 2026

  64. [64]

    PlanBench: An extensible benchmark for evaluating large language models on planning and reasoning about change

    Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. PlanBench: An extensible benchmark for evaluating large language models on planning and reasoning about change. InProc. NeurIPS 2023, pages 38975–38987, 2023

  65. [65]

    On the planning abilities of large language models - A critical investigation

    Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models - A critical investigation. InProc. NeurIPS 2023, pages 75993–76005, 2023. 13 A Benchmark Domains We evaluate on the ten domains of the Learning Track of the IPC 2023 [ 60]. We give a brief description of each domain below, b...

  66. [66]

    Alive State : r e a c h a b l e from the initial state , solvable , and not a goal state

  67. [67]

    I m p r o v i n g S u c c e s s o r : a s u c c e s s o r s ’ such that h (s ’) < h ( s )

  68. [68]

    All states that DFS can reach must contain an i m p r o v i n g s u c c e s s o r

    Direct : every alive state r e a c h a b l e from the initial state via a se qu en ce of s tr ict ly d e c r e a s i n g h values has at least one i m p r o v i n g s u c c e s s o r . All states that DFS can reach must contain an i m p r o v i n g s u c c e s s o r

  69. [69]

    Dead - End Av oi di ng : no i m p r o v i n g s u c c e s s o r of an alive state r e a c h a b l e by DFS is itself u n s o l v a b l e ( h = i nfi ni ty ) . </ success - criteria > < previous - iterations > I t e r a t i o n 0: - Prompt Type : Init - Solved 5 i n s t a n c e s ( Average ex pan de d nodes : 2.4) - Direct v a l i d a t e d on 5 tasks ( la...

  70. [70]

    action =( stack b3 b2 ) , h =6 , added =[ ’( clear b3 ) ’, ’( on b3 b2 ) ’, ’( arm - empty ) ’] , deleted =[ ’( holding b3 ) ’, ’( clear b2 ) ’]

  71. [71]

    action =( stack b3 b1 ) , h =6 , added =[ ’( clear b3 ) ’, ’( on b3 b1 ) ’, ’( arm - empty ) ’] , deleted =[ ’( holding b3 ) ’, ’( clear b1 ) ’]

  72. [72]

    </ previous - heuristic - code > I t e r a t i o n 1: - Prompt Type : Pr ope rt y - Main Idea :

    action =( putdown b3 ) , h =4 , added =[ ’( clear b3 ) ’, ’( on - table b3 ) ’, ’( arm - empty ) ’] , deleted =[ ’( holding b3 ) ’] < previous - heuristic - code > ... </ previous - heuristic - code > I t e r a t i o n 1: - Prompt Type : Pr ope rt y - Main Idea : ... - Solved 16 i n s t a n c e s ( Average ex pa nd ed nodes : 7.0) - Direct v a l i d a t e...

  73. [73]

    action =( unstack b3 b2 ) , h =11 , added =[ ’( holding b3 ) ’, ’( clear b2 ) ’] , deleted =[ ’( clear b3 ) ’, ’( on b3 b2 ) ’, ’( arm - empty ) ’]

  74. [74]

    action =( pickup b4 ) , h =11 , added =[ ’( holding b4 ) ’] , deleted =[ ’( on - table b4 ) ’, ’( clear b4 ) ’, ’( arm - empty ) ’]

  75. [75]

    </ previous - heuristic - code > I t e r a t i o n 2: - Prompt Type : Pr ope rt y - Main Idea :

    action =( pickup b1 ) , h =11 , added =[ ’( holding b1 ) ’] , deleted =[ ’( clear b1 ) ’, ’( on - table b1 ) ’, ’( arm - empty ) ’] < previous - heuristic - code > ... </ previous - heuristic - code > I t e r a t i o n 2: - Prompt Type : Pr ope rt y - Main Idea : ... - Solved 17 i n s t a n c e s ( Average ex pa nd ed nodes : 7.8) - Direct v a l i d a t e...

  76. [76]

    action =( unstack b5 b2 ) , h =9 , added =[ ’( holding b5 ) ’, ’( clear b2 ) ’] , deleted =[ ’( on b5 b2 ) ’, ’( clear b5 ) ’, ’( arm - empty ) ’]

  77. [77]

    action =( pickup b1 ) , h =9 , added =[ ’( holding b1 ) ’] , deleted =[ ’( clear b1 ) ’, ’( on - table b1 ) ’, ’( arm - empty ) ’]

  78. [78]

    </ previous - heuristic - code > </ previous - iterations > < current - status > Domain P ro gre ss : solving 99 i n s t a n c e s of i n c r e a s i n g d i f f i c u l t y

    action =( pickup b3 ) , h =9 , added =[ ’( holding b3 ) ’] , deleted =[ ’( clear b3 ) ’, ’( on - table b3 ) ’, ’( arm - empty ) ’] < previous - heuristic - code > ... </ previous - heuristic - code > </ previous - iterations > < current - status > Domain P ro gre ss : solving 99 i n s t a n c e s of i n c r e a s i n g d i f f i c u l t y . V a l i d a t ...

  79. [79]

    action =( unstack b15 b12 ) , h =39 , added =[ ’( holding b15 ) ’, ’( clear b12 ) ’] , deleted =[ ’( clear b15 ) ’, ’( on b15 b12 ) ’, ’( arm - empty ) ’]

  80. [80]

    action =( unstack b9 b16 ) , h =39 , added =[ ’( holding b9 ) ’, ’( clear b16 ) ’] , deleted =[ ’( on b9 b16 ) ’, ’( clear b9 ) ’, ’( arm - empty ) ’]

Showing first 80 references.