Property-Guided LLM Program Synthesis for Planning

Andr\'e G. Pereira; Augusto B. Corr\^ea; Jendrik Seipp

arxiv: 2605.16142 · v2 · pith:NCLCVWQ4new · submitted 2026-05-15 · 💻 cs.AI · cs.LG

Property-Guided LLM Program Synthesis for Planning

Andr\'e G. Pereira , Augusto B. Corr\^ea , Jendrik Seipp This is my paper

Pith reviewed 2026-05-20 17:43 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords property-guided synthesisLLM program synthesisplanning heuristicsdirect heuristicscounterexample feedbackPDDL planninghill-climbing

0 comments

The pith

Property-guided LLM synthesis with counterexample feedback creates direct planning heuristics that solve tasks without search using seven times fewer programs than score-based methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that replacing numeric scores with checks against a formal property, plus early feedback of concrete counterexamples on violations, lets LLMs synthesize effective heuristic functions for PDDL planning domains. The property requires that every state reached by a strictly improving transition has a strictly improving successor, so hill-climbing reaches the goal directly. A reader would care because the loop generates one candidate at a time, stops evaluation on the first violation, and produces heuristics that work on virtually all out-of-distribution test tasks while cutting generations by a factor of seven, solving more problems without search, and lowering evaluation cost by orders of magnitude. The approach applies whenever a domain admits a verifiable property that can be checked on a training set.

Core claim

The authors show that a counterexample-guided repair loop, which checks whether every improving state has an improving successor over a training set and returns the first violating case, guides an LLM to synthesize direct heuristics. On ten planning domains the resulting programs are effectively direct on virtually all test tasks, generate seven times fewer programs per domain on average than the best prior method, solve more tasks without search, and require several orders of magnitude less computation to evaluate candidates.

What carries the argument

The counterexample-guided repair loop that enforces the direct-heuristic property by halting evaluation at the first violation and feeding the concrete failure case back to the LLM.

If this is right

Hill-climbing with the heuristics reaches goal states directly on virtually all test tasks.
The method works across ten planning domains using out-of-distribution test sets.
Average program generations per domain drop to one-seventh the number needed by prior score-based generation.
Candidate evaluation cost falls by several orders of magnitude due to early stopping on violations.
More tasks are solved without invoking any search procedure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same property-check loop could be reused in other synthesis tasks where a domain admits an efficiently checkable correctness or progress property.
Hybrid systems that pair LLM generation with symbolic property checkers may reduce the need for expensive search in broader automated planning.
Testing the approach on properties stricter than direct heuristics or on larger state spaces would show where the cost savings break down.
Efficient property verifiers become a new bottleneck once LLM generation itself is cheap.

Load-bearing premise

A single formally defined property plus counterexample feedback is enough to steer the LLM to high-quality direct heuristics across diverse planning domains without extra mechanisms.

What would settle it

Measure whether hill-climbing with the synthesized heuristics reaches the goal on the out-of-distribution test tasks in virtually every case without any search steps.

Figures

Figures reproduced from arXiv: 2605.16142 by Andr\'e G. Pereira, Augusto B. Corr\^ea, Jendrik Seipp.

**Figure 2.** Figure 2: Example descending and direct heuristics for the same state space. Nodes are placed on [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Size of training and test tasks measured by the main domain parameter. Each line spans the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: State expansions of direct + HC versus each baseline, on test tasks solved by both methods. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes property-guided LLM program synthesis to generate direct heuristic functions for PDDL planning domains. The formal property requires that every state reachable by strictly improving transitions has a strictly improving successor, allowing hill-climbing search to reach the goal without backtracking. A counterexample-guided repair loop checks candidate programs against this property on a training set and feeds violations back to the LLM. The approach is evaluated on ten planning domains using an out-of-distribution test set, with claims of generating seven times fewer programs per domain on average, solving more tasks without search, and requiring orders of magnitude less computation than the best prior generation method.

Significance. If the empirical claims hold after verification details are added, the work shows that formally checkable properties plus counterexample feedback can substantially reduce LLM inference cost and improve program quality compared to numeric-score-based synthesis. This is a concrete strength for domains where such properties are definable, with potential extension beyond planning.

major comments (3)

[Abstract] Abstract: the headline claim that 'the synthesized heuristics are effectively direct on virtually all test tasks' lacks any description of post-synthesis verification that the property holds on OOD test states (as opposed to only the training set used in the repair loop). This is load-bearing for both the 'solves more tasks without using search' and 'seven times fewer programs' comparisons.
[Evaluation] Evaluation section: no details are given on statistical significance of the reported gains, the precise baselines compared against, or the exact number of tasks solved per domain. These omissions leave the central quantitative results only partially supported.
[Method] Method: the assumption that a program satisfying the property on all checked training states will preserve the direct property on unseen test states is not justified or tested; a local maximum reachable only from OOD initial states would still require search or cause failure, directly affecting the no-search success claims.

minor comments (2)

[Abstract] Abstract: define 'effectively direct' quantitatively and state how it was measured on the test set.
[Introduction] Add a short related-work paragraph contrasting the counterexample mechanism with prior LLM synthesis approaches that rely solely on numeric scores.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below, indicating where revisions will be made to improve clarity and support for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that 'the synthesized heuristics are effectively direct on virtually all test tasks' lacks any description of post-synthesis verification that the property holds on OOD test states (as opposed to only the training set used in the repair loop). This is load-bearing for both the 'solves more tasks without using search' and 'seven times fewer programs' comparisons.

Authors: We agree that the abstract would benefit from greater precision on this point. The claim that the heuristics are 'effectively direct' on test tasks is grounded in the empirical results, where the synthesized programs enable solving the large majority of OOD test tasks without any search; this outcome would be impossible if the direct property failed to hold on the states actually encountered during test execution. To make the verification explicit, we will revise the abstract and add a short paragraph in the evaluation section describing the post-synthesis checks we performed on sampled OOD states from the test distribution. These checks were run during our experiments but were not reported in the original submission. revision: yes
Referee: [Evaluation] Evaluation section: no details are given on statistical significance of the reported gains, the precise baselines compared against, or the exact number of tasks solved per domain. These omissions leave the central quantitative results only partially supported.

Authors: We accept that these omissions reduce the strength of the quantitative claims. In the revised manuscript we will expand the evaluation section to include a per-domain table listing the exact number of tasks solved by each method, a clear enumeration of all baselines with citations and implementation notes, and statistical significance results (paired t-tests with p-values and 95% confidence intervals) for the key metrics of program generations and no-search success rate. The underlying data already exist and will be used to populate these additions. revision: yes
Referee: [Method] Method: the assumption that a program satisfying the property on all checked training states will preserve the direct property on unseen test states is not justified or tested; a local maximum reachable only from OOD initial states would still require search or cause failure, directly affecting the no-search success claims.

Authors: This is a substantive concern about generalization. The counterexample loop guarantees the property only on the training distribution; we did not conduct an exhaustive search for OOD-specific local maxima beyond the observed no-search success rates. Nevertheless, the fact that our method solves substantially more test tasks without search than the baselines supplies indirect evidence that such failures are infrequent in the domains studied. We will add a dedicated paragraph in the method section that states the generalization assumption explicitly, discusses its limitations, and reports an auxiliary verification experiment in which the property was checked on additional OOD states drawn from the test distribution. This is a partial revision because we strengthen the justification and add supporting evidence without introducing entirely new experimental campaigns. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on external benchmarks

full rationale

The paper defines a formal property for direct heuristics and uses counterexample-guided LLM synthesis on a training set, then reports empirical success rates on out-of-distribution test tasks from standard PDDL benchmarks. No equations, fitted parameters, or self-citations reduce the reported improvements (fewer programs, more tasks solved without search) to quantities defined by the authors' own inputs by construction. The evaluation compares against prior generation methods on independent planning domains, making the derivation self-contained against external benchmarks. This is the expected honest non-finding for an empirical synthesis paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the chosen property guarantees direct heuristics and that counterexample feedback reliably improves LLM outputs. No free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption A heuristic satisfying the property that every state reachable by strictly improving transitions has a strictly improving successor leads hill-climbing directly to a goal state.
This property is the load-bearing definition used to guide synthesis and to claim that the resulting heuristics are direct.

pith-pipeline@v0.9.0 · 5831 in / 1243 out tokens · 44668 ms · 2026-05-20T17:43:24.791869+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We target the direct property: every state reachable from s0 by strictly improving transitions has a strictly improving successor.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

counterexample-guided repair loop generates one candidate program, checks the property over a training set

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 4 internal anchors

[1]

Counterexample guided inductive synthesis modulo theories

Alessandro Abate, Cristina David, Pascal Kesseli, Daniel Kroening, and Elizabeth Polgreen. Counterexample guided inductive synthesis modulo theories. InInternational Conference Computer Aided Verification, 2018

work page 2018
[2]

Mohammad Abdulaziz, Florian Pommerening, and Augusto B. Corrêa. Mechanically proving guarantees of generalized heuristics: First results and ongoing work. InICAPS Workshop on Heuristics and Search for Domain-independent Planning, 2022

work page 2022
[3]

Pyperplan.https://doi.org/10.5281/zenodo.3700819, 2020

Yusra Alkhazraji, Matthias Frorath, Markus Grützner, Malte Helmert, Thomas Liebetraut, Robert Mattmüller, Manuela Ortlieb, Jendrik Seipp, Tobias Springenberg, Philip Stahl, and Jan Wülfing. Pyperplan.https://doi.org/10.5281/zenodo.3700819, 2020

work page doi:10.5281/zenodo.3700819 2020
[4]

Arfaee, Sandra Zilles, and Robert C

Shahab J. Arfaee, Sandra Zilles, and Robert C. Holte. Learning heuristic functions for large state spaces.Artificial Intelligence, 175:2075–2098, 2011

work page 2075
[5]

Cambridge University Press, 2017

Franz Baader, Ian Horrocks, Carsten Lutz, and Uli Sattler, editors.An Introduction to Descrip- tion Logic. Cambridge University Press, 2017

work page 2017
[6]

Understanding sample generation strategies for learning heuristic functions in classical planning.Journal of Artificial Intelligence Research, 80:243–271, 2024

Rafael V Bettker, Pedro P Minini, André G Pereira, and Marcus Ritt. Understanding sample generation strategies for learning heuristic functions in classical planning.Journal of Artificial Intelligence Research, 80:243–271, 2024

work page 2024
[7]

Exploring and benchmarking the planning capabilities of large language models

Bernd Bohnet, Azade Nova, Aaron T Parisi, Kevin Swersky, Katayoon Goshvadi, Hanjun Dai, Dale Schuurmans, Noah Fiedel, and Hanie Sedghi. Exploring and benchmarking the planning capabilities of large language models. arXiv:2406.13094 [cs.CL], 2024

work page arXiv 2024
[8]

Planning as heuristic search.Artificial Intelligence, 129(1): 5–33, 2001

Blai Bonet and Héctor Geffner. Planning as heuristic search.Artificial Intelligence, 129(1): 5–33, 2001

work page 2001
[9]

Learning features and abstract actions for computing generalized plans

Blai Bonet, Guillem Francès, and Héctor Geffner. Learning features and abstract actions for computing generalized plans. InProc. AAAI 2019, pages 2703–2710, 2019

work page 2019
[10]

Corrêa, Salomé Eriksson, Patrick Ferber, Jendrik Seipp, and Silvan Sievers

Clemens Büchner, Remo Christen, Augusto B. Corrêa, Salomé Eriksson, Patrick Ferber, Jendrik Seipp, and Silvan Sievers. Fast Downward Stone Soup 2023. InIPC-10 Planner Abstracts, 2023

work page 2023
[11]

Chen, Sylvie Thiébaux, and Felipe Trevizan

Dillon Z. Chen, Sylvie Thiébaux, and Felipe Trevizan. Learning domain-independent heuristics for grounded and lifted planning. InProc. AAAI 2024, pages 20078–20086, 2024

work page 2024
[12]

Chen, Johannes Zenn, Tristan Cinquin, and Sheila A

Dillon Z. Chen, Johannes Zenn, Tristan Cinquin, and Sheila A. McIlraith. Language models for generalised pddl planning: Synthesising sound and programmatic policies. arXiv:2508.18507 [cs.CL], 2025

work page arXiv 2025
[13]

A unified theory of heuristic evaluation functions and its application to learning

Jens Christensen and Richard E Korf. A unified theory of heuristic evaluation functions and its application to learning. In Tom Kehler and Stanley J. Rosenschein, editors,Proc. AAAI 1986, pages 148–152, 1986

work page 1986
[14]

Corrêa and Giuseppe De Giacomo

Augusto B. Corrêa and Giuseppe De Giacomo. Lifted planning: Recent advances in planning using first-order representations. InProc. IJCAI 2024, pages 8010–8019, 2024

work page 2024
[15]

Corrêa, Guillem Francès, Markus Hecher, Davide Mario Longo, and Jendrik Seipp

Augusto B. Corrêa, Guillem Francès, Markus Hecher, Davide Mario Longo, and Jendrik Seipp. Scorpion Maidu: Width search in the Scorpion planning system. InIPC-10 Planner Abstracts, 2023. 10

work page 2023
[16]

Corrêa, André Grahl Pereira, and Jendrik Seipp

Augusto B. Corrêa, André Grahl Pereira, and Jendrik Seipp. Classical planning with LLM- generated heuristics: Challenging the state of the art with Python code. InProc. NeurIPS 2025, 2025

work page 2025
[17]

Frontier Large Language Models Rival State-of-the-Art Planners

Augusto B. Corrêa, André G. Pereira, and Jendrik Seipp. The 2025 planning performance of frontier large language models. arXiv:2511.09378 [cs.CL], 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Culberson

Joseph C. Culberson. Sokoban is PSPACE-complete. Technical Report TR 97-02, Department of Computing Science, The University of Alberta, Edmonton, Alberta, Canada, 1997

work page 1997
[19]

Novelty vs

Simon Dold and Malte Helmert. Novelty vs. potential heuristics: A comparison of hardness measures for satisficing planning. InProc. AAAI 2024, pages 20692–20699, 2024

work page 2024
[20]

Learning sketches for decomposing planning problems into subproblems of bounded width

Dominik Drexler, Jendrik Seipp, and Hector Geffner. Learning sketches for decomposing planning problems into subproblems of bounded width. InProc. ICAPS 2022, pages 62–70, 2022

work page 2022
[21]

Expressing and exploiting subgoal structure in classical planning using sketches.Journal of Artificial Intelligence Research, 80: 171–208, 2024

Dominik Drexler, Jendrik Seipp, and Hector Geffner. Expressing and exploiting subgoal structure in classical planning using sketches.Journal of Artificial Intelligence Research, 80: 171–208, 2024

work page 2024
[22]

Nau, and V

Kutluhan Erol, Dana S. Nau, and V . S. Subrahmanian. Complexity, decidability and unde- cidability results for domain-independent planning.Artificial Intelligence, 76(1–2):75–88, 1995

work page 1995
[23]

Neural network heuristic functions for classical planning: Bootstrapping and comparison to other methods

Patrick Ferber, Florian Geißer, Felipe Trevizan, Malte Helmert, and Jörg Hoffmann. Neural network heuristic functions for classical planning: Bootstrapping and comparison to other methods. InProc. ICAPS 2022, pages 583–587, 2022

work page 2022
[24]

Corrêa, Cedric Geissmann, and Florian Pommerening

Guillem Francès, Augusto B. Corrêa, Cedric Geissmann, and Florian Pommerening. Generalized potential heuristics for classical planning. InProc. IJCAI 2019, pages 5554–5561, 2019

work page 2019
[25]

Gemini 3.1 Pro model card

Gemini Team Google. Gemini 3.1 Pro model card. Google DeepMind, 2026

work page 2026
[26]

Morgan Kaufmann, 2004

Malik Ghallab, Dana Nau, and Paolo Traverso.Automated Planning: Theory and Practice. Morgan Kaufmann, 2004

work page 2004
[27]

DecStar-2023

Daniel Gnad, Álvaro Torralba, and Alexander Shleyfman. DecStar-2023. InIPC-10 Planner Abstracts, 2023

work page 2023
[28]

Leveraging pre-trained large language models to construct and utilize world models for model-based task planning

Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. InProc. NeurIPS 2023, pages 79081–79094, 2023

work page 2023
[29]

Guiding GBFS through learned pairwise rankings

Mingyu Hao, Felipe Trevizan, Sylvie Thiébaux, Patrick Ferber, and Jörg Hoffmann. Guiding GBFS through learned pairwise rankings. InProc. IJCAI 2024, pages 6724–6732, 2024

work page 2024
[30]

Morgan & Claypool, 2019

Patrik Haslum, Nir Lipovetzky, Daniele Magazzeni, and Christian Muise.An Introduction to the Planning Domain Definition Language, volume 13 ofSynthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2019

work page 2019
[31]

The Fast Downward planning system.Journal of Artificial Intelligence Research, 26:191–246, 2006

Malte Helmert. The Fast Downward planning system.Journal of Artificial Intelligence Research, 26:191–246, 2006

work page 2006
[32]

Malte Helmert, Silvan Sievers, Alexander Rovner, and Augusto B. Corrêa. On the complexity of heuristic synthesis for satisficing classical planning: Potential heuristics and beyond. InProc. ICAPS 2022, pages 124–133, 2022

work page 2022
[33]

The FF planning system: Fast plan generation through heuristic search.Journal of Artificial Intelligence Research, 14:253–302, 2001

Jörg Hoffmann and Bernhard Nebel. The FF planning system: Fast plan generation through heuristic search.Journal of Artificial Intelligence Research, 14:253–302, 2001

work page 2001
[34]

Chasing progress, not perfection: Revisiting strategies for end-to-end LLM plan generation

Sukai Huang, Trevor Cohn, and Nir Lipovetzky. Chasing progress, not perfection: Revisiting strategies for end-to-end LLM plan generation. arXiv:2412.10675 [cs.CL], 2024. 11

work page arXiv 2024
[35]

Seshia, and Ashish Tiwari

Susmit Jha, Sumit Gulwani, Sanjit A. Seshia, and Ashish Tiwari. Oracle-guided component- based program synthesis. InACM/IEEE International Conference on Software, 2010

work page 2010
[36]

Competition-Level Code Generation with AlphaCode

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Best-first width search: Exploration and exploitation in classical planning

Nir Lipovetzky and Hector Geffner. Best-first width search: Exploration and exploitation in classical planning. InProc. AAAI 2017, pages 3590–3596, 2017

work page 2017
[38]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Pe- ter Stone. LLM+P: Empowering large language models with optimal planning proficiency. arXiv:2304.11477 [cs.CL], 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Evolution of heuristics: Towards efficient automatic algorithm design using large language model

Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language model. InProc. ICML 2024, 2024

work page 2024
[40]

The 1998 AI Planning Systems competition.AI Magazine, 21(2):35–55, 2000

Drew McDermott. The 1998 AI Planning Systems competition.AI Magazine, 21(2):35–55, 2000

work page 1998
[41]

GenePlan: Evolv- ing better generalized PDDL plans using large language models

Andrew Murray, Danial Dervovic, Alberto Pozanco, and Michael Cashmore. GenePlan: Evolv- ing better generalized PDDL plans using large language models. InProc. ICAPS 2026, 2026

work page 2026
[42]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Manquinho

Pedro Orvalho, Mikolás Janota, and Vasco M. Manquinho. Counterexample guided program repair using zero-shot learning and maxsat-based fault localization. InAAAI Conference on Artificial Intelligence, 2025

work page 2025
[44]

Large language models as planning domain generators

James Oswald, Kavitha Srinivas, Harsha Kokel, Junkyu Lee, Michael Katz, and Shirin Sohrabi. Large language models as planning domain generators. InProc. ICAPS 2024, pages 423–431, 2024

work page 2024
[45]

Stefan O’Toole, Miquel Ramirez, Nir Lipovetzky, and Adrian R. Pearce. Sampling from pre- images to learn heuristic functions for classical planning (extended abstract). InProc. SoCS 2022, pages 308–310, 2022

work page 2022
[46]

Addison- Wesley, 1984

Judea Pearl.Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison- Wesley, 1984

work page 1984
[47]

First results on the effect of error in heuristic search

Ira Pohl. First results on the effect of error in heuristic search. In Bernard Meltzer and Donald Michie, editors,Machine Intelligence 5, pages 219–236. Edinburgh University Press, 1969

work page 1969
[48]

From non-negative to general operator cost partitioning

Florian Pommerening, Malte Helmert, Gabriele Röger, and Jendrik Seipp. From non-negative to general operator cost partitioning. InProc. AAAI 2015, pages 3335–3341, 2015

work page 2015
[49]

The LAMA planner: Guiding cost-based anytime planning with landmarks.Journal of Artificial Intelligence Research, 39:127–177, 2010

Silvia Richter and Matthias Westphal. The LAMA planner: Guiding cost-based anytime planning with landmarks.Journal of Artificial Intelligence Research, 39:127–177, 2010

work page 2010
[50]

Pawan Kumar, Emilien Dupont, Francisco J

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024

work page 2024
[51]

Learning general policies for planning through GPT models

Nicholas Rossetti, Massimiliano Tummolo, Alfonso Emilio Gerevini, Luca Putelli, Ivan Serina, Mattia Chiari, and Matteo Olivato. Learning general policies for planning through GPT models. InProc. ICAPS 2024, pages 500–508, 2024. 12

work page 2024
[52]

Learning from multiple heuristics

Mehdi Samadi, Ariel Felner, and Jonathan Schaeffer. Learning from multiple heuristics. In Proc. AAAI 2008, pages 357–362, 2008

work page 2008
[53]

Some studies in machine learning using the game of checkers.IBM Journal of research and development, 1959

Arthur L Samuel. Some studies in machine learning using the game of checkers.IBM Journal of research and development, 1959

work page 1959
[54]

Correlation complex- ity of classical planning domains

Jendrik Seipp, Florian Pommerening, Gabriele Röger, and Malte Helmert. Correlation complex- ity of classical planning domains. InProc. IJCAI 2016, pages 3242–3250, 2016

work page 2016
[55]

Downward Lab

Jendrik Seipp, Florian Pommerening, Silvan Sievers, and Malte Helmert. Downward Lab. https://doi.org/10.5281/zenodo.790461, 2017

work page doi:10.5281/zenodo.790461 2017
[56]

Learning domain-independent planning heuristics with hypergraph networks

William Shen, Felipe Trevizan, and Sylvie Thiébaux. Learning domain-independent planning heuristics with hypergraph networks. InProc. ICAPS 2020, pages 574–584, 2020

work page 2020
[57]

Rabbah, Rastislav Bodík, and Kemal Ebcioglu

Armando Solar-Lezama, Rodric M. Rabbah, Rastislav Bodík, and Kemal Ebcioglu. Program- ming by sketching for bit-streaming programs. In Vivek Sarkar and Mary W. Hall, editors,ACM SIGPLAN 2005 Conference on Programming anguage Design and Implementation, 2005

work page 2005
[58]

Learning general optimal policies with graph neural networks: Expressive power, transparency, and limits

Simon Ståhlberg, Blai Bonet, and Hector Geffner. Learning general optimal policies with graph neural networks: Expressive power, transparency, and limits. InProc. ICAPS 2022, pages 629–637, 2022

work page 2022
[59]

Chain of thoughtlessness? an analysis of CoT in planning

Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. Chain of thoughtlessness? an analysis of CoT in planning. InProc. NeurIPS 2024, pages 29106–29141, 2024

work page 2024
[60]

The 2023 International Planning Competition.AI Magazine, 45(2):280–296,

Ayal Taitler, Ron Alford, Joan Espasa, Gregor Behnke, Daniel Fišer, Michael Gimelfarb, Florian Pommerening, Scott Sanner, Enrico Scala, Dominik Schreiber, Javier Segovia-Aguas, and Jendrik Seipp. The 2023 International Planning Competition.AI Magazine, 45(2):280–296,

work page 2023
[61]

doi: 10.1002/aaai.12169

work page doi:10.1002/aaai.12169
[62]

LLMs as planning formalizers: A survey for leveraging large language models to construct automated planning models

Marcus Tantakoun, Christian Muise, and Xiaodan Zhu. LLMs as planning formalizers: A survey for leveraging large language models to construct automated planning models. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

work page 2025
[63]

Successor-generator planning with LLM-generated heuristics

Alexander Tuisov, Yonatan Vernik, and Alexander Shleyfman. Successor-generator planning with LLM-generated heuristics. InProc. ICAPS 2026, 2026

work page 2026
[64]

PlanBench: An extensible benchmark for evaluating large language models on planning and reasoning about change

Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. PlanBench: An extensible benchmark for evaluating large language models on planning and reasoning about change. InProc. NeurIPS 2023, pages 38975–38987, 2023

work page 2023
[65]

On the planning abilities of large language models - A critical investigation

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models - A critical investigation. InProc. NeurIPS 2023, pages 75993–76005, 2023. 13 A Benchmark Domains We evaluate on the ten domains of the Learning Track of the IPC 2023 [ 60]. We give a brief description of each domain below, b...

work page 2023
[66]

Alive State : r e a c h a b l e from the initial state , solvable , and not a goal state

work page
[67]

I m p r o v i n g S u c c e s s o r : a s u c c e s s o r s ’ such that h (s ’) < h ( s )

work page
[68]

All states that DFS can reach must contain an i m p r o v i n g s u c c e s s o r

Direct : every alive state r e a c h a b l e from the initial state via a se qu en ce of s tr ict ly d e c r e a s i n g h values has at least one i m p r o v i n g s u c c e s s o r . All states that DFS can reach must contain an i m p r o v i n g s u c c e s s o r

work page
[69]

Dead - End Av oi di ng : no i m p r o v i n g s u c c e s s o r of an alive state r e a c h a b l e by DFS is itself u n s o l v a b l e ( h = i nfi ni ty ) . </ success - criteria > < previous - iterations > I t e r a t i o n 0: - Prompt Type : Init - Solved 5 i n s t a n c e s ( Average ex pan de d nodes : 2.4) - Direct v a l i d a t e d on 5 tasks ( la...

work page
[70]

action =( stack b3 b2 ) , h =6 , added =[ ’( clear b3 ) ’, ’( on b3 b2 ) ’, ’( arm - empty ) ’] , deleted =[ ’( holding b3 ) ’, ’( clear b2 ) ’]

work page
[71]

action =( stack b3 b1 ) , h =6 , added =[ ’( clear b3 ) ’, ’( on b3 b1 ) ’, ’( arm - empty ) ’] , deleted =[ ’( holding b3 ) ’, ’( clear b1 ) ’]

work page
[72]

</ previous - heuristic - code > I t e r a t i o n 1: - Prompt Type : Pr ope rt y - Main Idea :

action =( putdown b3 ) , h =4 , added =[ ’( clear b3 ) ’, ’( on - table b3 ) ’, ’( arm - empty ) ’] , deleted =[ ’( holding b3 ) ’] < previous - heuristic - code > ... </ previous - heuristic - code > I t e r a t i o n 1: - Prompt Type : Pr ope rt y - Main Idea : ... - Solved 16 i n s t a n c e s ( Average ex pa nd ed nodes : 7.0) - Direct v a l i d a t e...

work page
[73]

action =( unstack b3 b2 ) , h =11 , added =[ ’( holding b3 ) ’, ’( clear b2 ) ’] , deleted =[ ’( clear b3 ) ’, ’( on b3 b2 ) ’, ’( arm - empty ) ’]

work page
[74]

action =( pickup b4 ) , h =11 , added =[ ’( holding b4 ) ’] , deleted =[ ’( on - table b4 ) ’, ’( clear b4 ) ’, ’( arm - empty ) ’]

work page
[75]

</ previous - heuristic - code > I t e r a t i o n 2: - Prompt Type : Pr ope rt y - Main Idea :

action =( pickup b1 ) , h =11 , added =[ ’( holding b1 ) ’] , deleted =[ ’( clear b1 ) ’, ’( on - table b1 ) ’, ’( arm - empty ) ’] < previous - heuristic - code > ... </ previous - heuristic - code > I t e r a t i o n 2: - Prompt Type : Pr ope rt y - Main Idea : ... - Solved 17 i n s t a n c e s ( Average ex pa nd ed nodes : 7.8) - Direct v a l i d a t e...

work page
[76]

action =( unstack b5 b2 ) , h =9 , added =[ ’( holding b5 ) ’, ’( clear b2 ) ’] , deleted =[ ’( on b5 b2 ) ’, ’( clear b5 ) ’, ’( arm - empty ) ’]

work page
[77]

action =( pickup b1 ) , h =9 , added =[ ’( holding b1 ) ’] , deleted =[ ’( clear b1 ) ’, ’( on - table b1 ) ’, ’( arm - empty ) ’]

work page
[78]

</ previous - heuristic - code > </ previous - iterations > < current - status > Domain P ro gre ss : solving 99 i n s t a n c e s of i n c r e a s i n g d i f f i c u l t y

action =( pickup b3 ) , h =9 , added =[ ’( holding b3 ) ’] , deleted =[ ’( clear b3 ) ’, ’( on - table b3 ) ’, ’( arm - empty ) ’] < previous - heuristic - code > ... </ previous - heuristic - code > </ previous - iterations > < current - status > Domain P ro gre ss : solving 99 i n s t a n c e s of i n c r e a s i n g d i f f i c u l t y . V a l i d a t ...

work page
[79]

action =( unstack b15 b12 ) , h =39 , added =[ ’( holding b15 ) ’, ’( clear b12 ) ’] , deleted =[ ’( clear b15 ) ’, ’( on b15 b12 ) ’, ’( arm - empty ) ’]

work page
[80]

action =( unstack b9 b16 ) , h =39 , added =[ ’( holding b9 ) ’, ’( clear b16 ) ’] , deleted =[ ’( on b9 b16 ) ’, ’( clear b9 ) ’, ’( arm - empty ) ’]

work page

Showing first 80 references.

[1] [1]

Counterexample guided inductive synthesis modulo theories

Alessandro Abate, Cristina David, Pascal Kesseli, Daniel Kroening, and Elizabeth Polgreen. Counterexample guided inductive synthesis modulo theories. InInternational Conference Computer Aided Verification, 2018

work page 2018

[2] [2]

Mohammad Abdulaziz, Florian Pommerening, and Augusto B. Corrêa. Mechanically proving guarantees of generalized heuristics: First results and ongoing work. InICAPS Workshop on Heuristics and Search for Domain-independent Planning, 2022

work page 2022

[3] [3]

Pyperplan.https://doi.org/10.5281/zenodo.3700819, 2020

Yusra Alkhazraji, Matthias Frorath, Markus Grützner, Malte Helmert, Thomas Liebetraut, Robert Mattmüller, Manuela Ortlieb, Jendrik Seipp, Tobias Springenberg, Philip Stahl, and Jan Wülfing. Pyperplan.https://doi.org/10.5281/zenodo.3700819, 2020

work page doi:10.5281/zenodo.3700819 2020

[4] [4]

Arfaee, Sandra Zilles, and Robert C

Shahab J. Arfaee, Sandra Zilles, and Robert C. Holte. Learning heuristic functions for large state spaces.Artificial Intelligence, 175:2075–2098, 2011

work page 2075

[5] [5]

Cambridge University Press, 2017

Franz Baader, Ian Horrocks, Carsten Lutz, and Uli Sattler, editors.An Introduction to Descrip- tion Logic. Cambridge University Press, 2017

work page 2017

[6] [6]

Understanding sample generation strategies for learning heuristic functions in classical planning.Journal of Artificial Intelligence Research, 80:243–271, 2024

Rafael V Bettker, Pedro P Minini, André G Pereira, and Marcus Ritt. Understanding sample generation strategies for learning heuristic functions in classical planning.Journal of Artificial Intelligence Research, 80:243–271, 2024

work page 2024

[7] [7]

Exploring and benchmarking the planning capabilities of large language models

Bernd Bohnet, Azade Nova, Aaron T Parisi, Kevin Swersky, Katayoon Goshvadi, Hanjun Dai, Dale Schuurmans, Noah Fiedel, and Hanie Sedghi. Exploring and benchmarking the planning capabilities of large language models. arXiv:2406.13094 [cs.CL], 2024

work page arXiv 2024

[8] [8]

Planning as heuristic search.Artificial Intelligence, 129(1): 5–33, 2001

Blai Bonet and Héctor Geffner. Planning as heuristic search.Artificial Intelligence, 129(1): 5–33, 2001

work page 2001

[9] [9]

Learning features and abstract actions for computing generalized plans

Blai Bonet, Guillem Francès, and Héctor Geffner. Learning features and abstract actions for computing generalized plans. InProc. AAAI 2019, pages 2703–2710, 2019

work page 2019

[10] [10]

Corrêa, Salomé Eriksson, Patrick Ferber, Jendrik Seipp, and Silvan Sievers

Clemens Büchner, Remo Christen, Augusto B. Corrêa, Salomé Eriksson, Patrick Ferber, Jendrik Seipp, and Silvan Sievers. Fast Downward Stone Soup 2023. InIPC-10 Planner Abstracts, 2023

work page 2023

[11] [11]

Chen, Sylvie Thiébaux, and Felipe Trevizan

Dillon Z. Chen, Sylvie Thiébaux, and Felipe Trevizan. Learning domain-independent heuristics for grounded and lifted planning. InProc. AAAI 2024, pages 20078–20086, 2024

work page 2024

[12] [12]

Chen, Johannes Zenn, Tristan Cinquin, and Sheila A

Dillon Z. Chen, Johannes Zenn, Tristan Cinquin, and Sheila A. McIlraith. Language models for generalised pddl planning: Synthesising sound and programmatic policies. arXiv:2508.18507 [cs.CL], 2025

work page arXiv 2025

[13] [13]

A unified theory of heuristic evaluation functions and its application to learning

Jens Christensen and Richard E Korf. A unified theory of heuristic evaluation functions and its application to learning. In Tom Kehler and Stanley J. Rosenschein, editors,Proc. AAAI 1986, pages 148–152, 1986

work page 1986

[14] [14]

Corrêa and Giuseppe De Giacomo

Augusto B. Corrêa and Giuseppe De Giacomo. Lifted planning: Recent advances in planning using first-order representations. InProc. IJCAI 2024, pages 8010–8019, 2024

work page 2024

[15] [15]

Corrêa, Guillem Francès, Markus Hecher, Davide Mario Longo, and Jendrik Seipp

Augusto B. Corrêa, Guillem Francès, Markus Hecher, Davide Mario Longo, and Jendrik Seipp. Scorpion Maidu: Width search in the Scorpion planning system. InIPC-10 Planner Abstracts, 2023. 10

work page 2023

[16] [16]

Corrêa, André Grahl Pereira, and Jendrik Seipp

Augusto B. Corrêa, André Grahl Pereira, and Jendrik Seipp. Classical planning with LLM- generated heuristics: Challenging the state of the art with Python code. InProc. NeurIPS 2025, 2025

work page 2025

[17] [17]

Frontier Large Language Models Rival State-of-the-Art Planners

Augusto B. Corrêa, André G. Pereira, and Jendrik Seipp. The 2025 planning performance of frontier large language models. arXiv:2511.09378 [cs.CL], 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Culberson

Joseph C. Culberson. Sokoban is PSPACE-complete. Technical Report TR 97-02, Department of Computing Science, The University of Alberta, Edmonton, Alberta, Canada, 1997

work page 1997

[19] [19]

Novelty vs

Simon Dold and Malte Helmert. Novelty vs. potential heuristics: A comparison of hardness measures for satisficing planning. InProc. AAAI 2024, pages 20692–20699, 2024

work page 2024

[20] [20]

Learning sketches for decomposing planning problems into subproblems of bounded width

Dominik Drexler, Jendrik Seipp, and Hector Geffner. Learning sketches for decomposing planning problems into subproblems of bounded width. InProc. ICAPS 2022, pages 62–70, 2022

work page 2022

[21] [21]

Expressing and exploiting subgoal structure in classical planning using sketches.Journal of Artificial Intelligence Research, 80: 171–208, 2024

Dominik Drexler, Jendrik Seipp, and Hector Geffner. Expressing and exploiting subgoal structure in classical planning using sketches.Journal of Artificial Intelligence Research, 80: 171–208, 2024

work page 2024

[22] [22]

Nau, and V

Kutluhan Erol, Dana S. Nau, and V . S. Subrahmanian. Complexity, decidability and unde- cidability results for domain-independent planning.Artificial Intelligence, 76(1–2):75–88, 1995

work page 1995

[23] [23]

Neural network heuristic functions for classical planning: Bootstrapping and comparison to other methods

Patrick Ferber, Florian Geißer, Felipe Trevizan, Malte Helmert, and Jörg Hoffmann. Neural network heuristic functions for classical planning: Bootstrapping and comparison to other methods. InProc. ICAPS 2022, pages 583–587, 2022

work page 2022

[24] [24]

Corrêa, Cedric Geissmann, and Florian Pommerening

Guillem Francès, Augusto B. Corrêa, Cedric Geissmann, and Florian Pommerening. Generalized potential heuristics for classical planning. InProc. IJCAI 2019, pages 5554–5561, 2019

work page 2019

[25] [25]

Gemini 3.1 Pro model card

Gemini Team Google. Gemini 3.1 Pro model card. Google DeepMind, 2026

work page 2026

[26] [26]

Morgan Kaufmann, 2004

Malik Ghallab, Dana Nau, and Paolo Traverso.Automated Planning: Theory and Practice. Morgan Kaufmann, 2004

work page 2004

[27] [27]

DecStar-2023

Daniel Gnad, Álvaro Torralba, and Alexander Shleyfman. DecStar-2023. InIPC-10 Planner Abstracts, 2023

work page 2023

[28] [28]

Leveraging pre-trained large language models to construct and utilize world models for model-based task planning

Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. InProc. NeurIPS 2023, pages 79081–79094, 2023

work page 2023

[29] [29]

Guiding GBFS through learned pairwise rankings

Mingyu Hao, Felipe Trevizan, Sylvie Thiébaux, Patrick Ferber, and Jörg Hoffmann. Guiding GBFS through learned pairwise rankings. InProc. IJCAI 2024, pages 6724–6732, 2024

work page 2024

[30] [30]

Morgan & Claypool, 2019

Patrik Haslum, Nir Lipovetzky, Daniele Magazzeni, and Christian Muise.An Introduction to the Planning Domain Definition Language, volume 13 ofSynthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool, 2019

work page 2019

[31] [31]

The Fast Downward planning system.Journal of Artificial Intelligence Research, 26:191–246, 2006

Malte Helmert. The Fast Downward planning system.Journal of Artificial Intelligence Research, 26:191–246, 2006

work page 2006

[32] [32]

Malte Helmert, Silvan Sievers, Alexander Rovner, and Augusto B. Corrêa. On the complexity of heuristic synthesis for satisficing classical planning: Potential heuristics and beyond. InProc. ICAPS 2022, pages 124–133, 2022

work page 2022

[33] [33]

The FF planning system: Fast plan generation through heuristic search.Journal of Artificial Intelligence Research, 14:253–302, 2001

Jörg Hoffmann and Bernhard Nebel. The FF planning system: Fast plan generation through heuristic search.Journal of Artificial Intelligence Research, 14:253–302, 2001

work page 2001

[34] [34]

Chasing progress, not perfection: Revisiting strategies for end-to-end LLM plan generation

Sukai Huang, Trevor Cohn, and Nir Lipovetzky. Chasing progress, not perfection: Revisiting strategies for end-to-end LLM plan generation. arXiv:2412.10675 [cs.CL], 2024. 11

work page arXiv 2024

[35] [35]

Seshia, and Ashish Tiwari

Susmit Jha, Sumit Gulwani, Sanjit A. Seshia, and Ashish Tiwari. Oracle-guided component- based program synthesis. InACM/IEEE International Conference on Software, 2010

work page 2010

[36] [36]

Competition-Level Code Generation with AlphaCode

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Best-first width search: Exploration and exploitation in classical planning

Nir Lipovetzky and Hector Geffner. Best-first width search: Exploration and exploitation in classical planning. InProc. AAAI 2017, pages 3590–3596, 2017

work page 2017

[38] [38]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Pe- ter Stone. LLM+P: Empowering large language models with optimal planning proficiency. arXiv:2304.11477 [cs.CL], 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Evolution of heuristics: Towards efficient automatic algorithm design using large language model

Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang. Evolution of heuristics: Towards efficient automatic algorithm design using large language model. InProc. ICML 2024, 2024

work page 2024

[40] [40]

The 1998 AI Planning Systems competition.AI Magazine, 21(2):35–55, 2000

Drew McDermott. The 1998 AI Planning Systems competition.AI Magazine, 21(2):35–55, 2000

work page 1998

[41] [41]

GenePlan: Evolv- ing better generalized PDDL plans using large language models

Andrew Murray, Danial Dervovic, Alberto Pozanco, and Michael Cashmore. GenePlan: Evolv- ing better generalized PDDL plans using large language models. InProc. ICAPS 2026, 2026

work page 2026

[42] [42]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and algor...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Manquinho

Pedro Orvalho, Mikolás Janota, and Vasco M. Manquinho. Counterexample guided program repair using zero-shot learning and maxsat-based fault localization. InAAAI Conference on Artificial Intelligence, 2025

work page 2025

[44] [44]

Large language models as planning domain generators

James Oswald, Kavitha Srinivas, Harsha Kokel, Junkyu Lee, Michael Katz, and Shirin Sohrabi. Large language models as planning domain generators. InProc. ICAPS 2024, pages 423–431, 2024

work page 2024

[45] [45]

Stefan O’Toole, Miquel Ramirez, Nir Lipovetzky, and Adrian R. Pearce. Sampling from pre- images to learn heuristic functions for classical planning (extended abstract). InProc. SoCS 2022, pages 308–310, 2022

work page 2022

[46] [46]

Addison- Wesley, 1984

Judea Pearl.Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison- Wesley, 1984

work page 1984

[47] [47]

First results on the effect of error in heuristic search

Ira Pohl. First results on the effect of error in heuristic search. In Bernard Meltzer and Donald Michie, editors,Machine Intelligence 5, pages 219–236. Edinburgh University Press, 1969

work page 1969

[48] [48]

From non-negative to general operator cost partitioning

Florian Pommerening, Malte Helmert, Gabriele Röger, and Jendrik Seipp. From non-negative to general operator cost partitioning. InProc. AAAI 2015, pages 3335–3341, 2015

work page 2015

[49] [49]

The LAMA planner: Guiding cost-based anytime planning with landmarks.Journal of Artificial Intelligence Research, 39:127–177, 2010

Silvia Richter and Matthias Westphal. The LAMA planner: Guiding cost-based anytime planning with landmarks.Journal of Artificial Intelligence Research, 39:127–177, 2010

work page 2010

[50] [50]

Pawan Kumar, Emilien Dupont, Francisco J

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024

work page 2024

[51] [51]

Learning general policies for planning through GPT models

Nicholas Rossetti, Massimiliano Tummolo, Alfonso Emilio Gerevini, Luca Putelli, Ivan Serina, Mattia Chiari, and Matteo Olivato. Learning general policies for planning through GPT models. InProc. ICAPS 2024, pages 500–508, 2024. 12

work page 2024

[52] [52]

Learning from multiple heuristics

Mehdi Samadi, Ariel Felner, and Jonathan Schaeffer. Learning from multiple heuristics. In Proc. AAAI 2008, pages 357–362, 2008

work page 2008

[53] [53]

Some studies in machine learning using the game of checkers.IBM Journal of research and development, 1959

Arthur L Samuel. Some studies in machine learning using the game of checkers.IBM Journal of research and development, 1959

work page 1959

[54] [54]

Correlation complex- ity of classical planning domains

Jendrik Seipp, Florian Pommerening, Gabriele Röger, and Malte Helmert. Correlation complex- ity of classical planning domains. InProc. IJCAI 2016, pages 3242–3250, 2016

work page 2016

[55] [55]

Downward Lab

Jendrik Seipp, Florian Pommerening, Silvan Sievers, and Malte Helmert. Downward Lab. https://doi.org/10.5281/zenodo.790461, 2017

work page doi:10.5281/zenodo.790461 2017

[56] [56]

Learning domain-independent planning heuristics with hypergraph networks

William Shen, Felipe Trevizan, and Sylvie Thiébaux. Learning domain-independent planning heuristics with hypergraph networks. InProc. ICAPS 2020, pages 574–584, 2020

work page 2020

[57] [57]

Rabbah, Rastislav Bodík, and Kemal Ebcioglu

Armando Solar-Lezama, Rodric M. Rabbah, Rastislav Bodík, and Kemal Ebcioglu. Program- ming by sketching for bit-streaming programs. In Vivek Sarkar and Mary W. Hall, editors,ACM SIGPLAN 2005 Conference on Programming anguage Design and Implementation, 2005

work page 2005

[58] [58]

Learning general optimal policies with graph neural networks: Expressive power, transparency, and limits

Simon Ståhlberg, Blai Bonet, and Hector Geffner. Learning general optimal policies with graph neural networks: Expressive power, transparency, and limits. InProc. ICAPS 2022, pages 629–637, 2022

work page 2022

[59] [59]

Chain of thoughtlessness? an analysis of CoT in planning

Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. Chain of thoughtlessness? an analysis of CoT in planning. InProc. NeurIPS 2024, pages 29106–29141, 2024

work page 2024

[60] [60]

The 2023 International Planning Competition.AI Magazine, 45(2):280–296,

Ayal Taitler, Ron Alford, Joan Espasa, Gregor Behnke, Daniel Fišer, Michael Gimelfarb, Florian Pommerening, Scott Sanner, Enrico Scala, Dominik Schreiber, Javier Segovia-Aguas, and Jendrik Seipp. The 2023 International Planning Competition.AI Magazine, 45(2):280–296,

work page 2023

[61] [61]

doi: 10.1002/aaai.12169

work page doi:10.1002/aaai.12169

[62] [62]

LLMs as planning formalizers: A survey for leveraging large language models to construct automated planning models

Marcus Tantakoun, Christian Muise, and Xiaodan Zhu. LLMs as planning formalizers: A survey for leveraging large language models to construct automated planning models. InFindings of the Association for Computational Linguistics: ACL 2025, 2025

work page 2025

[63] [63]

Successor-generator planning with LLM-generated heuristics

Alexander Tuisov, Yonatan Vernik, and Alexander Shleyfman. Successor-generator planning with LLM-generated heuristics. InProc. ICAPS 2026, 2026

work page 2026

[64] [64]

PlanBench: An extensible benchmark for evaluating large language models on planning and reasoning about change

Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. PlanBench: An extensible benchmark for evaluating large language models on planning and reasoning about change. InProc. NeurIPS 2023, pages 38975–38987, 2023

work page 2023

[65] [65]

On the planning abilities of large language models - A critical investigation

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models - A critical investigation. InProc. NeurIPS 2023, pages 75993–76005, 2023. 13 A Benchmark Domains We evaluate on the ten domains of the Learning Track of the IPC 2023 [ 60]. We give a brief description of each domain below, b...

work page 2023

[66] [66]

Alive State : r e a c h a b l e from the initial state , solvable , and not a goal state

work page

[67] [67]

I m p r o v i n g S u c c e s s o r : a s u c c e s s o r s ’ such that h (s ’) < h ( s )

work page

[68] [68]

All states that DFS can reach must contain an i m p r o v i n g s u c c e s s o r

Direct : every alive state r e a c h a b l e from the initial state via a se qu en ce of s tr ict ly d e c r e a s i n g h values has at least one i m p r o v i n g s u c c e s s o r . All states that DFS can reach must contain an i m p r o v i n g s u c c e s s o r

work page

[69] [69]

Dead - End Av oi di ng : no i m p r o v i n g s u c c e s s o r of an alive state r e a c h a b l e by DFS is itself u n s o l v a b l e ( h = i nfi ni ty ) . </ success - criteria > < previous - iterations > I t e r a t i o n 0: - Prompt Type : Init - Solved 5 i n s t a n c e s ( Average ex pan de d nodes : 2.4) - Direct v a l i d a t e d on 5 tasks ( la...

work page

[70] [70]

action =( stack b3 b2 ) , h =6 , added =[ ’( clear b3 ) ’, ’( on b3 b2 ) ’, ’( arm - empty ) ’] , deleted =[ ’( holding b3 ) ’, ’( clear b2 ) ’]

work page

[71] [71]

action =( stack b3 b1 ) , h =6 , added =[ ’( clear b3 ) ’, ’( on b3 b1 ) ’, ’( arm - empty ) ’] , deleted =[ ’( holding b3 ) ’, ’( clear b1 ) ’]

work page

[72] [72]

</ previous - heuristic - code > I t e r a t i o n 1: - Prompt Type : Pr ope rt y - Main Idea :

action =( putdown b3 ) , h =4 , added =[ ’( clear b3 ) ’, ’( on - table b3 ) ’, ’( arm - empty ) ’] , deleted =[ ’( holding b3 ) ’] < previous - heuristic - code > ... </ previous - heuristic - code > I t e r a t i o n 1: - Prompt Type : Pr ope rt y - Main Idea : ... - Solved 16 i n s t a n c e s ( Average ex pa nd ed nodes : 7.0) - Direct v a l i d a t e...

work page

[73] [73]

action =( unstack b3 b2 ) , h =11 , added =[ ’( holding b3 ) ’, ’( clear b2 ) ’] , deleted =[ ’( clear b3 ) ’, ’( on b3 b2 ) ’, ’( arm - empty ) ’]

work page

[74] [74]

action =( pickup b4 ) , h =11 , added =[ ’( holding b4 ) ’] , deleted =[ ’( on - table b4 ) ’, ’( clear b4 ) ’, ’( arm - empty ) ’]

work page

[75] [75]

</ previous - heuristic - code > I t e r a t i o n 2: - Prompt Type : Pr ope rt y - Main Idea :

action =( pickup b1 ) , h =11 , added =[ ’( holding b1 ) ’] , deleted =[ ’( clear b1 ) ’, ’( on - table b1 ) ’, ’( arm - empty ) ’] < previous - heuristic - code > ... </ previous - heuristic - code > I t e r a t i o n 2: - Prompt Type : Pr ope rt y - Main Idea : ... - Solved 17 i n s t a n c e s ( Average ex pa nd ed nodes : 7.8) - Direct v a l i d a t e...

work page

[76] [76]

action =( unstack b5 b2 ) , h =9 , added =[ ’( holding b5 ) ’, ’( clear b2 ) ’] , deleted =[ ’( on b5 b2 ) ’, ’( clear b5 ) ’, ’( arm - empty ) ’]

work page

[77] [77]

action =( pickup b1 ) , h =9 , added =[ ’( holding b1 ) ’] , deleted =[ ’( clear b1 ) ’, ’( on - table b1 ) ’, ’( arm - empty ) ’]

work page

[78] [78]

</ previous - heuristic - code > </ previous - iterations > < current - status > Domain P ro gre ss : solving 99 i n s t a n c e s of i n c r e a s i n g d i f f i c u l t y

action =( pickup b3 ) , h =9 , added =[ ’( holding b3 ) ’] , deleted =[ ’( clear b3 ) ’, ’( on - table b3 ) ’, ’( arm - empty ) ’] < previous - heuristic - code > ... </ previous - heuristic - code > </ previous - iterations > < current - status > Domain P ro gre ss : solving 99 i n s t a n c e s of i n c r e a s i n g d i f f i c u l t y . V a l i d a t ...

work page

[79] [79]

action =( unstack b15 b12 ) , h =39 , added =[ ’( holding b15 ) ’, ’( clear b12 ) ’] , deleted =[ ’( clear b15 ) ’, ’( on b15 b12 ) ’, ’( arm - empty ) ’]

work page

[80] [80]

action =( unstack b9 b16 ) , h =39 , added =[ ’( holding b9 ) ’, ’( clear b16 ) ’] , deleted =[ ’( on b9 b16 ) ’, ’( clear b9 ) ’, ’( arm - empty ) ’]

work page