Zero-Shot Goal Recognition with Large Language Models

Felipe Meneguzzi; Kin Max Piamolini Gusm\~ao; Nathan Gavenski; Nir Oren

arxiv: 2605.15333 · v1 · pith:SQGZQU75new · submitted 2026-05-14 · 💻 cs.AI

Zero-Shot Goal Recognition with Large Language Models

Kin Max Piamolini Gusm\~ao , Nathan Gavenski , Nir Oren , Felipe Meneguzzi This is my paper

Pith reviewed 2026-05-19 16:04 UTC · model grok-4.3

classification 💻 cs.AI

keywords goal recognitionlarge language modelszero-shot evaluationplanning domainsPDDLevidence integrationabductive inference

0 comments

The pith

Large language models show uneven competence at recognizing goals from action observations, with some improving as evidence grows and others staying fixed to prior knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates frontier large language models as zero-shot goal recognizers on standard classical planning benchmarks expressed in PDDL. Goal recognition asks a system to infer the intended goal from a sequence of observed actions, which aligns with the abductive strengths of LLMs that draw on world knowledge rather than generate new plans. The central finding is that model performance diverges sharply: certain models increase their accuracy in line with the amount of evidence supplied and approach the level of established landmark-based methods once all observations are available, while other models show little change and remain anchored to their pre-trained priors. Qualitative review of the models' reasoning traces indicates that this split arises from genuine differences in how evidence is incorporated rather than from domain familiarity or prompting details. The work therefore treats goal recognition itself as a diagnostic benchmark for the planning-related knowledge that LLMs actually possess.

Core claim

Frontier LLMs can be applied zero-shot to goal recognition on classical PDDL domains; some models scale their accuracy with accumulating observations and approach landmark-based performance at full evidence, whereas others remain anchored to world-knowledge priors regardless of evidence volume, and this split is visible in the structure of their reasoning traces.

What carries the argument

Zero-shot goal recognition evaluation on PDDL benchmarks that measures how LLMs integrate increasing numbers of observations versus reliance on pre-trained priors.

If this is right

Models that scale with evidence reach accuracy comparable to landmark-based recognizers when given complete observation sequences.
Models that do not scale remain limited by their initial world-knowledge priors even when more actions are observed.
Goal recognition serves as a diagnostic task for assessing the planning knowledge already present inside LLMs.
The task complements generative planning benchmarks because it tests consistency checking rather than novel sequence generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid systems could route goal-recognition subtasks to the LLMs that demonstrably improve with evidence while reserving generative planning for symbolic solvers.
The same evaluation protocol could be extended to non-PDDL domains to test whether the evidence-integration split generalizes beyond classical planning.
Prompt-engineering techniques that explicitly require step-by-step evidence weighing might reduce the performance gap between the two classes of models.

Load-bearing premise

The differences in performance trace to a basic distinction in how models integrate evidence rather than to artifacts of prompting, model size, or prior exposure to the domains.

What would settle it

Re-running the evaluation with every model given identical, minimal prompts and domain descriptions, then checking whether accuracy still diverges with evidence volume and whether reasoning traces continue to show distinct integration patterns.

Figures

Figures reproduced from arXiv: 2605.15333 by Felipe Meneguzzi, Kin Max Piamolini Gusm\~ao, Nathan Gavenski, Nir Oren.

**Figure 2.** Figure 2: Prompt template used for zero-shot goal recognition. Placeholders [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: GPT-4o’s reasoning for a problem with 10% observability in the [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: GPT-4o’s reasoning for a problem with 10% observability in the [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: GPT-4o’s reasoning for a problem with 10% observability in the [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: GPT-4o’s reasoning for a problem with 10% observability in the [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: GPT-4o’s reasoning for a problem with 30% observability in the [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: GPT-4o’s reasoning for a problem with 70% observability in the [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: GPT-4o’s reasoning for a problem with 70% observability in the [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: GPT-4o’s reasoning for a problem with 30% observability in the [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: GPT-OSS’s reasoning for a problem with 10% observability in the [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: GPT-OSS’s reasoning for a problem with 10% observability in the [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: GPT-OSS’s reasoning for a problem with 30% observability in the [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: GPT-5.4’s reasoning for a problem with 30% observability in the [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: GPT-5.4’s reasoning for a problem with full observability in the [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: GPT-5.4’s reasoning for a problem with full observability in the [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 17.** Figure 17: GPT-5.4’s reasoning for a problem with 30% observability in the Blocks World domain. The model fails to identify the correct goal, overcommitting to a single hypothesis with lacking information. The sequence is invalid. After (pick-up E), the robot’s hand holds E, so handempty is false. The action (pick-up C) requires handempty to be true. Therefore, you cannot pick up C while holding E. This makes the se… view at source ↗

**Figure 18.** Figure 18: Qwen 3.5’s reasoning for a problem with 10% observability in the Blocks World domain. The model identifies the precondition violation but refuses to continue from there. Hypothesis 1: ‘(ONTABLE B)(ON S A)(CLEAR D)(ONTABLE D)(ON S T)‘ Based on the provided text, there is significant confusion regarding the status of ‘(ON S T)‘ in Hypothesis 1 and whether ‘A‘ is ‘T‘. The text "Wait ‘(ON S T)‘ is in Hypothes… view at source ↗

**Figure 19.** Figure 19: Qwen 3.5’s reasoning for a problem with 10% observability in the Blocks World domain. The model generates unsound reasoning, generates output outside the required format, and selects a hypothesis that was not even among the provided goal hypotheses, showcasing a complete failure in understanding the task and the provided information. You put x on table [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗

**Figure 20.** Figure 20: Qwen 3.5’s reasoning for a problem with 30% observability in the Blocks World domain. The model fails completely and generates a nonsensical response. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_20.png] view at source ↗

**Figure 21.** Figure 21: Qwen 3.5’s reasoning for a problem with 30% observability in the Blocks World domain. The model fails completely and generates a nonsensical response. After the first two actions (UNSTACK R P and PUT-DOWN R), the block R is placed on the table, and P becomes clear. The next two actions (PICK-UP E and STACK E R) result in E being stacked on R. This establishes the state where (ON E R) holds and R is on the… view at source ↗

**Figure 22.** Figure 22: Qwen 3.5’s reasoning for a problem with 50% observability in the Blocks World domain. The model correctly identifies the most likely goal hypothesis based on the observed actions and state changes. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_22.png] view at source ↗

read the original abstract

Large language models have recently reached near-parity with classical planners on well-known planning domains, yet this competence relies on world-knowledge exploitation rather than genuine symbolic reasoning. Goal recognition is a complementary abductive task structurally better suited to LLM strengths: it consists of evaluating consistency with world knowledge rather than generating novel action sequences. This paper provides the first systematic zero-shot evaluation of frontier LLMs as goal recognisers on key classical PDDL benchmarks. Our results show that LLM competence on goal recognition is uneven: some models scale with evidence and approach landmark-based accuracy at full observations, while others remain anchored to world-knowledge priors regardless of how much evidence accumulates. Qualitative analysis of model reasoning traces reveals that this divergence reflects a fundamental difference in evidence integration rather than domain familiarity. These findings position goal recognition as a principled benchmark for the foundational planning knowledge of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives the first systematic zero-shot LLM evaluation on goal recognition over PDDL benchmarks and finds an uneven split in how models use accumulating evidence.

read the letter

The main takeaway is that this is the first systematic zero-shot evaluation of frontier LLMs on goal recognition using standard PDDL domains. Some models improve as observations accumulate and approach landmark-based accuracy, while others stay fixed on their world-knowledge priors no matter how much evidence arrives. That split is the central result and it is presented clearly enough to be testable. The setup itself is a good fit: goal recognition is an abductive consistency check rather than plan generation, so it aligns better with what LLMs already do well. Anchoring the results to an external landmark method gives an independent reference point instead of self-referential scoring. The paper therefore supplies a concrete benchmark that researchers can pick up and extend. The soft spot sits in the explanation for the split. The qualitative reading of reasoning traces is offered as evidence of a real difference in evidence integration, but that reading could easily be driven by prompt phrasing, model scale, or incidental domain familiarity instead. The abstract supplies no model names, no exact observation counts, no quantitative metrics, and no statistical details, so it is difficult to judge how stable the pattern is. A reader who wants tighter controls on the qualitative part will notice the gap right away. This work is aimed at people who study LLM planning competence or who need intention inference in robotics and assistants. Anyone looking for a new diagnostic for planning knowledge will find something usable here. I would send it to referees. The evaluation is new, the claims are specific enough to check, and the task choice is sensible even if the current writeup needs more detail on the analysis and fuller data reporting.

Referee Report

1 major / 1 minor

Summary. This paper conducts the first systematic zero-shot evaluation of frontier large language models on goal recognition tasks using classical PDDL benchmarks. The key results indicate uneven competence among models: some LLMs scale their performance with accumulating evidence and approach landmark-based accuracy with full observations, whereas others stay reliant on world-knowledge priors irrespective of evidence amount. Qualitative review of the models' reasoning traces is used to argue that this stems from differences in evidence integration abilities rather than domain familiarity.

Significance. If substantiated, these findings establish goal recognition as an effective benchmark for probing the planning knowledge embedded in LLMs, particularly their capacity for abductive consistency checking over generative planning. This could guide future work in hybrid planning systems that leverage LLMs where their evidence integration strengths are evident, and the use of reproducible benchmarks and qualitative trace analysis adds transparency to the evaluation.

major comments (1)

[Qualitative Analysis of Reasoning Traces] The central claim that performance divergence reflects a fundamental difference in evidence integration (rather than prompting artifacts, model scale, or domain familiarity) rests on the qualitative analysis of reasoning traces. The manuscript does not include explicit controls such as prompt ablations varying only observation density while holding phrasing fixed, or size-matched model comparisons. This is load-bearing for the main result, as alternative explanations remain possible without such tests.

minor comments (1)

[Abstract] The abstract would benefit from including at least one specific quantitative metric (e.g., accuracy value or model name) to convey the evaluation scale more concretely.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of goal recognition as a benchmark for LLM planning knowledge. We address the major comment below and describe the changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Qualitative Analysis of Reasoning Traces] The central claim that performance divergence reflects a fundamental difference in evidence integration (rather than prompting artifacts, model scale, or domain familiarity) rests on the qualitative analysis of reasoning traces. The manuscript does not include explicit controls such as prompt ablations varying only observation density while holding phrasing fixed, or size-matched model comparisons. This is load-bearing for the main result, as alternative explanations remain possible without such tests.

Authors: We agree that the qualitative analysis of reasoning traces is central to our interpretation and that stronger controls would reduce the plausibility of alternative explanations. The current traces were selected to illustrate consistent patterns of evidence use versus prior anchoring across observation levels, but we acknowledge the absence of systematic ablations. In the revised manuscript we will add (i) prompt ablations that vary only the number of observations while holding all other phrasing fixed and (ii) explicit comparisons against size-matched model pairs where such pairs exist in our evaluation set. These additions will be presented in a new subsection and will directly test whether the observed divergence persists under controlled conditions. revision: yes

Circularity Check

0 steps flagged

Empirical results benchmarked against external landmark accuracy; no reduction to self-defined terms or fitted predictions

full rationale

The paper conducts a zero-shot empirical evaluation of LLMs on standard PDDL goal recognition benchmarks and directly compares observed accuracies to an independent landmark-based method. The central claims about scaling with evidence versus anchoring to priors are derived from these external comparisons and observed performance trends rather than from any self-referential definition, parameter fitting renamed as prediction, or load-bearing self-citation chain. Qualitative analysis of reasoning traces is presented as interpretive support but does not circularly redefine the measured divergence. Minor background self-citations may exist for context but are not required to sustain the reported findings, leaving the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard assumptions from AI planning and LLM evaluation literature with no new free parameters or invented entities.

axioms (1)

domain assumption LLMs possess relevant world knowledge from pretraining that can be used for consistency evaluation in goal recognition.
The abstract contrasts this exploitation of world knowledge with genuine symbolic reasoning.

pith-pipeline@v0.9.0 · 5678 in / 1055 out tokens · 59375 ms · 2026-05-19T16:04:41.062824+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Goal recognition is an inverse problem... evaluating consistency with world knowledge rather than generating novel action sequences.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

landmark score... fraction of its landmarks observed so far

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

[1]

The computational complexity of propositional strips planning.Artificial Intelligence, 69(1-2):165–204, 1994

Tom Bylander. The computational complexity of propositional strips planning.Artificial Intelligence, 69(1-2):165–204, 1994

work page 1994
[2]

Goal recognition as a deep learning task: The grnet approach

Mattia Chiari, Alfonso Emilio Gerevini, Francesco Percassi, Luca Putelli, Ivan Serina, and Matteo Olivato. Goal recognition as a deep learning task: The grnet approach. In Sven Koenig, Roni Stern, and Mauro Vallati, editors,Proceedings of the Thirty-Third International Conference on Automated Planning and Scheduling, pages 560–568. AAAI Press, 2023. doi: ...

work page doi:10.1609/icaps.v33i1.27237 2023
[3]

Frontier Large Language Models Rival State-of-the-Art Planners

Augusto B. Corrêa, André Grahl Pereira, and Jendrik Seipp. The 2025 planning performance of frontier large language models.CoRR, abs/2511.09378, 2025. doi: 10.48550/ARXIV .2511. 09378

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
[4]

Ordered landmarks in planning.Journal of Artificial Intelligence Research, 22(1):215–278, 2004

J Hoffmann, J Porteous, and L Sebastia. Ordered landmarks in planning.Journal of Artificial Intelligence Research, 22(1):215–278, 2004. doi: 10.1613/jair.1492

work page doi:10.1613/jair.1492 2004
[5]

LLMs can’t plan, but can help planning in LLM- modulo frameworks

Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Kaya Stechly, Mudit Verma, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. LLMs can’t plan, but can help planning in LLM- modulo frameworks. InProceedings of the Forty-First International Conference on Machine Learning, 2024. URLhttps://arxiv.org/abs/2402.01817

work page arXiv 2024
[6]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. doi: 10.1162/tacl_a_00638. URLhttps://aclanthology.org/2024.tacl-1.9/

work page doi:10.1162/tacl_a_00638 2024
[7]

A survey on goal recognition as planning

Felipe Meneguzzi and Ramon Fraga Pereira. A survey on goal recognition as planning. In Zhi-Hua Zhou, editor,Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 4524–4532, 2021. doi: 10.24963/ijcai.2021/616

work page doi:10.24963/ijcai.2021/616 2021
[8]

Morgan & Claypool Publishers, 2021

Reuth Mirsky, Sarah Keren, and Christopher Geib.Introduction to Symbolic Plan and Goal Recognition, volume 16. Morgan & Claypool Publishers, 2021

work page 2021
[9]

On the compilability and expressive power of propositional planning for- malisms.Journal of Artificial Intelligence Research, 12:271–315, 2000

Bernhard Nebel. On the compilability and expressive power of propositional planning for- malisms.Journal of Artificial Intelligence Research, 12:271–315, 2000

work page 2000
[10]

Goal and plan recognition datasets using classical planning domains, 2017

Ramon Fraga Pereira and Felipe Meneguzzi. Goal and plan recognition datasets using classical planning domains, 2017

work page 2017
[11]

Landmark-based approaches for goal recognition as planning.Artificial Intelligence, 279:103217, 2020

Ramon Fraga Pereira, Nir Oren, and Felipe Meneguzzi. Landmark-based approaches for goal recognition as planning.Artificial Intelligence, 279:103217, 2020. doi: 10. 1016/j.artint.2019.103217. URL https://linkinghub.elsevier.com/retrieve/pii/ S0004370219300013

work page arXiv 2020
[12]

Temporally extended goal recognition in fully observable non-deterministic domain models: Temporally extended goal recognition in fond planning.Applied Intelligence, December 2023

Ramon Fraga Pereira, Francesco Fuggitti, Felipe Meneguzzi, and Giuseppe De Giacomo. Temporally extended goal recognition in fully observable non-deterministic domain models: Temporally extended goal recognition in fond planning.Applied Intelligence, December 2023. ISSN 1573-7497. doi: 10.1007/s10489-023-05087-1

work page doi:10.1007/s10489-023-05087-1 2023
[13]

Plan recognition as planning

Miquel Ramírez and Hector Geffner. Plan recognition as planning. In Craig Boutilier, editor, Proceedings of the 21st International Joint Conference on Artificial Intelligence, pages 1778– 1783, 2009. URLhttp://ijcai.org/Proceedings/09/Papers/296.pdf. 10

work page 2009
[14]

Probabilistic plan recognition using off-the-shelf classical planners

Miquel Ramírez and Hector Geffner. Probabilistic plan recognition using off-the-shelf classical planners. InAAAI Conference on Artificial Intelligence, pages 1121–1126, 2010. URL http: //dblp.org/rec/conf/aaai/RamirezG10

work page 2010
[15]

Towards efficient online goal recognition through deep learning

Lorenzo Serina, Mattia Chiari, Alfonso Emilio Gerevini, Luca Putelli, and Ivan Serina. Towards efficient online goal recognition through deep learning. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, pages 1895–1903. International Foundation for Autonomous Agents and Multiagent Systems, 2025. ISBN 9798400714...

work page arXiv 1903
[16]

On the self-verification limitations of large language models on reasoning and planning tasks

Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. On the self-verification limitations of large language models on reasoning and planning tasks. InProceedings of the Thirteenth International Conference on Learning Representations, 2025. URL https: //arxiv.org/abs/2402.11686

work page arXiv 2025
[17]

Real-time goal recognition using approximations in Euclidean space

Douglas Tesch, Leonardo Amado, and Felipe Meneguzzi. Real-time goal recognition using approximations in Euclidean space. In27th European Conference on Artificial Intelligence (ECAI), 2024. URL http://www.meneguzzi.eu/felipe/pubs/ecai-vector-gr-2024. pdf

work page 2024
[18]

Large language models still can’t plan (a benchmark for LLMs on planning and reasoning about change)

Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can’t plan (a benchmark for LLMs on planning and reasoning about change). InNeurIPS 2022 Workshop on Foundation Models for Decision Making, 2022. URL https://arxiv.org/abs/2206.10498

work page arXiv 2022
[19]

Planbench: an extensible benchmark for evaluating large language models on planning and reasoning about change

Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: an extensible benchmark for evaluating large language models on planning and reasoning about change. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

work page 2023
[20]

On the planning abilities of large language models – a critical investigation

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models – a critical investigation. InAdvances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2302. 06706

work page 2023
[21]

A system- atic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025

Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, and Subbarao Kambhampati. A system- atic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/ forum?id=FkKBxp0FhR

work page 2025
[22]

## Output Format

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871...

work page arXiv 2022
[23]

The initial unstacking of block B from G and subsequent stacking onto J suggests an effort to create a sub-structure involving (ON A J), reflective of moving towards a goal involving stacks initiated by A/J

work page
[24]

Actions such as stacking K on F and G suggest an intent to form a sequence that can end with (ON F E), (ON E G), corresponding directly with hypothesis patterns where F is involved in the base transitions of the structure

work page
[25]

Moreover, the repeated adjustments involving D and J imply significant reordering involving these blocks, in line with the corresponding hypotheses

work page
[26]

Finally, observed tasks like stacking A on J directly align with a sequence that follows the thematic transition in two of the hypothesized stack orders, indicating a prioritized sequence in these constructs. The two hypotheses that arise as most likely are those with a focus on initial D/J/A to H projects transitioning into wider stack compositions, repr...

work page
[27]

The UNSTACK T W and STACK T M operations suggest T is involved in the final configuration

work page
[28]

Subsequently, UNSTACK R A, PUT-DOWN R, PICK-UP A, and STACK A W indicate that R must be on the table while A is on W

work page
[29]

Given the executed actions, the only hypothesis producing such a state is (CLEAR M),(ONTABLE W),(ON M O),(ON O W)

Finally, STACK T H places T on H, making H a supporting block, implying H must support other blocks. Given the executed actions, the only hypothesis producing such a state is (CLEAR M),(ONTABLE W),(ON M O),(ON O W). This configuration would have M and O stacked together, with M clear and W clear, and the rest of the blocks not violating any constraints fr...

work page
[30]

UNSTACK A C - applicable, agent holds A, C becomes clear

work page
[31]

Still, the two key structural intentions revealed are: - the agent removes A from C, freeing A from below; - the agent then manipulates P and finally

PICK-UP P - not applicable immediately after action 1 in strict Blocks World semantics, because after unstacking A the hand is not empty. Still, the two key structural intentions revealed are: - the agent removes A from C, freeing A from below; - the agent then manipulates P and finally

work page
[32]

Wait ‘(ON S T)‘ is in Hypothesis 1

STACK P A - which places P on A. The final observed relation ON P A is highly diagnostic. Among all hypotheses, only hypothesis 20 contains (ON P A). It also requires (CLEAR P) and (ONTABLE W) and (ON A W). After stacking P on A, P is indeed clear, matching that goal component. This makes hypothesis 20 uniquely aligned with the observation. Figure 14: GPT...

work page
[33]

Guidelines: 24 • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

The computational complexity of propositional strips planning.Artificial Intelligence, 69(1-2):165–204, 1994

Tom Bylander. The computational complexity of propositional strips planning.Artificial Intelligence, 69(1-2):165–204, 1994

work page 1994

[2] [2]

Goal recognition as a deep learning task: The grnet approach

Mattia Chiari, Alfonso Emilio Gerevini, Francesco Percassi, Luca Putelli, Ivan Serina, and Matteo Olivato. Goal recognition as a deep learning task: The grnet approach. In Sven Koenig, Roni Stern, and Mauro Vallati, editors,Proceedings of the Thirty-Third International Conference on Automated Planning and Scheduling, pages 560–568. AAAI Press, 2023. doi: ...

work page doi:10.1609/icaps.v33i1.27237 2023

[3] [3]

Frontier Large Language Models Rival State-of-the-Art Planners

Augusto B. Corrêa, André Grahl Pereira, and Jendrik Seipp. The 2025 planning performance of frontier large language models.CoRR, abs/2511.09378, 2025. doi: 10.48550/ARXIV .2511. 09378

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025

[4] [4]

Ordered landmarks in planning.Journal of Artificial Intelligence Research, 22(1):215–278, 2004

J Hoffmann, J Porteous, and L Sebastia. Ordered landmarks in planning.Journal of Artificial Intelligence Research, 22(1):215–278, 2004. doi: 10.1613/jair.1492

work page doi:10.1613/jair.1492 2004

[5] [5]

LLMs can’t plan, but can help planning in LLM- modulo frameworks

Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Kaya Stechly, Mudit Verma, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. LLMs can’t plan, but can help planning in LLM- modulo frameworks. InProceedings of the Forty-First International Conference on Machine Learning, 2024. URLhttps://arxiv.org/abs/2402.01817

work page arXiv 2024

[6] [6]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. doi: 10.1162/tacl_a_00638. URLhttps://aclanthology.org/2024.tacl-1.9/

work page doi:10.1162/tacl_a_00638 2024

[7] [7]

A survey on goal recognition as planning

Felipe Meneguzzi and Ramon Fraga Pereira. A survey on goal recognition as planning. In Zhi-Hua Zhou, editor,Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 4524–4532, 2021. doi: 10.24963/ijcai.2021/616

work page doi:10.24963/ijcai.2021/616 2021

[8] [8]

Morgan & Claypool Publishers, 2021

Reuth Mirsky, Sarah Keren, and Christopher Geib.Introduction to Symbolic Plan and Goal Recognition, volume 16. Morgan & Claypool Publishers, 2021

work page 2021

[9] [9]

On the compilability and expressive power of propositional planning for- malisms.Journal of Artificial Intelligence Research, 12:271–315, 2000

Bernhard Nebel. On the compilability and expressive power of propositional planning for- malisms.Journal of Artificial Intelligence Research, 12:271–315, 2000

work page 2000

[10] [10]

Goal and plan recognition datasets using classical planning domains, 2017

Ramon Fraga Pereira and Felipe Meneguzzi. Goal and plan recognition datasets using classical planning domains, 2017

work page 2017

[11] [11]

Landmark-based approaches for goal recognition as planning.Artificial Intelligence, 279:103217, 2020

Ramon Fraga Pereira, Nir Oren, and Felipe Meneguzzi. Landmark-based approaches for goal recognition as planning.Artificial Intelligence, 279:103217, 2020. doi: 10. 1016/j.artint.2019.103217. URL https://linkinghub.elsevier.com/retrieve/pii/ S0004370219300013

work page arXiv 2020

[12] [12]

Temporally extended goal recognition in fully observable non-deterministic domain models: Temporally extended goal recognition in fond planning.Applied Intelligence, December 2023

Ramon Fraga Pereira, Francesco Fuggitti, Felipe Meneguzzi, and Giuseppe De Giacomo. Temporally extended goal recognition in fully observable non-deterministic domain models: Temporally extended goal recognition in fond planning.Applied Intelligence, December 2023. ISSN 1573-7497. doi: 10.1007/s10489-023-05087-1

work page doi:10.1007/s10489-023-05087-1 2023

[13] [13]

Plan recognition as planning

Miquel Ramírez and Hector Geffner. Plan recognition as planning. In Craig Boutilier, editor, Proceedings of the 21st International Joint Conference on Artificial Intelligence, pages 1778– 1783, 2009. URLhttp://ijcai.org/Proceedings/09/Papers/296.pdf. 10

work page 2009

[14] [14]

Probabilistic plan recognition using off-the-shelf classical planners

Miquel Ramírez and Hector Geffner. Probabilistic plan recognition using off-the-shelf classical planners. InAAAI Conference on Artificial Intelligence, pages 1121–1126, 2010. URL http: //dblp.org/rec/conf/aaai/RamirezG10

work page 2010

[15] [15]

Towards efficient online goal recognition through deep learning

Lorenzo Serina, Mattia Chiari, Alfonso Emilio Gerevini, Luca Putelli, and Ivan Serina. Towards efficient online goal recognition through deep learning. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, pages 1895–1903. International Foundation for Autonomous Agents and Multiagent Systems, 2025. ISBN 9798400714...

work page arXiv 1903

[16] [16]

On the self-verification limitations of large language models on reasoning and planning tasks

Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. On the self-verification limitations of large language models on reasoning and planning tasks. InProceedings of the Thirteenth International Conference on Learning Representations, 2025. URL https: //arxiv.org/abs/2402.11686

work page arXiv 2025

[17] [17]

Real-time goal recognition using approximations in Euclidean space

Douglas Tesch, Leonardo Amado, and Felipe Meneguzzi. Real-time goal recognition using approximations in Euclidean space. In27th European Conference on Artificial Intelligence (ECAI), 2024. URL http://www.meneguzzi.eu/felipe/pubs/ecai-vector-gr-2024. pdf

work page 2024

[18] [18]

Large language models still can’t plan (a benchmark for LLMs on planning and reasoning about change)

Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can’t plan (a benchmark for LLMs on planning and reasoning about change). InNeurIPS 2022 Workshop on Foundation Models for Decision Making, 2022. URL https://arxiv.org/abs/2206.10498

work page arXiv 2022

[19] [19]

Planbench: an extensible benchmark for evaluating large language models on planning and reasoning about change

Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: an extensible benchmark for evaluating large language models on planning and reasoning about change. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

work page 2023

[20] [20]

On the planning abilities of large language models – a critical investigation

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models – a critical investigation. InAdvances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2302. 06706

work page 2023

[21] [21]

A system- atic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025

Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, and Subbarao Kambhampati. A system- atic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/ forum?id=FkKBxp0FhR

work page 2025

[22] [22]

## Output Format

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871...

work page arXiv 2022

[23] [23]

The initial unstacking of block B from G and subsequent stacking onto J suggests an effort to create a sub-structure involving (ON A J), reflective of moving towards a goal involving stacks initiated by A/J

work page

[24] [24]

Actions such as stacking K on F and G suggest an intent to form a sequence that can end with (ON F E), (ON E G), corresponding directly with hypothesis patterns where F is involved in the base transitions of the structure

work page

[25] [25]

Moreover, the repeated adjustments involving D and J imply significant reordering involving these blocks, in line with the corresponding hypotheses

work page

[26] [26]

Finally, observed tasks like stacking A on J directly align with a sequence that follows the thematic transition in two of the hypothesized stack orders, indicating a prioritized sequence in these constructs. The two hypotheses that arise as most likely are those with a focus on initial D/J/A to H projects transitioning into wider stack compositions, repr...

work page

[27] [27]

The UNSTACK T W and STACK T M operations suggest T is involved in the final configuration

work page

[28] [28]

Subsequently, UNSTACK R A, PUT-DOWN R, PICK-UP A, and STACK A W indicate that R must be on the table while A is on W

work page

[29] [29]

Given the executed actions, the only hypothesis producing such a state is (CLEAR M),(ONTABLE W),(ON M O),(ON O W)

Finally, STACK T H places T on H, making H a supporting block, implying H must support other blocks. Given the executed actions, the only hypothesis producing such a state is (CLEAR M),(ONTABLE W),(ON M O),(ON O W). This configuration would have M and O stacked together, with M clear and W clear, and the rest of the blocks not violating any constraints fr...

work page

[30] [30]

UNSTACK A C - applicable, agent holds A, C becomes clear

work page

[31] [31]

Still, the two key structural intentions revealed are: - the agent removes A from C, freeing A from below; - the agent then manipulates P and finally

PICK-UP P - not applicable immediately after action 1 in strict Blocks World semantics, because after unstacking A the hand is not empty. Still, the two key structural intentions revealed are: - the agent removes A from C, freeing A from below; - the agent then manipulates P and finally

work page

[32] [32]

Wait ‘(ON S T)‘ is in Hypothesis 1

STACK P A - which places P on A. The final observed relation ON P A is highly diagnostic. Among all hypotheses, only hypothesis 20 contains (ON P A). It also requires (CLEAR P) and (ONTABLE W) and (ON A W). After stacking P on A, P is indeed clear, matching that goal component. This makes hypothesis 20 uniquely aligned with the observation. Figure 14: GPT...

work page

[33] [33]

Guidelines: 24 • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page