pith. sign in

arxiv: 2605.15333 · v1 · pith:SQGZQU75new · submitted 2026-05-14 · 💻 cs.AI

Zero-Shot Goal Recognition with Large Language Models

Pith reviewed 2026-05-19 16:04 UTC · model grok-4.3

classification 💻 cs.AI
keywords goal recognitionlarge language modelszero-shot evaluationplanning domainsPDDLevidence integrationabductive inference
0
0 comments X

The pith

Large language models show uneven competence at recognizing goals from action observations, with some improving as evidence grows and others staying fixed to prior knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates frontier large language models as zero-shot goal recognizers on standard classical planning benchmarks expressed in PDDL. Goal recognition asks a system to infer the intended goal from a sequence of observed actions, which aligns with the abductive strengths of LLMs that draw on world knowledge rather than generate new plans. The central finding is that model performance diverges sharply: certain models increase their accuracy in line with the amount of evidence supplied and approach the level of established landmark-based methods once all observations are available, while other models show little change and remain anchored to their pre-trained priors. Qualitative review of the models' reasoning traces indicates that this split arises from genuine differences in how evidence is incorporated rather than from domain familiarity or prompting details. The work therefore treats goal recognition itself as a diagnostic benchmark for the planning-related knowledge that LLMs actually possess.

Core claim

Frontier LLMs can be applied zero-shot to goal recognition on classical PDDL domains; some models scale their accuracy with accumulating observations and approach landmark-based performance at full evidence, whereas others remain anchored to world-knowledge priors regardless of evidence volume, and this split is visible in the structure of their reasoning traces.

What carries the argument

Zero-shot goal recognition evaluation on PDDL benchmarks that measures how LLMs integrate increasing numbers of observations versus reliance on pre-trained priors.

If this is right

  • Models that scale with evidence reach accuracy comparable to landmark-based recognizers when given complete observation sequences.
  • Models that do not scale remain limited by their initial world-knowledge priors even when more actions are observed.
  • Goal recognition serves as a diagnostic task for assessing the planning knowledge already present inside LLMs.
  • The task complements generative planning benchmarks because it tests consistency checking rather than novel sequence generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid systems could route goal-recognition subtasks to the LLMs that demonstrably improve with evidence while reserving generative planning for symbolic solvers.
  • The same evaluation protocol could be extended to non-PDDL domains to test whether the evidence-integration split generalizes beyond classical planning.
  • Prompt-engineering techniques that explicitly require step-by-step evidence weighing might reduce the performance gap between the two classes of models.

Load-bearing premise

The differences in performance trace to a basic distinction in how models integrate evidence rather than to artifacts of prompting, model size, or prior exposure to the domains.

What would settle it

Re-running the evaluation with every model given identical, minimal prompts and domain descriptions, then checking whether accuracy still diverges with evidence volume and whether reasoning traces continue to show distinct integration patterns.

Figures

Figures reproduced from arXiv: 2605.15333 by Felipe Meneguzzi, Kin Max Piamolini Gusm\~ao, Nathan Gavenski, Nir Oren.

Figure 1
Figure 1. Figure 1: Average acc, prec, and spread for different observation percentages for ALL DOMAINS. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt template used for zero-shot goal recognition. Placeholders [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: GPT-4o’s reasoning for a problem with 10% observability in the [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: GPT-4o’s reasoning for a problem with 10% observability in the [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: GPT-4o’s reasoning for a problem with 10% observability in the [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: GPT-4o’s reasoning for a problem with 10% observability in the [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: GPT-4o’s reasoning for a problem with 30% observability in the [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: GPT-4o’s reasoning for a problem with 70% observability in the [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: GPT-4o’s reasoning for a problem with 70% observability in the [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: GPT-4o’s reasoning for a problem with 30% observability in the [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: GPT-OSS’s reasoning for a problem with 10% observability in the [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: GPT-OSS’s reasoning for a problem with 10% observability in the [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: GPT-OSS’s reasoning for a problem with 30% observability in the [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: GPT-5.4’s reasoning for a problem with 30% observability in the [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: GPT-5.4’s reasoning for a problem with full observability in the [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: GPT-5.4’s reasoning for a problem with full observability in the [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: GPT-5.4’s reasoning for a problem with 30% observability in the Blocks World domain. The model fails to identify the correct goal, overcommitting to a single hypothesis with lacking information. The sequence is invalid. After (pick-up E), the robot’s hand holds E, so handempty is false. The action (pick-up C) requires handempty to be true. Therefore, you cannot pick up C while holding E. This makes the se… view at source ↗
Figure 18
Figure 18. Figure 18: Qwen 3.5’s reasoning for a problem with 10% observability in the Blocks World domain. The model identifies the precondition violation but refuses to continue from there. Hypothesis 1: ‘(ONTABLE B)(ON S A)(CLEAR D)(ONTABLE D)(ON S T)‘ Based on the provided text, there is significant confusion regarding the status of ‘(ON S T)‘ in Hypothesis 1 and whether ‘A‘ is ‘T‘. The text "Wait ‘(ON S T)‘ is in Hypothes… view at source ↗
Figure 19
Figure 19. Figure 19: Qwen 3.5’s reasoning for a problem with 10% observability in the Blocks World domain. The model generates unsound reasoning, generates output outside the required format, and selects a hypothesis that was not even among the provided goal hypotheses, showcasing a complete failure in understanding the task and the provided information. You put x on table [PITH_FULL_IMAGE:figures/full_fig_p017_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Qwen 3.5’s reasoning for a problem with 30% observability in the Blocks World domain. The model fails completely and generates a nonsensical response. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Qwen 3.5’s reasoning for a problem with 30% observability in the Blocks World domain. The model fails completely and generates a nonsensical response. After the first two actions (UNSTACK R P and PUT-DOWN R), the block R is placed on the table, and P becomes clear. The next two actions (PICK-UP E and STACK E R) result in E being stacked on R. This establishes the state where (ON E R) holds and R is on the… view at source ↗
Figure 22
Figure 22. Figure 22: Qwen 3.5’s reasoning for a problem with 50% observability in the Blocks World domain. The model correctly identifies the most likely goal hypothesis based on the observed actions and state changes. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_22.png] view at source ↗
read the original abstract

Large language models have recently reached near-parity with classical planners on well-known planning domains, yet this competence relies on world-knowledge exploitation rather than genuine symbolic reasoning. Goal recognition is a complementary abductive task structurally better suited to LLM strengths: it consists of evaluating consistency with world knowledge rather than generating novel action sequences. This paper provides the first systematic zero-shot evaluation of frontier LLMs as goal recognisers on key classical PDDL benchmarks. Our results show that LLM competence on goal recognition is uneven: some models scale with evidence and approach landmark-based accuracy at full observations, while others remain anchored to world-knowledge priors regardless of how much evidence accumulates. Qualitative analysis of model reasoning traces reveals that this divergence reflects a fundamental difference in evidence integration rather than domain familiarity. These findings position goal recognition as a principled benchmark for the foundational planning knowledge of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. This paper conducts the first systematic zero-shot evaluation of frontier large language models on goal recognition tasks using classical PDDL benchmarks. The key results indicate uneven competence among models: some LLMs scale their performance with accumulating evidence and approach landmark-based accuracy with full observations, whereas others stay reliant on world-knowledge priors irrespective of evidence amount. Qualitative review of the models' reasoning traces is used to argue that this stems from differences in evidence integration abilities rather than domain familiarity.

Significance. If substantiated, these findings establish goal recognition as an effective benchmark for probing the planning knowledge embedded in LLMs, particularly their capacity for abductive consistency checking over generative planning. This could guide future work in hybrid planning systems that leverage LLMs where their evidence integration strengths are evident, and the use of reproducible benchmarks and qualitative trace analysis adds transparency to the evaluation.

major comments (1)
  1. [Qualitative Analysis of Reasoning Traces] The central claim that performance divergence reflects a fundamental difference in evidence integration (rather than prompting artifacts, model scale, or domain familiarity) rests on the qualitative analysis of reasoning traces. The manuscript does not include explicit controls such as prompt ablations varying only observation density while holding phrasing fixed, or size-matched model comparisons. This is load-bearing for the main result, as alternative explanations remain possible without such tests.
minor comments (1)
  1. [Abstract] The abstract would benefit from including at least one specific quantitative metric (e.g., accuracy value or model name) to convey the evaluation scale more concretely.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential of goal recognition as a benchmark for LLM planning knowledge. We address the major comment below and describe the changes we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Qualitative Analysis of Reasoning Traces] The central claim that performance divergence reflects a fundamental difference in evidence integration (rather than prompting artifacts, model scale, or domain familiarity) rests on the qualitative analysis of reasoning traces. The manuscript does not include explicit controls such as prompt ablations varying only observation density while holding phrasing fixed, or size-matched model comparisons. This is load-bearing for the main result, as alternative explanations remain possible without such tests.

    Authors: We agree that the qualitative analysis of reasoning traces is central to our interpretation and that stronger controls would reduce the plausibility of alternative explanations. The current traces were selected to illustrate consistent patterns of evidence use versus prior anchoring across observation levels, but we acknowledge the absence of systematic ablations. In the revised manuscript we will add (i) prompt ablations that vary only the number of observations while holding all other phrasing fixed and (ii) explicit comparisons against size-matched model pairs where such pairs exist in our evaluation set. These additions will be presented in a new subsection and will directly test whether the observed divergence persists under controlled conditions. revision: yes

Circularity Check

0 steps flagged

Empirical results benchmarked against external landmark accuracy; no reduction to self-defined terms or fitted predictions

full rationale

The paper conducts a zero-shot empirical evaluation of LLMs on standard PDDL goal recognition benchmarks and directly compares observed accuracies to an independent landmark-based method. The central claims about scaling with evidence versus anchoring to priors are derived from these external comparisons and observed performance trends rather than from any self-referential definition, parameter fitting renamed as prediction, or load-bearing self-citation chain. Qualitative analysis of reasoning traces is presented as interpretive support but does not circularly redefine the measured divergence. Minor background self-citations may exist for context but are not required to sustain the reported findings, leaving the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard assumptions from AI planning and LLM evaluation literature with no new free parameters or invented entities.

axioms (1)
  • domain assumption LLMs possess relevant world knowledge from pretraining that can be used for consistency evaluation in goal recognition.
    The abstract contrasts this exploitation of world knowledge with genuine symbolic reasoning.

pith-pipeline@v0.9.0 · 5678 in / 1055 out tokens · 59375 ms · 2026-05-19T16:04:41.062824+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    The computational complexity of propositional strips planning.Artificial Intelligence, 69(1-2):165–204, 1994

    Tom Bylander. The computational complexity of propositional strips planning.Artificial Intelligence, 69(1-2):165–204, 1994

  2. [2]

    Goal recognition as a deep learning task: The grnet approach

    Mattia Chiari, Alfonso Emilio Gerevini, Francesco Percassi, Luca Putelli, Ivan Serina, and Matteo Olivato. Goal recognition as a deep learning task: The grnet approach. In Sven Koenig, Roni Stern, and Mauro Vallati, editors,Proceedings of the Thirty-Third International Conference on Automated Planning and Scheduling, pages 560–568. AAAI Press, 2023. doi: ...

  3. [3]

    Frontier Large Language Models Rival State-of-the-Art Planners

    Augusto B. Corrêa, André Grahl Pereira, and Jendrik Seipp. The 2025 planning performance of frontier large language models.CoRR, abs/2511.09378, 2025. doi: 10.48550/ARXIV .2511. 09378

  4. [4]

    Ordered landmarks in planning.Journal of Artificial Intelligence Research, 22(1):215–278, 2004

    J Hoffmann, J Porteous, and L Sebastia. Ordered landmarks in planning.Journal of Artificial Intelligence Research, 22(1):215–278, 2004. doi: 10.1613/jair.1492

  5. [5]

    LLMs can’t plan, but can help planning in LLM- modulo frameworks

    Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Kaya Stechly, Mudit Verma, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. LLMs can’t plan, but can help planning in LLM- modulo frameworks. InProceedings of the Forty-First International Conference on Machine Learning, 2024. URLhttps://arxiv.org/abs/2402.01817

  6. [6]

    and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. doi: 10.1162/tacl_a_00638. URLhttps://aclanthology.org/2024.tacl-1.9/

  7. [7]

    A survey on goal recognition as planning

    Felipe Meneguzzi and Ramon Fraga Pereira. A survey on goal recognition as planning. In Zhi-Hua Zhou, editor,Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 4524–4532, 2021. doi: 10.24963/ijcai.2021/616

  8. [8]

    Morgan & Claypool Publishers, 2021

    Reuth Mirsky, Sarah Keren, and Christopher Geib.Introduction to Symbolic Plan and Goal Recognition, volume 16. Morgan & Claypool Publishers, 2021

  9. [9]

    On the compilability and expressive power of propositional planning for- malisms.Journal of Artificial Intelligence Research, 12:271–315, 2000

    Bernhard Nebel. On the compilability and expressive power of propositional planning for- malisms.Journal of Artificial Intelligence Research, 12:271–315, 2000

  10. [10]

    Goal and plan recognition datasets using classical planning domains, 2017

    Ramon Fraga Pereira and Felipe Meneguzzi. Goal and plan recognition datasets using classical planning domains, 2017

  11. [11]

    Landmark-based approaches for goal recognition as planning.Artificial Intelligence, 279:103217, 2020

    Ramon Fraga Pereira, Nir Oren, and Felipe Meneguzzi. Landmark-based approaches for goal recognition as planning.Artificial Intelligence, 279:103217, 2020. doi: 10. 1016/j.artint.2019.103217. URL https://linkinghub.elsevier.com/retrieve/pii/ S0004370219300013

  12. [12]

    Temporally extended goal recognition in fully observable non-deterministic domain models: Temporally extended goal recognition in fond planning.Applied Intelligence, December 2023

    Ramon Fraga Pereira, Francesco Fuggitti, Felipe Meneguzzi, and Giuseppe De Giacomo. Temporally extended goal recognition in fully observable non-deterministic domain models: Temporally extended goal recognition in fond planning.Applied Intelligence, December 2023. ISSN 1573-7497. doi: 10.1007/s10489-023-05087-1

  13. [13]

    Plan recognition as planning

    Miquel Ramírez and Hector Geffner. Plan recognition as planning. In Craig Boutilier, editor, Proceedings of the 21st International Joint Conference on Artificial Intelligence, pages 1778– 1783, 2009. URLhttp://ijcai.org/Proceedings/09/Papers/296.pdf. 10

  14. [14]

    Probabilistic plan recognition using off-the-shelf classical planners

    Miquel Ramírez and Hector Geffner. Probabilistic plan recognition using off-the-shelf classical planners. InAAAI Conference on Artificial Intelligence, pages 1121–1126, 2010. URL http: //dblp.org/rec/conf/aaai/RamirezG10

  15. [15]

    Towards efficient online goal recognition through deep learning

    Lorenzo Serina, Mattia Chiari, Alfonso Emilio Gerevini, Luca Putelli, and Ivan Serina. Towards efficient online goal recognition through deep learning. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, pages 1895–1903. International Foundation for Autonomous Agents and Multiagent Systems, 2025. ISBN 9798400714...

  16. [16]

    On the self-verification limitations of large language models on reasoning and planning tasks

    Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. On the self-verification limitations of large language models on reasoning and planning tasks. InProceedings of the Thirteenth International Conference on Learning Representations, 2025. URL https: //arxiv.org/abs/2402.11686

  17. [17]

    Real-time goal recognition using approximations in Euclidean space

    Douglas Tesch, Leonardo Amado, and Felipe Meneguzzi. Real-time goal recognition using approximations in Euclidean space. In27th European Conference on Artificial Intelligence (ECAI), 2024. URL http://www.meneguzzi.eu/felipe/pubs/ecai-vector-gr-2024. pdf

  18. [18]

    Large language models still can’t plan (a benchmark for LLMs on planning and reasoning about change)

    Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large language models still can’t plan (a benchmark for LLMs on planning and reasoning about change). InNeurIPS 2022 Workshop on Foundation Models for Decision Making, 2022. URL https://arxiv.org/abs/2206.10498

  19. [19]

    Planbench: an extensible benchmark for evaluating large language models on planning and reasoning about change

    Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: an extensible benchmark for evaluating large language models on planning and reasoning about change. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

  20. [20]

    On the planning abilities of large language models – a critical investigation

    Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models – a critical investigation. InAdvances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2302. 06706

  21. [21]

    A system- atic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025

    Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, and Subbarao Kambhampati. A system- atic evaluation of the planning and scheduling abilities of the reasoning model o1.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/ forum?id=FkKBxp0FhR

  22. [22]

    ## Output Format

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc. ISBN 9781713871...

  23. [23]

    The initial unstacking of block B from G and subsequent stacking onto J suggests an effort to create a sub-structure involving (ON A J), reflective of moving towards a goal involving stacks initiated by A/J

  24. [24]

    Actions such as stacking K on F and G suggest an intent to form a sequence that can end with (ON F E), (ON E G), corresponding directly with hypothesis patterns where F is involved in the base transitions of the structure

  25. [25]

    Moreover, the repeated adjustments involving D and J imply significant reordering involving these blocks, in line with the corresponding hypotheses

  26. [26]

    Finally, observed tasks like stacking A on J directly align with a sequence that follows the thematic transition in two of the hypothesized stack orders, indicating a prioritized sequence in these constructs. The two hypotheses that arise as most likely are those with a focus on initial D/J/A to H projects transitioning into wider stack compositions, repr...

  27. [27]

    The UNSTACK T W and STACK T M operations suggest T is involved in the final configuration

  28. [28]

    Subsequently, UNSTACK R A, PUT-DOWN R, PICK-UP A, and STACK A W indicate that R must be on the table while A is on W

  29. [29]

    Given the executed actions, the only hypothesis producing such a state is (CLEAR M),(ONTABLE W),(ON M O),(ON O W)

    Finally, STACK T H places T on H, making H a supporting block, implying H must support other blocks. Given the executed actions, the only hypothesis producing such a state is (CLEAR M),(ONTABLE W),(ON M O),(ON O W). This configuration would have M and O stacked together, with M clear and W clear, and the rest of the blocks not violating any constraints fr...

  30. [30]

    UNSTACK A C - applicable, agent holds A, C becomes clear

  31. [31]

    Still, the two key structural intentions revealed are: - the agent removes A from C, freeing A from below; - the agent then manipulates P and finally

    PICK-UP P - not applicable immediately after action 1 in strict Blocks World semantics, because after unstacking A the hand is not empty. Still, the two key structural intentions revealed are: - the agent removes A from C, freeing A from below; - the agent then manipulates P and finally

  32. [32]

    Wait ‘(ON S T)‘ is in Hypothesis 1

    STACK P A - which places P on A. The final observed relation ON P A is highly diagnostic. Among all hypotheses, only hypothesis 20 contains (ON P A). It also requires (CLEAR P) and (ONTABLE W) and (ON A W). After stacking P on A, P is indeed clear, matching that goal component. This makes hypothesis 20 uniquely aligned with the observation. Figure 14: GPT...

  33. [33]

    Guidelines: 24 • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...