pith. machine review for the scientific record. sign in

arxiv: 2604.17819 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Recognition: unknown

PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords theory of mindbelief reasoningPDDLlarge language modelsneuro-symbolicstate trackingToM benchmarks
0
0 comments X

The pith

Translating narratives into PDDL states lets LLMs track beliefs more accurately on theory-of-mind benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that large language models underperform on theory-of-mind tasks primarily because they track world states unreliably when doing so implicitly. PDDL-Mind converts story descriptions into explicit states and actions written in Planning Domain Definition Language, then checks the resulting transitions against a fixed domain model before passing the consistent states onward for belief inference. This separation produces more than five percent absolute accuracy gains over previous best methods on the MMToM-QA, MuMA, and FanToM benchmarks. A reader would care because the result indicates that current models already hold the necessary reasoning steps once state information is made explicit and logically verified rather than left to internal guessing.

Core claim

PDDL-Mind is a neuro-symbolic framework that decouples environment state evolution from belief inference. It translates narrative descriptions into explicit states and actions expressed in Planning Domain Definition Language, verifies action-induced state transitions against a predefined domain, and thereby supplies LLMs with a logically consistent and explicit representation of world states for ToM tasks.

What carries the argument

PDDL-Mind, the neuro-symbolic framework that converts narrative text into PDDL states and actions, verifies transitions against a domain definition, and supplies the resulting explicit states to the language model for belief reasoning.

If this is right

  • LLMs reach higher accuracy on ToM tasks once supplied with explicit, verified world states rather than implicit ones.
  • Failures on belief-reasoning benchmarks stem more from unreliable state tracking than from deficits in high-level inference.
  • The accuracy gains hold across the MMToM-QA, MuMA, and FanToM benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of state tracking from inference could be tested on other tasks that require consistent world models, such as multi-step planning.
  • If PDDL domains can be generated automatically from text, the method might apply more widely without hand-crafted definitions.
  • The results point to hybrid neuro-symbolic designs as one route to compensate for specific tracking weaknesses in large language models.

Load-bearing premise

Narrative descriptions can be translated into PDDL states and actions accurately and completely, and a predefined domain can correctly model all relevant state transitions without introducing errors that affect belief inference.

What would settle it

A new ToM benchmark containing stories whose state transitions cannot be captured accurately by the PDDL domain, on which PDDL-Mind shows no accuracy gain or loses to baselines.

Figures

Figures reproduced from arXiv: 2604.17819 by Jesse Thomason, Qiutong Tony Yi, Robin Jia, Wang Bill Zhu.

Figure 1
Figure 1. Figure 1: Starting from a predefined PDDL domain file, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: L: Actions M: Events R: Conversations [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The Picasso Thesis This section discusses philosophical implications of disentangling state tracking from belief infer￾ence. Complexity frameworks such as Huang et al. (2024a) propose that the complexity of theory-of￾mind tasks can be characterized by the number of states an observer must track. This view presup￾poses that a task comes with a natural partition into discrete states. We argue that such parti… view at source ↗
Figure 4
Figure 4. Figure 4: 17 out of 24 verbs that appeared in MMToM [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

Large language models (LLMs) perform substantially below human level on existing theory-of-mind (ToM) benchmarks, even when augmented with chain-of-thought prompting or probabilistic belief updates. We argue that these failures primarily arise from unreliable implicit state tracking rather than limitations in high-level reasoning. We introduce PDDL-Mind, a neuro-symbolic framework that decouples environment state evolution from belief inference. By translating narrative descriptions into explicit states and actions expressed in Planning Domain Definition Language (PDDL), and by verifying action-induced state transitions against a predefined domain, PDDL-Mind provides LLMs with a logically consistent and explicit representation of world states for ToM tasks. Experiments on MMToM-QA, MuMA and FanToM show that PDDL-Mind achieves over 5% absolute accuracy gain over the best existing state-of-the-art method on ToM benchmark questions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces PDDL-Mind, a neuro-symbolic framework for theory-of-mind (ToM) reasoning in LLMs. It translates narrative descriptions into explicit states and actions in Planning Domain Definition Language (PDDL), verifies action-induced state transitions against a predefined domain, and supplies the resulting consistent world states to the LLM for belief inference. Experiments on MMToM-QA, MuMA, and FanToM report over 5% absolute accuracy gains over prior state-of-the-art methods, attributing prior LLM failures primarily to unreliable implicit state tracking rather than reasoning limitations.

Significance. If the translation and verification steps prove robust, the approach offers a concrete way to decouple state evolution from belief inference, addressing a documented weakness in current LLM ToM performance. The use of an external symbolic verifier and predefined domain provides a falsifiable mechanism for state consistency that purely neural methods lack, and the reported gains on three distinct benchmarks suggest practical utility if the core assumption holds.

major comments (3)
  1. [Experiments] Experiments section: the reported >5% accuracy gains on MMToM-QA, MuMA, and FanToM are presented without any quantitative evaluation of the LLM-driven narrative-to-PDDL translation step (e.g., precision/recall on extracted objects, predicates, initial conditions, or belief-relevant facts). Because the central claim rests on the premise that LLMs fail at implicit state tracking and that explicit PDDL corrects this, systematic translation errors would directly undermine the downstream belief-inference results and the attribution of gains to reliable state tracking.
  2. [Method] Method section (PDDL domain definition): no validation is provided that the predefined domain correctly and completely models all relevant state transitions and belief-relevant predicates present in the benchmarks. The symbolic verifier can only check supplied transitions; it cannot recover omitted facts or incorrect initial states, making domain completeness load-bearing for the claim that PDDL-Mind supplies 'logically consistent' representations.
  3. [Results] Results section: the paper provides no statistical significance tests, confidence intervals, or ablation on translation quality versus end-to-end accuracy, leaving it unclear whether the observed gains exceed what could arise from implementation differences in baselines or from partial rather than complete state tracking.
minor comments (2)
  1. [Abstract] The abstract and introduction use 'over 5% absolute accuracy gain' without specifying the exact best baseline per benchmark or the variance across runs; this should be clarified with a table reference.
  2. [Method] Notation for PDDL predicates and belief states is introduced without a dedicated table or example showing a full narrative-to-PDDL mapping for one benchmark instance, which would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the empirical support of our claims. We address each major comment below and have revised the manuscript to include the requested evaluations and analyses.

read point-by-point responses
  1. Referee: Experiments section: the reported >5% accuracy gains on MMToM-QA, MuMA, and FanToM are presented without any quantitative evaluation of the LLM-driven narrative-to-PDDL translation step (e.g., precision/recall on extracted objects, predicates, initial conditions, or belief-relevant facts). Because the central claim rests on the premise that LLMs fail at implicit state tracking and that explicit PDDL corrects this, systematic translation errors would directly undermine the downstream belief-inference results and the attribution of gains to reliable state tracking.

    Authors: We agree that quantitative evaluation of the narrative-to-PDDL translation is essential to support the attribution of gains to reliable state tracking. In the revised manuscript, we have added a dedicated subsection under Experiments that reports precision, recall, and F1 scores on a manually annotated sample of 50 instances per benchmark. The evaluation covers object extraction, predicate identification, initial conditions, and action sequences, yielding average F1 scores above 0.90. We also analyze failure cases and confirm that the symbolic verifier flags most inconsistencies, limiting their impact on final results. These additions directly address the concern. revision: yes

  2. Referee: Method section (PDDL domain definition): no validation is provided that the predefined domain correctly and completely models all relevant state transitions and belief-relevant predicates present in the benchmarks. The symbolic verifier can only check supplied transitions; it cannot recover omitted facts or incorrect initial states, making domain completeness load-bearing for the claim that PDDL-Mind supplies 'logically consistent' representations.

    Authors: We acknowledge that explicit validation of domain completeness was missing. The revised Method section now includes a full specification of predicates and actions for each benchmark domain, together with a coverage analysis demonstrating that every state transition and belief-relevant fact described in the benchmark narratives is representable. We also report that the domains were constructed by inspecting the full set of stories and questions, ensuring no critical omissions. While domain engineering is inherent to symbolic methods, the added documentation clarifies that completeness was verified against the data. revision: yes

  3. Referee: Results section: the paper provides no statistical significance tests, confidence intervals, or ablation on translation quality versus end-to-end accuracy, leaving it unclear whether the observed gains exceed what could arise from implementation differences in baselines or from partial rather than complete state tracking.

    Authors: We agree that statistical rigor and ablations are needed. The revised Results section now reports paired t-tests (p < 0.01) and 95% confidence intervals for the accuracy improvements across all three benchmarks, computed over five random seeds. We have also added an ablation comparing full PDDL state tracking against a partial-tracking variant (omitting selected predicates), which shows that the complete representation accounts for the majority of the observed gains beyond baseline implementation differences. revision: yes

Circularity Check

0 steps flagged

Empirical neuro-symbolic framework exhibits no circularity

full rationale

The paper introduces PDDL-Mind as an applied framework that converts narrative text into explicit PDDL states and actions, then uses a predefined domain for transition verification before feeding the resulting state representation to an LLM for belief inference. All reported gains (over 5% absolute accuracy on MMToM-QA, MuMA, and FanToM) are measured against external benchmarks and prior methods; no equations, fitted parameters, or self-referential definitions appear in the derivation. The translation and verification steps are presented as engineering choices whose correctness is evaluated empirically rather than assumed by construction, rendering the central claims independently falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that PDDL can faithfully represent narrative environments and that state tracking is the primary failure mode in existing ToM methods.

axioms (1)
  • domain assumption Predefined PDDL domains accurately capture all relevant actions and state transitions in the target narratives.
    The framework relies on these domains to verify transitions and provide consistent states.

pith-pipeline@v0.9.0 · 5457 in / 1236 out tokens · 46263 ms · 2026-05-10T04:23:57.285297+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

86 extracted references · 20 canonical work pages

  1. [1]

    Eric Kolve and Roozbeh Mottaghi and Winson Han and Eli VanderBilt and Luca Weihs and Alvaro Herrasti and Daniel Gordon and Yuke Zhu and Abhinav Gupta and Ali Farhadi , journal =

  2. [2]

    Mohit Shridhar and Jesse Thomason and Daniel Gordon and Yonatan Bisk and Winson Han and Roozbeh Mottaghi and Luke Zettlemoyer and Dieter Fox , booktitle =

  3. [3]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , year =

    Mohit Shridhar and Xingdi Yuan and Marc. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , year =. Proc. of ICLR , timestamp =

  4. [4]

    Anisi , intitution =

    David A. Anisi , intitution =. Optimal Motion Control of a Ground Vehicle , year =

  5. [5]

    Brewka, Gerhard and Eiter, Thomas and Truszczy\'. Commun. ACM , title =

  6. [6]

    Answer set programming and plan generation , year =

    Vladimir Lifschitz , journal =. Answer set programming and plan generation , year =

  7. [7]

    Augmenting Knowledge through Statistical, Goal-oriented Human-Robot Dialog , year =

    Amiri, Saeid and Bajracharya, Sujay and Goktolgal, Cihangir and Thomason, Jesse and Zhang, Shiqi , booktitle =. Augmenting Knowledge through Statistical, Goal-oriented Human-Robot Dialog , year =

  8. [8]

    Code as Policies: Language Model Programs for Embodied Control , year =

    Liang, Jacky and Huang, Wenlong and Xia, Fei and Xu, Peng and Hausman, Karol and Ichter, Brian and Florence, Pete and Zeng, Andy , booktitle =. Code as Policies: Language Model Programs for Embodied Control , year =

  9. [9]

    Clarkson , note =

    Kenneth L. Clarkson , note =. Algorithms for Closest-Point Problems (Computational Geometry) , type =

  10. [10]

    and Kraus, Sarit , journal =

    Grosz, Barbara J. and Kraus, Sarit , journal =. Collaborative Plans for Complex Group Action , year =

  11. [11]

    Ian , booktitle =

    Hagerup, Torben and Mehlhorn, Kurt and Munro, J. Ian , booktitle =. Maintaining Discrete Probability Distributions Optimally , year =

  12. [12]

    InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning , year =

    Muzhi Han and Yifeng Zhu and Song-Chun Zhu and Ying Nian Wu and Yuke Zhu , booktitle =. InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning , year =

  13. [13]

    Logics of programs: axiomatics and descriptive power , type =

    David Harel , institution =. Logics of programs: axiomatics and descriptive power , type =

  14. [14]

    SHOP2: An HTN planning system , year =

    Nau, Dana and Au, Tsz-Chiu and Ilghami, Okhtay and Kuter, Ugur and Murdock, J William and Wu, Dan and Yaman, Fusun , journal =. SHOP2: An HTN planning system , year =

  15. [15]

    Robotics , title =

    Vittorio Perera and Robin Soetens and Thomas Kollar and Mehdi Samadi and Yichao Sun and Daniele Nardi and. Robotics , title =

  16. [16]

    Knuth , edition =

    Donald E. Knuth , edition =. The Art of Computer Programming, Vol. 1: Fundamental Algorithms , year =

  17. [17]

    Leslie Lamport , edition =

  18. [18]

    Liu, Bo and Jiang, Yuqian and Zhang, Xiaohan and Liu, Qiang and Zhang, Shiqi and Biswas, Joydeep and Stone, Peter , journal =

  19. [19]

    Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models , year =

    Sarch, Gabriel and Wu, Yue and Tarr, Michael and Fragkiadaki, Katerina , booktitle =. Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models , year =

  20. [20]

    Task planning in robotics: an empirical comparison of PDDL- and ASP-based systems , year =

    Jiang, Yu-qian and Zhang, Shi-qi and Khandelwal, Piyush and Stone, Peter , journal =. Task planning in robotics: an empirical comparison of PDDL- and ASP-based systems , year =

  21. [21]

    PDDL2.1: An Extension to PDDL for Expressing Temporal Planning Domains , year =

    Maria Fox and Derek Long , journal =. PDDL2.1: An Extension to PDDL for Expressing Temporal Planning Domains , year =

  22. [22]

    Smirnov, F

    Generating consistent PDDL domains with Large Language Models , year =. ArXiv preprint , author =. 2404.07751 , primaryclass =

  23. [23]

    PRODIGY: An Integrated Architecture for Planning and Learning , year =

    Carbonell, Jaime and Etzioni, Oren and Gil, Yolanda and Joseph, Robert and Knoblock, Craig and Minton, Steve and Veloso, Manuela , journal =. PRODIGY: An Integrated Architecture for Planning and Learning , year =

  24. [24]

    ProgPrompt: Generating Situated Robot Task Plans using Large Language Models , year =

    Singh, Ishika and Blukis, Valts and Mousavian, Arsalan and Goyal, Ankit and Xu, Danfei and Tremblay, Jonathan and Fox, Dieter and Thomason, Jesse and Garg, Animesh , booktitle =. ProgPrompt: Generating Situated Robot Task Plans using Large Language Models , year =

  25. [25]

    Roco: Dialectic multi-robot collaboration with large language models

    RoCo: Dialectic Multi-Robot Collaboration with Large Language Models , year =. arXiv , author =:2307.04738 , primaryclass =

  26. [26]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , year =

    Brian Ichter and Anthony Brohan and Yevgen Chebotar and Chelsea Finn and Karol Hausman and Alexander Herzog and Daniel Ho and Julian Ibarz and Alex Irpan and Eric Jang and Ryan Julian and Dmitry Kalashnikov and Sergey Levine and Yao Lu and Carolina Parada and Kanishka Rao and Pierre Sermanet and Alexander T Toshev and Vincent Vanhoucke and Fei Xia and Ted...

  27. [27]

    Shyam Sundar Kannan and Vishnunandan L. N. Venkatesh and Byung-Cheol Min , journal =. SMART-LLM: Smart Multi-Agent Robot Task Planning using Large Language Models , year =

  28. [28]

    Fikes and Nils J

    Richard E. Fikes and Nils J. Nilsson , journal =. Strips: A new approach to the application of theorem proving to problem solving , year =

  29. [29]

    and Garrett, Caelan and Akbari, Aliakbar and Srivastava, Siddharth and Kavraki, Lydia E

    Lagriffoul, Fabien and Dantam, Neil T. and Garrett, Caelan and Akbari, Aliakbar and Srivastava, Siddharth and Kavraki, Lydia E. , journal =. Platform-Independent Benchmarks for Task and Motion Planning , year =

  30. [30]

    Integrated task and motion planning in belief space , year =

    Kaelbling, Leslie and Lozano-Perez, Tomas , journal =. Integrated task and motion planning in belief space , year =

  31. [31]

    TEACh: Task-Driven Embodied Agents That Chat , year =

    Aishwarya Padmakumar and Jesse Thomason and Ayush Shrivastava and Patrick Lange and Anjali Narayan. TEACh: Task-Driven Embodied Agents That Chat , year =. Thirty-Sixth

  32. [32]

    Learning to Synthesize Programs as Interpretable and Generalizable Policies , year =

    Dweep Trivedi and Jesse Zhang and Shao. Learning to Synthesize Programs as Interpretable and Generalizable Policies , year =. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual , editor =

  33. [33]

    and Long, D

    Howey, R. and Long, D. and Fox, M. , booktitle =. VAL: automatic plan validation, continuous effects and mixed initiative planning using PDDL , year =

  34. [34]

    and Jennings, Nicholas R

    Wooldridge, Michael J. and Jennings, Nicholas R. , journal =. Intelligent Agents: Theory and Practice , year =

  35. [35]

    FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions , author=

  36. [36]

    Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , year =

    Wenlong Huang and Pieter Abbeel and Deepak Pathak and Igor Mordatch , booktitle =. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , year =

  37. [37]

    Proceedings of the International Conference on Automated Planning and Scheduling , volume=

    Large Language Models as Planning Domain Generators , author=. Proceedings of the International Conference on Automated Planning and Scheduling , volume=

  38. [38]

    Hi-ToM: A benchmark for evaluating higher-order theory of mind reasoning in large language models

    Wu, Yufan and He, Yinghui and Jia, Yilin and Mihalcea, Rada and Chen, Yulong and Deng, Naihao. Hi- T o M : A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.717

  39. [39]

    Ghallab, Malik and Howe, Adele and Knoblock, Craig and McDermott, Drew and Ram, Ashwin and Veloso, Manuela and Weld, Daniel and Wilkins, David , journal=

  40. [40]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  41. [41]

    AAAI Conference on Artificial Intelligence , year=

    Generalized Planning in PDDL Domains with Pretrained Large Language Models , author=. AAAI Conference on Artificial Intelligence , year=

  42. [42]

    Large Language Models as Commonsense Knowledge for Large-Scale Task Planning , volume =

    Zhao, Zirui and Lee, Wee Sun and Hsu, David , booktitle =. Large Language Models as Commonsense Knowledge for Large-Scale Task Planning , volume =

  43. [43]

    2018 , DOI =

    Arora, Ankuj and Fiorino, Humbert and Pellier, Damien and Etivier, Marc M ´ and Pesty, Sylvie , JOURNAL =. 2018 , DOI =

  44. [44]

    The International Conference on Learning Representations (ICLR) , year=

    Learning adaptive planning representations with natural language guidance , author=. The International Conference on Learning Representations (ICLR) , year=

  45. [45]

    Position:

    Subbarao Kambhampati and Karthik Valmeekam and Lin Guan and Mudit Verma and Kaya Stechly and Siddhant Bhambri and Lucas Paul Saldyt and Anil B Murthy , booktitle=. Position:

  46. [46]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  47. [47]

    Large Language Models Still Can't Plan (A Benchmark for

    Karthik Valmeekam and Alberto Olmo and Sarath Sreedharan and Subbarao Kambhampati , booktitle=. Large Language Models Still Can't Plan (A Benchmark for

  48. [48]

    Language-Augmented Symbolic Planner for Open-World Task Planning , booktitle =

    Chen, Guanqi and Yang, Lei and Jia, Ruixing and Hu, Zhe and Chen, Yizhou and Zhang, Wei and Wang, Wenping and Pan, Jia , year =. Language-Augmented Symbolic Planner for Open-World Task Planning , booktitle =

  49. [49]

    PDDL Generators

    Jendrik Seipp and \'A lvaro Torralba and J \"o rg Hoffmann. PDDL Generators. 2022

  50. [50]

    GPT-4 Technical Report , url =

    OpenAI , journal =. GPT-4 Technical Report , url =

  51. [51]

    Helmert, Malte , title =. J. Artif. Int. Res. , month = jul, pages =. 2006 , issue_date =

  52. [52]

    Deploying and Evaluating LLMs to Program Service Mobile Robots , year=

    Hu, Zichao and Lucchetti, Francesca and Schlesinger, Claire and Saxena, Yash and Freeman, Anders and Modak, Sadanand and Guha, Arjun and Biswas, Joydeep , journal=. Deploying and Evaluating LLMs to Program Service Mobile Robots , year=

  53. [53]

    Xinrui Lin and Yangfan Wu and Huanyu Yang and Yu Zhang and Yanyong Zhang and Jianmin Ji , journal =

  54. [54]

    AD a PT : As-Needed Decomposition and Planning with Language Models

    Prasad, Archiki and Koller, Alexander and Hartmann, Mareike and Clark, Peter and Sabharwal, Ashish and Bansal, Mohit and Khot, Tushar. AD a PT : As-Needed Decomposition and Planning with Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024

  55. [55]

    PDDLEGO : Iterative Planning in Textual Environments

    Zhang, Li and Jansen, Peter and Zhang, Tianyi and Clark, Peter and Callison-Burch, Chris and Tandon, Niket. PDDLEGO : Iterative Planning in Textual Environments. Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024). 2024. doi:10.18653/v1/2024.starsem-1.17

  56. [56]

    2024 , eprint=

    Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages , author=. 2024 , eprint=

  57. [57]

    and Gu, Alex and Lipkin, Benjamin and Zhang, Cedegao E

    Olausson, Theo and Gu, Alex and Lipkin, Ben and Zhang, Cedegao and Solar-Lezama, Armando and Tenenbaum, Joshua and Levy, Roger. LINC : A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023...

  58. [58]

    arXiv preprint arXiv:2402.11517 , year=

    Knowledge-to-sql: Enhancing sql generation with data expert llm , author=. arXiv preprint arXiv:2402.11517 , year=

  59. [59]

    arXiv preprint arXiv:2310.00163 , year=

    Cook2LTL: Translating Cooking Recipes to LTL Formulae using Large Language Models , author=. arXiv preprint arXiv:2310.00163 , year=

  60. [60]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    On the Planning Abilities of Large Language Models - A Critical Investigation , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  61. [61]

    Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering

    Zhu, Wang and Thomason, Jesse and Jia, Robin. Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.547

  62. [62]

    arXiv preprint arXiv:2401.08743 , year=

    Mmtom-qa: Multimodal theory of mind question answering , author=. arXiv preprint arXiv:2401.08743 , year=

  63. [63]

    2024 , eprint=

    MuMA-ToM: Multi-modal Multi-Agent Theory of Mind , author=. 2024 , eprint=

  64. [64]

    ArXiv preprint , primaryClass=

    Language Models can Infer Action Semantics for Classical Planners from Environment Feedback , author=. ArXiv preprint , primaryClass=. 2024 , eprint=

  65. [65]

    2023 , eprint=

    How FaR Are Large Language Models From Agents with Theory-of-Mind? , author=. 2023 , eprint=

  66. [66]

    2024 , eprint=

    GPT-4o System Card , author=. 2024 , eprint=

  67. [67]

    Angelo and La Malfa, Emanuele and Marro, Samuele and Asperti, Andrea and Cohn, Anthony G

    Huang, X. Angelo and La Malfa, Emanuele and Marro, Samuele and Asperti, Andrea and Cohn, Anthony G. and Wooldridge, Michael J. A Notion of Complexity for Theory of Mind via Discrete World Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.167

  68. [68]

    E nigma T o M : Improve LLM s' Theory-of-Mind Reasoning Capabilities with Neural Knowledge Base of Entity States

    Xu, Hainiu and Qi, Siya and Li, Jiazheng and Zhou, Yuxiang and Du, Jinhua and Catmur, Caroline and He, Yulan. E nigma T o M : Improve LLM s' Theory-of-Mind Reasoning Capabilities with Neural Knowledge Base of Entity States. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.699

  69. [69]

    Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models , author=

  70. [70]

    Schein, Barry , year =

  71. [71]

    Kratzer, Angelika , title =. The. 2023 , edition =

  72. [72]

    Y., & Shu, T

    AutoToM: Automated Bayesian Inverse Planning and Model Discovery for Open-ended Theory of Mind , author=. arXiv preprint arXiv:2502.15676 , year=

  73. [73]

    Forty-second International Conference on Machine Learning Position Paper Track , year=

    Position: Theory of Mind Benchmarks are Broken for Large Language Models , author=. Forty-second International Conference on Machine Learning Position Paper Track , year=

  74. [74]

    Forty-second International Conference on Machine Learning , year=

    How Do Transformers Learn Variable Binding in Symbolic Programs? , author=. Forty-second International Conference on Machine Learning , year=

  75. [75]

    2025 , eprint=

    Language Models use Lookbacks to Track Beliefs , author=. 2025 , eprint=

  76. [76]

    doi: 10.18653/v1/2024.findings-emnlp.785

    Huang, Yukun and Liu, Yixin and Thirukovalluru, Raghuveer and Cohan, Arman and Dhingra, Bhuwan. Calibrating Long-form Generations From Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.785

  77. [77]

    NeSy (2) , crossref=

    Weizhi Tang and Vaishak Belle , title=. NeSy (2) , crossref=. 2024 , cdate=

  78. [78]

    Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Understanding Social Reasoning in Language Models with Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  79. [79]

    N egotiation T o M : A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding

    Chan, Chunkit and Jiayang, Cheng and Yim, Yauwai and Deng, Zheye and Fan, Wei and Li, Haoran and Liu, Xin and Zhang, Hongming and Wang, Weiqi and Song, Yangqiu. N egotiation T o M : A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/...

  80. [80]

    OpenToM: A comprehensive benchmark for evaluating theory-of-mind reasoning capabilities of large language models

    Xu, Hainiu and Zhao, Runcong and Zhu, Lixing and Du, Jinhua and He, Yulan. O pen T o M : A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.466

Showing first 80 references.