Recognition: unknown
PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking
Pith reviewed 2026-05-10 04:23 UTC · model grok-4.3
The pith
Translating narratives into PDDL states lets LLMs track beliefs more accurately on theory-of-mind benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PDDL-Mind is a neuro-symbolic framework that decouples environment state evolution from belief inference. It translates narrative descriptions into explicit states and actions expressed in Planning Domain Definition Language, verifies action-induced state transitions against a predefined domain, and thereby supplies LLMs with a logically consistent and explicit representation of world states for ToM tasks.
What carries the argument
PDDL-Mind, the neuro-symbolic framework that converts narrative text into PDDL states and actions, verifies transitions against a domain definition, and supplies the resulting explicit states to the language model for belief reasoning.
If this is right
- LLMs reach higher accuracy on ToM tasks once supplied with explicit, verified world states rather than implicit ones.
- Failures on belief-reasoning benchmarks stem more from unreliable state tracking than from deficits in high-level inference.
- The accuracy gains hold across the MMToM-QA, MuMA, and FanToM benchmarks.
Where Pith is reading between the lines
- The same separation of state tracking from inference could be tested on other tasks that require consistent world models, such as multi-step planning.
- If PDDL domains can be generated automatically from text, the method might apply more widely without hand-crafted definitions.
- The results point to hybrid neuro-symbolic designs as one route to compensate for specific tracking weaknesses in large language models.
Load-bearing premise
Narrative descriptions can be translated into PDDL states and actions accurately and completely, and a predefined domain can correctly model all relevant state transitions without introducing errors that affect belief inference.
What would settle it
A new ToM benchmark containing stories whose state transitions cannot be captured accurately by the PDDL domain, on which PDDL-Mind shows no accuracy gain or loses to baselines.
Figures
read the original abstract
Large language models (LLMs) perform substantially below human level on existing theory-of-mind (ToM) benchmarks, even when augmented with chain-of-thought prompting or probabilistic belief updates. We argue that these failures primarily arise from unreliable implicit state tracking rather than limitations in high-level reasoning. We introduce PDDL-Mind, a neuro-symbolic framework that decouples environment state evolution from belief inference. By translating narrative descriptions into explicit states and actions expressed in Planning Domain Definition Language (PDDL), and by verifying action-induced state transitions against a predefined domain, PDDL-Mind provides LLMs with a logically consistent and explicit representation of world states for ToM tasks. Experiments on MMToM-QA, MuMA and FanToM show that PDDL-Mind achieves over 5% absolute accuracy gain over the best existing state-of-the-art method on ToM benchmark questions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PDDL-Mind, a neuro-symbolic framework for theory-of-mind (ToM) reasoning in LLMs. It translates narrative descriptions into explicit states and actions in Planning Domain Definition Language (PDDL), verifies action-induced state transitions against a predefined domain, and supplies the resulting consistent world states to the LLM for belief inference. Experiments on MMToM-QA, MuMA, and FanToM report over 5% absolute accuracy gains over prior state-of-the-art methods, attributing prior LLM failures primarily to unreliable implicit state tracking rather than reasoning limitations.
Significance. If the translation and verification steps prove robust, the approach offers a concrete way to decouple state evolution from belief inference, addressing a documented weakness in current LLM ToM performance. The use of an external symbolic verifier and predefined domain provides a falsifiable mechanism for state consistency that purely neural methods lack, and the reported gains on three distinct benchmarks suggest practical utility if the core assumption holds.
major comments (3)
- [Experiments] Experiments section: the reported >5% accuracy gains on MMToM-QA, MuMA, and FanToM are presented without any quantitative evaluation of the LLM-driven narrative-to-PDDL translation step (e.g., precision/recall on extracted objects, predicates, initial conditions, or belief-relevant facts). Because the central claim rests on the premise that LLMs fail at implicit state tracking and that explicit PDDL corrects this, systematic translation errors would directly undermine the downstream belief-inference results and the attribution of gains to reliable state tracking.
- [Method] Method section (PDDL domain definition): no validation is provided that the predefined domain correctly and completely models all relevant state transitions and belief-relevant predicates present in the benchmarks. The symbolic verifier can only check supplied transitions; it cannot recover omitted facts or incorrect initial states, making domain completeness load-bearing for the claim that PDDL-Mind supplies 'logically consistent' representations.
- [Results] Results section: the paper provides no statistical significance tests, confidence intervals, or ablation on translation quality versus end-to-end accuracy, leaving it unclear whether the observed gains exceed what could arise from implementation differences in baselines or from partial rather than complete state tracking.
minor comments (2)
- [Abstract] The abstract and introduction use 'over 5% absolute accuracy gain' without specifying the exact best baseline per benchmark or the variance across runs; this should be clarified with a table reference.
- [Method] Notation for PDDL predicates and belief states is introduced without a dedicated table or example showing a full narrative-to-PDDL mapping for one benchmark instance, which would aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for strengthening the empirical support of our claims. We address each major comment below and have revised the manuscript to include the requested evaluations and analyses.
read point-by-point responses
-
Referee: Experiments section: the reported >5% accuracy gains on MMToM-QA, MuMA, and FanToM are presented without any quantitative evaluation of the LLM-driven narrative-to-PDDL translation step (e.g., precision/recall on extracted objects, predicates, initial conditions, or belief-relevant facts). Because the central claim rests on the premise that LLMs fail at implicit state tracking and that explicit PDDL corrects this, systematic translation errors would directly undermine the downstream belief-inference results and the attribution of gains to reliable state tracking.
Authors: We agree that quantitative evaluation of the narrative-to-PDDL translation is essential to support the attribution of gains to reliable state tracking. In the revised manuscript, we have added a dedicated subsection under Experiments that reports precision, recall, and F1 scores on a manually annotated sample of 50 instances per benchmark. The evaluation covers object extraction, predicate identification, initial conditions, and action sequences, yielding average F1 scores above 0.90. We also analyze failure cases and confirm that the symbolic verifier flags most inconsistencies, limiting their impact on final results. These additions directly address the concern. revision: yes
-
Referee: Method section (PDDL domain definition): no validation is provided that the predefined domain correctly and completely models all relevant state transitions and belief-relevant predicates present in the benchmarks. The symbolic verifier can only check supplied transitions; it cannot recover omitted facts or incorrect initial states, making domain completeness load-bearing for the claim that PDDL-Mind supplies 'logically consistent' representations.
Authors: We acknowledge that explicit validation of domain completeness was missing. The revised Method section now includes a full specification of predicates and actions for each benchmark domain, together with a coverage analysis demonstrating that every state transition and belief-relevant fact described in the benchmark narratives is representable. We also report that the domains were constructed by inspecting the full set of stories and questions, ensuring no critical omissions. While domain engineering is inherent to symbolic methods, the added documentation clarifies that completeness was verified against the data. revision: yes
-
Referee: Results section: the paper provides no statistical significance tests, confidence intervals, or ablation on translation quality versus end-to-end accuracy, leaving it unclear whether the observed gains exceed what could arise from implementation differences in baselines or from partial rather than complete state tracking.
Authors: We agree that statistical rigor and ablations are needed. The revised Results section now reports paired t-tests (p < 0.01) and 95% confidence intervals for the accuracy improvements across all three benchmarks, computed over five random seeds. We have also added an ablation comparing full PDDL state tracking against a partial-tracking variant (omitting selected predicates), which shows that the complete representation accounts for the majority of the observed gains beyond baseline implementation differences. revision: yes
Circularity Check
Empirical neuro-symbolic framework exhibits no circularity
full rationale
The paper introduces PDDL-Mind as an applied framework that converts narrative text into explicit PDDL states and actions, then uses a predefined domain for transition verification before feeding the resulting state representation to an LLM for belief inference. All reported gains (over 5% absolute accuracy on MMToM-QA, MuMA, and FanToM) are measured against external benchmarks and prior methods; no equations, fitted parameters, or self-referential definitions appear in the derivation. The translation and verification steps are presented as engineering choices whose correctness is evaluated empirically rather than assumed by construction, rendering the central claims independently falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Predefined PDDL domains accurately capture all relevant actions and state transitions in the target narratives.
Reference graph
Works this paper leans on
-
[1]
Eric Kolve and Roozbeh Mottaghi and Winson Han and Eli VanderBilt and Luca Weihs and Alvaro Herrasti and Daniel Gordon and Yuke Zhu and Abhinav Gupta and Ali Farhadi , journal =
-
[2]
Mohit Shridhar and Jesse Thomason and Daniel Gordon and Yonatan Bisk and Winson Han and Roozbeh Mottaghi and Luke Zettlemoyer and Dieter Fox , booktitle =
-
[3]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , year =
Mohit Shridhar and Xingdi Yuan and Marc. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning , year =. Proc. of ICLR , timestamp =
-
[4]
Anisi , intitution =
David A. Anisi , intitution =. Optimal Motion Control of a Ground Vehicle , year =
-
[5]
Brewka, Gerhard and Eiter, Thomas and Truszczy\'. Commun. ACM , title =
-
[6]
Answer set programming and plan generation , year =
Vladimir Lifschitz , journal =. Answer set programming and plan generation , year =
-
[7]
Augmenting Knowledge through Statistical, Goal-oriented Human-Robot Dialog , year =
Amiri, Saeid and Bajracharya, Sujay and Goktolgal, Cihangir and Thomason, Jesse and Zhang, Shiqi , booktitle =. Augmenting Knowledge through Statistical, Goal-oriented Human-Robot Dialog , year =
-
[8]
Code as Policies: Language Model Programs for Embodied Control , year =
Liang, Jacky and Huang, Wenlong and Xia, Fei and Xu, Peng and Hausman, Karol and Ichter, Brian and Florence, Pete and Zeng, Andy , booktitle =. Code as Policies: Language Model Programs for Embodied Control , year =
-
[9]
Clarkson , note =
Kenneth L. Clarkson , note =. Algorithms for Closest-Point Problems (Computational Geometry) , type =
-
[10]
and Kraus, Sarit , journal =
Grosz, Barbara J. and Kraus, Sarit , journal =. Collaborative Plans for Complex Group Action , year =
-
[11]
Ian , booktitle =
Hagerup, Torben and Mehlhorn, Kurt and Munro, J. Ian , booktitle =. Maintaining Discrete Probability Distributions Optimally , year =
-
[12]
InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning , year =
Muzhi Han and Yifeng Zhu and Song-Chun Zhu and Ying Nian Wu and Yuke Zhu , booktitle =. InterPreT: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning , year =
-
[13]
Logics of programs: axiomatics and descriptive power , type =
David Harel , institution =. Logics of programs: axiomatics and descriptive power , type =
-
[14]
SHOP2: An HTN planning system , year =
Nau, Dana and Au, Tsz-Chiu and Ilghami, Okhtay and Kuter, Ugur and Murdock, J William and Wu, Dan and Yaman, Fusun , journal =. SHOP2: An HTN planning system , year =
-
[15]
Robotics , title =
Vittorio Perera and Robin Soetens and Thomas Kollar and Mehdi Samadi and Yichao Sun and Daniele Nardi and. Robotics , title =
-
[16]
Knuth , edition =
Donald E. Knuth , edition =. The Art of Computer Programming, Vol. 1: Fundamental Algorithms , year =
-
[17]
Leslie Lamport , edition =
-
[18]
Liu, Bo and Jiang, Yuqian and Zhang, Xiaohan and Liu, Qiang and Zhang, Shiqi and Biswas, Joydeep and Stone, Peter , journal =
-
[19]
Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models , year =
Sarch, Gabriel and Wu, Yue and Tarr, Michael and Fragkiadaki, Katerina , booktitle =. Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models , year =
-
[20]
Task planning in robotics: an empirical comparison of PDDL- and ASP-based systems , year =
Jiang, Yu-qian and Zhang, Shi-qi and Khandelwal, Piyush and Stone, Peter , journal =. Task planning in robotics: an empirical comparison of PDDL- and ASP-based systems , year =
-
[21]
PDDL2.1: An Extension to PDDL for Expressing Temporal Planning Domains , year =
Maria Fox and Derek Long , journal =. PDDL2.1: An Extension to PDDL for Expressing Temporal Planning Domains , year =
-
[22]
Generating consistent PDDL domains with Large Language Models , year =. ArXiv preprint , author =. 2404.07751 , primaryclass =
-
[23]
PRODIGY: An Integrated Architecture for Planning and Learning , year =
Carbonell, Jaime and Etzioni, Oren and Gil, Yolanda and Joseph, Robert and Knoblock, Craig and Minton, Steve and Veloso, Manuela , journal =. PRODIGY: An Integrated Architecture for Planning and Learning , year =
-
[24]
ProgPrompt: Generating Situated Robot Task Plans using Large Language Models , year =
Singh, Ishika and Blukis, Valts and Mousavian, Arsalan and Goyal, Ankit and Xu, Danfei and Tremblay, Jonathan and Fox, Dieter and Thomason, Jesse and Garg, Animesh , booktitle =. ProgPrompt: Generating Situated Robot Task Plans using Large Language Models , year =
-
[25]
Roco: Dialectic multi-robot collaboration with large language models
RoCo: Dialectic Multi-Robot Collaboration with Large Language Models , year =. arXiv , author =:2307.04738 , primaryclass =
-
[26]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , year =
Brian Ichter and Anthony Brohan and Yevgen Chebotar and Chelsea Finn and Karol Hausman and Alexander Herzog and Daniel Ho and Julian Ibarz and Alex Irpan and Eric Jang and Ryan Julian and Dmitry Kalashnikov and Sergey Levine and Yao Lu and Carolina Parada and Kanishka Rao and Pierre Sermanet and Alexander T Toshev and Vincent Vanhoucke and Fei Xia and Ted...
-
[27]
Shyam Sundar Kannan and Vishnunandan L. N. Venkatesh and Byung-Cheol Min , journal =. SMART-LLM: Smart Multi-Agent Robot Task Planning using Large Language Models , year =
-
[28]
Fikes and Nils J
Richard E. Fikes and Nils J. Nilsson , journal =. Strips: A new approach to the application of theorem proving to problem solving , year =
-
[29]
and Garrett, Caelan and Akbari, Aliakbar and Srivastava, Siddharth and Kavraki, Lydia E
Lagriffoul, Fabien and Dantam, Neil T. and Garrett, Caelan and Akbari, Aliakbar and Srivastava, Siddharth and Kavraki, Lydia E. , journal =. Platform-Independent Benchmarks for Task and Motion Planning , year =
-
[30]
Integrated task and motion planning in belief space , year =
Kaelbling, Leslie and Lozano-Perez, Tomas , journal =. Integrated task and motion planning in belief space , year =
-
[31]
TEACh: Task-Driven Embodied Agents That Chat , year =
Aishwarya Padmakumar and Jesse Thomason and Ayush Shrivastava and Patrick Lange and Anjali Narayan. TEACh: Task-Driven Embodied Agents That Chat , year =. Thirty-Sixth
-
[32]
Learning to Synthesize Programs as Interpretable and Generalizable Policies , year =
Dweep Trivedi and Jesse Zhang and Shao. Learning to Synthesize Programs as Interpretable and Generalizable Policies , year =. Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual , editor =
2021
-
[33]
and Long, D
Howey, R. and Long, D. and Fox, M. , booktitle =. VAL: automatic plan validation, continuous effects and mixed initiative planning using PDDL , year =
-
[34]
and Jennings, Nicholas R
Wooldridge, Michael J. and Jennings, Nicholas R. , journal =. Intelligent Agents: Theory and Practice , year =
-
[35]
FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions , author=
-
[36]
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , year =
Wenlong Huang and Pieter Abbeel and Deepak Pathak and Igor Mordatch , booktitle =. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , year =
-
[37]
Proceedings of the International Conference on Automated Planning and Scheduling , volume=
Large Language Models as Planning Domain Generators , author=. Proceedings of the International Conference on Automated Planning and Scheduling , volume=
-
[38]
Hi-ToM: A benchmark for evaluating higher-order theory of mind reasoning in large language models
Wu, Yufan and He, Yinghui and Jia, Yilin and Mihalcea, Rada and Chen, Yulong and Deng, Naihao. Hi- T o M : A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.717
-
[39]
Ghallab, Malik and Howe, Adele and Knoblock, Craig and McDermott, Drew and Ram, Ashwin and Veloso, Manuela and Weld, Daniel and Wilkins, David , journal=
-
[40]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[41]
AAAI Conference on Artificial Intelligence , year=
Generalized Planning in PDDL Domains with Pretrained Large Language Models , author=. AAAI Conference on Artificial Intelligence , year=
-
[42]
Large Language Models as Commonsense Knowledge for Large-Scale Task Planning , volume =
Zhao, Zirui and Lee, Wee Sun and Hsu, David , booktitle =. Large Language Models as Commonsense Knowledge for Large-Scale Task Planning , volume =
-
[43]
2018 , DOI =
Arora, Ankuj and Fiorino, Humbert and Pellier, Damien and Etivier, Marc M ´ and Pesty, Sylvie , JOURNAL =. 2018 , DOI =
2018
-
[44]
The International Conference on Learning Representations (ICLR) , year=
Learning adaptive planning representations with natural language guidance , author=. The International Conference on Learning Representations (ICLR) , year=
-
[45]
Position:
Subbarao Kambhampati and Karthik Valmeekam and Lin Guan and Mudit Verma and Kaya Stechly and Siddhant Bhambri and Lucas Paul Saldyt and Anil B Murthy , booktitle=. Position:
-
[46]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[47]
Large Language Models Still Can't Plan (A Benchmark for
Karthik Valmeekam and Alberto Olmo and Sarath Sreedharan and Subbarao Kambhampati , booktitle=. Large Language Models Still Can't Plan (A Benchmark for
-
[48]
Language-Augmented Symbolic Planner for Open-World Task Planning , booktitle =
Chen, Guanqi and Yang, Lei and Jia, Ruixing and Hu, Zhe and Chen, Yizhou and Zhang, Wei and Wang, Wenping and Pan, Jia , year =. Language-Augmented Symbolic Planner for Open-World Task Planning , booktitle =
-
[49]
PDDL Generators
Jendrik Seipp and \'A lvaro Torralba and J \"o rg Hoffmann. PDDL Generators. 2022
2022
-
[50]
GPT-4 Technical Report , url =
OpenAI , journal =. GPT-4 Technical Report , url =
-
[51]
Helmert, Malte , title =. J. Artif. Int. Res. , month = jul, pages =. 2006 , issue_date =
2006
-
[52]
Deploying and Evaluating LLMs to Program Service Mobile Robots , year=
Hu, Zichao and Lucchetti, Francesca and Schlesinger, Claire and Saxena, Yash and Freeman, Anders and Modak, Sadanand and Guha, Arjun and Biswas, Joydeep , journal=. Deploying and Evaluating LLMs to Program Service Mobile Robots , year=
-
[53]
Xinrui Lin and Yangfan Wu and Huanyu Yang and Yu Zhang and Yanyong Zhang and Jianmin Ji , journal =
-
[54]
AD a PT : As-Needed Decomposition and Planning with Language Models
Prasad, Archiki and Koller, Alexander and Hartmann, Mareike and Clark, Peter and Sabharwal, Ashish and Bansal, Mohit and Khot, Tushar. AD a PT : As-Needed Decomposition and Planning with Language Models. Findings of the Association for Computational Linguistics: NAACL 2024. 2024
2024
-
[55]
PDDLEGO : Iterative Planning in Textual Environments
Zhang, Li and Jansen, Peter and Zhang, Tianyi and Clark, Peter and Callison-Burch, Chris and Tandon, Niket. PDDLEGO : Iterative Planning in Textual Environments. Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024). 2024. doi:10.18653/v1/2024.starsem-1.17
-
[56]
2024 , eprint=
Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages , author=. 2024 , eprint=
2024
-
[57]
and Gu, Alex and Lipkin, Benjamin and Zhang, Cedegao E
Olausson, Theo and Gu, Alex and Lipkin, Ben and Zhang, Cedegao and Solar-Lezama, Armando and Tenenbaum, Joshua and Levy, Roger. LINC : A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023...
-
[58]
arXiv preprint arXiv:2402.11517 , year=
Knowledge-to-sql: Enhancing sql generation with data expert llm , author=. arXiv preprint arXiv:2402.11517 , year=
-
[59]
arXiv preprint arXiv:2310.00163 , year=
Cook2LTL: Translating Cooking Recipes to LTL Formulae using Large Language Models , author=. arXiv preprint arXiv:2310.00163 , year=
-
[60]
Thirty-seventh Conference on Neural Information Processing Systems , year=
On the Planning Abilities of Large Language Models - A Critical Investigation , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[61]
Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering
Zhu, Wang and Thomason, Jesse and Jia, Robin. Chain-of-Questions Training with Latent Answers for Robust Multistep Question Answering. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.547
-
[62]
arXiv preprint arXiv:2401.08743 , year=
Mmtom-qa: Multimodal theory of mind question answering , author=. arXiv preprint arXiv:2401.08743 , year=
-
[63]
2024 , eprint=
MuMA-ToM: Multi-modal Multi-Agent Theory of Mind , author=. 2024 , eprint=
2024
-
[64]
ArXiv preprint , primaryClass=
Language Models can Infer Action Semantics for Classical Planners from Environment Feedback , author=. ArXiv preprint , primaryClass=. 2024 , eprint=
2024
-
[65]
2023 , eprint=
How FaR Are Large Language Models From Agents with Theory-of-Mind? , author=. 2023 , eprint=
2023
-
[66]
2024 , eprint=
GPT-4o System Card , author=. 2024 , eprint=
2024
-
[67]
Angelo and La Malfa, Emanuele and Marro, Samuele and Asperti, Andrea and Cohn, Anthony G
Huang, X. Angelo and La Malfa, Emanuele and Marro, Samuele and Asperti, Andrea and Cohn, Anthony G. and Wooldridge, Michael J. A Notion of Complexity for Theory of Mind via Discrete World Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.167
-
[68]
Xu, Hainiu and Qi, Siya and Li, Jiazheng and Zhou, Yuxiang and Du, Jinhua and Catmur, Caroline and He, Yulan. E nigma T o M : Improve LLM s' Theory-of-Mind Reasoning Capabilities with Neural Knowledge Base of Entity States. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.699
-
[69]
Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models , author=
-
[70]
Schein, Barry , year =
-
[71]
Kratzer, Angelika , title =. The. 2023 , edition =
2023
-
[72]
AutoToM: Automated Bayesian Inverse Planning and Model Discovery for Open-ended Theory of Mind , author=. arXiv preprint arXiv:2502.15676 , year=
-
[73]
Forty-second International Conference on Machine Learning Position Paper Track , year=
Position: Theory of Mind Benchmarks are Broken for Large Language Models , author=. Forty-second International Conference on Machine Learning Position Paper Track , year=
-
[74]
Forty-second International Conference on Machine Learning , year=
How Do Transformers Learn Variable Binding in Symbolic Programs? , author=. Forty-second International Conference on Machine Learning , year=
-
[75]
2025 , eprint=
Language Models use Lookbacks to Track Beliefs , author=. 2025 , eprint=
2025
-
[76]
doi: 10.18653/v1/2024.findings-emnlp.785
Huang, Yukun and Liu, Yixin and Thirukovalluru, Raghuveer and Cohan, Arman and Dhingra, Bhuwan. Calibrating Long-form Generations From Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.785
-
[77]
NeSy (2) , crossref=
Weizhi Tang and Vaishak Belle , title=. NeSy (2) , crossref=. 2024 , cdate=
2024
-
[78]
Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
Understanding Social Reasoning in Language Models with Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
-
[79]
Chan, Chunkit and Jiayang, Cheng and Yim, Yauwai and Deng, Zheye and Fan, Wei and Li, Haoran and Liu, Xin and Zhang, Hongming and Wang, Weiqi and Song, Yangqiu. N egotiation T o M : A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/...
-
[80]
Xu, Hainiu and Zhao, Runcong and Zhu, Lixing and Du, Jinhua and He, Yulan. O pen T o M : A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.466
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.