Introduces NCP-ExploreToM framework to evaluate LLMs on inducing belief states via planning and action, with GPT-5 succeeding on ~80% of tasks and outperforming humans.
arXiv preprint arXiv:2504.10839 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
5
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
years
2026 5roles
background 1polarities
background 1representative citing papers
GroupToM-Bench is presented as the first multimodal benchmark for group-level Theory of Mind spanning micro BDI states to macro outcome prediction, with experiments showing current MLLMs lag human baselines on nonlinear social dynamics.
LLM outputs are meaningful according to standard theories of human language, without requiring anthropomorphic assumptions about the models.
MedMSA framework retrieves knowledge via language models then builds formal probabilistic models to produce uncertainty-weighted differential diagnoses from symptoms.