pith. sign in

arxiv: 2506.17788 · v2 · submitted 2025-06-21 · 💻 cs.AI · cs.CL· cs.LG· cs.MA

Bayesian Social Deduction with Graph-Informed Language Models

Pith reviewed 2026-05-19 07:24 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.MA
keywords social reasoninglanguage agentsprobabilistic modelsgraph structuresAvalonbelief inferencehybrid AI systems
0
0 comments X

The pith

Externalizing belief inference to a graph-informed probabilistic model lets smaller language agents match large models and defeat humans in Avalon.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Social reasoning requires inferring hidden beliefs and intentions from limited observations, a task where even large language models struggle without heavy computation. This paper shows that splitting the work—using a structured graph-based probabilistic model for belief tracking and an LLM only for language—lets compact agents perform at the level of much larger systems. In tests on the social deduction game Avalon, the hybrid approach matches big models in agent play and beats human players with a 67 percent win rate while earning better ratings than human teammates. The result suggests that explicit external models for unobservables can overcome internal reasoning limits in LLMs.

Core claim

The authors demonstrate that a hybrid system, in which a graph-informed structured probabilistic model handles externalized belief inference about other players' hidden states while the language model manages only language understanding and communication, achieves competitive results against much larger pure language models in agent-versus-agent Avalon and secures a 67 percent win rate against human opponents in controlled experiments.

What carries the argument

The hybrid reasoning framework that externalizes belief inference to a graph-informed probabilistic model for tracking unobservable player intentions and beliefs.

If this is right

  • Agents can operate in real time without needing extensive test-time inference.
  • Performance holds across different model sizes when the belief model is kept separate.
  • Qualitative ratings from humans favor the hybrid agent over both reasoning baselines and fellow humans.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to other domains requiring theory-of-mind reasoning, such as negotiation or collaborative planning.
  • Success against humans points to uses in social simulation and AI training environments.

Load-bearing premise

The graph-informed probabilistic model correctly extracts and represents the crucial unobservable beliefs and intentions of other agents from partial game observations.

What would settle it

A replication study in which the hybrid agent plays against humans and fails to exceed a 50 percent win rate, or in which swapping the graph model for pure LLM-based belief inference restores the performance gap seen in distilled models.

Figures

Figures reproduced from arXiv: 2506.17788 by Angela Qian, Guven Gergerli, Joseph Campbell, Lucia Romero, Matthew Lyle Olson, Shahab Rahimirad, Simon Stepputtis.

Figure 1
Figure 1. Figure 1: Overview of GRAIL’s architecture and inter [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: F1 scores of agents’ voting predictions of team composition per round (error bars indicate [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Probability density of GRAIL beliefs about Good and Evil players, with and without [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average per-round token usage for GRAIL, LRM-based reasoning agents, and ReCon in Agent-Agent games. Belief Distribution: To analyze the effect of lan￾guage priors, we visualize the evolution of GRAIL’s belief over the course of 13 games that end in 5 rounds. Fig. 3a shows the kernel density estimations KDE(b t j | rj = 1) (Evil player) and KDE(b t j | rj = 0) (Good player), computed both with and without … view at source ↗
Figure 5
Figure 5. Figure 5: Combined ablation results across agent components, model size, and reasoning types. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Hallucination rates for GRAIL (Llama 3.1) and the rea￾soning agent (DS-R1) for different model sizes over 40 games. During experiments, we observed that agent messages some￾times included hallucinations and references to non-existent game events. To evaluate how well agents align their messages with the ground truth game state, we analyze hallucination rates in both GRAIL and the reasoning agent across var… view at source ↗
Figure 7
Figure 7. Figure 7: Average scores given to agents by humans across two questions assessing contri￾bution and helpfulness. Human ratings (Evil players’ votes for Good human players) are included for baseline comparison. H1, H2: Across 15 games, GRAIL won 10 and lost 5 (67% win rate), whereas the reasoning-based agent won 4 and lost 11 (27% win rate). To assess the statis￾tical significance of this performance difference, we e… view at source ↗
Figure 8
Figure 8. Figure 8: The Factor Graph Structure 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The relationship between accuracy and confidence of the model before and after calibration. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The game interface as seen in Spectator Mode [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗
read the original abstract

Social reasoning - inferring unobservable beliefs and intentions from partial observations of other agents - remains a challenging task for large language models (LLMs). We evaluate the limits of current reasoning language models in the social deduction game Avalon and find that while the largest models demonstrate strong performance, they require extensive test-time inference and degrade sharply when distilled to smaller, real-time-capable variants. To address this, we introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model, while using an LLM for language understanding and interaction. Our approach achieves competitive performance with much larger models in Agent-Agent play and, notably, is the first language agent to defeat human players in a controlled study - achieving a 67% win rate and receiving higher qualitative ratings than both reasoning baselines and human teammates. We release code, models, and a dataset to support future work on social reasoning in LLM agents, which can be found at https://camp-lab-purdue.github.io/bayesian-social-deduction/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a hybrid reasoning framework for the social deduction game Avalon that externalizes belief inference about hidden roles, intentions, and knowledge to a graph-informed structured probabilistic model while delegating language understanding and interaction to an LLM. It reports competitive performance against much larger models in agent-agent play and, in a controlled human study, a 67% win rate for the hybrid agent along with higher qualitative ratings than both reasoning baselines and human teammates. The authors release code, models, and a dataset.

Significance. If the results are robust, the work provides evidence that hybrid systems can achieve strong social reasoning with smaller, real-time LLMs by offloading complex belief tracking to an external probabilistic structure, addressing a known limitation of pure end-to-end LLM reasoning. The human-study result would constitute a notable first for language agents if the controls and statistical reporting hold. Open release of artifacts is a clear strength that supports reproducibility and follow-on research.

major comments (2)
  1. [§3] §3 (Model): The description of the graph-informed probabilistic belief model lacks explicit equations for the update rules and the representation of higher-order beliefs (e.g., nested beliefs about other agents' beliefs). Without these, it is impossible to verify whether the structure captures the deceptive or nested intentions that arise in human play, which is the load-bearing assumption for the 67% win-rate claim.
  2. [§4.3] §4.3 (Human Study): The results section reports a 67% win rate and higher qualitative ratings but provides no details on participant count, experience matching, statistical tests, or exclusion criteria. This omission prevents assessment of whether the hybrid split introduces new failure modes precisely in the human regime where superiority is claimed.
minor comments (2)
  1. [Abstract] The abstract and introduction use the term 'parameter-free' for the probabilistic component; clarify whether any hyperparameters in the graph construction or prior specification are tuned on the evaluation data.
  2. [Figure 2] Figure 2 (architecture diagram) would benefit from an explicit legend distinguishing LLM-generated text from probabilistic belief outputs to improve readability for readers unfamiliar with hybrid agent designs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We provide point-by-point responses to the major comments below. We agree that additional mathematical detail and experimental reporting are needed to strengthen the manuscript and will revise accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (Model): The description of the graph-informed probabilistic belief model lacks explicit equations for the update rules and the representation of higher-order beliefs (e.g., nested beliefs about other agents' beliefs). Without these, it is impossible to verify whether the structure captures the deceptive or nested intentions that arise in human play, which is the load-bearing assumption for the 67% win-rate claim.

    Authors: We acknowledge that the model section provides a high-level description of the graph-informed probabilistic belief model without explicit equations. This is a valid point, as the update rules for beliefs and the mechanism for representing higher-order beliefs are crucial for understanding how the model handles deception and nested intentions in Avalon. In the revised manuscript, we will add the formal equations for the belief update process, including how the graph structure encodes and propagates higher-order beliefs. This revision will make it possible to verify the model's capacity to support the reported performance. revision: yes

  2. Referee: [§4.3] §4.3 (Human Study): The results section reports a 67% win rate and higher qualitative ratings but provides no details on participant count, experience matching, statistical tests, or exclusion criteria. This omission prevents assessment of whether the hybrid split introduces new failure modes precisely in the human regime where superiority is claimed.

    Authors: We appreciate the referee drawing attention to the reporting standards for the human study. While the manuscript states the 67% win rate and qualitative ratings, we agree that more details are necessary for full evaluation. We will revise the section to include the number of participants, how they were matched for experience, the statistical tests conducted (including p-values), and the exclusion criteria used. These additions will allow for a better assessment of the results and any potential failure modes in human interactions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the hybrid model or performance claims

full rationale

The paper describes a hybrid framework that externalizes belief inference to an independent structured probabilistic model over a graph of agents/roles/observations, while delegating language tasks to the LLM. No equations, derivations, or self-citations are shown that reduce the claimed win rates or competitive performance to a fitted parameter defined by the result itself, a self-referential definition, or a load-bearing uniqueness theorem from prior author work. The results are presented as empirical outcomes from agent-agent and controlled human-play experiments, with the probabilistic component described as external and structured rather than fitted to the target metrics, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the hybrid framework implicitly relies on standard Bayesian updating assumptions and LLM language capabilities, but none are detailed enough to enumerate.

pith-pipeline@v0.9.0 · 5724 in / 1265 out tokens · 27477 ms · 2026-05-19T07:24:16.591106+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

    cs.AI 2026-04 unverdicted novelty 7.0

    COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.

  2. Evaluating Large Language Models in a Complex Hidden Role Game

    cs.CL 2026-04 unverdicted novelty 5.0

    LLMs achieve only 59.7% role identification accuracy in Secret Hitler versus 86.7% for rule-based agents, show negative impact as fascists, and produce 40% shorter games due to failed deception.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 2 Pith papers · 8 internal anchors

  1. [1]

    Large language models for mathematical reasoning: Progresses and challenges

    Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. In Neele Falk, Sara Papi, and Mike Zhang, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 225–237, St. Ju...

  2. [2]

    A (dis-)information theory of revealed and unrevealed preferences: Emerging deception and skepticism via theory of mind

    Nitay Alon, Lion Schulz, Jeffrey Rosenschein, and Peter Dayan. A (dis-)information theory of revealed and unrevealed preferences: Emerging deception and skepticism via theory of mind. Open Mind, 7:1–17, 08 2023. doi: 10.1162/opmi_a_00097

  3. [3]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sashank Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bra...

  4. [4]

    Exploring and controlling diversity in llm-agent conversation, 2025

    KuanChao Chu, Yi-Pei Chen, and Hideki Nakayama. Exploring and controlling diversity in llm-agent conversation, 2025. URL https://arxiv.org/abs/2412.21102

  5. [5]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948

  6. [6]

    Wellman, Yu Wang, Genyue Fu, and Kang Lee

    Xiao Pan Ding, Henry M. Wellman, Yu Wang, Genyue Fu, and Kang Lee. Theory-of-mind training causes honest young children to lie. Psychological Science, 26(11):1812–1821, 2015. doi: 10.1177/0956797615604628. URL https://doi.org/10.1177/0956797615604628. PMID: 26431737

  7. [7]

    Gtbench: Uncovering the strategic reasoning capabilities of llms via game-theoretic evaluations

    Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu. Gtbench: Uncovering the strategic reasoning capabilities of llms via game-theoretic evaluations. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Info...

  8. [8]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, 10 Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Beth...

  9. [9]

    The resistance: Avalon

    Don Eskridge. The resistance: Avalon. Board game, 2012

  10. [10]

    Human-level play in the game of diplomacy by combining language models with strategic reasoning

    Meta Fundamental AI Research Diplomacy Team (FAIR) †, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchintala, Stephen Roller, Dirk Rowe, Weiy...

  11. [11]

    Tracking beliefs and intentions in the werewolf game

    Codruta Gîrlea, Eyal Amir, and Roxana Girju. Tracking beliefs and intentions in the werewolf game. In International Conference on Principles of Knowledge Representation and Reasoning,

  12. [12]

    URL https://api.semanticscholar.org/CorpusID:11838

  13. [13]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. A survey on llm-as-a-judge, 2025. URL https://arxiv.org/abs/ 2411.15594

  14. [14]

    On Calibration of Modern Neural Networks

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks, 2017. URL https://arxiv.org/abs/1706.04599

  15. [15]

    secret hitler

    Jiaxian Guo, Bo Yang, Paul Yoo, Bill Yuchen Lin, Yusuke Iwasawa, and Yutaka Matsuo. Suspicion-agent: Playing imperfect information games with theory of mind aware gpt-4, 2024. URL https://arxiv.org/abs/2309.17277

  16. [16]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=d7KBjmI3GmQ

  17. [17]

    Ho, Rebecca Saxe, and Fiery Cushman

    Mark K. Ho, Rebecca Saxe, and Fiery Cushman. Planning with theory of mind. Trends in Cognitive Sciences , 26(11):959–971, 2022. ISSN 1364-6613. doi: https://doi.org/ 10.1016/j.tics.2022.08.003. URL https://www.sciencedirect.com/science/article/ pii/S1364661322001851

  18. [18]

    Putting the con in context: Identifying deceptive actors in the game of mafia

    Samee Ibraheem, Gaoyue Zhou, and John DeNero. Putting the con in context: Identifying deceptive actors in the game of mafia. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, ...

  19. [19]

    Overcoming catastrophic forgetting in neural networks

    Michal Kosinski. Evaluating large language models in theory of mind tasks. Proceedings of the National Academy of Sciences, 121(45), October 2024. ISSN 1091-6490. doi: 10.1073/pnas. 2405460121. URL http://dx.doi.org/10.1073/pnas.2405460121

  20. [20]

    Kschischang, B.J

    F.R. Kschischang, B.J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2):498–519, 2001. doi: 10.1109/18.910572

  21. [21]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  22. [22]

    Werewolf among us: Multimodal resources for modeling persuasion behaviors in social deduction games

    Bolin Lai, Hongxin Zhang, Miao Liu, Aryan Pariani, Fiona Ryan, Wenqi Jia, Shirley Anugrah Hayati, James Rehg, and Diyi Yang. Werewolf among us: Multimodal resources for modeling persuasion behaviors in social deduction games. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 202...

  23. [23]

    Llm-based agent society investigation: Collaboration and confrontation in avalon gameplay, 2024

    Yihuai Lan, Zhiqiang Hu, Lei Wang, Yang Wang, Deheng Ye, Peilin Zhao, Ee-Peng Lim, Hui Xiong, and Hao Wang. Llm-based agent society investigation: Collaboration and confrontation in avalon gameplay, 2024. URL https://arxiv.org/abs/2310.14985

  24. [24]

    Theory of mind for multi-agent collaboration via large language models

    Huao Li, Yu Chong, Simon Stepputtis, Joseph Campbell, Dana Hughes, Charles Lewis, and Katia Sycara. Theory of mind for multi-agent collaboration via large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 180–192, Singapore, December

  25. [25]

    doi: 10.18653/v1/2023.emnlp-main.13

    Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.13. URL https://aclanthology.org/2023.emnlp-main.13/

  26. [26]

    AvalonBench: Evaluating LLMs Playing the Game of Avalon, 2023

    Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu. Avalonbench: Evaluating llms playing the game of avalon, 2023. URL https://arxiv.org/abs/2310.05036

  27. [27]

    Enhancing language model agents using diversity of thoughts

    Vijay Lingam, Behrooz Omidvar Tehrani, Sujay Sanghavi, Gaurav Gupta, Sayan Ghosh, Linbo Liu, Jun Huan, and Anoop Deoras. Enhancing language model agents using diversity of thoughts. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=ZsP3YbYeE9

  28. [29]

    Manakul, A

    Potsawee Manakul, Adian Liusie, and Mark Gales. SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, December 2023. Association for Comput...

  29. [30]

    Mann and Douglas R

    Henry B. Mann and Douglas R. Whitney. On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18:50–60, 1947. URL https://api.semanticscholar.org/CorpusID:14328772

  30. [31]

    Situations, actions, and causal laws

    John McCarthy. Situations, actions, and causal laws. 1963. URL https://api. semanticscholar.org/CorpusID:118922379

  31. [32]

    Can LLMs keep a secret? testing privacy implications of language models via contextual integrity theory

    Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou, Yulia Tsvetkov, Maarten Sap, Reza Shokri, and Yejin Choi. Can LLMs keep a secret? testing privacy implications of language models via contextual integrity theory. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gmg7t8b4s0. 12

  32. [33]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URL https://arxiv.org/abs/2501.19393

  33. [34]

    Murphy, Yair Weiss, and Michael I

    Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate inference: an empirical study. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI’99, page 467–475, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. ISBN 1558606149

  34. [35]

    Reasoning over uncertain text by generative large language models, 2024

    Aliakbar Nafar, Kristen Brent Venable, and Parisa Kordjamshidi. Reasoning over uncertain text by generative large language models, 2024. URL https://arxiv.org/abs/2402.09614

  35. [36]

    Précis of bayesian rationality: The probabilistic approach to human reasoning

    Mike Oaksford and Nick Chater. Précis of bayesian rationality: The probabilistic approach to human reasoning. Behavioral and Brain Sciences , 32(1):69–84, 2009. doi: 10.1017/ S0140525X09000284

  36. [37]

    Introducing openai o3 and o4-mini, April 2025

    OpenAI. Introducing openai o3 and o4-mini, April 2025. URL https://openai.com/index/ introducing-o3-and-o4-mini/ . Accessed 2025-05-09

  37. [38]

    What are the odds? language models are capable of probabilistic reasoning

    Akshay Paruchuri, Jake Garrison, Shun Liao, John Hernandez, Jacob Sunshine, Tim Althoff, Xin Liu, and Daniel McDuff. What are the odds? language models are capable of probabilistic reasoning. In Conference on Empirical Methods in Natural Language Processing, 2024. URL https://api.semanticscholar.org/CorpusID:270562235

  38. [39]

    Reasoning with language model prompting: A survey

    Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 5368–...

  39. [40]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022

  40. [41]

    Richard and Richard P

    Michael D. Richard and Richard P. Lippmann. Neural network classifiers estimate bayesian a posteriori probabilities. Neural Computation, 3(4):461–483, 1991. doi: 10.1162/neco.1991.3.4. 461

  41. [42]

    arXiv preprint arXiv:2412.19726 , year=

    Matthew Riemer, Zahra Ashktorab, Djallel Bouneffouf, Payel Das, Miao Liu, Justin D. Weisz, and Murray Campbell. Position: Theory of mind benchmarks are broken for large language models, 2025. URL https://arxiv.org/abs/2412.19726

  42. [43]

    Karen Liu, and Dorsa Sadigh

    Bidipta Sarkar, Warren Xia, C. Karen Liu, and Dorsa Sadigh. Training language models for social deduction with multi-agent reinforcement learning, 2025. URL https://arxiv.org/ abs/2502.06060

  43. [44]

    Are emergent abilities of large language models a mirage? In Thirty-seventh Conference on Neural Information Processing Systems,

    Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? In Thirty-seventh Conference on Neural Information Processing Systems,

  44. [45]

    URL https://openreview.net/forum?id=ITw9edRDlD

  45. [46]

    Pomegranate: fast and flexible probabilistic modeling in python

    Jacob Schreiber. Pomegranate: fast and flexible probabilistic modeling in python. Journal of Machine Learning Research, 18(164):1–6, 2018

  46. [47]

    Knowledge-Centric Hallucination Detection

    Eli Schwartz, Leshem Choshen, Joseph Shtok, Sivan Doveh, Leonid Karlinsky, and Assaf Arbelle. NumeroLogic: Number encoding for enhanced LLMs’ numerical reasoning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 206–212, Miami, Florida, USA, Novemb...

  47. [48]

    Minding language models’ (lack of) theory of mind: A plug-and-play multi-character belief tracker

    Melanie Sclar, Sachin Kumar, Peter West, Alane Suhr, Yejin Choi, and Yulia Tsvetkov. Minding language models’ (lack of) theory of mind: A plug-and-play multi-character belief tracker. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

  48. [49]

    Finding friend and foe in multi-agent games

    Jack Serrino, Max Kleiman-Weiner, David C Parkes, and Josh Tenenbaum. Finding friend and foe in multi-agent games. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/ paper/...

  49. [50]

    Clever hans or neural theory of mind? stress testing social reasoning in large language models

    Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, and Vered Shwartz. Clever hans or neural theory of mind? stress testing social reasoning in large language models. In Yvette Graham and Matthew Purver, editors,Proceedings of the 18th Conference of the European Chapter of the Association for Computational ...

  50. [51]

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Esin DURMUS, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. In The Twelfth Internati...

  51. [52]

    Loopy Belief Propagation in the Presence of Determinism

    David Smith and Vibhav Gogate. Loopy Belief Propagation in the Presence of Determinism. In Samuel Kaski and Jukka Corander, editors, Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics , volume 33 of Proceedings of Machine Learning Research, pages 895–903, Reykjavik, Iceland, 22–25 Apr 2014. PMLR. URL https: /...

  52. [53]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters, 2024. URL https://arxiv. org/abs/2408.03314

  53. [54]

    Long-horizon dialogue understanding for role identification in the game of avalon with large language models

    Simon Stepputtis, Joseph Campbell, Yaqi Xie, Zhengyang Qi, Wenxin Zhang, Ruiyi Wang, Sanketh Rangreji, Charles Lewis, and Katia Sycara. Long-horizon dialogue understanding for role identification in the game of avalon with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistic...

  54. [55]

    James W. A. Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, Eugenio Scaliti, Saurabh Gupta, Krati Saxena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, Michael S. A. Graziano, and Cristina Becchio. Testing theory of mind in large language models and humans. Nature Human Behaviour , 8(7):1285–1295, May 2024. ISSN 2397-3374. doi: 10.1038/ s4156...

  55. [56]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URL https://qwenlm.github.io/blog/qwq-32b/

  56. [57]

    Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks, 2023

    Tomer Ullman. Large language models fail on trivial alterations to theory-of-mind tasks, 2023. URL https://arxiv.org/abs/2302.08399

  57. [58]

    Wainwright and Michael I

    Martin J. Wainwright and Michael I. Jordan. Graphical models, exponential families, and varia- tional inference. Foundations and Trends® in Machine Learning, 1(1–2):25–34, 2008. ISSN 1935-8237. doi: 10.1561/2200000001. URL http://dx.doi.org/10.1561/2200000001. 14

  58. [59]

    Avalon’s game of thoughts: Battle against deception through recursive contemplation, 2023

    Shenzhi Wang, Chang Liu, Zilong Zheng, Siyuan Qi, Shuo Chen, Qisen Yang, Andrew Zhao, Chaofei Wang, Shiji Song, and Gao Huang. Avalon’s game of thoughts: Battle against deception through recursive contemplation, 2023. URL https://arxiv.org/abs/2310.01320

  59. [60]

    Chi, Quoc V Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?...

  60. [61]

    Correctness of belief propagation in gaussian graphical models of arbitrary topology

    Yair Weiss and William Freeman. Correctness of belief propagation in gaussian graphical models of arbitrary topology. In S. Solla, T. Leen, and K. Müller, ed- itors, Advances in Neural Information Processing Systems , volume 12. MIT Press,

  61. [62]

    URL https://proceedings.neurips.cc/paper_files/paper/1999/file/ 10c272d06794d3e5785d5e7c5356e9ff-Paper.pdf

  62. [63]

    Individual comparisons by ranking methods

    Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83,

  63. [64]

    Individual comparisons by ranking methods,

    ISSN 00994987. URL http://www.jstor.org/stable/3001968

  64. [65]

    Enhance reasoning for large language models in the game werewolf

    Shuang Wu, Liwen Zhu, Tao Yang, Shiwei Xu, Qiang Fu, Yang Wei, and Haobo Fu. Enhance reasoning for large language models in the game werewolf, 2024. URL https://arxiv.org/ abs/2402.02330

  65. [66]

    Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

    Fengli Xu, Qianyue Hao, Zefang Zong, Jingwei Wang, Yunke Zhang, Jingyi Wang, Xiaochong Lan, Jiahui Gong, Tianjian Ouyang, Fanjin Meng, Chenyang Shao, Yuwei Yan, Qinglong Yang, Yiwen Song, Sijian Ren, Xinyuan Hu, Yu Li, Jie Feng, Chen Gao, and Yong Li. Towards large reasoning models: A survey of reinforced reasoning with large language models, 2025. URL ht...

  66. [67]

    MAgIC: Investigation of large language model powered multi-agent in cognition, adaptability, rationality and collaboration

    Lin Xu, Zhiyuan Hu, Daquan Zhou, Hongyu Ren, Zhen Dong, Kurt Keutzer, See-Kiong Ng, and Jiashi Feng. MAgIC: Investigation of large language model powered multi-agent in cognition, adaptability, rationality and collaboration. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural La...

  67. [68]

    Exploring large language models for communication games: An empirical study on werewolf,

    Yuzhuang Xu, Shuo Wang, Peng Li, Fuwen Luo, Xiaolong Wang, Weidong Liu, and Yang Liu. Exploring large language models for communication games: An empirical study on werewolf,

  68. [69]

    URL https://arxiv.org/abs/2309.04658

  69. [70]

    Language agents with reinforcement learning for strategic play in the werewolf game, 2024

    Zelai Xu, Chao Yu, Fei Fang, Yu Wang, and Yi Wu. Language agents with reinforcement learning for strategic play in the werewolf game, 2024. URLhttps://arxiv.org/abs/2310. 18940

  70. [71]

    Number cookbook: Number understanding of language models and how to improve it

    Haotong Yang, Yi Hu, Shijia Kang, Zhouchen Lin, and Muhan Zhang. Number cookbook: Number understanding of language models and how to improve it. In The Thirteenth Inter- national Conference on Learning Representations, 2025. URL https://openreview.net/ forum?id=BWS5gVjgeY

  71. [72]

    Finding the m most probable configurations us- ing loopy belief propagation

    Chen Yanover and Yair Weiss. Finding the m most probable configurations us- ing loopy belief propagation. In S. Thrun, L. Saul, and B. Schölkopf, edi- tors, Advances in Neural Information Processing Systems , volume 16. MIT Press,

  72. [73]

    URL https://proceedings.neurips.cc/paper_files/paper/2003/file/ 70fcb77e6349f4467edd7227baa73222-Paper.pdf

  73. [74]

    Benchmarking reasoning robustness in large language models, 2025

    Tong Yu, Yongcheng Jing, Xikun Zhang, Wentao Jiang, Wenjie Wu, Yingjie Wang, Wenbin Hu, Bo Du, and Dacheng Tao. Benchmarking reasoning robustness in large language models, 2025. URL https://arxiv.org/abs/2503.04550

  74. [75]

    Making small language models efficient reasoners: Intervention, supervision, reinforcement,

    Xuechen Zhang, Zijian Huang, Chenshun Ni, Ziyang Xiong, Jiasi Chen, and Samet Oymak. Making small language models efficient reasoners: Intervention, supervision, reinforcement,

  75. [76]

    URL https://arxiv.org/abs/2505.07961. 15

  76. [77]

    I cast detect thoughts: Learning to converse and guide with intents and theory-of-mind in dungeons and dragons

    Pei Zhou, Andrew Zhu, Jennifer Hu, Jay Pujara, Xiang Ren, Chris Callison-Burch, Yejin Choi, and Prithviraj Ammanabrolu. I cast detect thoughts: Learning to converse and guide with intents and theory-of-mind in dungeons and dragons. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association fo...

  77. [78]

    influence

    Xueyang Zhou, Guiyao Tie, Guowen Zhang, Weidong Wang, Zhigang Zuo, Di Wu, Duanfeng Chu, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. Large reasoning models in agent scenarios: Exploring the necessity of reasoning capabilities, 2025. URL https://arxiv. org/abs/2503.11074. 16 Appendix for Bayesian Social Deduction with Graph-Informed Language Models A The...

  78. [82]

    Suspicious or trustworthy behaviors When looking for suspicious behavior, con- sider whether players are behaving suspi- ciously or illogically with respect to chat mes- sages or party votes, e.g. always rejecting party votes unless they or another specific player is in it, or making assertions without evidence (especially early in the game such as on Que...

  79. [85]

    Remember that you are on the evil side - if it is possible, aim to include at least one evil player while main- taining your cover

    Suspicious or trustworthy behaviors Present your team selection to the other play- ers with a detailed rationale based on past events and player behaviors. Remember that you are on the evil side - if it is possible, aim to include at least one evil player while main- taining your cover. However, you will have to justify the proposed team to the other play...

  80. [90]

    The team must consist of <team size> players When looking for suspicious behavior, con- sider whether players are behaving suspi- ciously or illogically with respect to chat mes- sages or party votes, e.g. always rejecting party votes unless they or another specific player is in it, or making assertions without evidence (especially early in the game such ...

Showing first 80 references.