Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia
Pith reviewed 2026-05-18 13:21 UTC · model grok-4.3
The pith
An analytical formula with three parameters per model predicts mafia win rates across all language model combinations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the Mini-Mafia setting the mafia win-rate p is given by the analytical expression logit(p) = v × (m - d), where the parameters m, d, and v quantify the mafioso's deception, the detective's disclosure, and the villager's detection. Bayesian inference from observed gameplay yields these parameters for each model, allowing accurate prediction of all tournament outcomes using only 3I parameters for I models and yielding a 76.6 percent reduction in Brier score relative to a random baseline in cross-validation.
What carries the argument
The logit-linear formula logit(p) = v × (m - d) that collapses the multi-turn game to the outcome probability of one critical exchange among mafioso, detective, and villager.
If this is right
- For any collection of I models, all I cubed possible three-player tournaments can be predicted from only 3I parameters.
- Models can be ranked by role-specific strengths, such as Grok 3 Mini as the strongest detector and Claude Sonnet 4 as near-random in detection.
- The Mini-Mafia Benchmark supplies a data-efficient method to evaluate language-model interactive capabilities without exhaustive matchup simulation.
- The analytical description supports principled comparisons that isolate deception, disclosure, and detection contributions to collective results.
Where Pith is reading between the lines
- If the single-exchange reduction holds, analogous critical-point analyses could be applied to other multi-agent tasks such as negotiation or debate.
- The three parameters might guide selection of model teams for tasks that require complementary social skills.
- Extending the same fitting procedure to larger player counts or altered game rules would test how far the low-dimensional description remains valid.
- The framework implies that many collective outcomes in language-model social games may be governed by simple additive effects rather than higher-order emergent interactions.
Load-bearing premise
The full dynamics of the game reduce without loss of accuracy to a single critical exchange whose outcome probability is exactly captured by the linear form in logit space.
What would settle it
A new collection of repeated games among previously unseen model triples in which the observed mafia win frequencies deviate substantially from the probabilities predicted by the fitted parameters m, d, and v would falsify the claim.
Figures
read the original abstract
Large language models are increasingly deployed in multi-agent settings whose outcomes hinge on social intelligence, motivating evaluations of their interactive capabilities; yet existing studies remain overwhelmingly empirical, leaving us without a theoretical understanding of how agent interactions determine collective outcomes. To address this, we introduce \textit{Mini-Mafia}, a four-player simplification of the social deduction game Mafia in which a fixed night phase reduces the game to a single critical exchange among a mafioso, a detective, and a villager. In this setting, we show that the mafia win-rate $p$ is predicted by the analytical formula $\text{logit}(p) = v \times (m - d)$, where $m$, $d$, and $v$ represent the mafioso's deception, the detective's disclosure, and the villager's detection capabilities. We turn this analytical framework into the \textit{Mini-Mafia Benchmark}, where Bayesian inference over gameplay data yields per-model estimates of the intrinsic parameters $m$, $d$, and $v$. For $I$ models, only $3I$ parameters suffice to predict the outcomes of all $I^3$ tournament combinations; and in 5-fold cross-validation the formula achieves a $76.6\%$ Brier-score reduction over a random baseline. The benchmark also reveals counterintuitive results: Grok 3 Mini is the strongest detector and GPT-5 Mini the strongest discloser, both ahead of DeepSeek V3.1, Claude Opus 4, and Claude Sonnet 4; while Claude Sonnet 4 is the weakest detector, near random chance. Together, these results show that Mini-Mafia, a simple but nontrivial multi-agent system, admits an analytical description and serves as a principled benchmark for language model interactions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Mini-Mafia, a four-player simplification of the Mafia game reduced to a single critical exchange among a mafioso, detective, and villager. It claims that the mafia win-rate p follows the analytical formula logit(p) = v × (m - d), where m, d, and v are per-model parameters for deception, disclosure, and detection. Bayesian inference on gameplay data yields these parameters; for I models only 3I scalars suffice to predict all I^3 tournament outcomes. 5-fold cross-validation shows a 76.6% Brier-score reduction over random baseline, and the benchmark produces model rankings (e.g., Grok 3 Mini strongest detector, GPT-5 Mini strongest discloser, Claude Sonnet 4 weakest detector).
Significance. If the result holds, the work supplies a rare analytical, parameter-efficient account of LLM social intelligence in a multi-agent setting, moving beyond purely empirical evaluations. The cross-validation evidence, the reduction from I^3 to 3I parameters, and the falsifiable predictions constitute clear strengths. The counterintuitive capability rankings further illustrate the benchmark's potential utility for the field.
major comments (2)
- [Abstract / analytical framework] Abstract and the section presenting the analytical framework: the formula logit(p) = v × (m - d) is asserted to capture the outcome of the full game via a single critical exchange, yet no derivation steps or explicit checks confirming the absence of residual pairwise interactions or higher-order terms across model triples are provided. This assumption is load-bearing for the claim that 3I parameters predict every I^3 combination without loss of accuracy.
- [Cross-validation results] Cross-validation results (5-fold): while the 76.6% Brier reduction is reported, the manuscript does not include residual diagnostics or comparisons against models that add interaction terms to test whether the exact multiplicative logit form is misspecified versus merely adequate within the observed range. Such a check is required to substantiate that no model-pair-specific deviations exist.
minor comments (2)
- Notation for the parameters m, d, v would benefit from an early explicit table or equation block defining each quantity and its estimation procedure.
- The discussion of counterintuitive rankings (Grok 3 Mini, GPT-5 Mini, Claude variants) could be expanded with brief qualitative examples of the observed behaviors to aid interpretability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / analytical framework] Abstract and the section presenting the analytical framework: the formula logit(p) = v × (m - d) is asserted to capture the outcome of the full game via a single critical exchange, yet no derivation steps or explicit checks confirming the absence of residual pairwise interactions or higher-order terms across model triples are provided. This assumption is load-bearing for the claim that 3I parameters predict every I^3 combination without loss of accuracy.
Authors: We agree that the current presentation would benefit from explicit derivation steps. The formula is derived from the game's reduction to a single decisive exchange in which the mafioso's deception advantage is modulated by the detective's disclosure and the villager's detection; under the assumption that other interactions are negligible due to the fixed night phase and role assignments, the net logit effect takes the multiplicative form v × (m - d). In the revision we will add a dedicated derivation subsection that walks through this reasoning from the game rules, together with post-hoc residual analyses across held-out model triples to verify that higher-order terms do not materially improve fit or predictive accuracy. revision: yes
-
Referee: [Cross-validation results] Cross-validation results (5-fold): while the 76.6% Brier reduction is reported, the manuscript does not include residual diagnostics or comparisons against models that add interaction terms to test whether the exact multiplicative logit form is misspecified versus merely adequate within the observed range. Such a check is required to substantiate that no model-pair-specific deviations exist.
Authors: We accept that residual diagnostics and explicit misspecification tests against richer models are necessary to substantiate the claim. In the revised manuscript we will include (i) residual plots and summary statistics from the Bayesian posterior predictive checks and (ii) a direct comparison of the base model against versions augmented with pairwise interaction terms (e.g., m·d, m·v). These additions will quantify whether the simple multiplicative form is adequate or whether systematic deviations appear for particular model combinations. revision: yes
Circularity Check
No circularity: functional form is a modeling assumption validated by generalization on held-out data rather than a tautological fit.
full rationale
The paper presents logit(p) = v × (m - d) as an analytical formula that collapses the multi-turn game to a single critical exchange and estimates the three per-model scalars via Bayesian inference on observed gameplay. It then tests predictive accuracy on held-out tournament combinations through 5-fold cross-validation, reporting a 76.6% Brier-score reduction. Because the evaluation explicitly withholds data from parameter estimation and measures out-of-sample performance, the reported predictions are not equivalent to the inputs by construction. No self-citation chain, uniqueness theorem, or ansatz imported from prior author work is invoked to justify the functional form; the derivation chain therefore remains self-contained against the external empirical benchmark.
Axiom & Free-Parameter Ledger
free parameters (3)
- m (mafioso deception)
- d (detective disclosure)
- v (villager detection)
axioms (2)
- domain assumption The four-player game with fixed night phase reduces to a single critical exchange among mafioso, detective, and villager.
- ad hoc to paper Win probability follows exactly logit(p) = v × (m - d) with no additional interaction terms.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the mafia win-rate p is predicted by the analytical formula logit(p) = v × (m - d)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
For I models, only 3I parameters suffice to predict the outcomes of all I³ tournament combinations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models
LLM moral robustness under persona role-play is largely determined by model family with Claude models most consistent, while susceptibility shows little family dependence.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Werewolf arena: A case study in llm evaluation via social deduction
Suma Bailis, Jane Friedhoff, and Feiyang Chen. Werewolf arena: A case study in llm evaluation via social deduction. arXiv preprint arXiv:2407.13943, 2024. URL https://arxiv.org/abs/2407.13943
-
[3]
Sourav Banerjee, Ayushi Agarwal, and Eishkaran Singh. The vulnerability of language model benchmarks: Do they accurately reflect true llm performance?, 2024. URL https://arxiv.org/abs/2412.03597
-
[4]
Philosophy of Physics, volume 45 of Synthese Library
Mario Bunge. Philosophy of Physics, volume 45 of Synthese Library. D. Reidel Publishing Company, Dordrecht, Holland, 1973
work page 1973
-
[5]
Mini-Mafia: LLM Benchmarking for Deception, Detection, and Disclosure
Davi Bastos Costa. Mini-Mafia: LLM Benchmarking for Deception, Detection, and Disclosure . https://github.com/bastoscostadavi/llm-mafia-game, 2025
work page 2025
-
[6]
Silin Du and Xiaowei Zhang. Helmsman of the masses? evaluate the opinion leadership of large language models in the werewolf game. arXiv preprint arXiv:2404.01602, 2024. URL https://arxiv.org/abs/2404.01602
-
[7]
Truthful ai: Developing and governing ai that does not lie, 2021
Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, and William Saunders. Truthful ai: Developing and governing ai that does not lie, 2021. URL https://arxiv.org/abs/2110.06674
-
[9]
Chawla, Olaf Wiest, and Xiangliang Zhang
Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pp.\ 8048--8057. IJCAI, 2024. doi:10.24963/ijcai.2024/890
-
[10]
Guiyang Hou, Wenqi Zhang, Yongliang Shen, Zeqi Tan, Sihao Shen, and Weiming Lu. Egosocialarena: Benchmarking the social intelligence of large language models from a first-person perspective. arXiv preprint arXiv:2410.06195, 2024. URL https://arxiv.org/abs/2410.06195
-
[11]
Homo Ludens: A Study of the Play-Element in Culture
Johan Huizinga. Homo Ludens: A Study of the Play-Element in Culture. Routledge & Kegan Paul, 1938
work page 1938
-
[12]
Learning to discuss strategically: A case study on one night ultimate werewolf
Xuanfa Jin, Ziyan Wang, Yali Du, Meng Fang, Haifeng Zhang, and Jun Wang. Learning to discuss strategically: A case study on one night ultimate werewolf. arXiv preprint arXiv:2405.19946, 2024. URL https://arxiv.org/abs/2405.19946
-
[13]
Brian Lai, Haofan Zhang, Ming Liu, Andrea Pariani, Francesca Ryan, Weizhe Jia, Shirley Anugrah Hayati, James M. Rehg, and Diyi Yang. Werewolf among us: A multimodal dataset for modeling persuasion behaviors in social deduction games. arXiv preprint arXiv:2212.08279, 2022. URL https://arxiv.org/abs/2212.08279
-
[14]
Th \'e orie Analytique des Probabilit \'e s
Pierre-Simon Laplace. Th \'e orie Analytique des Probabilit \'e s . Courcier, Paris, 1812. See Livre II, Chapitre VI for the rule of succession. Reprinted with additions, 2nd ed. 1814; English translation in A. I. Dale (ed.), Pierre-Simon Laplace: Philosophical Essay on Probabilities , Springer, 1995
work page 1995
-
[15]
Strategy adaptation in large language model werewolf agents
Fumiya Nakamori, Yoshinobu Kano, Neo Watanabe, et al. Strategy adaptation in large language model werewolf agents. arXiv preprint arXiv:2507.12732, 2025. URL https://arxiv.org/abs/2507.12732
-
[16]
When benchmarks talk: Re-evaluating code llms with interactive feedback
Jane Pan, Ryan Shar, Jacob Pfau, Ameet Talwalkar, He He, and Valerie Chen. When benchmarks talk: Re-evaluating code llms with interactive feedback. arXiv preprint arXiv:2502.18413, 2025
-
[17]
and Goldstein, Simon and O'Gara, Aidan and Chen, Michael and Hendrycks, Dan , title =
Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks. Ai deception: A survey of examples, risks, and potential solutions, 2023. URL https://arxiv.org/abs/2308.14752
-
[18]
Playing the werewolf game with artificial intelligence for language understanding
Hisaichi Shibata, Soichiro Miki, et al. Playing the werewolf game with artificial intelligence for language understanding. arXiv preprint arXiv:2302.10646, 2023. URL https://arxiv.org/abs/2302.10646
-
[19]
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529 0 (7587): 0 484--489, 2016. doi:10.1038/nature16961
-
[20]
A survey on large language model based autonomous agents
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18 0 (6): 0 186345, 2024. doi:10.1007/s11704-024-40231-1
-
[21]
Enhance reasoning for large language models in the game werewolf
Shuang Wu, Liwen Zhu, Tao Yang, Shiwei Xu, Qiang Fu, Yang Wei, and Haobo Fu. Enhance reasoning for large language models in the game werewolf. arXiv preprint arXiv:2402.02330, 2024. URL https://arxiv.org/abs/2402.02330
-
[22]
Language agents with reinforcement learning for strategic play in the werewolf game
Zelai Xu, Chao Yu, Fei Fang, Yu Wang, and Yi Wu. Language agents with reinforcement learning for strategic play in the werewolf game. arXiv preprint arXiv:2310.18940, 2023. URL https://arxiv.org/abs/2310.18940. Uses Werewolf as a social-deduction testbed
-
[23]
Zelai Xu, Wanjun Gu, Chao Yu, Yi Wu, and Yu Wang. Learning strategic language agents in the werewolf game with iterative latent space policy optimization. In Proceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 of Proceedings of Machine Learning Research, 2025. URL https://nicsefc.ee.tsinghua.edu.cn/nics_file/pdf/a58b31b...
work page 2025
-
[24]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[25]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[26]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.