pith. sign in

arxiv: 2509.23023 · v3 · pith:43ALCLR2new · submitted 2025-09-27 · 💻 cs.AI

Deceive, Detect, and Disclose: Large Language Models Play Mini-Mafia

Pith reviewed 2026-05-18 13:21 UTC · model grok-4.3

classification 💻 cs.AI
keywords Mini-Mafialarge language modelssocial deductionmulti-agent interactiondeception detectionanalytical predictionwin-rate modelbenchmark
0
0 comments X

The pith

An analytical formula with three parameters per model predicts mafia win rates across all language model combinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Mini-Mafia, a four-player simplification of the social deduction game that reduces multi-turn play to one critical exchange. It shows that the mafia win probability follows the formula logit(p) equals v times open parenthesis m minus d close parenthesis, where m, d, and v are model-specific measures of deception, disclosure, and detection. Bayesian inference from gameplay data estimates these three numbers per model, allowing every possible three-model matchup to be predicted without direct testing. A sympathetic reader would care because the result replaces exhaustive empirical trials with a compact theoretical account of how agent capabilities shape group outcomes in interactive settings.

Core claim

In the Mini-Mafia setting the mafia win-rate p is given by the analytical expression logit(p) = v × (m - d), where the parameters m, d, and v quantify the mafioso's deception, the detective's disclosure, and the villager's detection. Bayesian inference from observed gameplay yields these parameters for each model, allowing accurate prediction of all tournament outcomes using only 3I parameters for I models and yielding a 76.6 percent reduction in Brier score relative to a random baseline in cross-validation.

What carries the argument

The logit-linear formula logit(p) = v × (m - d) that collapses the multi-turn game to the outcome probability of one critical exchange among mafioso, detective, and villager.

If this is right

  • For any collection of I models, all I cubed possible three-player tournaments can be predicted from only 3I parameters.
  • Models can be ranked by role-specific strengths, such as Grok 3 Mini as the strongest detector and Claude Sonnet 4 as near-random in detection.
  • The Mini-Mafia Benchmark supplies a data-efficient method to evaluate language-model interactive capabilities without exhaustive matchup simulation.
  • The analytical description supports principled comparisons that isolate deception, disclosure, and detection contributions to collective results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the single-exchange reduction holds, analogous critical-point analyses could be applied to other multi-agent tasks such as negotiation or debate.
  • The three parameters might guide selection of model teams for tasks that require complementary social skills.
  • Extending the same fitting procedure to larger player counts or altered game rules would test how far the low-dimensional description remains valid.
  • The framework implies that many collective outcomes in language-model social games may be governed by simple additive effects rather than higher-order emergent interactions.

Load-bearing premise

The full dynamics of the game reduce without loss of accuracy to a single critical exchange whose outcome probability is exactly captured by the linear form in logit space.

What would settle it

A new collection of repeated games among previously unseen model triples in which the observed mafia win frequencies deviate substantially from the probabilities predicted by the fitted parameters m, d, and v would falsify the claim.

Figures

Figures reproduced from arXiv: 2509.23023 by Davi Bastos Costa, Renato Vicente.

Figure 1
Figure 1. Figure 1: Deceive performance: (a) Aggregated scores across all backgrounds, Eq. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Detect performance: (a) aggregated scores across all backgrounds, Eq. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Disclose performance: (a) aggregated scores across all backgrounds, Eq. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Complete mafioso performance results across all detective and villager backgrounds. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Complete villager performance results across all mafioso and detective backgrounds. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Complete detective performance results across all mafioso and villager backgrounds. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of methodological approaches for deceive, detect and disclose capability [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

Large language models are increasingly deployed in multi-agent settings whose outcomes hinge on social intelligence, motivating evaluations of their interactive capabilities; yet existing studies remain overwhelmingly empirical, leaving us without a theoretical understanding of how agent interactions determine collective outcomes. To address this, we introduce \textit{Mini-Mafia}, a four-player simplification of the social deduction game Mafia in which a fixed night phase reduces the game to a single critical exchange among a mafioso, a detective, and a villager. In this setting, we show that the mafia win-rate $p$ is predicted by the analytical formula $\text{logit}(p) = v \times (m - d)$, where $m$, $d$, and $v$ represent the mafioso's deception, the detective's disclosure, and the villager's detection capabilities. We turn this analytical framework into the \textit{Mini-Mafia Benchmark}, where Bayesian inference over gameplay data yields per-model estimates of the intrinsic parameters $m$, $d$, and $v$. For $I$ models, only $3I$ parameters suffice to predict the outcomes of all $I^3$ tournament combinations; and in 5-fold cross-validation the formula achieves a $76.6\%$ Brier-score reduction over a random baseline. The benchmark also reveals counterintuitive results: Grok 3 Mini is the strongest detector and GPT-5 Mini the strongest discloser, both ahead of DeepSeek V3.1, Claude Opus 4, and Claude Sonnet 4; while Claude Sonnet 4 is the weakest detector, near random chance. Together, these results show that Mini-Mafia, a simple but nontrivial multi-agent system, admits an analytical description and serves as a principled benchmark for language model interactions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Mini-Mafia, a four-player simplification of the Mafia game reduced to a single critical exchange among a mafioso, detective, and villager. It claims that the mafia win-rate p follows the analytical formula logit(p) = v × (m - d), where m, d, and v are per-model parameters for deception, disclosure, and detection. Bayesian inference on gameplay data yields these parameters; for I models only 3I scalars suffice to predict all I^3 tournament outcomes. 5-fold cross-validation shows a 76.6% Brier-score reduction over random baseline, and the benchmark produces model rankings (e.g., Grok 3 Mini strongest detector, GPT-5 Mini strongest discloser, Claude Sonnet 4 weakest detector).

Significance. If the result holds, the work supplies a rare analytical, parameter-efficient account of LLM social intelligence in a multi-agent setting, moving beyond purely empirical evaluations. The cross-validation evidence, the reduction from I^3 to 3I parameters, and the falsifiable predictions constitute clear strengths. The counterintuitive capability rankings further illustrate the benchmark's potential utility for the field.

major comments (2)
  1. [Abstract / analytical framework] Abstract and the section presenting the analytical framework: the formula logit(p) = v × (m - d) is asserted to capture the outcome of the full game via a single critical exchange, yet no derivation steps or explicit checks confirming the absence of residual pairwise interactions or higher-order terms across model triples are provided. This assumption is load-bearing for the claim that 3I parameters predict every I^3 combination without loss of accuracy.
  2. [Cross-validation results] Cross-validation results (5-fold): while the 76.6% Brier reduction is reported, the manuscript does not include residual diagnostics or comparisons against models that add interaction terms to test whether the exact multiplicative logit form is misspecified versus merely adequate within the observed range. Such a check is required to substantiate that no model-pair-specific deviations exist.
minor comments (2)
  1. Notation for the parameters m, d, v would benefit from an early explicit table or equation block defining each quantity and its estimation procedure.
  2. The discussion of counterintuitive rankings (Grok 3 Mini, GPT-5 Mini, Claude variants) could be expanded with brief qualitative examples of the observed behaviors to aid interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / analytical framework] Abstract and the section presenting the analytical framework: the formula logit(p) = v × (m - d) is asserted to capture the outcome of the full game via a single critical exchange, yet no derivation steps or explicit checks confirming the absence of residual pairwise interactions or higher-order terms across model triples are provided. This assumption is load-bearing for the claim that 3I parameters predict every I^3 combination without loss of accuracy.

    Authors: We agree that the current presentation would benefit from explicit derivation steps. The formula is derived from the game's reduction to a single decisive exchange in which the mafioso's deception advantage is modulated by the detective's disclosure and the villager's detection; under the assumption that other interactions are negligible due to the fixed night phase and role assignments, the net logit effect takes the multiplicative form v × (m - d). In the revision we will add a dedicated derivation subsection that walks through this reasoning from the game rules, together with post-hoc residual analyses across held-out model triples to verify that higher-order terms do not materially improve fit or predictive accuracy. revision: yes

  2. Referee: [Cross-validation results] Cross-validation results (5-fold): while the 76.6% Brier reduction is reported, the manuscript does not include residual diagnostics or comparisons against models that add interaction terms to test whether the exact multiplicative logit form is misspecified versus merely adequate within the observed range. Such a check is required to substantiate that no model-pair-specific deviations exist.

    Authors: We accept that residual diagnostics and explicit misspecification tests against richer models are necessary to substantiate the claim. In the revised manuscript we will include (i) residual plots and summary statistics from the Bayesian posterior predictive checks and (ii) a direct comparison of the base model against versions augmented with pairwise interaction terms (e.g., m·d, m·v). These additions will quantify whether the simple multiplicative form is adequate or whether systematic deviations appear for particular model combinations. revision: yes

Circularity Check

0 steps flagged

No circularity: functional form is a modeling assumption validated by generalization on held-out data rather than a tautological fit.

full rationale

The paper presents logit(p) = v × (m - d) as an analytical formula that collapses the multi-turn game to a single critical exchange and estimates the three per-model scalars via Bayesian inference on observed gameplay. It then tests predictive accuracy on held-out tournament combinations through 5-fold cross-validation, reporting a 76.6% Brier-score reduction. Because the evaluation explicitly withholds data from parameter estimation and measures out-of-sample performance, the reported predictions are not equivalent to the inputs by construction. No self-citation chain, uniqueness theorem, or ansatz imported from prior author work is invoked to justify the functional form; the derivation chain therefore remains self-contained against the external empirical benchmark.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The framework rests on three fitted parameters per model plus the structural assumption that game outcomes reduce to the stated logit form; no new physical entities are postulated.

free parameters (3)
  • m (mafioso deception)
    Intrinsic parameter estimated via Bayesian inference from gameplay data for each model's performance in the mafioso role.
  • d (detective disclosure)
    Intrinsic parameter estimated via Bayesian inference from gameplay data for each model's performance in the detective role.
  • v (villager detection)
    Intrinsic parameter estimated via Bayesian inference from gameplay data for each model's performance in the villager role.
axioms (2)
  • domain assumption The four-player game with fixed night phase reduces to a single critical exchange among mafioso, detective, and villager.
    This reduction is invoked to derive the analytical formula for win probability.
  • ad hoc to paper Win probability follows exactly logit(p) = v × (m - d) with no additional interaction terms.
    Presented as the predictive analytical formula without further derivation shown in the abstract.

pith-pipeline@v0.9.0 · 5855 in / 1665 out tokens · 46606 ms · 2026-05-18T13:21:09.335145+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Moral Susceptibility and Robustness under Persona Role-Play in Large Language Models

    cs.CL 2025-11 unverdicted novelty 6.0

    LLM moral robustness under persona role-play is largely determined by model family with Claude models most consistent, while susceptibility shows little family dependence.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 1 Pith paper

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Werewolf arena: A case study in llm evaluation via social deduction

    Suma Bailis, Jane Friedhoff, and Feiyang Chen. Werewolf arena: A case study in llm evaluation via social deduction. arXiv preprint arXiv:2407.13943, 2024. URL https://arxiv.org/abs/2407.13943

  3. [3]

    Chang, Y .; Wang, X.; Wang, J.; Wu, Y .; Zhu, K.; Chen, H.; Yang, L.; Yi, X.; Wang, C.; Wang, Y .; Ye, W.; Zhang, Y .; Chang, Y .; Yu, P

    Sourav Banerjee, Ayushi Agarwal, and Eishkaran Singh. The vulnerability of language model benchmarks: Do they accurately reflect true llm performance?, 2024. URL https://arxiv.org/abs/2412.03597

  4. [4]

    Philosophy of Physics, volume 45 of Synthese Library

    Mario Bunge. Philosophy of Physics, volume 45 of Synthese Library. D. Reidel Publishing Company, Dordrecht, Holland, 1973

  5. [5]

    Mini-Mafia: LLM Benchmarking for Deception, Detection, and Disclosure

    Davi Bastos Costa. Mini-Mafia: LLM Benchmarking for Deception, Detection, and Disclosure . https://github.com/bastoscostadavi/llm-mafia-game, 2025

  6. [6]

    Helmsman of the masses? evaluate the opinion leadership of large language models in the werewolf game

    Silin Du and Xiaowei Zhang. Helmsman of the masses? evaluate the opinion leadership of large language models in the werewolf game. arXiv preprint arXiv:2404.01602, 2024. URL https://arxiv.org/abs/2404.01602

  7. [7]

    Truthful ai: Developing and governing ai that does not lie, 2021

    Owain Evans, Owen Cotton-Barratt, Lukas Finnveden, Adam Bales, Avital Balwit, Peter Wills, Luca Righetti, and William Saunders. Truthful ai: Developing and governing ai that does not lie, 2021. URL https://arxiv.org/abs/2110.06674

  8. [9]

    Chawla, Olaf Wiest, and Xiangliang Zhang

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V. Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, pp.\ 8048--8057. IJCAI, 2024. doi:10.24963/ijcai.2024/890

  9. [10]

    Egosocialarena: Benchmarking the social intelligence of large language models from a first-person perspective

    Guiyang Hou, Wenqi Zhang, Yongliang Shen, Zeqi Tan, Sihao Shen, and Weiming Lu. Egosocialarena: Benchmarking the social intelligence of large language models from a first-person perspective. arXiv preprint arXiv:2410.06195, 2024. URL https://arxiv.org/abs/2410.06195

  10. [11]

    Homo Ludens: A Study of the Play-Element in Culture

    Johan Huizinga. Homo Ludens: A Study of the Play-Element in Culture. Routledge & Kegan Paul, 1938

  11. [12]

    Learning to discuss strategically: A case study on one night ultimate werewolf

    Xuanfa Jin, Ziyan Wang, Yali Du, Meng Fang, Haifeng Zhang, and Jun Wang. Learning to discuss strategically: A case study on one night ultimate werewolf. arXiv preprint arXiv:2405.19946, 2024. URL https://arxiv.org/abs/2405.19946

  12. [13]

    Rehg, and Diyi Yang

    Brian Lai, Haofan Zhang, Ming Liu, Andrea Pariani, Francesca Ryan, Weizhe Jia, Shirley Anugrah Hayati, James M. Rehg, and Diyi Yang. Werewolf among us: A multimodal dataset for modeling persuasion behaviors in social deduction games. arXiv preprint arXiv:2212.08279, 2022. URL https://arxiv.org/abs/2212.08279

  13. [14]

    Th \'e orie Analytique des Probabilit \'e s

    Pierre-Simon Laplace. Th \'e orie Analytique des Probabilit \'e s . Courcier, Paris, 1812. See Livre II, Chapitre VI for the rule of succession. Reprinted with additions, 2nd ed. 1814; English translation in A. I. Dale (ed.), Pierre-Simon Laplace: Philosophical Essay on Probabilities , Springer, 1995

  14. [15]

    Strategy adaptation in large language model werewolf agents

    Fumiya Nakamori, Yoshinobu Kano, Neo Watanabe, et al. Strategy adaptation in large language model werewolf agents. arXiv preprint arXiv:2507.12732, 2025. URL https://arxiv.org/abs/2507.12732

  15. [16]

    When benchmarks talk: Re-evaluating code llms with interactive feedback

    Jane Pan, Ryan Shar, Jacob Pfau, Ameet Talwalkar, He He, and Valerie Chen. When benchmarks talk: Re-evaluating code llms with interactive feedback. arXiv preprint arXiv:2502.18413, 2025

  16. [17]

    and Goldstein, Simon and O'Gara, Aidan and Chen, Michael and Hendrycks, Dan , title =

    Peter S. Park, Simon Goldstein, Aidan O'Gara, Michael Chen, and Dan Hendrycks. Ai deception: A survey of examples, risks, and potential solutions, 2023. URL https://arxiv.org/abs/2308.14752

  17. [18]

    Playing the werewolf game with artificial intelligence for language understanding

    Hisaichi Shibata, Soichiro Miki, et al. Playing the werewolf game with artificial intelligence for language understanding. arXiv preprint arXiv:2302.10646, 2023. URL https://arxiv.org/abs/2302.10646

  18. [19]

    Nature , author =

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529 0 (7587): 0 484--489, 2016. doi:10.1038/nature16961

  19. [20]

    A survey on large language model based autonomous agents

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents. Frontiers of Computer Science, 18 0 (6): 0 186345, 2024. doi:10.1007/s11704-024-40231-1

  20. [21]

    Enhance reasoning for large language models in the game werewolf

    Shuang Wu, Liwen Zhu, Tao Yang, Shiwei Xu, Qiang Fu, Yang Wei, and Haobo Fu. Enhance reasoning for large language models in the game werewolf. arXiv preprint arXiv:2402.02330, 2024. URL https://arxiv.org/abs/2402.02330

  21. [22]

    Language agents with reinforcement learning for strategic play in the werewolf game

    Zelai Xu, Chao Yu, Fei Fang, Yu Wang, and Yi Wu. Language agents with reinforcement learning for strategic play in the werewolf game. arXiv preprint arXiv:2310.18940, 2023. URL https://arxiv.org/abs/2310.18940. Uses Werewolf as a social-deduction testbed

  22. [23]

    Learning strategic language agents in the werewolf game with iterative latent space policy optimization

    Zelai Xu, Wanjun Gu, Chao Yu, Yi Wu, and Yu Wang. Learning strategic language agents in the werewolf game with iterative latent space policy optimization. In Proceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 of Proceedings of Machine Learning Research, 2025. URL https://nicsefc.ee.tsinghua.edu.cn/nics_file/pdf/a58b31b...

  23. [24]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  24. [25]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  25. [26]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...