pith. sign in

arxiv: 2606.27909 · v1 · pith:RZZ6SOU7new · submitted 2026-06-26 · 💻 cs.CL · cs.AI· cs.GT· cs.MA

Triadic Werewolf: A Jester Role for Multi-Hop Theory of Mind in LLMs

Pith reviewed 2026-06-29 04:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.GTcs.MA
keywords theory of mindlarge language modelssocial deductionmulti-agent reasoningWerewolf gameincentive structureJester role
0
0 comments X

The pith

A Jester role that wins by being voted out forces language models to track three opposing incentives at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard theory-of-mind tests for language models use two-sided deduction games where surface cues often suffice. This paper adds a third faction whose payoff is inverted: the Jester gains when others treat it as suspicious. In the resulting triadic game, models must simulate not only what each side believes but how each side's utility differs. Experiments across GPT-4.1, DeepSeek-V3.1 and Llama-3.3-70B show the Jester winning 60-70 percent of games while werewolves rarely exceed 20 percent, with GPT-4.1 wolves eliminating the Jester on day one in most trials. The triadic structure therefore surfaces reasoning steps that remain hidden when only two utility functions are present.

Core claim

The Jester-augmented Werewolf game creates a triadic incentive structure in which optimal play requires simultaneous modeling of three mutually incompatible utility functions. Across sixty games the Jester wins 60-70 percent of the time, werewolves never exceed 20 percent, and GPT-4.1 wolves cast self-defeating day-one votes against the Jester in 60-70 percent of matches. Self-learning improves DeepSeek and Llama but harms GPT-4.1, with the performance cost borne by villagers rather than werewolves. Only DeepSeek acquires the strategy of appearing suspicious without appearing deliberately so.

What carries the argument

The Jester role, whose win condition is elimination by vote and therefore inverts the value of peer suspicion relative to both villagers and werewolves.

If this is right

  • Self-learning on the triadic game improves some models but degrades others, with the degradation falling on the villager faction.
  • Only one of the three tested models acquires the strategy of generating suspicion without obvious intent.
  • Dyadic deduction tasks systematically understate the depth of reasoning required once three utility functions must be tracked simultaneously.
  • Werewolf win rates remain low even for the strongest model tested once the Jester is present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks limited to two-player hidden-role games may systematically overestimate models' readiness for real multi-party interactions.
  • Training loops that reward only pairwise deception may leave models unprepared for roles whose payoffs are deliberately misaligned with both other factions.
  • If the pattern holds under varied prompt phrasings, future multi-agent evaluations should routinely include at least one inverted-utility agent.

Load-bearing premise

The observed voting patterns and win rates reflect genuine limits in multi-hop theory-of-mind simulation rather than prompt wording, rule misunderstanding, or the particular self-learning procedure.

What would settle it

A controlled run in which the same models, given identical rules but with every mention of the Jester's win condition removed or reversed, produce near-zero early self-defeating votes and near-zero Jester wins would falsify the claim that the triadic structure itself exposes the deficit.

Figures

Figures reproduced from arXiv: 2606.27909 by Avni Mittal.

Figure 1
Figure 1. Figure 1: Overview of the WOLF triadic-incentive pipeline. A 10-player Werewolf game is instantiated with three [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Faction win-rate by model under Jester-learning OFF [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-role deception-type distribution. Each row is one (model, condition) pair. Bars are 100%-stacked [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (a) Theory-of-mind order of accumulated Jester lessons across the three models (141 entries total): [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: ON − OFF difference in average end-of-game cross-perception suspicion. Rows are observer roles, columns are target roles. Red indicates ON higher. Llama-3.3-70B (right) shows the largest second-order effect: learned Jester behavior reshapes how all other roles perceive the Doctor. are normalized to [0, 1] with larger values pushed outward, so a model that improves under the loop expands outward on the rele… view at source ↗
Figure 6
Figure 6. Figure 6: Rounds-to-resolution by winner and jester-learning condition, faceted by model. Jester wins are nearly [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Six-axis behavioral fingerprint of each model under Jester-learning OFF (grey) and ON (orange). Axes (all [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Night-action target distributions, paired OFF/ON per model. The Doctor self-protects in 80–100% of [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Mechanistic findings derived from per-statement deception logs and per-game outcome traces. (a) Mean [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Auxiliary per-statement and per-game findings. (a) Per-game Jester average bid versus eventual [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Temporal dynamics across three time axes. (a) Mean observer-attracted suspicion by role over rounds. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Per-model OFF → ON slope charts of two voting-failure metrics. (a) Fraction of games in which at least one Werewolf voted the Jester on day 1 (Wilson 95% CI bands). (b) Mean rank of each voter’s actual vote target within their own suspicion ranking (1 = top suspect; lower = tighter alignment, ± SEM). J.1 Bidding economics (per-game Jester bids) [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: (a) Substitution effect: who pays for Jester wins under the self-learning intervention. (b) Jester fate [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Dumbbell plots of OFF → ON for peer-detected deceptions per statement (left) and mean suspicion attracted (right), broken down by role and model. J.3 Within-cell learning curve (first 5 vs second 5 games) [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
read the original abstract

Theory-of-mind evaluations of large language models typically use dyadic social-deduction games, where every observable cue points to a single hidden side, so a model with strong language priors can score well without ever simulating opponents' incentives. We extend the Werewolf game with a Jester, a third faction whose utility on peer suspicion is inverted because it wins by being voted out, so optimal play requires reasoning across three opposing utility functions. Across 60 games on GPT-4.1, DeepSeek-V3.1, and Llama-3.3-70B with Jester self-learning on and off, the Jester wins 60-70% of games while Werewolves never exceed 20%, and GPT-4.1 wolves vote the Jester out on day 1 in 60-70% of games, a strictly self-defeating action. Self-learning helps DeepSeek and Llama but hurts GPT-4.1, with the cost landing on Villagers rather than Werewolves. Only DeepSeek learns the subtle strategy of looking suspicious without looking intentionally suspicious, and it gains the most from the loop. Triadic incentive structure exposes a layer of multi-agent reasoning that dyadic deduction games leave invisible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper extends the Werewolf social-deduction game by adding a Jester role whose win condition is inverted (wins by being voted out), creating a triadic incentive structure that requires LLMs to simulate three opposing utility functions. Experiments across 60 games with GPT-4.1, DeepSeek-V3.1, and Llama-3.3-70B (self-learning on/off) report Jester win rates of 60-70%, Werewolf win rates never exceeding 20%, GPT-4.1 wolves voting the Jester out on day 1 in 60-70% of games, and differential effects of self-learning across models, with only DeepSeek learning to appear suspicious without seeming intentional.

Significance. If the empirical patterns hold after controls, the triadic setup offers a useful probe for multi-hop ToM that standard dyadic games may miss, and the self-learning results provide a concrete signal about which models can acquire subtle multi-agent strategies. The work is empirical rather than axiomatic, so its value rests on the robustness of the game transcripts and statistical reporting.

major comments (3)
  1. [Abstract/Methods] Abstract and Methods: the reported 60-70% Jester win rates and 60-70% early-vote statistics are given without error bars, number of independent runs, or any description of how the self-learning loop was implemented (prompts, update rule, or averaging procedure). These omissions are load-bearing because the central claim attributes the patterns to a multi-hop ToM deficit rather than implementation artifacts.
  2. [Results/Discussion] Results/Discussion: no controls are described for rule comprehension (e.g., utility-function quizzes, paraphrased rule variants, or a dyadic baseline using identical phrasing). Without such checks, the self-defeating wolf votes cannot be unambiguously linked to failure to simulate three opposing utilities rather than incomplete parsing of the Jester's inverted win condition.
  3. [Results] Results: the claim that 'only DeepSeek learns the subtle strategy of looking suspicious without looking intentionally suspicious' is presented without quantitative metrics or transcript examples that would allow readers to verify the distinction from other models' behavior.
minor comments (1)
  1. [Abstract] Abstract: the total of 60 games is stated but the per-model and per-condition breakdown is not given, making it hard to interpret the aggregate percentages.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. The comments highlight important gaps in statistical reporting, experimental controls, and evidentiary support for specific claims. We address each point below and commit to revisions that strengthen the manuscript without altering its core findings.

read point-by-point responses
  1. Referee: [Abstract/Methods] Abstract and Methods: the reported 60-70% Jester win rates and 60-70% early-vote statistics are given without error bars, number of independent runs, or any description of how the self-learning loop was implemented (prompts, update rule, or averaging procedure). These omissions are load-bearing because the central claim attributes the patterns to a multi-hop ToM deficit rather than implementation artifacts.

    Authors: We agree these details are necessary for reproducibility. The 60 games were run as independent trials; we will add error bars (standard error across games), explicitly state the run count, and insert a new Methods subsection detailing the self-learning prompts, update rule (e.g., post-game reflection and prompt revision), and averaging procedure. These additions will appear in both the main text and appendix. revision: yes

  2. Referee: [Results/Discussion] Results/Discussion: no controls are described for rule comprehension (e.g., utility-function quizzes, paraphrased rule variants, or a dyadic baseline using identical phrasing). Without such checks, the self-defeating wolf votes cannot be unambiguously linked to failure to simulate three opposing utilities rather than incomplete parsing of the Jester's inverted win condition.

    Authors: The concern is valid; explicit comprehension checks were not reported. We will add a dyadic baseline condition (identical prompt phrasing but without the Jester) and report results from paraphrased rule variants to confirm that models correctly parse the inverted win condition. These controls will be included in the revised Results section to better isolate the triadic reasoning deficit. revision: yes

  3. Referee: [Results] Results: the claim that 'only DeepSeek learns the subtle strategy of looking suspicious without looking intentionally suspicious' is presented without quantitative metrics or transcript examples that would allow readers to verify the distinction from other models' behavior.

    Authors: We accept that the claim requires stronger substantiation. The revision will include quantitative metrics (e.g., average suspicion scores and linguistic markers of intentionality extracted from model outputs) and selected transcript excerpts placed in an appendix, allowing readers to directly compare the behavioral patterns across models. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical game-play results with no derivation chain

full rationale

The paper presents win-rate and voting statistics from LLM gameplay in an extended Werewolf setup. No equations, fitted parameters, predictions, or first-principles derivations appear; the reported 60-70% Jester win rates and day-1 voting patterns are direct experimental outputs, not quantities that reduce by construction to inputs defined inside the paper. Self-citations are absent from the provided text, and the central claim rests on observable game outcomes rather than any self-referential loop or renamed known result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that game win rates measure multi-hop ToM and on the invented Jester entity to create triadic incentives.

axioms (1)
  • domain assumption LLM performance in social deduction games measures theory-of-mind capabilities
    Invoked to interpret win rates and voting patterns as evidence of reasoning deficits.
invented entities (1)
  • Jester role no independent evidence
    purpose: To create a third faction with inverted utility on peer suspicion
    New entity introduced to force triadic reasoning; no independent evidence provided beyond the game results.

pith-pipeline@v0.9.1-grok · 6472 in / 1186 out tokens · 44250 ms · 2026-06-29T04:46:28.695138+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    The llama 3 herd of models. Robert Geirhos, J. Jacobsen, Claudio Michaelis, R. Zemel, Wieland Brendel, M. Bethge, and Felix Wichmann. 2020. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2:665 – 673. Thilo Hagendorff. 2023. Deception abilities emerged in large language models.Proceedings of the National Academy of Sciences of the U...

  2. [2]

    Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu

    Fantom: A benchmark for stress-testing machine theory of mind in interactions.ArXiv, abs/2310.15421. Jonathan Light, Min Cai, Sheng Shen, and Ziniu Hu

  3. [3]

    Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

    Avalonbench: Evaluating llms playing the game of avalon. Ziyi Liu, Abhishek Anand, Pei Zhou, Jen tse Huang, and Jieyu Zhao. 2024. Interintent: Investigating social intelligence of llms via intention understanding in an interactive game context. pages 6718–6746. R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing s...

  4. [4]

    Reflexion: language agents with verbal rein- forcement learning.Advances in Neural Information Processing Systems 36. James W. A. Strachan, Dalila Albergo, Giulia Borghini, Oriana Pansardi, E. Scaliti, Saurabh Gupta, Krati Sax- ena, Alessandro Rufo, Stefano Panzeri, Guido Manzi, Michael S. A. Graziano, and Cristina Becchio. 2024a. Testing theory of mind i...

  5. [5]

    stay quiet on day 1, pop later

    Learning strategic language agents in the were- wolf game with iterative latent space policy optimiza- tion.ArXiv, abs/2502.04686. Zelai Xu, Chao Yu, Fei Fang, Yu Wang, and Yi Wu. 2023b. Language agents with reinforcement learn- ing for strategic play in the werewolf game.ArXiv, abs/2310.18940. Zelai Xu, Chao Yu, Fei Fang, Yu Wang, and Yi Wu. 2023c. Langu...

  6. [6]

    Intention behind the statement - what were you trying to achieve?

  7. [7]

    Any deliberate omission, distortion, or fabrication of information

  8. [8]

    Whether you tried to mislead others about your role/suspicions/knowledge

  9. [9]

    Consistency with your true beliefs/ knowledge as a {role}

  10. [10]

    chain_of_thought

    Did you hide information that benefits your team? Then classify whether your statement was deceptive or not. Deception Types: - none: Truthful statement with no deceptive intent - omission: Deliberately hiding relevant information - distortion: Twisting or exaggerating facts - fabrication: Making up false information - misdirection: Diverting attention aw...

  11. [11]

    Likely intentions behind the statement - what is {speaker_name} trying to achieve ?

  12. [12]

    Consistency with their prior statements in this conversation

  13. [13]

    Contradictions, unusual phrasing, or suspicious elements

  14. [14]

    Does this statement benefit Werewolves or Villagers?

  15. [15]

    Consider what role {speaker_name} might have based on their behavior

  16. [16]

    chain_of_thought

    How does this statement affect your suspicions as a {observer_role}? Based on your analysis, classify whether you think this statement was deceptive or not. Deception Types: - none: Truthful statement with no deceptive intent - omission: Deliberately hiding relevant information - distortion: Twisting or exaggerating facts - fabrication: Making up false in...

  17. [17]

    What tactics worked or didn't work?

  18. [18]

    What should you do more of or avoid in future games?

  19. [19]

    How did other players react to your behavior?

  20. [20]

    to_do": [

    What made players suspicious of you (good ) or identify you as Jester (bad)? Respond ONLY with this exact JSON format: { "to_do": ["specific actionable thing to do in future games", ...], "to_not_do": ["specific thing to avoid in future games", ...], "winning_tactics": ["tactic that helped get exiled", ...] } CRITICAL RULES: - DO NOT add items that are al...