pith. machine review for the scientific record. sign in

arxiv: 2605.01986 · v1 · submitted 2026-05-03 · 💻 cs.AI

Recognition: unknown

12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation

Authors on Pith no claims yet

Pith reviewed 2026-05-09 17:03 UTC · model grok-4.3

classification 💻 cs.AI
keywords multi-agent LLMsdeliberationalignment trainingjury simulationdecision-makinganchoringvote changes12 Angry Men
0
0 comments X

The pith

Lightly aligned LLMs change votes and reach minority outcomes in a 12 Angry Men jury simulation while heavily aligned ones remain anchored.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets up twelve LLM agents as jurors in a recreation of the 12 Angry Men deliberation, each assigned a distinct film persona, to test whether they can persuade one another toward a unanimous verdict. It runs the scenario across three prompt variations with two models that differ mainly in how much alignment training they received. Nearly every trial ends without a decision because agents stick to their starting positions. The less intensively aligned model shifts its stance multiple times per run and occasionally arrives at the not-guilty position that the film reaches, while the more heavily aligned model almost never changes its vote. The pattern points to alignment intensity as the main control on whether models can engage in the back-and-forth reasoning that human groups use.

Core claim

The central claim is that the intensity of RLHF alignment training, rather than raw model capability, determines deliberative flexibility in multi-agent settings. In the 12-agent jury benchmark, the lighter-aligned model produced between 2 and 6 vote changes per run and reached a not-guilty verdict in one condition, whereas the heavier-aligned model averaged only one vote change across all conditions and never reached that outcome. Seventeen of the eighteen total runs ended in a hung jury, showing that current models anchor to initial positions instead of performing the gradual minority-to-majority persuasion that defines the original story.

What carries the argument

The multi-agent jury deliberation benchmark that assigns twelve agents film-faithful personas from 12 Angry Men and tracks vote changes under different initial conditions and prompt instructions.

If this is right

  • Most current LLMs will anchor to an initial majority vote and fail to replicate the gradual persuasion that occurs in human deliberation groups.
  • Instructions that encourage open-mindedness are followed by lighter-aligned models but ignored by heavier-aligned ones.
  • Benchmarks for multi-agent debate should measure internal dynamics such as vote changes, not only final consensus.
  • Achieving human-like flexibility in AI decision panels may require deliberate choices about alignment intensity rather than simply scaling capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could tune alignment strength specifically for collaborative tasks where persuasion and compromise are desired.
  • The same benchmark could be applied to other group decision settings, such as policy panels or medical boards, to test whether anchoring is a general limitation.
  • If flexibility tracks alignment intensity, then safety training may need to be paired with techniques that preserve openness to counter-arguments.
  • Extending the setup to more than two models would help isolate whether the pattern is driven by alignment level across a wider range of systems.

Load-bearing premise

Observed differences in vote-changing behavior are caused by the differing intensities of alignment training rather than by other unmeasured differences between the two models.

What would settle it

Repeating the experiment with a third model whose alignment intensity is known and intermediate between the two tested models, then checking whether its vote-change rate falls between the observed values of 1.0 and 2–6.

Figures

Figures reproduced from arXiv: 2605.01986 by Ahmet Bahaddin Ersoz.

Figure 1
Figure 1. Figure 1: Verdict distribution by model and condition ( view at source ↗
Figure 2
Figure 2. Figure 2: Mean vote changes per run by model and condition ( view at source ↗
Figure 3
Figure 3. Figure 3: Asymmetric effect of the “open-minded” prompt instruction ( view at source ↗
Figure 4
Figure 4. Figure 4: Effect of removing initial vote conditioning ( view at source ↗
Figure 5
Figure 5. Figure 5: Mean turns to termination by model and condition. Early stopping (3 consecutive view at source ↗
read the original abstract

What if the twelve jurors of Sidney Lumet's 12 Angry Men (1957) were not men, but large language models? Would the one juror who disagrees still be able to change everyone's mind? This paper instantiates that scenario as a multi-agent benchmark for LLM deliberation: twelve agents, each conditioned on a film-faithful persona, debate the film's murder case using multi-agent framework. Two models representing opposite ends of the RLHF spectrum are tested: GPT-4o (closed-source, heavy alignment) and Llama-4-Scout (open-weight, lighter alignment), across three conditions (baseline, open-minded prompt, no initial vote), with N = 3 replications per cell (18 runs total). Three findings emerge. (i) Seventeen of eighteen runs end in a hung jury (a state where the jury fails to reach a unanimous verdict); the film's central event, gradual minority-to-majority persuasion, almost never occurs, indicating that anchoring is the dominant failure mode of current LLMs in this setting. (ii) The two models exhibit sharply different internal dynamics: GPT-4o produces a mean of 1.0 vote changes per run across all conditions, while Llama-4-Scout ranges from 2.0 (baseline) to 6.0 (open-minded prompt), and is the only model to reach a NOT\_GUILTY verdict (1 of 3 runs in the no-initial-vote condition). The same ``open-minded'' instruction is internalized by Llama and ignored by GPT-4o. (iii) This asymmetry suggests that the intensity of RLHF alignment training, not model capability, is the primary determinant of deliberative flexibility in multi-agent settings. Flexibility, not capability, tracks human deliberation. The work is framed as an exploratory study and discusses implications for jury-of-LLMs evaluation and multi-agent debate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an exploratory multi-agent simulation benchmark based on the plot of 12 Angry Men, instantiating 12 LLM agents with distinct personas to deliberate a murder case. Using two models (GPT-4o and Llama-4-Scout) across three prompt conditions with three replications each, the authors find that 17 out of 18 simulations result in hung juries, with GPT-4o showing minimal vote changes (mean 1.0) compared to Llama-4-Scout (2-6), and attribute the difference in deliberative flexibility to RLHF intensity.

Significance. This work offers a novel, interpretable benchmark for multi-agent LLM deliberation using a well-known cinematic scenario. If the central observations are robust, it suggests that alignment training may constrain flexibility in group decision-making more than raw capability, which could inform the design of multi-agent systems for tasks requiring persuasion and consensus-building. The paper is positioned as exploratory and highlights the dominance of anchoring effects.

major comments (2)
  1. [Abstract and Results] The attribution in finding (iii) that RLHF intensity is the primary determinant of deliberative flexibility is underdetermined. The comparison is between two architecturally dissimilar models (closed-source GPT-4o vs. open-weight Llama-4-Scout) without matched controls for parameter scale, pre-training data, or other factors, making it impossible to isolate RLHF as the causal variable.
  2. [Experimental Design] With only N=3 replications per cell and no reported statistical tests, error bars, or variance measures, the reported means (e.g., 1.0 vs 2-6 vote changes) lack robustness. A single stochastic run could dominate the averages, undermining claims about dominant failure modes and model differences.
minor comments (2)
  1. [Abstract] The abstract states 'Llama-4-Scout ranges from 2.0 (baseline) to 6.0 (open-minded prompt)', but it would be clearer to specify if these are means or ranges across the three runs.
  2. Consider adding a table summarizing vote changes per condition and model for better readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our exploratory multi-agent benchmark. We address each major point below, agreeing where revisions are needed to strengthen the claims and acknowledging the preliminary nature of the study.

read point-by-point responses
  1. Referee: [Abstract and Results] The attribution in finding (iii) that RLHF intensity is the primary determinant of deliberative flexibility is underdetermined. The comparison is between two architecturally dissimilar models (closed-source GPT-4o vs. open-weight Llama-4-Scout) without matched controls for parameter scale, pre-training data, or other factors, making it impossible to isolate RLHF as the causal variable.

    Authors: We agree that the comparison does not isolate RLHF intensity as a causal variable, given the multiple differences between the models. The manuscript frames the work as exploratory and presents finding (iii) as a suggestion based on the observed asymmetry rather than a definitive causal conclusion. We will revise the abstract, results, and discussion to soften the language, explicitly noting that this is a hypothesis and that controlled follow-up experiments with matched models would be required to test the role of alignment training. A new limitations section will discuss potential confounds including architecture and pre-training data. revision: partial

  2. Referee: [Experimental Design] With only N=3 replications per cell and no reported statistical tests, error bars, or variance measures, the reported means (e.g., 1.0 vs 2-6 vote changes) lack robustness. A single stochastic run could dominate the averages, undermining claims about dominant failure modes and model differences.

    Authors: We acknowledge that N=3 per cell is small and that the lack of variance reporting and statistical tests limits the strength of the quantitative comparisons. As an exploratory study, the design focused on observing the phenomenon across conditions. We will revise the experimental design and results sections to report full per-run outcomes, include standard deviations or ranges for vote changes, add error bars to any summary figures, and increase replications to N=5 per cell. We will also add a note on stochasticity and temper claims about model differences to reflect the preliminary sample size. The consistent pattern of 17/18 hung juries provides qualitative support for anchoring even with the current N. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical simulation with direct reporting of run outcomes

full rationale

The paper contains no derivations, equations, fitted parameters, or mathematical claims. All reported findings (vote-change counts, hung-jury rates, model asymmetries) are direct empirical observations from 18 simulation runs. The central interpretive claim about RLHF intensity is presented as a post-hoc suggestion rather than a derived result, with no self-citation chains, uniqueness theorems, or ansatzes invoked to support it. The study is self-contained against its own experimental data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on domain assumptions about persona fidelity and prompt interpretation rather than any free parameters or new entities.

axioms (1)
  • domain assumption LLM agents conditioned on film-faithful personas will produce deliberation dynamics comparable to human jurors when given the same case facts.
    Invoked in the setup description to justify the benchmark construction.

pith-pipeline@v0.9.0 · 5646 in / 1252 out tokens · 38534 ms · 2026-05-09T17:03:40.375310+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Science , volume =

    Judgment under Uncertainty: Heuristics and Biases , author =. Science , volume =

  2. [2]

    and Burger, Doug and Wang, Chi , journal =

    Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Zhang, Shaokun and Liu, Jiale and Awadallah, Ahmed Hassan and White, Ryen W. and Burger, Doug and Wang, Chi , journal =

  3. [3]

    Li, Guohao and Hammoud, Hasan Abed Al Kader and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard , journal =

  4. [4]

    Qian, Chen and Cong, Xin and Yang, Cheng and Chen, Weize and Su, Yusheng and Xu, Juyuan and Liu, Zhiyuan and Sun, Maosong , journal =

  5. [5]

    Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , year =

    Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , year =

  6. [6]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Improving Factuality and Reasoning in Language Models through Multiagent Debate , author =. arXiv preprint arXiv:2305.14325 , year =

  7. [7]

    Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

    Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , author =. arXiv preprint arXiv:2305.19118 , year =

  8. [8]

    and Ziaee, Ali and Nguyen, Morgan , journal =

    Suri, Gaurav and Slater, Lily R. and Ziaee, Ali and Nguyen, Morgan , journal =. Do Large Language Models Show Decision Heuristics Similar to Humans?

  9. [9]

    Cognitive Bias in Decision-Making with

    Echterhoff, Jessica and Liu, Yao and Alessa, Abeer and McAuley, Julian and He, Zexue , journal =. Cognitive Bias in Decision-Making with

  10. [10]

    arXiv preprint arXiv:2401.01275 , year =

    On the Persona Stability of Large Language Models , author =. arXiv preprint arXiv:2401.01275 , year =

  11. [11]

    and Zhang, Hao and Gonzalez, Joseph E

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , journal =. Judging

  12. [12]

    Science , volume =

    Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning , author =. Science , volume =

  13. [13]

    Lumet, Sidney , year =. 12