Recognition: unknown
12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation
Pith reviewed 2026-05-09 17:03 UTC · model grok-4.3
The pith
Lightly aligned LLMs change votes and reach minority outcomes in a 12 Angry Men jury simulation while heavily aligned ones remain anchored.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the intensity of RLHF alignment training, rather than raw model capability, determines deliberative flexibility in multi-agent settings. In the 12-agent jury benchmark, the lighter-aligned model produced between 2 and 6 vote changes per run and reached a not-guilty verdict in one condition, whereas the heavier-aligned model averaged only one vote change across all conditions and never reached that outcome. Seventeen of the eighteen total runs ended in a hung jury, showing that current models anchor to initial positions instead of performing the gradual minority-to-majority persuasion that defines the original story.
What carries the argument
The multi-agent jury deliberation benchmark that assigns twelve agents film-faithful personas from 12 Angry Men and tracks vote changes under different initial conditions and prompt instructions.
If this is right
- Most current LLMs will anchor to an initial majority vote and fail to replicate the gradual persuasion that occurs in human deliberation groups.
- Instructions that encourage open-mindedness are followed by lighter-aligned models but ignored by heavier-aligned ones.
- Benchmarks for multi-agent debate should measure internal dynamics such as vote changes, not only final consensus.
- Achieving human-like flexibility in AI decision panels may require deliberate choices about alignment intensity rather than simply scaling capability.
Where Pith is reading between the lines
- Developers could tune alignment strength specifically for collaborative tasks where persuasion and compromise are desired.
- The same benchmark could be applied to other group decision settings, such as policy panels or medical boards, to test whether anchoring is a general limitation.
- If flexibility tracks alignment intensity, then safety training may need to be paired with techniques that preserve openness to counter-arguments.
- Extending the setup to more than two models would help isolate whether the pattern is driven by alignment level across a wider range of systems.
Load-bearing premise
Observed differences in vote-changing behavior are caused by the differing intensities of alignment training rather than by other unmeasured differences between the two models.
What would settle it
Repeating the experiment with a third model whose alignment intensity is known and intermediate between the two tested models, then checking whether its vote-change rate falls between the observed values of 1.0 and 2–6.
Figures
read the original abstract
What if the twelve jurors of Sidney Lumet's 12 Angry Men (1957) were not men, but large language models? Would the one juror who disagrees still be able to change everyone's mind? This paper instantiates that scenario as a multi-agent benchmark for LLM deliberation: twelve agents, each conditioned on a film-faithful persona, debate the film's murder case using multi-agent framework. Two models representing opposite ends of the RLHF spectrum are tested: GPT-4o (closed-source, heavy alignment) and Llama-4-Scout (open-weight, lighter alignment), across three conditions (baseline, open-minded prompt, no initial vote), with N = 3 replications per cell (18 runs total). Three findings emerge. (i) Seventeen of eighteen runs end in a hung jury (a state where the jury fails to reach a unanimous verdict); the film's central event, gradual minority-to-majority persuasion, almost never occurs, indicating that anchoring is the dominant failure mode of current LLMs in this setting. (ii) The two models exhibit sharply different internal dynamics: GPT-4o produces a mean of 1.0 vote changes per run across all conditions, while Llama-4-Scout ranges from 2.0 (baseline) to 6.0 (open-minded prompt), and is the only model to reach a NOT\_GUILTY verdict (1 of 3 runs in the no-initial-vote condition). The same ``open-minded'' instruction is internalized by Llama and ignored by GPT-4o. (iii) This asymmetry suggests that the intensity of RLHF alignment training, not model capability, is the primary determinant of deliberative flexibility in multi-agent settings. Flexibility, not capability, tracks human deliberation. The work is framed as an exploratory study and discusses implications for jury-of-LLMs evaluation and multi-agent debate.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an exploratory multi-agent simulation benchmark based on the plot of 12 Angry Men, instantiating 12 LLM agents with distinct personas to deliberate a murder case. Using two models (GPT-4o and Llama-4-Scout) across three prompt conditions with three replications each, the authors find that 17 out of 18 simulations result in hung juries, with GPT-4o showing minimal vote changes (mean 1.0) compared to Llama-4-Scout (2-6), and attribute the difference in deliberative flexibility to RLHF intensity.
Significance. This work offers a novel, interpretable benchmark for multi-agent LLM deliberation using a well-known cinematic scenario. If the central observations are robust, it suggests that alignment training may constrain flexibility in group decision-making more than raw capability, which could inform the design of multi-agent systems for tasks requiring persuasion and consensus-building. The paper is positioned as exploratory and highlights the dominance of anchoring effects.
major comments (2)
- [Abstract and Results] The attribution in finding (iii) that RLHF intensity is the primary determinant of deliberative flexibility is underdetermined. The comparison is between two architecturally dissimilar models (closed-source GPT-4o vs. open-weight Llama-4-Scout) without matched controls for parameter scale, pre-training data, or other factors, making it impossible to isolate RLHF as the causal variable.
- [Experimental Design] With only N=3 replications per cell and no reported statistical tests, error bars, or variance measures, the reported means (e.g., 1.0 vs 2-6 vote changes) lack robustness. A single stochastic run could dominate the averages, undermining claims about dominant failure modes and model differences.
minor comments (2)
- [Abstract] The abstract states 'Llama-4-Scout ranges from 2.0 (baseline) to 6.0 (open-minded prompt)', but it would be clearer to specify if these are means or ranges across the three runs.
- Consider adding a table summarizing vote changes per condition and model for better readability.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our exploratory multi-agent benchmark. We address each major point below, agreeing where revisions are needed to strengthen the claims and acknowledging the preliminary nature of the study.
read point-by-point responses
-
Referee: [Abstract and Results] The attribution in finding (iii) that RLHF intensity is the primary determinant of deliberative flexibility is underdetermined. The comparison is between two architecturally dissimilar models (closed-source GPT-4o vs. open-weight Llama-4-Scout) without matched controls for parameter scale, pre-training data, or other factors, making it impossible to isolate RLHF as the causal variable.
Authors: We agree that the comparison does not isolate RLHF intensity as a causal variable, given the multiple differences between the models. The manuscript frames the work as exploratory and presents finding (iii) as a suggestion based on the observed asymmetry rather than a definitive causal conclusion. We will revise the abstract, results, and discussion to soften the language, explicitly noting that this is a hypothesis and that controlled follow-up experiments with matched models would be required to test the role of alignment training. A new limitations section will discuss potential confounds including architecture and pre-training data. revision: partial
-
Referee: [Experimental Design] With only N=3 replications per cell and no reported statistical tests, error bars, or variance measures, the reported means (e.g., 1.0 vs 2-6 vote changes) lack robustness. A single stochastic run could dominate the averages, undermining claims about dominant failure modes and model differences.
Authors: We acknowledge that N=3 per cell is small and that the lack of variance reporting and statistical tests limits the strength of the quantitative comparisons. As an exploratory study, the design focused on observing the phenomenon across conditions. We will revise the experimental design and results sections to report full per-run outcomes, include standard deviations or ranges for vote changes, add error bars to any summary figures, and increase replications to N=5 per cell. We will also add a note on stochasticity and temper claims about model differences to reflect the preliminary sample size. The consistent pattern of 17/18 hung juries provides qualitative support for anchoring even with the current N. revision: yes
Circularity Check
No significant circularity: purely empirical simulation with direct reporting of run outcomes
full rationale
The paper contains no derivations, equations, fitted parameters, or mathematical claims. All reported findings (vote-change counts, hung-jury rates, model asymmetries) are direct empirical observations from 18 simulation runs. The central interpretive claim about RLHF intensity is presented as a post-hoc suggestion rather than a derived result, with no self-citation chains, uniqueness theorems, or ansatzes invoked to support it. The study is self-contained against its own experimental data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM agents conditioned on film-faithful personas will produce deliberation dynamics comparable to human jurors when given the same case facts.
Reference graph
Works this paper leans on
-
[1]
Science , volume =
Judgment under Uncertainty: Heuristics and Biases , author =. Science , volume =
-
[2]
and Burger, Doug and Wang, Chi , journal =
Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Zhang, Shaokun and Liu, Jiale and Awadallah, Ahmed Hassan and White, Ryen W. and Burger, Doug and Wang, Chi , journal =
-
[3]
Li, Guohao and Hammoud, Hasan Abed Al Kader and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard , journal =
-
[4]
Qian, Chen and Cong, Xin and Yang, Cheng and Chen, Weize and Su, Yusheng and Xu, Juyuan and Liu, Zhiyuan and Sun, Maosong , journal =
-
[5]
Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , year =
Generative Agents: Interactive Simulacra of Human Behavior , author =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , year =
-
[6]
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Improving Factuality and Reasoning in Language Models through Multiagent Debate , author =. arXiv preprint arXiv:2305.14325 , year =
work page internal anchor Pith review arXiv
-
[7]
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , author =. arXiv preprint arXiv:2305.19118 , year =
work page internal anchor Pith review arXiv
-
[8]
and Ziaee, Ali and Nguyen, Morgan , journal =
Suri, Gaurav and Slater, Lily R. and Ziaee, Ali and Nguyen, Morgan , journal =. Do Large Language Models Show Decision Heuristics Similar to Humans?
-
[9]
Cognitive Bias in Decision-Making with
Echterhoff, Jessica and Liu, Yao and Alessa, Abeer and McAuley, Julian and He, Zexue , journal =. Cognitive Bias in Decision-Making with
-
[10]
arXiv preprint arXiv:2401.01275 , year =
On the Persona Stability of Large Language Models , author =. arXiv preprint arXiv:2401.01275 , year =
-
[11]
and Zhang, Hao and Gonzalez, Joseph E
Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , journal =. Judging
-
[12]
Science , volume =
Human-Level Play in the Game of Diplomacy by Combining Language Models with Strategic Reasoning , author =. Science , volume =
-
[13]
Lumet, Sidney , year =. 12
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.