pith. sign in

arxiv: 2510.15746 · v2 · submitted 2025-10-17 · 💻 cs.CL · cs.AI

LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation

Pith reviewed 2026-05-18 06:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM evaluationmutual evaluationgame theoryhuman alignmentpeer reviewvoting algorithmsself-playopen-ended tasks
0
0 comments X

The pith

LLMs can evaluate one another through mutual peer review aggregated by game-theoretic voting rules and the resulting rankings sometimes align with human preferences on open-ended tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to test whether game theory can guide LLMs in judging each other's outputs on subjective, open-ended problems where fixed benchmarks fall short. It proposes that models review one another's answers in a self-play setup, then applies voting algorithms drawn from game theory to turn those reviews into overall rankings. These machine-generated rankings are placed side by side with rankings produced by human voters to measure agreement and disagreement. A reader would care because the approach could scale evaluation without needing ever-larger numbers of human annotators while still tracking what people actually prefer. The empirical comparison shows both points of convergence and points of divergence, which the authors treat as evidence of both the promise and the current limits of the method.

Core claim

The central claim is that automatic mutual evaluation, in which LLMs assess one another's responses via self-play and peer review, can be aggregated with game-theoretic voting algorithms to produce rankings that can be directly compared with human voting behavior, and that such comparisons reveal both convergences and divergences between the theoretical predictions and actual human judgments on open-ended tasks.

What carries the argument

Game-theoretic voting algorithms that aggregate peer assessments generated by LLMs into coherent rankings for comparison with human preferences.

If this is right

  • Rankings derived from LLM mutual evaluation can serve as a partial substitute for human voting on certain open-ended tasks.
  • The method supplies a concrete way to measure how well model-generated judgments track human preferences.
  • Both agreements and disagreements between the aggregated rankings and human votes provide direct evidence of the strengths and weaknesses of mutual evaluation.
  • The framework makes it possible to test game-theoretic aggregation rules against real human data rather than against abstract theory alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the partial alignment holds across more tasks, the same voting rules could be reused to evaluate new models without fresh human data each time.
  • Divergences might point to specific biases in current LLMs that prevent full human alignment and could be targeted for improvement.
  • The approach suggests a route to self-consistent evaluation loops that grow more reliable as the underlying models improve.

Load-bearing premise

That peer assessments produced by LLMs can be combined through game-theoretic voting in a way that yields rankings meaningfully comparable to how humans would vote on the same open-ended outputs.

What would settle it

Collect a fresh set of open-ended questions, obtain both human rankings and LLM-generated peer-review rankings aggregated by the same voting rules, and find no statistically significant correlation between the two ranking orders.

Figures

Figures reproduced from arXiv: 2510.15746 by Gao Yang, Heyan Huang, Siyu Miao, Xinyue Liang, Yuhang Liu, Zhengyang Liu.

Figure 1
Figure 1. Figure 1: Illustration of game-theoretic peer evaluation [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The proposed framework for game-theoretic evaluation of LLMs. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Alignment with Human Judgments under Different Evaluation Protocols. This figure reports the distribution of Pearson correlation coefficients between model-generated rankings and human preferences from Chatbot Arena under four evaluation protocols: SE, PE, SIE, and SFE. In each boxplot, a higher box indicates stronger alignment with human rankings, while a shorter box implies lower variance and thus more s… view at source ↗
Figure 4
Figure 4. Figure 4: Alignment with Human Judgments Across Benchmarks. Each boxplot shows the distribution of Pearson correlations between human rankings and rankings aggregated by the Kemeny-Young method un￾der the micro-level setting. Higher values indicate stronger alignment. Purple dots represent macro-level correlations for each benchmark. “CreWrite” denotes the creative writing benchmark, and “InstFol” refers to instruct… view at source ↗
Figure 5
Figure 5. Figure 5: Alignment with Human Judgments under Different Evaluation Protocols. This figure reports the distribution of Pearson correlation coefficients between model-generated rankings and human preferences (from Chatbot Arena) across seven benchmarks under four evaluation protocols: SE (Self-Evaluation), PE (Peer Evaluation), SIE (Self-Inclusive Evaluation), and SFE (Self-Free Evaluation). In each boxplot, a higher… view at source ↗
read the original abstract

Ideal or real - that is the question.In this work, we explore whether principles from game theory can be effectively applied to the evaluation of large language models (LLMs). This inquiry is motivated by the growing inadequacy of conventional evaluation practices, which often rely on fixed-format tasks with reference answers and struggle to capture the nuanced, subjective, and open-ended nature of modern LLM behavior. To address these challenges, we propose a novel alternative: automatic mutual evaluation, where LLMs assess each other's output through self-play and peer review. These peer assessments are then systematically compared with human voting behavior to evaluate their alignment with human judgment. Our framework incorporates game-theoretic voting algorithms to aggregate peer reviews, enabling a principled investigation into whether model-generated rankings reflect human preferences. Empirical results reveal both convergences and divergences between theoretical predictions and human evaluations, offering valuable insights into the promises and limitations of mutual evaluation. To the best of our knowledge, this is the first work to jointly integrate mutual evaluation, game-theoretic aggregation, and human-grounded validation for evaluating the capabilities of LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a framework for automatic mutual evaluation of LLMs via self-play and peer review. Peer assessments are aggregated using game-theoretic voting algorithms and compared to human voting on open-ended tasks. The central claim is that empirical results reveal both convergences and divergences between the resulting rankings and human preferences, yielding insights into the promises and limitations of LLM mutual evaluation.

Significance. If the empirical comparisons can be shown to be robust to known LLM judgment artifacts, the approach could offer a reference-free method for evaluating subjective LLM outputs that is more aligned with human preferences than fixed benchmarks. The integration of game-theoretic aggregation is a distinctive element, but its added value over simpler averaging remains to be demonstrated.

major comments (2)
  1. [Abstract] Abstract and empirical results section: The claims of convergences and divergences between theoretical predictions and human evaluations are presented without any reported dataset size, number of models or tasks, statistical controls, error bars, or analysis of how post-hoc choices affected the findings. This absence makes it impossible to assess whether the reported alignment is reliable or artifact-driven.
  2. [Framework] Framework description: The game-theoretic aggregation step presupposes that LLM peer judgments are sufficiently stable and satisfy core voting axioms (transitivity, consistency). No diagnostic is reported on whether the generated assessments exhibit intransitivities, position bias, or prompt-dependent variance, which would make the aggregation an arbitrary transformation rather than a principled bridge to human preferences.
minor comments (2)
  1. [Methodology] Provide the exact prompts used for peer assessment and the precise definitions or equations for the game-theoretic voting rules applied.
  2. [Experiments] Clarify whether the human comparison data was collected on the same open-ended tasks and outputs as the LLM evaluations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We appreciate the opportunity to clarify the presentation of our empirical results and framework. Below we respond point-by-point to the major comments, indicating where we will revise the manuscript to address the concerns.

read point-by-point responses
  1. Referee: [Abstract] Abstract and empirical results section: The claims of convergences and divergences between theoretical predictions and human evaluations are presented without any reported dataset size, number of models or tasks, statistical controls, error bars, or analysis of how post-hoc choices affected the findings. This absence makes it impossible to assess whether the reported alignment is reliable or artifact-driven.

    Authors: We agree that the absence of these specifics in the abstract and results section limits the ability to evaluate the reliability of the reported alignments. The current version of the manuscript contains the underlying experimental data but does not foreground these details or include the requested statistical elements. In the revised manuscript we will update the abstract to state the number of models, tasks, and human-voting dataset size, add error bars and statistical significance tests to the results figures and tables, and include a brief discussion of sensitivity to post-hoc choices such as prompt variations. These additions will allow readers to better assess whether the observed convergences and divergences are robust. revision: yes

  2. Referee: [Framework] Framework description: The game-theoretic aggregation step presupposes that LLM peer judgments are sufficiently stable and satisfy core voting axioms (transitivity, consistency). No diagnostic is reported on whether the generated assessments exhibit intransitivities, position bias, or prompt-dependent variance, which would make the aggregation an arbitrary transformation rather than a principled bridge to human preferences.

    Authors: The referee rightly notes that the framework relies on the stability of LLM judgments without providing supporting diagnostics. The initial submission focused on the aggregation method and its comparison to human votes but did not include explicit checks for intransitivity, position bias, or prompt variance. In the revised version we will add a dedicated subsection (or appendix) that reports these diagnostics on the collected judgments, including measured rates of intransitive cycles, results from position-swapping experiments, and variance across prompt paraphrases. We will also discuss any observed violations and their implications for the validity of the game-theoretic aggregation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework validated against external human benchmarks

full rationale

The paper proposes applying game-theoretic aggregation to LLM-generated peer assessments and then compares the resulting rankings directly to independent human voting data on open-ended tasks. This comparison to external human judgments serves as an independent validation step rather than deriving the target result from fitted parameters or self-referential definitions. No load-bearing steps reduce by construction to the inputs: the abstract explicitly frames the work as revealing convergences and divergences between the framework's outputs and human evaluations, without evidence of self-definition, fitted-input predictions, or uniqueness theorems imported from the authors' prior work. The derivation remains self-contained against the external human benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on domain assumptions about LLM judgment quality and the applicability of game-theoretic voting to human alignment; no new physical entities are postulated and free parameters appear limited to voting algorithm settings.

free parameters (1)
  • Voting algorithm parameters
    Parameters controlling how peer reviews are aggregated in the game-theoretic step; exact values or fitting procedure not stated in abstract.
axioms (1)
  • domain assumption LLM-generated peer reviews can be aggregated to approximate human preferences on subjective tasks.
    Invoked as the basis for using mutual evaluation as a human-aligned proxy.

pith-pipeline@v0.9.0 · 5730 in / 1283 out tokens · 42277 ms · 2026-05-18T06:13:17.852769+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.CoRR, abs/2501.12948. Laura Dietz, Oleg Zendel, Peter Bailey, Charles Clarke, Ellese Cotterill, Jeff Dalton, Faegheh Hasibi, Mark Sanderson, and Nick Craswell. 2025. Llm- evaluation tropes: Perspectives on the validity of llm-evaluations.Preprint, arXiv:2504.19076. Alexan...

  2. [2]

    InFindings of the Asso- ciation for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11- 16, 2024, pages 12688–12701

    Llms as narcissistic evaluators: When ego inflates evaluation scores. InFindings of the Asso- ciation for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11- 16, 2024, pages 12688–12701. Association for Com- putational Linguistics. Amirmahdi Mirfakhar, Xuchuang Wang, Jinhang Zuo, Yair Zick, and Mohammad Hajiesmaili. 2025...

  3. [3]

    Writingbench: A comprehensive benchmark for generative writing, 2025

    Distributed agreement in the arrovian frame- work. In28th International Conference on Principles of Distributed Systems, OPODIS 2024, December 11- 13, 2024, Lucca, Italy, volume 324 ofLIPIcs, pages 32:1–32:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Me...

  4. [4]

    Benchmarking benchmark leakage in large language mod- els

    Benchmarking benchmark leakage in large language models.CoRR, abs/2404.18824. Di Zhang, Jianbo Wu, Jingdi Lei, Tong Che, Jiatong Li, Tong Xie, Xiaoshui Huang, Shufei Zhang, Marco Pavone, Yuqiang Li, Wanli Ouyang, and Dongzhan Zhou. 2024. Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning. CoRR, abs/2410.02884. Jeffrey Zho...

  5. [5]

    Preference Graph:Construct a directed graph with weighted edges representing preference conflicts

  6. [6]

    Example:Given input rankings[A, B, C],[B, A, C],[C, A, B]: •B→A(1 disagreement),A→C,B→C

    Minimum Feedback Arc Set:Remove minimal-weight edges to make the graph acyclic; the topological sort gives the consensus. Example:Given input rankings[A, B, C],[B, A, C],[C, A, B]: •B→A(1 disagreement),A→C,B→C. • The consensus ranking minimizing total discordance is[B, A, C]. In our implementation, we adapt this procedure by replacing D(σ, πk) with τ(σ, π...

  7. [8]

    Mathematical expressions and formulas should be written using Markdown math syntax, en- closed in$...$for inline math or$$...$$for display equations

  8. [9]

    High-quality Chinese Question Generation Named GenChinese Please generate50 Chinese language-related tasksthat cover a comprehensive range of linguistic dimensions

    All questions should be written inEnglish, with clear and precise language. High-quality Chinese Question Generation Named GenChinese Please generate50 Chinese language-related tasksthat cover a comprehensive range of linguistic dimensions. These tasks should be suitable for applications such as: • Phonetics and Phonology • V ocabulary and Word Formation ...

  9. [11]

    All questions should be written inChinese, with clarity and appropriateness for use in linguistics research, teaching, test design, or LLM training. Comprehensive Evaluation of LLM Capabilities Problem Generation Please generate50 evaluation questionsdesigned to comprehensively assess the capabilities of large language models (LLMs). The questions should ...

  10. [12]

    id": id,

    Output format should be a JSON array in the following structure: [ {"id": id, "question": question} ]

  11. [13]

    I Prompts Answer Ranking Prompt Design for Overall You are a reviewer assigned to rank multiple solutions to a given question

    All questions should be written inEnglish, using clear, precise, and instruction-oriented lan- guage. I Prompts Answer Ranking Prompt Design for Overall You are a reviewer assigned to rank multiple solutions to a given question. Your evaluation must be based solely on the following three criteria: •Accuracy: How correct and relevant is the information? •L...

  12. [14]

    You must rankall six solutions, without skipping or tying any of them.Do not add any comments or explanations.Only return the final ordered list by solution number

    Solution z ... You must rankall six solutions, without skipping or tying any of them.Do not add any comments or explanations.Only return the final ordered list by solution number. Answer Ranking Prompt Design for Mathematical Problem You are a reviewer assigned to rank multiple solutions to the same math problem. Your evaluation must be based solely on th...

  13. [15]

    You must rankall six solutions, without skipping or tying any of them.Do not add any comments or explanations.Only return the final ordered list by solution number

    Solution z ... You must rankall six solutions, without skipping or tying any of them.Do not add any comments or explanations.Only return the final ordered list by solution number. Answer Ranking Prompt Design for Chinese You are a reviewer assigned to rank multiple answers written in Chinese. Your evaluation must be based solely on the following three cri...

  14. [16]

    You must rankall six solutions, without skipping or tying any of them.Do not add any comments or explanations.Only return the final ordered list by solution number

    Solution z ... You must rankall six solutions, without skipping or tying any of them.Do not add any comments or explanations.Only return the final ordered list by solution number. Answer Ranking Prompt Design for Instruction Following You are a reviewer assigned to rank multiple responses to the same instruction. Your evaluation must be based solely on th...

  15. [17]

    You must rankall six solutions, without skipping or tying any of them.Do not add any comments or explanations.Only return the final ordered list by solution number

    Solution z ... You must rankall six solutions, without skipping or tying any of them.Do not add any comments or explanations.Only return the final ordered list by solution number. Answer Ranking Prompt Design for Code Implementation You are a reviewer assigned to rank multiple code implementations. Your evaluation must be based solely on the following thr...

  16. [18]

    You must rankall six solutions, without skipping or tying any of them.Do not add any comments or explanations.Only return the final ordered list by solution number

    Solution z ... You must rankall six solutions, without skipping or tying any of them.Do not add any comments or explanations.Only return the final ordered list by solution number. Answer Ranking Prompt Design for Creative Writing You are a reviewer assigned to rank multiple creative writing pieces. Your evaluation must be based solely on the following thr...

  17. [19]

    You must rankall six solutions, without skipping or tying any of them.Do not add any comments or explanations.Only return the final ordered list by solution number

    Solution z ... You must rankall six solutions, without skipping or tying any of them.Do not add any comments or explanations.Only return the final ordered list by solution number