Evaluating Language Models' Evaluations of Games

Adrian Weller; Cedegao E. Zhang; Graham Todd; Ionatan Kuperwajs; Joshua B. Tenenbaum; Katherine M. Collins; Lance Ying; Lionel Wong; Mauricio Barba da Costa; Prafull Sharma

arxiv: 2510.10930 · v3 · pith:GWLF3Z5Inew · submitted 2025-10-13 · 💻 cs.CL · cs.AI

Evaluating Language Models' Evaluations of Games

Katherine M. Collins , Cedegao E. Zhang , Graham Todd , Lance Ying , Mauricio Barba da Costa , Ryan Liu , Prafull Sharma , Adrian Weller

show 4 more authors

Ionatan Kuperwajs Lionel Wong Joshua B. Tenenbaum Thomas L. Griffiths

This is my paper

Pith reviewed 2026-05-21 20:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords language modelsreasoning modelsboard gameshuman alignmentgame evaluationpayoff assessmentfunnessAI meta-evaluation

0 comments

The pith

Reasoning models produce game evaluations that align more closely with human judgments than non-reasoning language models, though this alignment decreases as models approach optimal performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a new way to assess artificial intelligence by examining how well models evaluate games rather than how well they play them. Using a dataset of more than 100 original board games and hundreds of human ratings, it tests model assessments of game fairness and enjoyment against both people and symbolic methods. A reader might care because this reveals whether AI can make the kind of judgments humans do when deciding what challenges are worthwhile. The work finds that adding reasoning capabilities improves match to human views on these dimensions. Yet it also uncovers that better performing models sometimes diverge more from human preferences, especially on harder-to-measure qualities like fun.

Core claim

The central discovery is that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. This alignment follows a non-monotonic pattern where models closer to game-theoretic optimal show weaker fit to the human data. Assessments of funness display more jaggedness across different models than assessments of payoff fairness. Reasoning models also show highly variable and unpredictable resource usage when making these evaluations.

What carries the argument

A formalism that evaluates AI evaluations of games along two axes: the computational complexity of the query and the difficulty of quantifying the target property, tested via payoff and funness queries on novel board games.

If this is right

Models with reasoning show stronger correspondence to human payoff and funness judgments.
Closer approximation to optimal play reduces the match with human evaluations.
Funness proves harder to quantify consistently, leading to greater model-to-model differences.
Resource consumption varies unpredictably for reasoning models during evaluation tasks.
The approach suggests value in incorporating human-like evaluation capabilities into AI systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method of evaluating evaluations could apply to other areas where AI must select or design problems, such as in automated research or game design tools.
The observed non-monotonic trend implies that pushing models toward pure optimality might reduce their usefulness as proxies for human preference in creative tasks.
Future work might explore training objectives that balance game-theoretic performance with alignment to human funness ratings.
Variable resource use highlights a potential need for models that can decide how much computation to allocate to different evaluation queries.

Load-bearing premise

Human judgments collected for the novel board games provide a consistent and representative standard for what counts as fair payoff or fun gameplay.

What would settle it

Re-running the comparisons with a new set of games chosen without the original selection criteria or with repeated ratings from the same people showing high disagreement would challenge the results.

read the original abstract

Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over 100 novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a formalism for assessing AI systems' evaluations of games (rather than their ability to play them) and reports an empirical comparison of language models and reasoning models against human judgments and symbolic agents. Using a dataset of over 100 novel board games and over 450 human judgments, it evaluates models on two query types—payoff/fairness and funness—finding that reasoning models are generally more aligned with human evaluations than non-reasoning models, but with a non-monotonic relationship to game-theoretic optimality, greater jaggedness for funness, and highly variable resource usage.

Significance. If the central claims hold, the work is significant for shifting AI evaluation paradigms toward meta-evaluation of problem worthiness, which has implications for designing more capable and aligned AI systems. Strengths include the scale of the novel-games dataset (reducing reliance on memorized benchmarks like chess) and the explicit comparison to symbolic computational agents, which provides a parameter-free baseline. The non-monotonic alignment finding and resource-usage observations, if robust, could motivate new research on resource-rational meta-reasoning in language models.

major comments (2)

[Abstract and Dataset section] Human Judgments collection (referenced in the abstract): the alignment metrics and non-monotonic trend rest on treating the 450+ human judgments as stable ground truth for both payoff and funness, yet no inter-rater reliability statistics, test-retest measures, rater screening protocols, or bias controls are described. This is load-bearing because every quantitative comparison flows through these judgments; unaccounted noise or systematic biases could artifactually produce the reported model differences and jaggedness patterns.
[Results] Results on non-monotonic relationship (abstract): the claim that fit to human data weakens as models approach game-theoretic optimal requires explicit definition of the optimality metric (e.g., via specific equilibrium computation or simulation) and statistical support (e.g., quadratic regression or breakpoint analysis) rather than visual inspection alone, as this underpins the key challenge to scaling assumptions.

minor comments (2)

[Abstract] The term 'jaggedness' for funness assessments is used without a precise operational definition or quantitative measure, which reduces clarity when comparing across queries.
[Results] Ensure all reported comparisons include error bars, sample sizes per condition, and any data exclusion criteria to allow verification of the alignment results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, outlining revisions that will strengthen the transparency and statistical rigor of our claims without altering the core findings.

read point-by-point responses

Referee: [Abstract and Dataset section] Human Judgments collection (referenced in the abstract): the alignment metrics and non-monotonic trend rest on treating the 450+ human judgments as stable ground truth for both payoff and funness, yet no inter-rater reliability statistics, test-retest measures, rater screening protocols, or bias controls are described. This is load-bearing because every quantitative comparison flows through these judgments; unaccounted noise or systematic biases could artifactually produce the reported model differences and jaggedness patterns.

Authors: We agree that greater methodological transparency is needed to support the human judgments as ground truth. The current manuscript reports the collection of over 450 judgments but does not include reliability metrics. In the revised version, we will add inter-rater reliability statistics (such as Krippendorff's alpha) separately for payoff and funness queries, along with details on rater screening protocols, test-retest procedures where available, and steps taken to mitigate bias. These additions will be placed in the Dataset section to directly address concerns about noise or systematic artifacts in the alignment results. revision: yes
Referee: [Results] Results on non-monotonic relationship (abstract): the claim that fit to human data weakens as models approach game-theoretic optimal requires explicit definition of the optimality metric (e.g., via specific equilibrium computation or simulation) and statistical support (e.g., quadratic regression or breakpoint analysis) rather than visual inspection alone, as this underpins the key challenge to scaling assumptions.

Authors: We concur that the non-monotonic relationship merits more formal definition and statistical validation. The optimality metric is derived from comparisons against symbolic computational agents that compute equilibria, as described in the methods for establishing game-theoretic baselines. In the revised Results section, we will explicitly define this metric (including the equilibrium computation approach) and supplement the visual observation with a quadratic regression analysis (or equivalent breakpoint test) on the alignment scores versus optimality distance, reporting coefficients, p-values, and goodness-of-fit measures. This will provide quantitative support for the non-monotonic pattern and its implications for scaling. revision: yes

Circularity Check

0 steps flagged

No circularity: results rest on external human judgments and symbolic agents as independent benchmarks.

full rationale

The paper introduces a formalism for evaluating AI evaluations of games and then performs empirical comparisons of model outputs (on payoff/fairness and funness queries) against a dataset of over 100 novel board games and 450+ human judgments, plus symbolic computational agents. No equations or derivations reduce by construction to fitted inputs or self-referential definitions; the central claims about alignment and non-monotonicity are measured against these external data sources rather than being forced by internal parameter fits or self-citations. The derivation chain is self-contained against the provided benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on treating human ratings as the reference standard and assuming the novel game set adequately samples the space of evaluable games; no free parameters or invented entities are described.

axioms (1)

domain assumption Human judgments on payoff and funness provide a reliable external benchmark for model evaluations
The paper uses these judgments to measure alignment and reports deviations from them.

pith-pipeline@v0.9.0 · 5832 in / 1295 out tokens · 68635 ms · 2026-05-21T20:52:03.862431+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Post-training makes large language models less human-like
cs.CL 2026-05 unverdicted novelty 6.0

Post-training reduces LLMs' behavioral alignment with humans across families and sizes, with the misalignment increasing in newer generations while persona induction fails to improve individual-level predictions.
Statistical mechanics for Scrabble predicts strategy, entropy and language
physics.bio-ph 2026-05 unverdicted novelty 6.0

A pairwise maximum-entropy model fitted to Scrabble tile graphs reproduces observed statistics, predicts word-length and geometric features, and classifies languages by entropy and structure.