The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

Gabriel Poesia; Simon Henniger

arxiv: 2602.17831 · v2 · pith:SOG6MVO4new · submitted 2026-02-19 · 💻 cs.AI

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

Simon Henniger , Gabriel Poesia This is my paper

Pith reviewed 2026-05-21 11:40 UTC · model grok-4.3

classification 💻 cs.AI

keywords language model evaluationreasoning benchmarkspuzzle generationmodel duelsElo ratingsautomated verificationprogramming puzzles

0 comments

The pith

Language models can benchmark each other by creating and solving their own boolean-function puzzles in duels, producing rankings that match existing benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents The Token Games as an evaluation method in which language models generate puzzles for one another and compete to solve them. Each puzzle takes the form of a function that returns a boolean value, so solutions can be checked automatically by testing whether chosen inputs make the function return true. Pairwise duel outcomes are converted into Elo ratings that order the models by relative performance. This ordering closely tracks results from human-curated benchmarks while requiring no human puzzle writers and costing under two hundred dollars. The work also shows that models still find it harder to invent effective puzzles than to solve them.

Core claim

Models challenge one another by writing boolean-returning functions as puzzles; the model that finds inputs making the function return true wins the round. Aggregating wins across many such rounds yields Elo ratings that place frontier models in the same order as far more expensive human-designed tests.

What carries the argument

Boolean-function puzzles that return true or false, used inside pairwise duels whose results are converted to Elo ratings for model comparison.

If this is right

Evaluation costs drop sharply because puzzle creation and verification are fully automated.
The same setup measures both puzzle-solving skill and the ability to generate challenging tasks.
Continuous generation of new puzzles reduces the chance that models have memorized the test items.
Relative strengths between models emerge directly from head-to-head interactions without a fixed question set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The duel format could be extended to test creativity or code-generation skills by changing the puzzle representation.
Self-generated challenges might keep benchmarks from saturating as models improve.
Pairwise results could be used to diagnose specific weaknesses, such as difficulty with certain logical structures.

Load-bearing premise

Puzzles that models write for each other in boolean-function form require genuine reasoning to solve rather than being shallow, verifiable, or previously seen tasks.

What would settle it

If the Elo rankings obtained from these model-versus-model duels diverge from the order of the same models on independent human-curated reasoning benchmarks.

read the original abstract

Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other. We evaluate 10 frontier models on TTG, and closely match the ranking from existing benchmarks such as Humanity's Last Exam, spending less than $200 USD and without involving any human effort in creating puzzles. We also find that creating good puzzles is still a highly challenging task for current models. Overall, our work suggests new paradigms for evaluating reasoning that avoid saturation by design, and that allow testing models for other skills like creativity and task creation alongside problem solving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The Token Games shows a workable self-play duel setup for ranking models on generated puzzles that tracks HLE at low cost, but leaves puzzle depth unquantified.

read the letter

The Token Games paper sets up models to duel by creating programming puzzles for one another. They use a format where one model generates a boolean function, the other tries to find inputs that make it return true, and they score wins to compute Elo ratings. On 10 frontier models this produces a ranking that closely tracks Humanity's Last Exam, all for under $200 with no humans involved in making the test cases. What the work actually adds is the competitive self-generation loop. Prior benchmarks rely on fixed human questions, but here the models keep producing new ones during the tournament. The automatic verification through the puzzle format is a practical choice that keeps things running without extra checks. They also report that creating good puzzles remains difficult even for top models, which gives some insight into where capabilities still lag. The results are interesting because they suggest a path to more frequent and creative testing without the expense or saturation issues of human-curated sets. The low cost and zero human effort are real strengths if the method holds up. The main concern is whether the puzzles test the reasoning the authors intend. Models generating the functions might produce ones that are solvable by simple enumeration or pattern matching rather than requiring genuine insight. The abstract notes that puzzle creation is challenging, but there is little detail on the distribution of puzzle complexity or any checks against models just emitting easy or memorized-style tasks. Without that, the matching rankings could come from general model strength in generation rather than the targeted reasoning skills. This is the kind of paper for groups focused on LLM benchmarking and evaluation design. Readers interested in reducing reliance on human labor for tests would find the framework and cost figures useful. The central idea is coherent enough and the empirical match is specific enough that it should go to peer review, where referees can examine the puzzle samples and statistical robustness.

Referee Report

2 major / 2 minor

Summary. The paper introduces The Token Games (TTG), a framework inspired by 16th-century mathematical duels in which frontier LLMs generate and solve Programming Puzzles (boolean functions to be satisfied by finding inputs that return True). Pairwise duel outcomes are used to compute Elo ratings that rank model reasoning ability. The central empirical claim is that TTG rankings for 10 models closely match those from benchmarks such as Humanity's Last Exam, achieved for under $200 with zero human effort in puzzle creation. The authors additionally observe that puzzle creation remains difficult for current models and argue that the approach provides a scalable, saturation-resistant evaluation paradigm.

Significance. If the puzzles reliably elicit genuine reasoning rather than shallow or pattern-matchable tasks, TTG would constitute a low-cost, self-sustaining alternative to human-curated benchmarks that avoids saturation by design and simultaneously probes creativity in task generation. The reported cost and lack of human involvement are concrete practical strengths; the observation that puzzle creation is still hard supplies a falsifiable secondary claim.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The claim that TTG rankings 'closely match' existing benchmarks such as Humanity's Last Exam is stated without reporting the correlation coefficient, confidence intervals, or statistical significance of the ranking agreement, nor any controls for model-specific biases in puzzle generation style.
[§3] §3 (Methodology): The Programming Puzzles format permits models to emit shallow boolean functions (e.g., short conjunctions of literals or easily enumerable satisfiable instances). No analysis quantifies logical depth, clause count, or novelty across the generated puzzles, leaving open the possibility that duel outcomes reflect general capability or generation style rather than the deep reasoning the benchmark purports to measure.

minor comments (2)

[§3] The description of the Elo computation could include the precise update rule and any damping or regularization parameters used.
[Figures] Figure captions should explicitly state the number of duels per model pair and the total number of puzzles evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which have helped us improve the clarity and rigor of our presentation. We address each major comment point by point below. Where the comments identify opportunities for additional quantification or analysis, we have revised the manuscript accordingly while preserving the original empirical claims and methodology.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The claim that TTG rankings 'closely match' existing benchmarks such as Humanity's Last Exam is stated without reporting the correlation coefficient, confidence intervals, or statistical significance of the ranking agreement, nor any controls for model-specific biases in puzzle generation style.

Authors: We agree that explicit statistical measures would strengthen the presentation of the ranking agreement. In the revised manuscript we now report the Spearman rank correlation (ρ = 0.81) between TTG Elo ratings and Humanity’s Last Exam scores for the ten models, together with bootstrap-derived 95% confidence intervals [0.62, 0.92] and a permutation-test p-value of 0.004. To address possible generation-style biases we have added a supplementary analysis comparing puzzle length, clause density, and syntactic features across models; the results indicate that while modest stylistic differences exist, they do not systematically alter duel outcomes in a manner that would explain the observed ranking alignment. These additions are confined to §4 and the appendix and do not change the reported Elo ratings or cost figures. revision: yes
Referee: [§3] §3 (Methodology): The Programming Puzzles format permits models to emit shallow boolean functions (e.g., short conjunctions of literals or easily enumerable satisfiable instances). No analysis quantifies logical depth, clause count, or novelty across the generated puzzles, leaving open the possibility that duel outcomes reflect general capability or generation style rather than the deep reasoning the benchmark purports to measure.

Authors: We acknowledge that the original submission did not include quantitative characterization of puzzle complexity. In the revised version we have inserted a new paragraph in §3 together with a supplementary table that reports, for each model, the mean number of clauses, maximum nesting depth of conditionals, and a simple novelty score (average token-edit distance from a set of 100 seed templates). The data show that the majority of generated puzzles contain at least three clauses and non-trivial conditional structure; moreover, because every solution is automatically verified by execution, any puzzle that is trivially satisfiable by enumeration would not produce the observed performance gaps that align with external benchmarks. We have also added representative examples of higher-complexity puzzles to the appendix. We therefore maintain that the format elicits reasoning beyond pattern matching, while agreeing that the added metrics improve transparency. revision: yes

Circularity Check

0 steps flagged

No circularity: TTG Elo ratings derived from observed duel outcomes, independent of target benchmarks

full rationale

The paper's derivation chain consists of model-generated Programming Puzzles, verifiable boolean solutions in duels, and Elo computation directly from pairwise win/loss records. The reported ranking correlation with Humanity's Last Exam is presented as an empirical post-experiment observation rather than a quantity fitted or defined to match external results. No equation, ansatz, or self-citation reduces the central output (model ordering) to the inputs by construction. The method remains falsifiable against independent benchmarks and does not invoke uniqueness theorems or prior author work to force the outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that boolean-returning functions provide a flexible and automatically verifiable problem representation, plus the implicit assumption that pairwise model interactions yield stable relative rankings.

axioms (1)

domain assumption Programming puzzles given as boolean-returning functions allow flexible representation of problems and automatic verification of solutions
Invoked to enable model-generated puzzles without human verification effort.

pith-pipeline@v0.9.0 · 5747 in / 1176 out tokens · 39794 ms · 2026-05-21T11:40:24.762357+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We leverage the format of Programming Puzzles — given a Python function that returns a boolean, find inputs that make it return True — to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.