ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models
Pith reviewed 2026-05-18 12:21 UTC · model grok-4.3
The pith
Large language models lack genuine strategic reasoning, failing to beat even amateur-level chess opponents in competitive tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ChessArena provides a competitive framework in which LLMs play chess under four play modes to assess basic understanding, move selection, and puzzle solving. Testing thirteen models across more than eight hundred games reveals that no evaluated LLM beats the Maia-1100 engine, which corresponds to human amateur play, and that some models are outperformed by random play. A fine-tuned Qwen3-8B model delivers substantial performance gains and approaches the results of much larger state-of-the-art reasoning models.
What carries the argument
ChessArena, a chess-based competitive testbed using four play modes to evaluate LLMs on rule following and game-state tracking in both full games and puzzles.
Load-bearing premise
Success or failure in chess games and puzzles under the four play modes directly measures genuine strategic reasoning rather than pattern recognition or rule memorization.
What would settle it
An LLM that wins the majority of games against Maia-1100 across the four play modes would indicate the presence of strategic reasoning.
Figures
read the original abstract
Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine strategic reasoning, or do they primarily excel at pattern recognition? To address this, we present ChessArena, a chess-based testbed for evaluating LLMs. Chess demands strategic reasoning, precise rule adherence, and the ability to track complex game states. ChessArena is a competitive framework where LLMs play against each other under four play modes. We evaluate 13 LLMs across over 800 games, testing basic understanding, move selection, and puzzle solving. Results reveal significant shortcomings: no model beats Maia-1100 (human amateur level), and some lose to random play. We also present a strong baseline: our fine-tuned Qwen3-8B substantially improves performance, approaching much larger state-of-the-art reasoning models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ChessArena, a competitive chess testbed for assessing whether LLMs possess genuine strategic reasoning or primarily rely on pattern recognition. It evaluates 13 LLMs across more than 800 games in four play modes (testing basic understanding, move selection, and puzzle solving), comparing them to random play and the Maia-1100 engine (human amateur level). The central findings are that no evaluated LLM beats Maia-1100 and some lose to random play, while a fine-tuned Qwen3-8B model shows substantial gains approaching larger state-of-the-art reasoning models.
Significance. If the results hold after addressing reporting gaps, the work offers a concrete, falsifiable benchmark that highlights limitations in LLMs' strategic capabilities beyond surface-level chess knowledge. The fine-tuned Qwen3-8B baseline is a constructive strength, demonstrating that targeted adaptation can narrow the gap to larger models and providing a reproducible starting point for future research. The use of external engines (Maia-1100) and multiple play modes strengthens the empirical framing compared to purely internal evaluations.
major comments (2)
- [Evaluation and Results] The central claim that performance differences reflect deficits in strategic reasoning (rather than rule-following or state-tracking failures) is load-bearing but insecure without details on illegal move handling. The abstract and evaluation sections report aggregate outcomes over 800+ games but omit illegal move rates, enforcement mechanisms, and how invalid outputs are resolved (e.g., default to random, early termination, or penalty). This directly affects interpretation of losses to random play and comparisons to Maia-1100.
- [Results and Discussion] No error bars, confidence intervals, or statistical tests are provided for the aggregate win rates or comparisons across models and modes. This weakens the strength of the claim that 'no model beats Maia-1100' and that the fine-tuned Qwen3-8B 'substantially improves' performance, as variability across games or seeds cannot be assessed.
minor comments (2)
- [Play Modes] The description of the four play modes would benefit from a short example trace showing a full game or puzzle interaction, including how the LLM prompt is constructed and how the opponent responds.
- [Experimental Setup] Clarify the implementation details of the random baseline and Maia-1100 (e.g., move selection policy, time controls, or opening book usage) to ensure reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: [Evaluation and Results] The central claim that performance differences reflect deficits in strategic reasoning (rather than rule-following or state-tracking failures) is load-bearing but insecure without details on illegal move handling. The abstract and evaluation sections report aggregate outcomes over 800+ games but omit illegal move rates, enforcement mechanisms, and how invalid outputs are resolved (e.g., default to random, early termination, or penalty). This directly affects interpretation of losses to random play and comparisons to Maia-1100.
Authors: We agree that explicit details on illegal-move handling are required to support interpretation of the results. We have added a dedicated paragraph to the Evaluation section describing the move-generation protocol (standard algebraic notation with up to three resampling attempts for invalid outputs), the observed illegal-move rates per model and mode, and the resolution rule (unresolved invalid moves count as a loss). These additions are also summarized briefly in the abstract. The new information shows that illegal-move rates are low for most models and do not account for the majority of losses against random play or Maia-1100, thereby reinforcing rather than undermining the claim that performance gaps reflect strategic-reasoning limitations. revision: yes
-
Referee: [Results and Discussion] No error bars, confidence intervals, or statistical tests are provided for the aggregate win rates or comparisons across models and modes. This weakens the strength of the claim that 'no model beats Maia-1100' and that the fine-tuned Qwen3-8B 'substantially improves' performance, as variability across games or seeds cannot be assessed.
Authors: We accept that the original presentation lacked quantitative measures of uncertainty. In the revised Results section we now report 95 % bootstrap confidence intervals for every win rate and include binomial tests comparing each model against the Maia-1100 and random baselines. These additions confirm that no evaluated model significantly exceeds Maia-1100 and that the fine-tuned Qwen3-8B improvement is statistically significant, thereby strengthening the evidential basis for our conclusions. revision: yes
Circularity Check
Empirical evaluation with no derivation chain or self-referential reductions
full rationale
The paper introduces ChessArena as an experimental testbed and reports direct game outcomes from LLM play against Maia-1100, random baselines, and internal fine-tuning. No equations, first-principles derivations, or predictions are presented that could reduce to fitted inputs or self-citations by construction. All claims rest on observable win rates and puzzle accuracy measured against external engines, satisfying the criterion for self-contained empirical work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Chess requires and therefore measures genuine strategic reasoning beyond pattern matching.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ChessArena is a competitive framework where LLMs play against each other under four play modes... Glicko rating system... fine-grained evaluation tasks: Basic Understanding, Move Selection, Puzzle Solving.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We define three types of rewards: format reward, legal move reward, and top move reward.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLM move selection in four-in-a-row is best explained by myopic models that ignore deep nodes in their own reasoning traces, while performance correlates with search breadth rather than depth.
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLMs display myopic planning in games: move selection is driven by shallow nodes in reasoning traces despite generating deep lookahead, with performance tied to search breadth rather than depth.
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLMs exhibit myopic planning in games, with move choices driven by shallow nodes despite deep reasoning traces, in contrast to human deep-search reliance.
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLM planning in four-in-a-row is myopic: move choices match a shallow model that ignores deep nodes expanded in reasoning traces.
Reference graph
Works this paper leans on
-
[1]
A player initiates a match request, and the system records their current ratingrand rating deviation RD
-
[2]
The system searches for potential opponents in the match pool and calculates the matching score: score(i, j) =E i(1−E i) g(RDi)2 +g(RD j)2
-
[3]
The opponent with the highest matching score is prioritized
-
[4]
For players with high RD, the system prioritizes matching them with opponents who have lowRD and similar ratings
-
[5]
After the opponent accepts the match, the match begins
-
[6]
After the match, both players’rand RD are updated based on the results ChessArena Matching System VariantsThe system supports two startup modes:
-
[7]
Random startup mode: (a) A player is randomly selected from the player pool (b) The selected player automatically initiates a match request (c) Steps 2-6 of the Competition Sampling process are executed
-
[8]
Specified startup mode: (a) An initial player is specified by a human (b) The specified player initiates a match request (c) Steps 2-6 of the Competition Sampling process are executed D Post-training Details D.1 SFT Data Collection ChessGPTChessGPT Feng et al. [2023] has open-sourced a text pre-training dataset and a post-training SFT dataset related to c...
work page 2023
-
[9]
Models fine-tuned via Chess Supervised Fine-Tuning (SFT), regardless of whether they subsequently undergo Code, Math, or Chess Reinforcement Learning (RL), consistently demonstrate a degree of generalization ability to the logical reasoning benchmark, ZebraLogic. Specifically, models that underwent Chess SFT show a significant improvement in their ZebraLo...
-
[10]
Incorporating a certain amount of Chess data into the RL dataset contributes to the generalization of Code capabilities, particularly on the LiveCodeBench. Comparisons (e.g., Math + Code + Chess - RL / Math + Chess - RL versus Math + Code - RL / Math - RL) reveal that RL models augmented with Chess data achieve a improvement in their LiveCodeBench scores....
-
[11]
The person who loves fantasy books is in the second house
-
[12]
Alice is not in the second house
-
[13]
Eric is the person who is a teacher
-
[14]
The person who uses a Samsung Galaxy S21 is the person who loves fantasy books
-
[15]
The person who uses an iPhone 13 is the person who loves science fiction books
-
[16]
The person who loves science fiction books is somewhere to the left of the person who uses a OnePlus 9
-
[17]
The person who uses a OnePlus 9 is Arnold
-
[18]
The person who is a doctor is the person who loves mystery books
-
[19]
The person who uses an iPhone 13 is the person who is a teacher. Ground-Truth { "House 1": { "Name": "Alice", "Occupation": "engineer", "BookGenre": "romance", "PhoneModel": "google pixel 6" }, "House 2": { "Name": "Peter", "Occupation": "artist", "BookGenre": "fantasy", "PhoneModel": "samsung galaxy s21" }, "House 3": { "Name": "Eric", "Occupation": "tea...
-
[20]
Clue 6 says the person using an iPhone 13 loves science fiction books, so Eric loves science fiction books. Clue 7 says the person who loves science fiction books is to the left of the person using a OnePlus 9, so Eric must be in a house to the left of the OnePlus 9 user. Clue 8 says the person using a OnePlus 9 is Arnold, so Arnold is in a house to the r...
-
[21]
The person’s child is named Alice is Bob
-
[22]
The person who loves country music is directly left of Arnold
-
[23]
Alice is the person who is tall
-
[24]
The person who loves pop music is Eric
-
[25]
Bob is somewhere to the right of the person who is super tall
-
[26]
The person’s child is named Fred is Peter
-
[27]
The person’s child is named Bella is the person who loves hip-hop music
-
[28]
The person who is the mother of Timothy is not in the sixth house
-
[29]
The person who is super tall is somewhere to the right of the person who has an average height
-
[30]
The person’s child is named Alice is somewhere to the right of Arnold
-
[31]
There is one house between the person who is short and the person who is very short
-
[32]
The person who is very short is in the fifth house
-
[33]
The person who loves jazz music is not in the fifth house
-
[34]
Carol is somewhere to the left of the person who is the mother of Timothy
-
[35]
The person who is very tall is not in the sixth house
-
[36]
The person who loves classical music is in the sixth house
-
[37]
The person who loves rock music is in the first house. Ground-Truth { "House 1": { "Name": "Carol", "Children": "Samantha", "MusicGenre": "rock", "Height": "average" }, "House 2": { "Name": "Peter", "Children": "Fred", "MusicGenre": "country", "Height": "super tall" }, "House 3": { "Name": "Arnold", "Children": "Bella", "MusicGenre": "hip hop", "Height": ...
-
[38]
So, House 5 has Height: very short
From clue 13: The person who is very short is in the fifth house. So, House 5 has Height: very short
-
[39]
Since House 5 is very short, the short person must be in House 3
From clue 12: There is one house between the person who is short and the person who is very short. Since House 5 is very short, the short person must be in House 3. So, House 3 has Height: short
-
[40]
So, super tall must be in a house after the average height
From clue 10: The person who is super tall is somewhere to the right of the person who has an average height. So, super tall must be in a house after the average height
-
[41]
So, very tall must be in House 1-5
From clue 16: The person who is very tall is not in the sixth house. So, very tall must be in House 1-5
-
[42]
So, House 6 has MusicGenre: classical
From clue 17: The person who loves classical music is in the sixth house. So, House 6 has MusicGenre: classical
-
[43]
So, House 1 has MusicGenre: rock
From clue 18: The person who loves rock music is in the first house. So, House 1 has MusicGenre: rock
-
[44]
From clue 5: The person who loves pop music is Eric. So, Eric’s MusicGenre is pop
-
[45]
So, Arnold must be in a house to the right of the person who loves country music
From clue 3: The person who loves country music is directly left of Arnold. So, Arnold must be in a house to the right of the person who loves country music
-
[46]
From clue 2: The person’s child is named Alice is Bob. So, Bob’s child is Alice
-
[47]
From clue 7: The person’s child is named Fred is Peter. So, Peter’s child is Fred
-
[48]
So, the person who loves hip-hop has a child named Bella
From clue 8: The person’s child is named Bella is the person who loves hip-hop music. So, the person who loves hip-hop has a child named Bella
-
[49]
From clue 4: Alice is the person who is tall. So, Alice’s Height is tall
-
[50]
So, Bob is to the right of Arnold
From clue 11: The person’s child is named Alice is somewhere to the right of Arnold. So, Bob is to the right of Arnold
-
[51]
So, Carol is to the left of the mother of Timothy
From clue 15: Carol is somewhere to the left of the person who is the mother of Timothy. So, Carol is to the left of the mother of Timothy
-
[52]
So, the mother of Timothy is in House 1-5
From clue 9: The person who is the mother of Timothy is not in the sixth house. So, the mother of Timothy is in House 1-5
-
[53]
Since the person who is short is in House 3, the person with child Samantha must be in House 1 or 5
From clue 1: There is one house between the person’s child is named Samantha and the person who is short. Since the person who is short is in House 3, the person with child Samantha must be in House 1 or 5. But House 5 is very short, so the person with child Samantha is in House 1
-
[54]
So, Bob is in a house after the super tall person
From clue 6: Bob is somewhere to the right of the person who is super tall. So, Bob is in a house after the super tall person
-
[55]
From clue 14: The person who loves jazz music is not in the fifth house. So, jazz is in House 1-4 or 6. By carefully placing all these constraints, the solution is deduced as follows: Solution:{ "House 1": { "Name": "Peter", "Children": "Fred", "MusicGenre": "rock", "Height": "super tall" }, "House 2": { "Name": "Carol", "Children": "Timothy", "MusicGenre...
-
[56]
The person who loves fantasy books is the Norwegian
-
[57]
The cat lover and the person who loves biography books are next to each other
-
[58]
The person who loves yellow is Bob
-
[59]
The person whose favorite color is green is Peter
-
[60]
There is one house between the Dane and the person who is a pizza lover
-
[61]
The person who loves blue is somewhere to the left of the Dane
-
[62]
The person who loves eating grilled cheese is somewhere to the left of the Norwegian
-
[63]
The person who loves the spaghetti eater is Peter
-
[64]
The person who keeps horses is Alice
-
[65]
The fish enthusiast is directly left of the person who loves science fiction books
-
[66]
There is one house between the Norwegian and Arnold
-
[67]
The person who loves romance books is the British person
-
[68]
There are two houses between the Norwegian and Alice
-
[69]
The bird keeper is the person whose favorite color is red
-
[70]
The dog owner is directly left of the fish enthusiast
-
[71]
The person who loves the stew is the Norwegian. Ground-truth { “House 1”: {“Name”: “Bob”, “Nationality”: “german”, “BookGenre”: “mystery”, “Food”: “grilled cheese”, “Color”: “yellow”, “Animal”: “dog”}, “House 2”: {“Name”: “Eric”, “Nationality”: “norwegian”, “BookGenre”: “fantasy”, “Food”: “stew”, “Color”: “blue”, “Animal”: “fish”}, “House 3”: {“Name”: “Pe...
-
[72]
From Clue 17, the Norwegian loves stew
-
[73]
From Clue 8, the grilled cheese lover is to the left of the Norwegian
-
[74]
From Clue 12, there is one house between the Norwegian and Arnold
-
[75]
From Clue 14, there are two houses between the Norwegian and Alice
-
[76]
From Clue 10, Alice keeps horses
-
[77]
From Clue 3, the German is Bob
-
[78]
From Clue 4, Bob loves yellow
-
[79]
From Clue 5, Peter loves green
-
[80]
From Clue 9, Peter loves spaghetti
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.