ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

Jincheng Liu; Jingjing Wu; Sijun He; Siqi Bao; Xiangsen Wang; Yang Chen; Yuan Yao; Zhaoqi Kuang

arxiv: 2509.24239 · v4 · submitted 2025-09-29 · 💻 cs.LG · cs.AI

ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

Jincheng Liu , Sijun He , Jingjing Wu , Xiangsen Wang , Yang Chen , Zhaoqi Kuang , Siqi Bao , Yuan Yao This is my paper

Pith reviewed 2026-05-18 12:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Large Language ModelsStrategic ReasoningChessEvaluation TestbedFine-TuningAI BenchmarksGame Playing

0 comments

The pith

Large language models lack genuine strategic reasoning, failing to beat even amateur-level chess opponents in competitive tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ChessArena to probe whether LLMs possess true strategic reasoning or rely mainly on pattern recognition by pitting them against each other and baselines in chess, a domain that requires planning, rule adherence, and state tracking. Through evaluations of thirteen models in over eight hundred games across four play modes covering basic play, move selection, and puzzles, the results show none can defeat a simple chess program at human amateur strength, and some lose to random move choices. A fine-tuned smaller model achieves markedly better results that approach those of much larger systems. This matters because chess offers a clear test of whether models can handle complex, evolving situations rather than surface patterns. If the findings hold, it points to a persistent gap in AI for tasks that demand foresight and adaptive decision making.

Core claim

ChessArena provides a competitive framework in which LLMs play chess under four play modes to assess basic understanding, move selection, and puzzle solving. Testing thirteen models across more than eight hundred games reveals that no evaluated LLM beats the Maia-1100 engine, which corresponds to human amateur play, and that some models are outperformed by random play. A fine-tuned Qwen3-8B model delivers substantial performance gains and approaches the results of much larger state-of-the-art reasoning models.

What carries the argument

ChessArena, a chess-based competitive testbed using four play modes to evaluate LLMs on rule following and game-state tracking in both full games and puzzles.

Load-bearing premise

Success or failure in chess games and puzzles under the four play modes directly measures genuine strategic reasoning rather than pattern recognition or rule memorization.

What would settle it

An LLM that wins the majority of games against Maia-1100 across the four play modes would indicate the presence of strategic reasoning.

Figures

Figures reproduced from arXiv: 2509.24239 by Jincheng Liu, Jingjing Wu, Sijun He, Siqi Bao, Xiangsen Wang, Yang Chen, Yuan Yao, Zhaoqi Kuang.

**Figure 2.** Figure 2: Input prompt format for Blitz and Standard chess competition. Whether to provide legal moves is optional. [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: Input prompt format for Bullet chess competition. Whether to provide legal moves is optional. Thinking is [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Input prompt format for Blindfold chess competition. Whether to provide legal moves is optional. This is a [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Input prompt format for basic understanding [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of Game Terminations 26 [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: DeepSeek-R1 fails to checkmate. Left: DeepSeek-R1’s choice; Right: The optimal Checkmate Move. [PITH_FULL_IMAGE:figures/full_fig_p030_7.png] view at source ↗

**Figure 8.** Figure 8: Providing legal moves may lock model’s potential [PITH_FULL_IMAGE:figures/full_fig_p048_8.png] view at source ↗

**Figure 9.** Figure 9: Mean Reward and Response Length Curve of RL Training [PITH_FULL_IMAGE:figures/full_fig_p048_9.png] view at source ↗

read the original abstract

Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine strategic reasoning, or do they primarily excel at pattern recognition? To address this, we present ChessArena, a chess-based testbed for evaluating LLMs. Chess demands strategic reasoning, precise rule adherence, and the ability to track complex game states. ChessArena is a competitive framework where LLMs play against each other under four play modes. We evaluate 13 LLMs across over 800 games, testing basic understanding, move selection, and puzzle solving. Results reveal significant shortcomings: no model beats Maia-1100 (human amateur level), and some lose to random play. We also present a strong baseline: our fine-tuned Qwen3-8B substantially improves performance, approaching much larger state-of-the-art reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ChessArena, a competitive chess testbed for assessing whether LLMs possess genuine strategic reasoning or primarily rely on pattern recognition. It evaluates 13 LLMs across more than 800 games in four play modes (testing basic understanding, move selection, and puzzle solving), comparing them to random play and the Maia-1100 engine (human amateur level). The central findings are that no evaluated LLM beats Maia-1100 and some lose to random play, while a fine-tuned Qwen3-8B model shows substantial gains approaching larger state-of-the-art reasoning models.

Significance. If the results hold after addressing reporting gaps, the work offers a concrete, falsifiable benchmark that highlights limitations in LLMs' strategic capabilities beyond surface-level chess knowledge. The fine-tuned Qwen3-8B baseline is a constructive strength, demonstrating that targeted adaptation can narrow the gap to larger models and providing a reproducible starting point for future research. The use of external engines (Maia-1100) and multiple play modes strengthens the empirical framing compared to purely internal evaluations.

major comments (2)

[Evaluation and Results] The central claim that performance differences reflect deficits in strategic reasoning (rather than rule-following or state-tracking failures) is load-bearing but insecure without details on illegal move handling. The abstract and evaluation sections report aggregate outcomes over 800+ games but omit illegal move rates, enforcement mechanisms, and how invalid outputs are resolved (e.g., default to random, early termination, or penalty). This directly affects interpretation of losses to random play and comparisons to Maia-1100.
[Results and Discussion] No error bars, confidence intervals, or statistical tests are provided for the aggregate win rates or comparisons across models and modes. This weakens the strength of the claim that 'no model beats Maia-1100' and that the fine-tuned Qwen3-8B 'substantially improves' performance, as variability across games or seeds cannot be assessed.

minor comments (2)

[Play Modes] The description of the four play modes would benefit from a short example trace showing a full game or puzzle interaction, including how the LLM prompt is constructed and how the opponent responds.
[Experimental Setup] Clarify the implementation details of the random baseline and Maia-1100 (e.g., move selection policy, time controls, or opening book usage) to ensure reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Evaluation and Results] The central claim that performance differences reflect deficits in strategic reasoning (rather than rule-following or state-tracking failures) is load-bearing but insecure without details on illegal move handling. The abstract and evaluation sections report aggregate outcomes over 800+ games but omit illegal move rates, enforcement mechanisms, and how invalid outputs are resolved (e.g., default to random, early termination, or penalty). This directly affects interpretation of losses to random play and comparisons to Maia-1100.

Authors: We agree that explicit details on illegal-move handling are required to support interpretation of the results. We have added a dedicated paragraph to the Evaluation section describing the move-generation protocol (standard algebraic notation with up to three resampling attempts for invalid outputs), the observed illegal-move rates per model and mode, and the resolution rule (unresolved invalid moves count as a loss). These additions are also summarized briefly in the abstract. The new information shows that illegal-move rates are low for most models and do not account for the majority of losses against random play or Maia-1100, thereby reinforcing rather than undermining the claim that performance gaps reflect strategic-reasoning limitations. revision: yes
Referee: [Results and Discussion] No error bars, confidence intervals, or statistical tests are provided for the aggregate win rates or comparisons across models and modes. This weakens the strength of the claim that 'no model beats Maia-1100' and that the fine-tuned Qwen3-8B 'substantially improves' performance, as variability across games or seeds cannot be assessed.

Authors: We accept that the original presentation lacked quantitative measures of uncertainty. In the revised Results section we now report 95 % bootstrap confidence intervals for every win rate and include binomial tests comparing each model against the Maia-1100 and random baselines. These additions confirm that no evaluated model significantly exceeds Maia-1100 and that the fine-tuned Qwen3-8B improvement is statistically significant, thereby strengthening the evidential basis for our conclusions. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation with no derivation chain or self-referential reductions

full rationale

The paper introduces ChessArena as an experimental testbed and reports direct game outcomes from LLM play against Maia-1100, random baselines, and internal fine-tuning. No equations, first-principles derivations, or predictions are presented that could reduce to fitted inputs or self-citations by construction. All claims rest on observable win rates and puzzle accuracy measured against external engines, satisfying the criterion for self-contained empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmarking paper with no mathematical derivations or new theoretical entities. The central claims rest on the assumption that chess performance isolates strategic reasoning and that the chosen opponents and modes are fair proxies.

axioms (1)

domain assumption Chess requires and therefore measures genuine strategic reasoning beyond pattern matching.
Invoked in the motivation and interpretation of results; if false, poor chess performance does not imply lack of strategic reasoning.

pith-pipeline@v0.9.0 · 5696 in / 1257 out tokens · 26828 ms · 2026-05-18T12:21:22.570023+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ChessArena is a competitive framework where LLMs play against each other under four play modes... Glicko rating system... fine-grained evaluation tasks: Basic Understanding, Move Selection, Puzzle Solving.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define three types of rewards: format reward, legal move reward, and top move reward.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 conditional novelty 8.0

LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLM move selection in four-in-a-row is best explained by myopic models that ignore deep nodes in their own reasoning traces, while performance correlates with search breadth rather than depth.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLMs display myopic planning in games: move selection is driven by shallow nodes in reasoning traces despite generating deep lookahead, with performance tied to search breadth rather than depth.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLMs exhibit myopic planning in games, with move choices driven by shallow nodes despite deep reasoning traces, in contrast to human deep-search reliance.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 6.0

LLM planning in four-in-a-row is myopic: move choices match a shallow model that ignores deep nodes expanded in reasoning traces.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 1 Pith paper

[1]

A player initiates a match request, and the system records their current ratingrand rating deviation RD

work page
[2]

The system searches for potential opponents in the match pool and calculates the matching score: score(i, j) =E i(1−E i) g(RDi)2 +g(RD j)2

work page
[3]

The opponent with the highest matching score is prioritized

work page
[4]

For players with high RD, the system prioritizes matching them with opponents who have lowRD and similar ratings

work page
[5]

After the opponent accepts the match, the match begins

work page
[6]

After the match, both players’rand RD are updated based on the results ChessArena Matching System VariantsThe system supports two startup modes:

work page
[7]

Random startup mode: (a) A player is randomly selected from the player pool (b) The selected player automatically initiates a match request (c) Steps 2-6 of the Competition Sampling process are executed

work page
[8]

thinking tokens

Specified startup mode: (a) An initial player is specified by a human (b) The specified player initiates a match request (c) Steps 2-6 of the Competition Sampling process are executed D Post-training Details D.1 SFT Data Collection ChessGPTChessGPT Feng et al. [2023] has open-sourced a text pre-training dataset and a post-training SFT dataset related to c...

work page 2023
[9]

Specifically, models that underwent Chess SFT show a significant improvement in their ZebraLogic scores after the RL phase

Models fine-tuned via Chess Supervised Fine-Tuning (SFT), regardless of whether they subsequently undergo Code, Math, or Chess Reinforcement Learning (RL), consistently demonstrate a degree of generalization ability to the logical reasoning benchmark, ZebraLogic. Specifically, models that underwent Chess SFT show a significant improvement in their ZebraLo...

work page
[10]

Incorporating a certain amount of Chess data into the RL dataset contributes to the generalization of Code capabilities, particularly on the LiveCodeBench. Comparisons (e.g., Math + Code + Chess - RL / Math + Chess - RL versus Math + Code - RL / Math - RL) reveal that RL models augmented with Chess data achieve a improvement in their LiveCodeBench scores....

work page
[11]

The person who loves fantasy books is in the second house

work page
[12]

Alice is not in the second house

work page
[13]

Eric is the person who is a teacher

work page
[14]

The person who uses a Samsung Galaxy S21 is the person who loves fantasy books

work page
[15]

The person who uses an iPhone 13 is the person who loves science fiction books

work page
[16]

The person who loves science fiction books is somewhere to the left of the person who uses a OnePlus 9

work page
[17]

The person who uses a OnePlus 9 is Arnold

work page
[18]

The person who is a doctor is the person who loves mystery books

work page
[19]

House 1": {

The person who uses an iPhone 13 is the person who is a teacher. Ground-Truth { "House 1": { "Name": "Alice", "Occupation": "engineer", "BookGenre": "romance", "PhoneModel": "google pixel 6" }, "House 2": { "Name": "Peter", "Occupation": "artist", "BookGenre": "fantasy", "PhoneModel": "samsung galaxy s21" }, "House 3": { "Name": "Eric", "Occupation": "tea...

work page
[20]

House 1": {

Clue 6 says the person using an iPhone 13 loves science fiction books, so Eric loves science fiction books. Clue 7 says the person who loves science fiction books is to the left of the person using a OnePlus 9, so Eric must be in a house to the left of the OnePlus 9 user. Clue 8 says the person using a OnePlus 9 is Arnold, so Arnold is in a house to the r...

work page
[21]

The person’s child is named Alice is Bob

work page
[22]

The person who loves country music is directly left of Arnold

work page
[23]

Alice is the person who is tall

work page
[24]

The person who loves pop music is Eric

work page
[25]

Bob is somewhere to the right of the person who is super tall

work page
[26]

The person’s child is named Fred is Peter

work page
[27]

The person’s child is named Bella is the person who loves hip-hop music

work page
[28]

The person who is the mother of Timothy is not in the sixth house

work page
[29]

The person who is super tall is somewhere to the right of the person who has an average height

work page
[30]

The person’s child is named Alice is somewhere to the right of Arnold

work page
[31]

There is one house between the person who is short and the person who is very short

work page
[32]

The person who is very short is in the fifth house

work page
[33]

The person who loves jazz music is not in the fifth house

work page
[34]

Carol is somewhere to the left of the person who is the mother of Timothy

work page
[35]

The person who is very tall is not in the sixth house

work page
[36]

The person who loves classical music is in the sixth house

work page
[37]

House 1": {

The person who loves rock music is in the first house. Ground-Truth { "House 1": { "Name": "Carol", "Children": "Samantha", "MusicGenre": "rock", "Height": "average" }, "House 2": { "Name": "Peter", "Children": "Fred", "MusicGenre": "country", "Height": "super tall" }, "House 3": { "Name": "Arnold", "Children": "Bella", "MusicGenre": "hip hop", "Height": ...

work page
[38]

So, House 5 has Height: very short

From clue 13: The person who is very short is in the fifth house. So, House 5 has Height: very short

work page
[39]

Since House 5 is very short, the short person must be in House 3

From clue 12: There is one house between the person who is short and the person who is very short. Since House 5 is very short, the short person must be in House 3. So, House 3 has Height: short

work page
[40]

So, super tall must be in a house after the average height

From clue 10: The person who is super tall is somewhere to the right of the person who has an average height. So, super tall must be in a house after the average height

work page
[41]

So, very tall must be in House 1-5

From clue 16: The person who is very tall is not in the sixth house. So, very tall must be in House 1-5

work page
[42]

So, House 6 has MusicGenre: classical

From clue 17: The person who loves classical music is in the sixth house. So, House 6 has MusicGenre: classical

work page
[43]

So, House 1 has MusicGenre: rock

From clue 18: The person who loves rock music is in the first house. So, House 1 has MusicGenre: rock

work page
[44]

So, Eric’s MusicGenre is pop

From clue 5: The person who loves pop music is Eric. So, Eric’s MusicGenre is pop

work page
[45]

So, Arnold must be in a house to the right of the person who loves country music

From clue 3: The person who loves country music is directly left of Arnold. So, Arnold must be in a house to the right of the person who loves country music

work page
[46]

So, Bob’s child is Alice

From clue 2: The person’s child is named Alice is Bob. So, Bob’s child is Alice

work page
[47]

So, Peter’s child is Fred

From clue 7: The person’s child is named Fred is Peter. So, Peter’s child is Fred

work page
[48]

So, the person who loves hip-hop has a child named Bella

From clue 8: The person’s child is named Bella is the person who loves hip-hop music. So, the person who loves hip-hop has a child named Bella

work page
[49]

So, Alice’s Height is tall

From clue 4: Alice is the person who is tall. So, Alice’s Height is tall

work page
[50]

So, Bob is to the right of Arnold

From clue 11: The person’s child is named Alice is somewhere to the right of Arnold. So, Bob is to the right of Arnold

work page
[51]

So, Carol is to the left of the mother of Timothy

From clue 15: Carol is somewhere to the left of the person who is the mother of Timothy. So, Carol is to the left of the mother of Timothy

work page
[52]

So, the mother of Timothy is in House 1-5

From clue 9: The person who is the mother of Timothy is not in the sixth house. So, the mother of Timothy is in House 1-5

work page
[53]

Since the person who is short is in House 3, the person with child Samantha must be in House 1 or 5

From clue 1: There is one house between the person’s child is named Samantha and the person who is short. Since the person who is short is in House 3, the person with child Samantha must be in House 1 or 5. But House 5 is very short, so the person with child Samantha is in House 1

work page
[54]

So, Bob is in a house after the super tall person

From clue 6: Bob is somewhere to the right of the person who is super tall. So, Bob is in a house after the super tall person

work page
[55]

House 1": {

From clue 14: The person who loves jazz music is not in the fifth house. So, jazz is in House 1-4 or 6. By carefully placing all these constraints, the solution is deduced as follows: Solution:{ "House 1": { "Name": "Peter", "Children": "Fred", "MusicGenre": "rock", "Height": "super tall" }, "House 2": { "Name": "Carol", "Children": "Timothy", "MusicGenre...

work page
[56]

The person who loves fantasy books is the Norwegian

work page
[57]

The cat lover and the person who loves biography books are next to each other

work page
[58]

The person who loves yellow is Bob

work page
[59]

The person whose favorite color is green is Peter

work page
[60]

There is one house between the Dane and the person who is a pizza lover

work page
[61]

The person who loves blue is somewhere to the left of the Dane

work page
[62]

The person who loves eating grilled cheese is somewhere to the left of the Norwegian

work page
[63]

The person who loves the spaghetti eater is Peter

work page
[64]

The person who keeps horses is Alice

work page
[65]

The fish enthusiast is directly left of the person who loves science fiction books

work page
[66]

There is one house between the Norwegian and Arnold

work page
[67]

The person who loves romance books is the British person

work page
[68]

There are two houses between the Norwegian and Alice

work page
[69]

The bird keeper is the person whose favorite color is red

work page
[70]

The dog owner is directly left of the fish enthusiast

work page
[71]

House 1”: {“Name

The person who loves the stew is the Norwegian. Ground-truth { “House 1”: {“Name”: “Bob”, “Nationality”: “german”, “BookGenre”: “mystery”, “Food”: “grilled cheese”, “Color”: “yellow”, “Animal”: “dog”}, “House 2”: {“Name”: “Eric”, “Nationality”: “norwegian”, “BookGenre”: “fantasy”, “Food”: “stew”, “Color”: “blue”, “Animal”: “fish”}, “House 3”: {“Name”: “Pe...

work page
[72]

From Clue 17, the Norwegian loves stew

work page
[73]

From Clue 8, the grilled cheese lover is to the left of the Norwegian

work page
[74]

From Clue 12, there is one house between the Norwegian and Arnold

work page
[75]

From Clue 14, there are two houses between the Norwegian and Alice

work page
[76]

From Clue 10, Alice keeps horses

work page
[77]

From Clue 3, the German is Bob

work page
[78]

From Clue 4, Bob loves yellow

work page
[79]

From Clue 5, Peter loves green

work page
[80]

From Clue 9, Peter loves spaghetti

work page

Showing first 80 references.

[1] [1]

A player initiates a match request, and the system records their current ratingrand rating deviation RD

work page

[2] [2]

The system searches for potential opponents in the match pool and calculates the matching score: score(i, j) =E i(1−E i) g(RDi)2 +g(RD j)2

work page

[3] [3]

The opponent with the highest matching score is prioritized

work page

[4] [4]

For players with high RD, the system prioritizes matching them with opponents who have lowRD and similar ratings

work page

[5] [5]

After the opponent accepts the match, the match begins

work page

[6] [6]

After the match, both players’rand RD are updated based on the results ChessArena Matching System VariantsThe system supports two startup modes:

work page

[7] [7]

Random startup mode: (a) A player is randomly selected from the player pool (b) The selected player automatically initiates a match request (c) Steps 2-6 of the Competition Sampling process are executed

work page

[8] [8]

thinking tokens

Specified startup mode: (a) An initial player is specified by a human (b) The specified player initiates a match request (c) Steps 2-6 of the Competition Sampling process are executed D Post-training Details D.1 SFT Data Collection ChessGPTChessGPT Feng et al. [2023] has open-sourced a text pre-training dataset and a post-training SFT dataset related to c...

work page 2023

[9] [9]

Specifically, models that underwent Chess SFT show a significant improvement in their ZebraLogic scores after the RL phase

Models fine-tuned via Chess Supervised Fine-Tuning (SFT), regardless of whether they subsequently undergo Code, Math, or Chess Reinforcement Learning (RL), consistently demonstrate a degree of generalization ability to the logical reasoning benchmark, ZebraLogic. Specifically, models that underwent Chess SFT show a significant improvement in their ZebraLo...

work page

[10] [10]

Incorporating a certain amount of Chess data into the RL dataset contributes to the generalization of Code capabilities, particularly on the LiveCodeBench. Comparisons (e.g., Math + Code + Chess - RL / Math + Chess - RL versus Math + Code - RL / Math - RL) reveal that RL models augmented with Chess data achieve a improvement in their LiveCodeBench scores....

work page

[11] [11]

The person who loves fantasy books is in the second house

work page

[12] [12]

Alice is not in the second house

work page

[13] [13]

Eric is the person who is a teacher

work page

[14] [14]

The person who uses a Samsung Galaxy S21 is the person who loves fantasy books

work page

[15] [15]

The person who uses an iPhone 13 is the person who loves science fiction books

work page

[16] [16]

The person who loves science fiction books is somewhere to the left of the person who uses a OnePlus 9

work page

[17] [17]

The person who uses a OnePlus 9 is Arnold

work page

[18] [18]

The person who is a doctor is the person who loves mystery books

work page

[19] [19]

House 1": {

The person who uses an iPhone 13 is the person who is a teacher. Ground-Truth { "House 1": { "Name": "Alice", "Occupation": "engineer", "BookGenre": "romance", "PhoneModel": "google pixel 6" }, "House 2": { "Name": "Peter", "Occupation": "artist", "BookGenre": "fantasy", "PhoneModel": "samsung galaxy s21" }, "House 3": { "Name": "Eric", "Occupation": "tea...

work page

[20] [20]

House 1": {

Clue 6 says the person using an iPhone 13 loves science fiction books, so Eric loves science fiction books. Clue 7 says the person who loves science fiction books is to the left of the person using a OnePlus 9, so Eric must be in a house to the left of the OnePlus 9 user. Clue 8 says the person using a OnePlus 9 is Arnold, so Arnold is in a house to the r...

work page

[21] [21]

The person’s child is named Alice is Bob

work page

[22] [22]

The person who loves country music is directly left of Arnold

work page

[23] [23]

Alice is the person who is tall

work page

[24] [24]

The person who loves pop music is Eric

work page

[25] [25]

Bob is somewhere to the right of the person who is super tall

work page

[26] [26]

The person’s child is named Fred is Peter

work page

[27] [27]

The person’s child is named Bella is the person who loves hip-hop music

work page

[28] [28]

The person who is the mother of Timothy is not in the sixth house

work page

[29] [29]

The person who is super tall is somewhere to the right of the person who has an average height

work page

[30] [30]

The person’s child is named Alice is somewhere to the right of Arnold

work page

[31] [31]

There is one house between the person who is short and the person who is very short

work page

[32] [32]

The person who is very short is in the fifth house

work page

[33] [33]

The person who loves jazz music is not in the fifth house

work page

[34] [34]

Carol is somewhere to the left of the person who is the mother of Timothy

work page

[35] [35]

The person who is very tall is not in the sixth house

work page

[36] [36]

The person who loves classical music is in the sixth house

work page

[37] [37]

House 1": {

The person who loves rock music is in the first house. Ground-Truth { "House 1": { "Name": "Carol", "Children": "Samantha", "MusicGenre": "rock", "Height": "average" }, "House 2": { "Name": "Peter", "Children": "Fred", "MusicGenre": "country", "Height": "super tall" }, "House 3": { "Name": "Arnold", "Children": "Bella", "MusicGenre": "hip hop", "Height": ...

work page

[38] [38]

So, House 5 has Height: very short

From clue 13: The person who is very short is in the fifth house. So, House 5 has Height: very short

work page

[39] [39]

Since House 5 is very short, the short person must be in House 3

From clue 12: There is one house between the person who is short and the person who is very short. Since House 5 is very short, the short person must be in House 3. So, House 3 has Height: short

work page

[40] [40]

So, super tall must be in a house after the average height

From clue 10: The person who is super tall is somewhere to the right of the person who has an average height. So, super tall must be in a house after the average height

work page

[41] [41]

So, very tall must be in House 1-5

From clue 16: The person who is very tall is not in the sixth house. So, very tall must be in House 1-5

work page

[42] [42]

So, House 6 has MusicGenre: classical

From clue 17: The person who loves classical music is in the sixth house. So, House 6 has MusicGenre: classical

work page

[43] [43]

So, House 1 has MusicGenre: rock

From clue 18: The person who loves rock music is in the first house. So, House 1 has MusicGenre: rock

work page

[44] [44]

So, Eric’s MusicGenre is pop

From clue 5: The person who loves pop music is Eric. So, Eric’s MusicGenre is pop

work page

[45] [45]

So, Arnold must be in a house to the right of the person who loves country music

From clue 3: The person who loves country music is directly left of Arnold. So, Arnold must be in a house to the right of the person who loves country music

work page

[46] [46]

So, Bob’s child is Alice

From clue 2: The person’s child is named Alice is Bob. So, Bob’s child is Alice

work page

[47] [47]

So, Peter’s child is Fred

From clue 7: The person’s child is named Fred is Peter. So, Peter’s child is Fred

work page

[48] [48]

So, the person who loves hip-hop has a child named Bella

From clue 8: The person’s child is named Bella is the person who loves hip-hop music. So, the person who loves hip-hop has a child named Bella

work page

[49] [49]

So, Alice’s Height is tall

From clue 4: Alice is the person who is tall. So, Alice’s Height is tall

work page

[50] [50]

So, Bob is to the right of Arnold

From clue 11: The person’s child is named Alice is somewhere to the right of Arnold. So, Bob is to the right of Arnold

work page

[51] [51]

So, Carol is to the left of the mother of Timothy

From clue 15: Carol is somewhere to the left of the person who is the mother of Timothy. So, Carol is to the left of the mother of Timothy

work page

[52] [52]

So, the mother of Timothy is in House 1-5

From clue 9: The person who is the mother of Timothy is not in the sixth house. So, the mother of Timothy is in House 1-5

work page

[53] [53]

Since the person who is short is in House 3, the person with child Samantha must be in House 1 or 5

From clue 1: There is one house between the person’s child is named Samantha and the person who is short. Since the person who is short is in House 3, the person with child Samantha must be in House 1 or 5. But House 5 is very short, so the person with child Samantha is in House 1

work page

[54] [54]

So, Bob is in a house after the super tall person

From clue 6: Bob is somewhere to the right of the person who is super tall. So, Bob is in a house after the super tall person

work page

[55] [55]

House 1": {

From clue 14: The person who loves jazz music is not in the fifth house. So, jazz is in House 1-4 or 6. By carefully placing all these constraints, the solution is deduced as follows: Solution:{ "House 1": { "Name": "Peter", "Children": "Fred", "MusicGenre": "rock", "Height": "super tall" }, "House 2": { "Name": "Carol", "Children": "Timothy", "MusicGenre...

work page

[56] [56]

The person who loves fantasy books is the Norwegian

work page

[57] [57]

The cat lover and the person who loves biography books are next to each other

work page

[58] [58]

The person who loves yellow is Bob

work page

[59] [59]

The person whose favorite color is green is Peter

work page

[60] [60]

There is one house between the Dane and the person who is a pizza lover

work page

[61] [61]

The person who loves blue is somewhere to the left of the Dane

work page

[62] [62]

The person who loves eating grilled cheese is somewhere to the left of the Norwegian

work page

[63] [63]

The person who loves the spaghetti eater is Peter

work page

[64] [64]

The person who keeps horses is Alice

work page

[65] [65]

The fish enthusiast is directly left of the person who loves science fiction books

work page

[66] [66]

There is one house between the Norwegian and Arnold

work page

[67] [67]

The person who loves romance books is the British person

work page

[68] [68]

There are two houses between the Norwegian and Alice

work page

[69] [69]

The bird keeper is the person whose favorite color is red

work page

[70] [70]

The dog owner is directly left of the fish enthusiast

work page

[71] [71]

House 1”: {“Name

The person who loves the stew is the Norwegian. Ground-truth { “House 1”: {“Name”: “Bob”, “Nationality”: “german”, “BookGenre”: “mystery”, “Food”: “grilled cheese”, “Color”: “yellow”, “Animal”: “dog”}, “House 2”: {“Name”: “Eric”, “Nationality”: “norwegian”, “BookGenre”: “fantasy”, “Food”: “stew”, “Color”: “blue”, “Animal”: “fish”}, “House 3”: {“Name”: “Pe...

work page

[72] [72]

From Clue 17, the Norwegian loves stew

work page

[73] [73]

From Clue 8, the grilled cheese lover is to the left of the Norwegian

work page

[74] [74]

From Clue 12, there is one house between the Norwegian and Arnold

work page

[75] [75]

From Clue 14, there are two houses between the Norwegian and Alice

work page

[76] [76]

From Clue 10, Alice keeps horses

work page

[77] [77]

From Clue 3, the German is Bob

work page

[78] [78]

From Clue 4, Bob loves yellow

work page

[79] [79]

From Clue 5, Peter loves green

work page

[80] [80]

From Clue 9, Peter loves spaghetti

work page