pith. machine review for the scientific record. sign in

arxiv: 2509.24239 · v4 · submitted 2025-09-29 · 💻 cs.LG · cs.AI

Recognition: unknown

ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models

Authors on Pith no claims yet
classification 💻 cs.LG cs.AI
keywords reasoningllmsmodelschessarenaplaystrategiccapabilitieschess
0
0 comments X
read the original abstract

Recent large language models (LLMs) have shown strong reasoning capabilities. However, a critical question remains: do these models possess genuine strategic reasoning, or do they primarily excel at pattern recognition? To address this, we present ChessArena, a chess-based testbed for evaluating LLMs. Chess demands strategic reasoning, precise rule adherence, and the ability to track complex game states. ChessArena is a competitive framework where LLMs play against each other under four play modes. We evaluate 13 LLMs across over 800 games, testing basic understanding, move selection, and puzzle solving. Results reveal significant shortcomings: no model beats Maia-1100 (human amateur level), and some lose to random play. We also present a strong baseline: our fine-tuned Qwen3-8B substantially improves performance, approaching much larger state-of-the-art reasoning models.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 conditional novelty 8.0

    LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.

  2. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 unverdicted novelty 7.0

    LLM move selection in four-in-a-row is best explained by myopic models that ignore deep nodes in their own reasoning traces, while performance correlates with search breadth rather than depth.

  3. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 unverdicted novelty 7.0

    LLMs display myopic planning in games: move selection is driven by shallow nodes in reasoning traces despite generating deep lookahead, with performance tied to search breadth rather than depth.

  4. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

    cs.AI 2026-05 unverdicted novelty 7.0

    LLMs exhibit myopic planning in games, with move choices driven by shallow nodes despite deep reasoning traces, in contrast to human deep-search reliance.