Disentangling generalization and memorization in large language models using chess

Leonard S. Pleiss; Maximilian Schiffer; Robert K. von Weizsaecker

arxiv: 2601.16823 · v2 · pith:NC3BHYBSnew · submitted 2026-01-23 · 💻 cs.CL · cs.AI

Disentangling generalization and memorization in large language models using chess

Leonard S. Pleiss , Maximilian Schiffer , Robert K. von Weizsaecker This is my paper

Pith reviewed 2026-05-21 14:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsmemorizationgeneralizationchesspriorsreasoning

0 comments

The pith

LLM chess performance degrades with fewer relevant priors and reaches random baseline for novel positions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces chess as a testbed to separate memorization from genuine generalization in large language models. It builds a taxonomy of positions based on how many relevant prior patterns they contain, from everyday ones to completely new configurations. Models are tested across this spectrum, revealing that their success rate falls steadily as familiar patterns become rarer. In the sparsest cases, performance matches that of random moves, even in advanced models. This suggests that current scaling and reasoning methods have limited ability to handle situations without memorized priors.

Core claim

By classifying chess positions according to the density of relevant priors they share with likely training data, the analysis shows that LLM move quality degrades consistently with decreasing prior density and reaches random baseline levels for novel positions.

What carries the argument

A taxonomy of chess positions that varies the density of relevant priors, using engine evaluations as ground truth for position quality.

If this is right

Models show a steep performance gradient based on prior density.
For low prior density tasks, performance matches random baseline.
Improvements in newer models are smaller for sparse prior tasks.
Reasoning methods provide less benefit per token when priors are absent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures may need explicit mechanisms for handling low-prior scenarios beyond increasing model size.
Similar taxonomies could be applied to other domains like code or math problems to test generalization.
Future training might benefit from synthetic data that deliberately reduces prior density to force generalization learning.

Load-bearing premise

The taxonomy correctly captures the density of relevant priors in a way that is independent of any particular model's training data, with engine evaluations serving as objective ground truth.

What would settle it

A model that achieves strong performance on low-prior-density chess positions without any corresponding increase in training exposure to similar patterns would challenge the claim that generalization is limited.

read the original abstract

Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall or genuine reasoning ability. We introduce chess as a controlled testbed aimed at disentangling these faculties. Leveraging the game's structure and scalable engine evaluations, we construct a taxonomy of positions varying in density of relevant priors - ranging from common states solvable by memorization to completely novel ones requiring generalization. Crucially, our approach achieves this distinction without requiring explicit knowledge of the models' training data. Applying this taxonomy, we combine a longitudinal analysis of the GPT lineage with a rigorous evaluation of contemporary models, including Claude Opus and Gemini. Our analysis reveals a steep gradient: performance consistently degrades as the density of relevant priors decreases. Notably, for tasks with few relevant priors, base model performance regresses to the random-play baseline. While newer models improve, progress slows significantly for tasks with sparse priors. Furthermore, while reasoning-augmented inference improves performance, its relative marginal benefit per token decreases in the absence of relevant priors. These results suggest limitations in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust performance when deprived of relevant priors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The chess taxonomy reveals a performance gradient in LLMs but risks conflating memorization with generalization due to potential training data overlap.

read the letter

The main thing to know is that this paper uses a chess position taxonomy based on how many relevant priors are available to test whether LLMs are generalizing or just recalling patterns. They find performance falls off as prior density drops, hitting random baseline on the lowest ones, and that scaling helps less there. They do a longitudinal look at GPT models and compare to Claude and Gemini. The approach stands out for trying to separate memorization from generalization without inspecting training data, instead relying on position features and engine assessments to build the taxonomy from common to novel states. This is useful because it gives a scalable way to probe these limits in a domain with clear ground truth via chess engines. The trend across models is consistent in what they report, and the note on reasoning methods providing diminishing returns on sparse tasks adds a practical angle. The main concern is the taxonomy's claim to independence. Positions labeled low-prior might still show up in the vast chess literature and databases that likely fed into pretraining, even if not explicitly in the model's data. Without decontamination steps or model-specific tests for whether those exact positions were encountered, the drop could mix in some memorization effects. The abstract doesn't detail the position selection or stats enough to judge robustness fully. Overall this is aimed at researchers studying LLM capabilities in structured domains and those looking for better ways to measure true reasoning. It has enough of a fresh application and clear results to merit peer review, though revisions would likely focus on tightening the prior density measurement and ruling out data overlap. I'd recommend sending it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces chess as a controlled testbed to disentangle memorization from generalization in LLMs. It constructs a taxonomy of positions varying in density of relevant priors (from common states to novel ones) using engine evaluations and game database frequencies, without direct training-data inspection. Longitudinal evaluation of the GPT lineage plus tests on Claude Opus and Gemini show performance degrading as prior density decreases, regressing to random baseline for low-density tasks; reasoning-augmented inference helps but with diminishing returns absent priors.

Significance. If the taxonomy isolates low-prior positions independently of training data, the results would demonstrate systematic limits to LLM generalization beyond memorized patterns and indicate that scale alone is insufficient for robust out-of-distribution reasoning. The use of scalable engine evaluations for objective position quality and the cross-model comparison provide a structured, falsifiable probe in a domain with clear ground truth.

major comments (2)

[§3] §3 (Taxonomy Construction): The central claim that performance regresses to the random baseline as relevant-prior density falls requires the taxonomy to partition positions by prior exposure independently of any specific model's training set. Reliance on frequency in game databases and chess theory sources, which overlap substantially with web-scale pretraining corpora, leaves open the possibility that 'low-density' positions remain partially memorized. No decontamination step or model-specific probing is described to confirm absence from training, so the degradation could reflect residual memorization of rare documented positions rather than a true generalization limit.
[§4] §4 (Evaluation and Results): The reported steep gradient and regression to random baseline for sparse-prior tasks is load-bearing for the generalization conclusion. Without explicit details on position-selection criteria, statistical controls for multiple comparisons, or how 'density of relevant priors' is quantified (e.g., exact thresholds or feature combinations), it is unclear whether post-hoc choices in taxonomy construction influence the observed trends across models.

minor comments (2)

[Abstract] The abstract introduces 'relevant priors' without a concise operational definition; adding one sentence linking it to the engine features and database frequencies would improve accessibility.
[Results figures] Figure captions and axis labels in the results section could more explicitly state the random baseline value and number of positions per density bin to aid quick interpretation.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments on our work. We address each major comment below and have revised the manuscript accordingly where appropriate.

read point-by-point responses

Referee: [§3] §3 (Taxonomy Construction): The central claim that performance regresses to the random baseline as relevant-prior density falls requires the taxonomy to partition positions by prior exposure independently of any specific model's training set. Reliance on frequency in game databases and chess theory sources, which overlap substantially with web-scale pretraining corpora, leaves open the possibility that 'low-density' positions remain partially memorized. No decontamination step or model-specific probing is described to confirm absence from training, so the degradation could reflect residual memorization of rare documented positions rather than a true generalization limit.

Authors: We acknowledge that game databases and chess theory sources likely overlap with the pretraining data of LLMs. Our taxonomy is designed to use publicly available, objective metrics—such as move frequencies from large game databases and engine evaluations—to categorize positions by the density of relevant priors, without needing direct access to any model's training corpus. This approach allows for a model-agnostic taxonomy. However, we agree that residual memorization of rare positions cannot be entirely ruled out. In the revised manuscript, we will expand the discussion in Section 3 to explicitly address this limitation and include additional analysis of position novelty based on combinatorial features not commonly documented. revision: partial
Referee: [§4] §4 (Evaluation and Results): The reported steep gradient and regression to random baseline for sparse-prior tasks is load-bearing for the generalization conclusion. Without explicit details on position-selection criteria, statistical controls for multiple comparisons, or how 'density of relevant priors' is quantified (e.g., exact thresholds or feature combinations), it is unclear whether post-hoc choices in taxonomy construction influence the observed trends across models.

Authors: We appreciate this point and will include more precise details on the position selection criteria and the quantification of prior density in the revised Section 4. Specifically, we will define the density metric as a weighted combination of database frequency (log-transformed) and the number of relevant theoretical concepts identified via engine analysis. We have added bootstrapped statistical tests and corrections for multiple comparisons to ensure the robustness of the observed gradients. These additions clarify that the trends are not due to post-hoc selections. revision: yes

standing simulated objections not resolved

Full verification of absence from training data requires access to the proprietary training sets of the evaluated models, which is not available.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines a taxonomy of chess positions by combining engine evaluations (as objective ground truth) with position features and frequencies drawn from game databases or theory, then measures LLM performance degradation as prior density decreases. This chain does not reduce any prediction or central claim to its own inputs by construction, nor does it rely on self-citation load-bearing, fitted parameters renamed as predictions, or ansatzes smuggled via prior work. The explicit claim that the distinction is achieved without explicit training-data inspection keeps the method independent of the target models' memorization, making the empirical gradient a genuine test rather than a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that prior density can be quantified independently of model training data and that chess provides a clean gradient from memorization to generalization.

axioms (1)

domain assumption Chess positions can be reliably ordered by density of relevant priors without access to any model's training corpus.
This premise is required for the taxonomy to serve as a training-data-free probe.

pith-pipeline@v0.9.0 · 5738 in / 1279 out tokens · 52833 ms · 2026-05-21T14:19:10.016230+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We construct three disjoint subsets … XWD, XND, XOOD … ptrain(xWD) ≫ ptrain(xND) ≫ ptrain(xOOD)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Average Centipawn Loss … performance collapses to random levels … in the absence of relevant priors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 conditional novelty 8.0

LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLM move selection in four-in-a-row is best explained by myopic models that ignore deep nodes in their own reasoning traces, while performance correlates with search breadth rather than depth.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLMs display myopic planning in games: move selection is driven by shallow nodes in reasoning traces despite generating deep lookahead, with performance tied to search breadth rather than depth.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 7.0

LLMs exhibit myopic planning in games, with move choices driven by shallow nodes despite deep reasoning traces, in contrast to human deep-search reliance.
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 unverdicted novelty 6.0

LLM planning in four-in-a-row is myopic: move choices match a shallow model that ignores deep nodes expanded in reasoning traces.