Anyone for chess? Analysing chess ratings above high thresholds

Nils Lid Hjort

arxiv: 2602.04353 · v1 · pith:Y2IJS2YWnew · submitted 2026-02-04 · 📊 stat.OT

Anyone for chess? Analysing chess ratings above high thresholds

Nils Lid Hjort This is my paper

Pith reviewed 2026-05-21 14:01 UTC · model grok-4.3

classification 📊 stat.OT

keywords chess ratingstail analysisextreme valuesFIDE datavariance differenceshigh thresholdselite performance

0 comments

The pith

Differences in variance can create large gaps among the very top chess players even when averages are similar.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops statistical models specifically for the upper tails of distributions like chess ratings, where interest centers on players above high thresholds such as 2100 points or the top 100. These models rely only on the listed elite scores rather than the full population distribution below the threshold. When applied to FIDE data for over 14,000 active men and 753 women, the analysis shows that small variance differences can produce substantial separations at the extreme top even if means or medians are nearly identical. This matters for understanding why certain groups or strata dominate the absolute highest ranks without needing complete data on all players.

Core claim

The author develops models and tools for analyzing chess ratings above high thresholds using only the listed top scores, and applies them to the FIDE top-100 and above-2100 lists for active players. The central argument is that even when two or more distributions have close to identical expected values or medians, smaller differences in variance may explain gaps for the few very best ones.

What carries the argument

Tail models for ratings exceeding high thresholds, fitted using only the listed top scores from FIDE lists.

If this is right

Gaps among the absolute top ranks can be attributed primarily to variance differences rather than shifts in central tendency.
Comparisons between groups such as men and women in chess become possible through tail analysis alone.
Similar tail models can be applied to other skill or performance measures where only elite scores are readily available.
Predictions for the distribution of even rarer extreme ratings follow directly from the fitted tail parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same variance-driven mechanism could appear in other competitive domains such as scientific output or athletic records where full population data is unavailable.
Efforts to close performance gaps at the elite level may need to target spread in addition to average ability.
Historical FIDE rating lists could be reanalyzed with these models to test whether variance differences have changed over time.

Load-bearing premise

That tail behavior above high thresholds can be usefully modeled from the listed top scores alone without reference to the shape or parameters of the bulk distribution below the threshold.

What would settle it

If the number of players predicted to exceed a yet higher rating threshold like 2500, based on fitting the tail model to current top-100 data, deviates markedly from the actual observed count in updated FIDE lists.

read the original abstract

Suppose some cleverness score parameter is sufficiently interesting to be defined and then measured, perhaps for different strata of specialists or for the broader population. Such phenomena could have Gaussian distributions, when it comes to all players in a stratum, but when interest focuses on the very tails, for the top few percent, those above certain high thresholds, different models are called for, along with the need to analyse such based on the listed top scores only. In this note I develop such models and tools, and apply them to the top-100 and above 2100 points lists for regular chess ratings, for the currently active 14671 men and 753 women, as given by the FIDE, January 2026. It is argued that even when two or more distributions have close to identical expected values, or medians, even smaller differences in variance may explain gaps for the few very best ones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hjort applies tail models to chess ratings from top scores alone and argues variance differences can explain elite gaps even with similar means, but the bulk distribution is needed to check that similarity.

read the letter

Hi colleague, The main takeaway is that this note takes standard extreme-value tools and adapts them for analyzing rating tails when only the top-listed scores are available. Hjort fits models to the FIDE January 2026 lists above 2100 for 14671 men and 753 women, then uses the results to argue that modest variance differences between groups can produce large gaps at the very top even if the full distributions have nearly identical means or medians. What the paper does well is show a workable approach for truncated data. The chess example is concrete, the gender comparison adds a clear application, and the focus on practical modeling from listed tops only is useful for anyone who faces similar data constraints in performance or ranking settings. The models stay close to established tail techniques without unnecessary invention. The soft spot is exactly the one the stress-test flags. The central claim needs the means or medians to be close, yet the overall mean is driven by the mass of ratings below 2100. With no data or modeling of that bulk, you cannot confirm the means are similar; a shift in the lower part of the distribution could generate the same top gaps through mean differences instead. This makes the variance attribution plausible but not unique or strongly supported by the evidence at hand. The abstract also leaves out fitting details and validation steps, which would normally be checked in review. This is the sort of short applied note that would interest statisticians working on extremes in sports, competitions, or other truncated performance data. A reader who wants a worked example of tail modeling on real ranking lists would get value from it, though it is not a methodological advance. I would send it for peer review as a concise note. The modeling framework is solid enough to deserve referee time, with the expectation that the mean-variance identifiability issue gets clarified or qualified.

Referee Report

2 major / 1 minor

Summary. The manuscript develops statistical models and tools for analyzing distributions above high thresholds using only the listed top scores. It applies these to FIDE chess ratings above 2100 for 14671 active men and 753 women as of January 2026, arguing that small differences in variance can explain gaps among the very top performers even when expected values or medians are nearly identical.

Significance. If the tail models are valid and the similarity of means can be substantiated, the work would provide useful methods for extreme-value analysis in rating systems without requiring the full distribution. The empirical focus on chess data offers a concrete illustration of how variance influences tail disparities, with potential relevance to performance gaps in other specialist domains.

major comments (2)

[Abstract] Abstract: The claim that 'two or more distributions have close to identical expected values, or medians' is not supported by the top-scores analysis alone. The overall mean is dominated by the bulk below the 2100 threshold, which is neither modeled nor estimated from the given data for men and women; without this, it is impossible to verify mean similarity and the attribution of top gaps to variance differences is non-unique.
[Application to chess ratings] Application section: The tail models fitted directly to the listed scores above 2100 do not reference the shape or parameters of the distribution below the threshold. This leaves open the possibility that shifts in the unmodeled lower tail could generate equivalent top-end gaps through mean differences rather than variance, undermining the central variance-gap explanation.

minor comments (1)

[Abstract] The data reference 'January 2026' appears inconsistent with present timelines; confirm the exact FIDE list date used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which correctly identify limitations in how the manuscript frames its claims given the tail-only data. We have revised the abstract, introduction, and application section to clarify the conditional nature of the variance explanation and to avoid implying empirical verification of overall mean or median similarity.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'two or more distributions have close to identical expected values, or medians' is not supported by the top-scores analysis alone. The overall mean is dominated by the bulk below the 2100 threshold, which is neither modeled nor estimated from the given data for men and women; without this, it is impossible to verify mean similarity and the attribution of top gaps to variance differences is non-unique.

Authors: We agree that the analysis relies exclusively on ratings above the 2100 threshold and provides no information on the distribution below it, so overall means or medians cannot be verified or compared from the available data. The manuscript develops tail-specific models for extremes using only listed top scores and illustrates that, within such conditional tail distributions, modest variance differences can produce large gaps at the highest quantiles. We did not claim to have empirically established mean equality from the tail data alone. In revision we have updated the abstract to present the argument as conditional ('even when two or more distributions have close to identical expected values or medians, small variance differences may explain...') and added an explicit limitations paragraph in the application section noting that mean shifts arising from the unmodeled bulk remain a possible alternative explanation. revision: yes
Referee: [Application to chess ratings] Application section: The tail models fitted directly to the listed scores above 2100 do not reference the shape or parameters of the distribution below the threshold. This leaves open the possibility that shifts in the unmodeled lower tail could generate equivalent top-end gaps through mean differences rather than variance, undermining the central variance-gap explanation.

Authors: This observation is accurate: the models are fitted only to the observed scores above 2100 and are therefore silent on the form or location of the distribution below the threshold. Consequently, differences in the lower bulk could alter overall means and thereby affect the upper tail without any change in conditional variance. Our contribution is the development of tail-specific tools that permit analysis of extremes from top-score lists alone; the variance parameter in these models governs spread within the conditional tail. We have added a clarifying subsection that states the variance-gap account is offered under the maintained assumption of similar central tendencies and acknowledges that unmodeled mean shifts constitute a competing explanation. The language in the application section has been revised to present the variance mechanism as one plausible account supported by the tail analysis rather than the sole or definitive cause. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper develops tail models for chess ratings using only the provided top scores above the 2100 threshold for the FIDE lists of active players. The central claim that small variance differences can account for gaps among the very top players even when means or medians are nearly identical is framed as an interpretive modeling result applied to the external data. No equations, self-citations, or derivations in the abstract reduce any prediction or uniqueness result to a fitted input or prior self-referential step by construction. The analysis remains self-contained against the listed scores without tautological redefinition of inputs as outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the models are described only at the level of 'different models are called for' without naming functional forms or assumptions.

pith-pipeline@v0.9.0 · 5698 in / 1075 out tokens · 48532 ms · 2026-05-21T14:01:04.758719+00:00 · methodology

Anyone for chess? Analysing chess ratings above high thresholds

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)