When Mean CE Fails: Median CE Can Better Track Language Model Quality

Hao Guo; Kevin Shabahang; Rivaan Patil; Simon Dennis

arxiv: 2605.24667 · v1 · pith:D4GDQHOEnew · submitted 2026-05-23 · 💻 cs.AI · cs.LG

When Mean CE Fails: Median CE Can Better Track Language Model Quality

Hao Guo , Simon Dennis , Rivaan Patil , Kevin Shabahang This is my paper

Pith reviewed 2026-06-30 13:21 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords cross-entropylanguage model evaluationknowledge distillationsupervised fine-tuningvalidation metricsper-token loss distribution

0 comments

The pith

Median cross-entropy tracks language model task performance more closely than mean cross-entropy in two common training scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mean cross-entropy is the standard validation metric for language models, yet it can rise or worsen while actual task performance remains high or improves. In supervised fine-tuning of Qwen2.5-1.5B on synthetic facts, mean CE increases substantially after the model has already reached peak fact-recall accuracy. In top-K distillation on TinyStories, smaller K values produce students with worse mean CE but better median CE, and the top student crosses below its teacher on median CE while receiving the highest LLM-judge score. The paper traces the mismatch to training-induced changes in the empirical per-token CE distribution, where bulk and tail percentiles move independently and task metrics appear more sensitive to the bulk. The authors therefore recommend reporting several percentile CE summaries alongside the mean as a low-cost way to detect when the loss distribution is being reshaped.

Core claim

In both Qwen2.5-1.5B SFT on synthetic facts and top-K distillation on TinyStories, median cross-entropy correlates much more closely with task performance than mean cross-entropy does, because training reshapes the empirical per-token CE distribution so that bulk and tail percentiles diverge.

What carries the argument

The empirical distribution of per-token cross-entropies, whose median and mean respond differently when training saturates the bulk or extends the tail.

If this is right

Model selection or early stopping based on mean CE alone can pick suboptimal checkpoints or student models.
In top-K distillation, smaller K can produce a student that looks worse on mean CE yet better on actual quality metrics.
Task performance appears more sensitive to the bulk of the CE distribution than to its tail.
Concordance among multiple percentile CE summaries provides a diagnostic for when the loss distribution is being reshaped.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

When loss distributions become heavy-tailed during extended training, the mean may cease to be a reliable single summary statistic.
Similar reshaping of the CE distribution may occur in other autoregressive training regimes beyond the two cases studied here.
Practitioners could test whether monitoring the interquartile range or other quantiles improves checkpoint selection in long runs.

Load-bearing premise

The two examined scenarios are representative of the cases where mean CE commonly fails to track quality during language model training.

What would settle it

A third training regime, such as continued pretraining or RLHF on a new model scale, in which mean CE and held-out task performance select the same checkpoint while median CE selects a different one.

Figures

Figures reproduced from arXiv: 2605.24667 by Hao Guo, Kevin Shabahang, Rivaan Patil, Simon Dennis.

**Figure 2.** Figure 2: visualizes the judge ranking (left) and the median-CE trajectory during training (right, with checkpoints every 25K steps). The right panel shows a surprising pattern: a student with the same architecture and parameter count can drop below its teacher’s eventual median CE—Top-5 KL crosses below at ∼75K steps, while Top-15 KL crosses near the end of training (∼175K) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Per-token CE distributions at 250K steps for the four key variants. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Per-checkpoint scatter of mean and median CE against LLM judge score. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Correlation between CE percentile and LLM judge score within the 15-checkpoint distilled [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Left: (same as [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: FineWebEdu: 150M-parameter students trained for 500K steps against a 355M-parameter [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Per-token CE density evolution during Top-5 KL training, evaluated on the full 6.2Mtoken TinyStories validation set at every 25K-step checkpoint (log y, full range CE ∈ [0, 14]). The moderate-loss region (CE ∼ 1–8) thins as training proceeds. The upper tail (CE ≳ 10) moves much less, drifting slightly upward in the last ∼ 75K steps after the bulk has largely saturated. Within-training trajectory for Top-5… view at source ↗

**Figure 9.** Figure 9: Left: standardized upper-tail percentile pe (m) 95 per model family across all 30 checkpoints (error bars: within-family std). Right: standardized percentile profiles pe (m) k averaged within family. They tell the same qualitative story ( [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Linear attention: mean judge scores at 250K over 200 prompts. The same qualitative [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Linear attention: percentile-vs-judge correlation within the 15-checkpoint distilled subset [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Mean cross-entropy is the standard validation metric for language models, but it can fail to track model quality during training. We examine this in two common scenarios. First, in Qwen2.5-1.5B SFT on synthetic fact-learning, we find that mean CE rises substantially after the initial learning phase while held-out fact-recall accuracy remains near its peak. Second, we find that in top-K distillation on TinyStories, decreasing K improves median CE while worsening mean CE; the Top-5 student attains the highest LLM-judge score and crosses below its teacher on median CE, despite having the worst mean CE. In both cases, median CE correlates much more closely with task performance than does mean CE. Analyzing how bulk and tail percentile CE move during training reveals that training reshapes the empirical per-token CE distribution. In top-K distillation, smaller K yields a distribution with more mass at both extremes, decreasing the median and increasing the mean. In Qwen SFT, the bulk saturates quickly while the tail extends in the latter half of training. In both, the task-evaluation metric appears more sensitive to the bulk than to the tail. Practically, we recommend reporting a small set of percentile CE summaries alongside the mean, and using concordance among them as a tool to keep track of distribution reshaping, as well as a low-cost diagnostic for when mean and median CE disagree on model selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Median CE tracks task performance better than mean CE in the two scoped training setups, with a useful note on watching distribution shape.

read the letter

The key point is that in these two cases, median cross-entropy follows model quality where the mean does not. During SFT on synthetic facts with Qwen2.5-1.5B, mean CE rises after accuracy has leveled off. In top-K distillation on TinyStories, the top student by LLM judge has the highest mean CE but the lowest median, and it beats the teacher on median.

What stands out is the analysis of how the CE distribution shifts. Training can fatten the tail or move mass to extremes, which affects mean and median differently. The paper shows the task metric cares more about the bulk of the distribution.

This is a practical observation worth noting for anyone watching validation loss during fine-tuning or distillation. Reporting a couple of percentiles alongside the mean is an easy change that could flag when the mean is being pulled by outliers.

The limitation is the narrow scope. Both experiments are small-scale and specific—one on fact memorization, one on story generation with distillation. Without more runs or larger models, it's hard to know how often this mismatch happens in typical pretraining or other fine-tuning. The correlations are described qualitatively rather than with specific coefficients or significance tests.

Still, the work is honest about what it shows and doesn't claim broader generality. For readers who train models and have noticed mean CE behaving strangely, this gives a concrete diagnostic and a fix.

I would bring this to a reading group if the group works on evaluation metrics. It is not something I would cite in a methods section, but the idea of checking distribution shape when mean and median disagree is worth remembering. It deserves peer review because the experiments are clear and the suggestion is useful even if incremental.

Referee Report

2 major / 2 minor

Summary. The paper claims that mean cross-entropy (CE) can fail to track language model quality in two specific scenarios: (1) Qwen2.5-1.5B SFT on synthetic facts, where mean CE rises after initial learning while held-out fact-recall accuracy plateaus; (2) top-K distillation on TinyStories, where smaller K worsens mean CE but improves median CE and LLM-judge scores (with Top-5 student best on judge score and crossing teacher on median CE). It shows median CE correlates more closely with task performance, attributes this to reshaping of the per-token CE distribution (bulk saturation with tail extension in SFT; more mass at extremes in distillation), and recommends reporting percentile CE summaries alongside mean as a diagnostic.

Significance. If the scoped empirical observations hold, the work usefully demonstrates a concrete limitation of the dominant mean CE validation metric during LM training and offers a low-cost, practical alternative via median and other percentiles to detect distribution reshaping. The analysis of bulk vs. tail behavior provides mechanistic insight into the divergence. Credit is due for the targeted, falsifiable examples and the actionable recommendation without reliance on new parameters or derivations.

major comments (2)

[Abstract and analysis of distribution reshaping] The support for the central claim that median CE 'correlates much more closely' with task performance rests on the two scenarios, but the manuscript provides no quantitative correlation measures (e.g., Pearson or Spearman coefficients) or statistical comparison between mean and median CE alignments with the evaluation metrics; this weakens the ability to assess the magnitude of improvement.
[Experimental scenarios] § on experimental details: the descriptions of the SFT and distillation runs omit key reproducibility elements such as number of random seeds, error bars on the reported CE and accuracy curves, exact dataset sizes/splits, and hyperparameter values, making it difficult to evaluate robustness of the observed mean/median divergence.

minor comments (2)

[Conclusion] The practical recommendation to report 'a small set of percentile CE summaries' would be clearer if the manuscript specified which percentiles (e.g., 50th, 90th) are suggested and why.
[Figures] Figure captions or legends should explicitly label which curves correspond to mean vs. median CE to avoid reader confusion when comparing to task metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address the two major comments point by point below.

read point-by-point responses

Referee: [Abstract and analysis of distribution reshaping] The support for the central claim that median CE 'correlates much more closely' with task performance rests on the two scenarios, but the manuscript provides no quantitative correlation measures (e.g., Pearson or Spearman coefficients) or statistical comparison between mean and median CE alignments with the evaluation metrics; this weakens the ability to assess the magnitude of improvement.

Authors: We agree that quantitative measures would allow a clearer assessment of the improvement in alignment. In the revised manuscript we will compute and report Pearson and Spearman correlation coefficients between mean CE, median CE, and the task metrics (fact-recall accuracy and LLM-judge scores) in both experimental scenarios. revision: yes
Referee: [Experimental scenarios] § on experimental details: the descriptions of the SFT and distillation runs omit key reproducibility elements such as number of random seeds, error bars on the reported CE and accuracy curves, exact dataset sizes/splits, and hyperparameter values, making it difficult to evaluate robustness of the observed mean/median divergence.

Authors: We acknowledge the omission of these details. The revised manuscript will specify the number of random seeds, include error bars on the relevant curves, state the exact dataset sizes and splits, and list all hyperparameter values used in the SFT and distillation runs. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical observations

full rationale

The paper contains no derivation chain, mathematical claims, or first-principles results. Its central assertions are direct empirical reports of how mean vs. median cross-entropy behave in two specific training runs (Qwen2.5-1.5B SFT and top-K distillation on TinyStories), with no equations, fitted parameters renamed as predictions, self-citations invoked as uniqueness theorems, or ansatzes smuggled in. The observations are scoped to the examined scenarios and do not reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard domain assumptions about cross-entropy as a training objective and the validity of the chosen task metrics; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Cross-entropy loss is an appropriate base measure for tracking language model quality
Invoked throughout as the metric under study.
domain assumption Held-out fact-recall accuracy and LLM-judge scores are valid proxies for model quality
Used to compare against CE metrics.

pith-pipeline@v0.9.1-grok · 5792 in / 1257 out tokens · 40079 ms · 2026-06-30T13:21:19.664890+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 11 canonical work pages · 8 internal anchors

[1]

Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation

Sayantan Dasgupta et al. Don’t ignore the tail: Decoupling top-K probabilities for efficient language model distillation.arXiv preprint arXiv:2602.20816,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent English?arXiv preprint arXiv:2305.07759,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

MiniLLM: On-Policy Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Tinybert: Distilling bert for natural language understanding.arXiv preprint arXiv:1909.10351,

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding.arXiv preprint arXiv:1909.10351,

work page arXiv 1909
[6]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Hynek Kydlí ˇcek, Alessandro Cappelli, Haitz Saez de Ocáriz Borde, et al. The fineweb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Observational scaling laws and the predictability of language model performance.arXiv preprint arXiv:2405.10938,

Yangjun Ruan, Chris J Maddison, and Tatsunori Hashimoto. Observational scaling laws and the predictability of language model performance.arXiv preprint arXiv:2405.10938,

work page arXiv
[8]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[9]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

?”. Examples include“Complete: Above Nimbusglade floats ___

Correct vs. incorrect CE split.For each epoch we compute mean CE separately on examples answered correctly vs. incorrectly under greedy decoding (Table 6). After the early-training improve- ment (epoch 1→4–5), mean CE rises onbothsubsets, while the ratio CE(incorrect)/CE(correct) stays within ∼2.5 –2.7. Incorrectly decoded examples therefore consistently ...

work page arXiv
[12]

Large”; 4-layer, 256-dim, 4 heads,∼8M params for “Small

Pass@5 (any-of-5 sampled correctness) sits in 0.82–0.93 and shows the same shape. Error bars on pass@1 are within-prompt 95% CIs from theN=5sampling variance (≤ ±0.7%). 13 C TinyStories Protocol and Checkpoints LLM-as-judge protocol Prompts and generation.We draw 200 prompts once from the TinyStories validation set (seed 42), using 64-token prefixes. This...

2016
[13]

and the percentile-correlation analysis (Section 4.4) on a linear-attention architecture. Each transformer block replaces softmax self-attention with RetNet-style retention [Sun et al., 2023]: identity feature map with fixed per-head multi-scale decay γh = 1−2 −5−h, and a per-head LayerNorm before the output projection. For a given head h, the state recur...

2023

[1] [1]

Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation

Sayantan Dasgupta et al. Don’t ignore the tail: Decoupling top-K probabilities for efficient language model distillation.arXiv preprint arXiv:2602.20816,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent English?arXiv preprint arXiv:2305.07759,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

MiniLLM: On-Policy Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Tinybert: Distilling bert for natural language understanding.arXiv preprint arXiv:1909.10351,

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding.arXiv preprint arXiv:1909.10351,

work page arXiv 1909

[6] [6]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Guilherme Penedo, Hynek Kydlí ˇcek, Alessandro Cappelli, Haitz Saez de Ocáriz Borde, et al. The fineweb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Observational scaling laws and the predictability of language model performance.arXiv preprint arXiv:2405.10938,

Yangjun Ruan, Chris J Maddison, and Tatsunori Hashimoto. Observational scaling laws and the predictability of language model performance.arXiv preprint arXiv:2405.10938,

work page arXiv

[8] [8]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[9] [9]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

?”. Examples include“Complete: Above Nimbusglade floats ___

Correct vs. incorrect CE split.For each epoch we compute mean CE separately on examples answered correctly vs. incorrectly under greedy decoding (Table 6). After the early-training improve- ment (epoch 1→4–5), mean CE rises onbothsubsets, while the ratio CE(incorrect)/CE(correct) stays within ∼2.5 –2.7. Incorrectly decoded examples therefore consistently ...

work page arXiv

[12] [12]

Large”; 4-layer, 256-dim, 4 heads,∼8M params for “Small

Pass@5 (any-of-5 sampled correctness) sits in 0.82–0.93 and shows the same shape. Error bars on pass@1 are within-prompt 95% CIs from theN=5sampling variance (≤ ±0.7%). 13 C TinyStories Protocol and Checkpoints LLM-as-judge protocol Prompts and generation.We draw 200 prompts once from the TinyStories validation set (seed 42), using 64-token prefixes. This...

2016

[13] [13]

and the percentile-correlation analysis (Section 4.4) on a linear-attention architecture. Each transformer block replaces softmax self-attention with RetNet-style retention [Sun et al., 2023]: identity feature map with fixed per-head multi-scale decay γh = 1−2 −5−h, and a per-head LayerNorm before the output projection. For a given head h, the state recur...

2023