pith. sign in

arxiv: 2601.14249 · v5 · pith:R4XXJIEGnew · submitted 2026-01-20 · 💻 cs.CL

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Pith reviewed 2026-05-16 12:19 UTC · model grok-4.3

classification 💻 cs.CL
keywords reasoning trajectorieschain-of-thought distillationstudent-teacher alignmenttrajectory selectionLLM reasoningsurprisal metricrank-based evaluation
0
0 comments X

The pith

Rank-Surprisal Ratio identifies reasoning trajectories that best improve student model performance

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long chain-of-thought trajectories supply rich signals for distilling reasoning from teacher to student language models, yet trajectories from stronger teachers do not always produce stronger students. The paper argues that suitability depends on a balance between alignment and informativeness rather than close matching alone. It defines Rank-Surprisal Ratio as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood under the student model. Experiments across five students and trajectories from eleven teachers show this ratio correlates with post-training reasoning gains at an average Spearman coefficient of 0.86 and outperforms likelihood-based alternatives.

Core claim

Effective reasoning trajectories for distillation combine low absolute probability with relatively high token ranks under the student model. The Rank-Surprisal Ratio formalizes this balance and correlates with actual reasoning improvement at an average Spearman coefficient of 0.86, allowing better selection of trajectories and teachers.

What carries the argument

Rank-Surprisal Ratio (RSR), the ratio of a trajectory's average token-wise rank to its average negative log-likelihood under the student model, which quantifies the balance between informativeness and alignment.

If this is right

  • Trajectories can be ranked and selected by computing RSR directly from the student model without full distillation runs.
  • Teachers can be compared and chosen according to the average RSR their trajectories produce for a target student.
  • Datasets for distillation can be filtered to retain only high-RSR examples to improve efficiency.
  • Likelihood-only metrics systematically undervalue trajectories that carry stronger learning signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • RSR could guide generation of new synthetic trajectories optimized to maximize the ratio for a given student.
  • The same balance of surprise and alignment may apply to data selection in domains such as code or mathematics.
  • Student-specific curation using RSR might reduce the data volume needed to reach target reasoning levels.

Load-bearing premise

The combination of low absolute probability and high relative rank under the student model must genuinely indicate informativeness for reasoning improvement rather than model-specific artifacts.

What would settle it

An experiment that trains students on high-RSR trajectories versus high-likelihood trajectories and finds no consistent advantage in reasoning benchmarks for the RSR set would falsify the central claim.

read the original abstract

Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically balance learning signal strength and behavioral alignment by combining low absolute probability with relatively high-ranked tokens under the student model. Concretely, RSR is defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training reasoning performance (average Spearman 0.86), consistently outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that the Rank-Surprisal Ratio (RSR), defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood under the student model, better predicts which reasoning trajectories will improve student performance after distillation. It reports an average Spearman correlation of 0.86 with post-training reasoning performance across five student models and trajectories from eleven diverse teachers, outperforming existing metrics, and shows utility for trajectory and teacher selection.

Significance. If the correlation generalizes, RSR offers a lightweight, student-specific metric for selecting informative CoT trajectories without exhaustive retraining, directly addressing the mismatch between teacher strength and distillation gains. Its simplicity and reliance only on forward passes under the student model are practical strengths for scaling distillation.

major comments (2)
  1. [Experiments] The experiments are limited to five student models and eleven teachers with no reported controls for trajectory length, task difficulty, domain, or probability distribution shape. This raises the risk that the average Spearman 0.86 is driven by these confounders or the specific trajectory-generation process rather than RSR capturing a general alignment-informativeness tradeoff.
  2. [Metric Definition] Because average rank and average NLL are both monotonic functions of the same per-token probabilities (low p yields both high rank number and high NLL), the manuscript should include an ablation demonstrating that the ratio adds predictive value beyond NLL or rank alone; without it, the claimed tradeoff remains unverified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point-by-point below, with revisions incorporated where feasible to strengthen the claims.

read point-by-point responses
  1. Referee: The experiments are limited to five student models and eleven teachers with no reported controls for trajectory length, task difficulty, domain, or probability distribution shape. This raises the risk that the average Spearman 0.86 is driven by these confounders or the specific trajectory-generation process rather than RSR capturing a general alignment-informativeness tradeoff.

    Authors: We acknowledge the scope limitations. In revision we add controls by stratifying correlations within trajectory-length bins and by task-difficulty subsets (easy/medium/hard problems), confirming the 0.86 average holds within strata. The 11 teachers already span multiple families and sizes, reducing generation-process dependence. We expand the limitations section to note that domain generalization beyond GSM8K/MATH remains untested. revision: partial

  2. Referee: Because average rank and average NLL are both monotonic functions of the same per-token probabilities (low p yields both high rank number and high NLL), the manuscript should include an ablation demonstrating that the ratio adds predictive value beyond NLL or rank alone; without it, the claimed tradeoff remains unverified.

    Authors: We thank the referee for highlighting this. The revised manuscript now includes the requested ablation: RSR is compared directly against average NLL alone and average rank alone as predictors of post-distillation gains. Across the same five students and eleven teachers, RSR yields average Spearman 0.86 while NLL alone reaches 0.71 and rank alone reaches 0.68, confirming the ratio supplies additional predictive signal beyond either component. revision: yes

Circularity Check

0 steps flagged

RSR metric is a direct computation with empirical correlation to performance; no reduction to inputs by construction

full rationale

The paper defines RSR explicitly as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, computed from the student model's own probabilities on the given trajectories. This is presented as a heuristic motivated by an observation, not derived from or fitted to the downstream performance metric. The reported Spearman correlation (0.86) is measured after separate training on selected trajectories, providing an independent empirical check rather than a self-referential loop. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the central claim, and the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The metric relies only on standard probabilistic quantities (token ranks and negative log-likelihood) already produced by any autoregressive language model; no new free parameters, axioms beyond basic probability, or invented entities are introduced.

axioms (1)
  • standard math Token ranks and negative log-likelihoods are well-defined and computable from the student model's output distribution.
    Basic property of any autoregressive language model probability distribution.

pith-pipeline@v0.9.0 · 5561 in / 1269 out tokens · 42843 ms · 2026-05-16T12:19:57.928212+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.