Training-Trajectory-Aware Token Selection

Guoshan Lu; Hao Chen; Jiaqi Hu; Junbo Zhao; Junlin Zhou; Wentao Ye; Yihong Zhuang; Zenan Huang; Zeyu Qin; Zhanming Shen

arxiv: 2601.10348 · v2 · pith:AECZLY3Knew · submitted 2026-01-15 · 💻 cs.CL · cs.AI· cs.LG

Training-Trajectory-Aware Token Selection

Zhanming Shen , Jiaqi Hu , Zeyu Qin , Hao Chen , Wentao Ye , Zenan Huang , Yihong Zhuang , Guoshan Lu

show 2 more authors

Junlin Zhou Junbo Zhao

This is my paper

Pith reviewed 2026-05-22 11:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords continual distillationtoken selectionreasoning modelstraining trajectoryimitation anchor tokensautoregressive modelsdiffusion LLMsmodel compression

0 comments

The pith

Training-trajectory-aware token selection clears the optimization path for suppressed tokens and fixes continual distillation of reasoning ability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that continual distillation of strong reasoning models often fails with a sharp performance drop even while loss keeps falling. This drop stems from a token-level split where Imitation-Anchor Tokens quickly gain high confidence and anchor the model, while other tokens stay suppressed and cannot coexist with them. The authors introduce Training-Trajectory-Aware Token Selection (T3S) that reconstructs the per-token training objective to remove the anchors and let the suppressed tokens learn. With this change, small models reach or exceed much larger ones using only hundreds of examples in both standard and diffusion-based language model settings.

Core claim

Even as loss decreases, performance drops at a bottleneck because token confidence splits into steadily rising Imitation-Anchor Tokens that lock in the optimization and yet-to-learn tokens whose confidence remains suppressed. These two token types cannot coexist, which is the root cause of failure in continual distillation. T3S identifies tokens by their training trajectories and rebuilds the objective to clear the path for the suppressed tokens.

What carries the argument

Training-Trajectory-Aware Token Selection (T3S) that distinguishes Imitation-Anchor Tokens from yet-to-learn tokens along the training trajectory and adjusts the loss to favor the latter.

Load-bearing premise

The non-coexistence of imitation-anchor tokens and suppressed yet-to-learn tokens is the root cause of the observed distillation failure.

What would settle it

Run the same distillation with and without T3S and check whether the sharp performance drop still appears exactly when the token-confidence bifurcation occurs.

Figures

Figures reproduced from arXiv: 2601.10348 by Guoshan Lu, Hao Chen, Jiaqi Hu, Junbo Zhao, Junlin Zhou, Wentao Ye, Yihong Zhuang, Zenan Huang, Zeyu Qin, Zhanming Shen.

**Figure 1.** Figure 1: Imitation Shock under DeepSeek-R1 distillation on BOBA-200. Although training loss decreases monotonically (a), training-set answer accuracy and multiple benchmarks (AIME24/25, MMLU-Pro) drop sharply to a shared minimum stage and then recover, revealing an Imitation Shock. 0 10 20 30 40 50 Checkpoint 10 20 30 40 50 60 70 Score AIME25 Performance (a) Different teacher (QWQ) 0 10 20 30 40 50 Checkpoint 30 35… view at source ↗

**Figure 2.** Figure 2: Imitation Shock is universal across settings. We observe the same “crash then recover” trajectory across (a) different teachers, (b) different datasets, (c) larger-scale datasets, (d) different student backbones, and (e) different training domains. See Appendix B for detailed setups and metrics. code information that is neither necessary nor beneficial. Crucially, it motivates us to attribute the distillat… view at source ↗

**Figure 3.** Figure 3: Word clouds at the Imitation Bottleneck (identified by training-set accuracy). Token sizes are proportional to the magnitude of confidence change relative to the base model. Dataset Tokens with Confidence Drop (%) BOBA-200 68.51 S1K-200 53.03 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Checkpoint-wise one-step intervention on ImitationAnchor Tokens. Panel (a) reports ∆Lother after one gradient step that optimizes only anchor tokens at each checkpoint. Panel (b) shows the corresponding anchor-token loss Lanchor at that checkpoint. points. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Workflow of our method. Let D = {(x, y)} be a dataset of inputs x (e.g., prompts) and target sequences y = (y1, . . . , yT ) of length T. Autoregressive (AR) models. An AR model factorizes pθ(y | x) = Y T t=1 pθ(yt | y<t, x). (1) 5 [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Training dynamics comparison between T3S and SFT on BOBA-200. gains rather than diluting them. For completeness, we also report the base performance of the teacher models themselves. Results and analysis. Several observations emerge from [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Imitation Shock under DeepSeek-R1 distillation on S1K-200. The same sharp performance drop and recovery pattern observed on BOBA-200 also appears on S1K-200. (5) Different domain. Beyond math reasoning, we sample a code subset of comparable scale from RL-Code-Math-v5 (typhoon ai, 2025) and distill Qwen3-8B from QWQ. Using LiveCodeBench as the metric, we again observe the same crash-and-recovery dynamics ( … view at source ↗

**Figure 9.** Figure 9: Imitation Shock persists under an in-distribution teacher (QWQ) on BOBA-200. D. OOD Reasoning Gains and Forgetting Mitigation To assess whether T3S improves general reasoning beyond AIME-style evaluation and whether it mitigates catastrophic forgetting, we evaluate distilled students on two task families: Reasoning-related tasks. We report performance on PhyBench (Meng et al., 2024), GPQA-Diamond (GPQA-D) … view at source ↗

**Figure 10.** Figure 10: Token-gradient sketch visualization does not cleanly separate anchors. We project token-level LoRA gradient sketches into 2D (t-SNE/PCA-style embedding) and color tokens by sign(∆ct) (anchor vs. non-anchor). The two groups heavily overlap, indicating that local gradient geometry alone is insufficient for reliably identifying imitation anchors. baselines perform far worse than T3S ( [PITH_FULL_IMAGE:figur… view at source ↗

**Figure 11.** Figure 11: Base-model confidence is insufficient to separate anchors. We plot token groups using their base-model confidence and color by the sign of trajectory-based confidence change ∆ct (anchor if ∆ct > 0, otherwise non-anchor). The two groups overlap substantially, showing that the anchor/non-anchor partition cannot be recovered by initial confidence alone. equal-sized halves: Anchor-Easy, Anchor-Hard, Other-Eas… view at source ↗

**Figure 12.** Figure 12: Imitation Shock persists under initial-confidence masking. Even when masking the same token fraction as T3S, selecting tokens by base-model confidence (highest or lowest) still exhibits the characteristic crash-then-recover pattern in AIME25 across checkpoints. T . We report the percentage loss change: ∆S→T = 100 × ℓT (θS ) − ℓT (θ0) ℓT (θ0) , where θS is the checkpoint obtained by training only on S. Neg… view at source ↗

read the original abstract

Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3S yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The token bifurcation observation during distillation is a useful empirical note, but the claim that coexistence failure is the root cause remains correlational without isolating tests.

read the letter

The main point is that this paper spots a training bottleneck in continual distillation of reasoning models: performance drops sharply even while loss keeps falling, and they trace it to token confidence splitting into early-anchoring imitation tokens and suppressed yet-to-learn ones. They argue these two groups cannot coexist and that this incompatibility drives the failure. From there they introduce T3S, which rebuilds the training objective at the token level to clear space for the lagging tokens.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Training-Trajectory-Aware Token Selection (T3S) to address failures in continual distillation of reasoning capabilities from large teacher models to smaller students. It reports an empirical observation that loss decreases monotonically while performance metrics drop sharply at a bottleneck before recovering, attributes this to a bifurcation where some tokens become Imitation-Anchor Tokens with steadily increasing confidence while others (yet-to-learn tokens) have suppressed confidence, and claims that the inability of these two token types to coexist is the root cause. T3S reconstructs the token-level training objective to clear the optimization path for yet-to-learn tokens. Experiments in both autoregressive and diffusion LLM settings report large gains, including Qwen3-8B surpassing DeepSeek-R1 on reasoning benchmarks with only hundreds of examples, Qwen3-32B approaching Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeding its AR baseline to achieve SOTA among 16B-scale no-think models.

Significance. If the causal mechanism is confirmed and the gains prove robust across settings, the work would be significant for efficient reasoning distillation, offering a trajectory-aware alternative to standard imitation that could reduce data and compute requirements for deploying strong reasoning models. The empirical identification of the bifurcation phenomenon during training is a useful observation that merits further study. The application to both AR and dLLM architectures is a strength, as is the focus on the frontier regime where the student already possesses strong reasoning ability.

major comments (2)

[Abstract and the section describing the token-level mechanism] The central claim that 'the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation' (abstract) is not supported by a direct causal test. The reported correlation between token bifurcation and the performance bottleneck does not isolate coexistence incompatibility from alternative explanations such as optimization dynamics, data ordering, or capacity limits. A controlled intervention (e.g., a modified loss or sampling procedure that forces coexistence while holding other factors fixed) is needed to establish that clearing this specific path, rather than generic token reweighting, produces the observed recovery and T3S gains.
[Experimental results section] The performance claims (e.g., Qwen3-8B surpassing DeepSeek-R1 and T3-trained LLaDA-2.0-Mini achieving SOTA among 16B-scale no-think models) require fuller experimental details on baselines, number of runs, variance, and exact evaluation protocols to be load-bearing. Without these, it is difficult to assess whether the gains are attributable to T3S specifically addressing the bifurcation or to other factors in the token selection.

minor comments (2)

[Method section] Clarify the precise definition and detection criteria for Imitation-Anchor Tokens versus yet-to-learn tokens, ideally with a formal equation or algorithm box.
[Related work] Add a brief discussion of how T3S relates to prior token-level importance or curriculum learning methods in the related work section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. Their comments on establishing stronger causal evidence for the token bifurcation mechanism and on improving experimental reporting are well taken. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract and the section describing the token-level mechanism] The central claim that 'the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation' (abstract) is not supported by a direct causal test. The reported correlation between token bifurcation and the performance bottleneck does not isolate coexistence incompatibility from alternative explanations such as optimization dynamics, data ordering, or capacity limits. A controlled intervention (e.g., a modified loss or sampling procedure that forces coexistence while holding other factors fixed) is needed to establish that clearing this specific path, rather than generic token reweighting, produces the observed recovery and T3S gains.

Authors: We appreciate the referee's emphasis on the need for a direct causal test. T3S was explicitly developed as the controlled intervention that reconstructs the token-level training objective to clear the optimization path for yet-to-learn tokens, thereby mitigating the suppression caused by Imitation-Anchor Tokens. This change is applied while holding the training data, model architecture, and optimization hyperparameters fixed, allowing us to attribute the observed recovery and downstream gains to addressing the specific incompatibility rather than generic reweighting. To further isolate this effect, we will add a targeted comparison in the revised manuscript between T3S and standard token reweighting baselines. revision: partial
Referee: [Experimental results section] The performance claims (e.g., Qwen3-8B surpassing DeepSeek-R1 and T3-trained LLaDA-2.0-Mini achieving SOTA among 16B-scale no-think models) require fuller experimental details on baselines, number of runs, variance, and exact evaluation protocols to be load-bearing. Without these, it is difficult to assess whether the gains are attributable to T3S specifically addressing the bifurcation or to other factors in the token selection.

Authors: We agree that fuller experimental details are essential for reproducibility and for confirming that gains stem from T3S. In the revised manuscript we will expand the experimental results section to include complete descriptions of all baselines, the number of independent runs, variance or standard deviation measures, and the precise evaluation protocols for each benchmark. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation grounded in empirical observations

full rationale

The paper identifies a training bottleneck and token-level confidence bifurcation through direct observation of loss curves and per-token dynamics during continual distillation. The statement that non-coexistence of Imitation-Anchor Tokens and yet-to-learn tokens is the root cause is presented as an empirical finding from these trajectories rather than a definitional equivalence or fitted parameter. T3S is then introduced as a token-level objective modification motivated by this observation, with gains demonstrated on external benchmarks (Qwen3-8B vs DeepSeek-R1, LLaDA-2.0-Mini). No equation reduces a claimed result to its inputs by construction, no self-citation chain carries the uniqueness or derivation load, and the work remains self-contained against the reported empirical evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation of a training bottleneck and the assumption that the described token mechanism is causal; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption The characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation.
Directly stated in the abstract as the explanation for distillation failure.

pith-pipeline@v0.9.0 · 5788 in / 1228 out tokens · 47557 ms · 2026-05-22T11:36:40.307975+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation... T3S... masking Imitation-Anchor Tokens
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

token confidence bifurcates into steadily increasing Imitation-Anchor Tokens... suppressive effect on other yet-to-learn tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FedSDR: Federated Self-Distillation with Rectification
cs.LG 2026-05 unverdicted novelty 5.0

FedSDR augments federated self-distillation with dual LoRA streams (local smoothing and global rectification) to produce globally aligned, factually faithful models under statistical heterogeneity.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 1 Pith paper · 15 internal anchors

[1]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Arriola, M., Gokaslan, A., Chiu, J. T., Yang, Z., Qi, Z., Han, J., Sahoo, S. S., and Kuleshov, V . Block diffusion: Inter- polating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Bie, T., Cao, M., Chen, K., Du, L., Gong, M., Gong, Z., Gu, Y ., Hu, J., Huang, Z., Lan, Z., et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Alpagasus: Training a better alpaca with fewer data.arXiv preprint arXiv:2307.08701,

Chen, L., Li, S., Yan, J., Wang, H., Gunaratna, K., Yadav, V ., Tang, Z., Srinivasan, V ., Zhou, T., Huang, H., et al. Alpagasus: Training a better alpaca with fewer data.arXiv preprint arXiv:2307.08701,

work page arXiv
[4]

Spectral policy optimization: Coloring your incorrect reasoning in grpo

Chen, P., Li, X., Li, Z., Chen, X., and Lin, T. Spectral policy optimization: Coloring your incorrect reasoning in grpo. arXiv preprint arXiv:2505.11595,

work page arXiv
[5]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

URL https://huggingface.co/deepseek-ai/ DeepSeek-R1-Distill-Llama-8B. Du, X., Yao, Y ., Ma, K., Wang, B., Zheng, T., Zhu, K., Liu, M., Liang, Y ., Jin, X., Wei, Z., et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

OpenThoughts: Data Recipes for Reasoning Models

Guha, E., Marten, R., Keh, S., Raoof, N., Smyrnis, G., Bansal, H., Nezhurina, M., Mercat, J., Vu, T., Sprague, Z., et al. Openthoughts: Data recipes for reasoning models. arXiv preprint arXiv:2506.04178,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Hammoud, H. A. A. K., Alhamoud, K., Hammoud, A., Bou- Zeid, E., Ghassemi, M., and Ghanem, B. Train long, think short: Curriculum learning for efficient reasoning. arXiv preprint arXiv:2508.08940,

work page arXiv
[9]

OpenAI o1 System Card

URL https:// huggingface.co/datasets/inclusionAI/ AReaL-boba-Data. Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Crafting papers on machine learning

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stan- ford, CA,

work page 2000
[12]

Rho-1: Not all tokens are what you need.arXiv preprint arXiv:2404.07965, 2024a

Lin, Z., Gou, Z., Gong, Y ., Liu, X., Shen, Y ., Xu, R., Lin, C., Yang, Y ., Jiao, J., Duan, N., et al. Rho-1: Not all tokens are what you need.arXiv preprint arXiv:2404.07965, 2024a. Lin, Z., Liang, T., Xu, J., Lin, Q., Wang, X., Luo, R., Shi, C., Li, S., Yang, Y ., and Tu, Z. Critical tokens matter: Token-level contrastive estimation enhances llm’s reas...

work page arXiv
[13]

Deconstructing long chain-of- thought: A structured reasoning optimization framework for long cot distillation.arXiv preprint arXiv:2503.16385,

Luo, Y ., Song, Y ., Zhang, X., Liu, J., Wang, W., Chen, G., Su, W., and Zheng, B. Deconstructing long chain-of- thought: A structured reasoning optimization framework for long cot distillation.arXiv preprint arXiv:2503.16385,

work page arXiv
[14]

Phybench: A physical commonsense benchmark for evaluating text- to-image models.arXiv preprint arXiv:2406.11802,

Meng, F., Shao, W., Luo, L., Wang, Y ., Chen, Y ., Lu, Q., Yang, Y ., Yang, T., Zhang, K., Qiao, Y ., et al. Phybench: A physical commonsense benchmark for evaluating text- to-image models.arXiv preprint arXiv:2406.11802,

work page arXiv
[15]

s1: Simple test-time scaling

Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand `es, E., and Hashimoto, T. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

12 Training-Trajectory-Aware Token Selection Ni, J., Liu, Q., Dou, L., Du, C., Wang, Z., Yan, H., Pang, T., and Shieh, M. Q. Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276,

work page arXiv
[17]

Large Language Diffusion Models

Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y ., Wen, J.-R., and Li, C. Large language diffusion models.arXiv preprint arXiv:2502.09992,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

H., Fung, Y

Qin, Z., Dong, Q., Zhang, X., Dong, L., Huang, X., Yang, Z., Khademi, M., Zhang, D., Awadalla, H. H., Fung, Y . R., et al. Scaling laws of synthetic data for language models. arXiv preprint arXiv:2503.19551,

work page arXiv
[19]

Cycle-instruct: Fully seed- free instruction tuning via dual self-training and cycle consistency

Shen, Z., Chen, H., Tang, Y ., Zhu, S., Ye, W., Hu, X., Wang, H., Chen, G., and Zhao, J. Cycle-instruct: Fully seed- free instruction tuning via dual self-training and cycle consistency. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 5123–5137, 2025a. Shen, Z., Qin, Z., Huang, Z., Chen, H., Hu, J., Zhuang, Y ...

work page arXiv 2025
[20]

Zephyr: Direct Distillation of LM Alignment

Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Ra- sul, K., Belkada, Y ., Huang, S., V on Werra, L., Fourrier, C., Habib, N., et al. Zephyr: Direct distillation of lm alignment.arXiv preprint arXiv:2310.16944,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Vassoyan, J., Beau, N., and Plaud, R

URL https: //huggingface.co/datasets/typhoon-ai/ rl-code-math-v5. Vassoyan, J., Beau, N., and Plaud, R. Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine- tuning.arXiv preprint arXiv:2502.06533,

work page arXiv
[22]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Wang, S., Yu, L., Gao, C., Zheng, C., Liu, S., Lu, R., Dang, K., Chen, X., Yang, J., Zhang, Z., et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Beyond scaling law: A data-efficient distillation framework for reasoning.arXiv preprint arXiv:2508.09883,

Wu, X., Jiang, X., Li, H., Zhai, J., Liu, D., Hao, Q., Liu, H., Yang, Z., Xie, J., Gu, N., et al. Beyond scaling law: A data-efficient distillation framework for reasoning.arXiv preprint arXiv:2508.09883,

work page arXiv
[24]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., and Jiang, D. Wizardlm: Empowering large language models to follow complex instructions.arXiv preprint arXiv:2304.12244,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Dream 7B: Diffusion Large Language Models

Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025a. Ye, Y ., Huang, Z., Xiao, Y ., Chern, E., Xia, S., and Liu, P. Limo: Less is more for reasoning.arXiv preprint arXiv:2502.03387, 2025b. 13 Training-Trajectory-Aware Token Selection Zhang, Q., Zhang,...

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Instruction-Following Evaluation for Large Language Models

Zhong, H., Shan, Z., Feng, G., Xiong, W., Cheng, X., Zhao, L., He, D., Bian, J., and Wang, L. Dpo meets ppo: Re- inforced token optimization for rlhf. InForty-second International Conference on Machine Learning. Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y ., Ma, X., Efrat, A., Yu, P., Yu, L., et al. Lima: Less is more for alignment.Advances in Ne...

work page internal anchor Pith review Pith/arXiv arXiv
[28]

as the primary performance metric; for thecode reasoningsetting, we useLiveCodeBench(Jain et al., 2024). (1) Different teacher model.We distill Qwen3-8B fromQWQ, a widely regarded in-distribution teacher for small Qwen-series models (Shen et al., 2025b; Wu et al., 2025), using the same BOBA-200 prompts and evaluation protocol. We observe the same performa...

work page 2024
[29]

Imitation Shock persists (Figure 2(b)), indicating the phenomenon is not specific to a single dataset

(math reasoning) with the same training configuration. Imitation Shock persists (Figure 2(b)), indicating the phenomenon is not specific to a single dataset. (3) Larger dataset scale.To go beyond the efficient-distillation regime, we distill Qwen3-8B from DeepSeek-R1 on OpenThought3(Guha et al., 2025), which has ∼65K samples. Even though training does not...

work page 2025
[30]

and distill Qwen3-8B fromQWQ. UsingLiveCodeBenchas the metric, we again observe the same crash-and-recovery dynamics (Figure 2(e)), suggesting that Imitation Shock is not confined to math reasoning but extends to code reasoning domains as well. C. Additional Distillation Dynamics 0 10 20 30 40 50 Step 0.30 0.35 0.40 0.45 0.50 0.55 0.60Loss Training Loss (...

work page 2024
[31]

and S1K (Muennighoff et al., 2025). Following previous work (Shen et al., 2025b), from each dataset, we sample 200 prompts and 16 Training-Trajectory-Aware Token Selection Dataset Teacher (Method) PHYBENCH GPQA-D LCB MMLU-Pro∆Avg Base 20.47 57.77 55.76 71.42 – BOBA-200 R1 (SFT) 21.79 56.34 53.41 69.92↓-0.99 R1 (T3S)23.95 59.47 59.32 72.65↑2.49 QWQ (SFT) 1...

work page arXiv 2025
[32]

∆Avg is computed as the average score change relative to the Base model across these retention benchmarks (higher is better)

Forgetting mitigation of T3S.We evaluate on SUPER-GPQA, CEval, and IFEval. ∆Avg is computed as the average score change relative to the Base model across these retention benchmarks (higher is better). denote the resulting subsets asBOBA-200andS1K-200. For BOBA-200, we directly use the default 200 problems released with the BOBA benchmark. For S1K-200, we ...

work page 2025
[33]

to-be-learned

and AIME25 (Math-AI, 2025), with each score reported as the average over 16 runs. 17 Training-Trajectory-Aware Token Selection F. Why Static Signals are Insufficient for Targeting This appendix section studies whether the anchor/non-anchor partition discovered by T3S can be recovered fromstatic signals available at a single checkpoint, without using train...

work page 2025

[1] [1]

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Arriola, M., Gokaslan, A., Chiu, J. T., Yang, Z., Qi, Z., Han, J., Sahoo, S. S., and Kuleshov, V . Block diffusion: Inter- polating between autoregressive and diffusion language models.arXiv preprint arXiv:2503.09573,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Bie, T., Cao, M., Chen, K., Du, L., Gong, M., Gong, Z., Gu, Y ., Hu, J., Huang, Z., Lan, Z., et al. Llada2. 0: Scaling up diffusion language models to 100b.arXiv preprint arXiv:2512.15745,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Alpagasus: Training a better alpaca with fewer data.arXiv preprint arXiv:2307.08701,

Chen, L., Li, S., Yan, J., Wang, H., Gunaratna, K., Yadav, V ., Tang, Z., Srinivasan, V ., Zhou, T., Huang, H., et al. Alpagasus: Training a better alpaca with fewer data.arXiv preprint arXiv:2307.08701,

work page arXiv

[4] [4]

Spectral policy optimization: Coloring your incorrect reasoning in grpo

Chen, P., Li, X., Li, Z., Chen, X., and Lin, T. Spectral policy optimization: Coloring your incorrect reasoning in grpo. arXiv preprint arXiv:2505.11595,

work page arXiv

[5] [5]

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

URL https://huggingface.co/deepseek-ai/ DeepSeek-R1-Distill-Llama-8B. Du, X., Yao, Y ., Ma, K., Wang, B., Zheng, T., Zhu, K., Liu, M., Liang, Y ., Jin, X., Wei, Z., et al. Supergpqa: Scaling llm evaluation across 285 graduate disciplines.arXiv preprint arXiv:2502.14739,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

OpenThoughts: Data Recipes for Reasoning Models

Guha, E., Marten, R., Keh, S., Raoof, N., Smyrnis, G., Bansal, H., Nezhurina, M., Mercat, J., Vu, T., Sprague, Z., et al. Openthoughts: Data recipes for reasoning models. arXiv preprint arXiv:2506.04178,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Hammoud, H. A. A. K., Alhamoud, K., Hammoud, A., Bou- Zeid, E., Ghassemi, M., and Ghanem, B. Train long, think short: Curriculum learning for efficient reasoning. arXiv preprint arXiv:2508.08940,

work page arXiv

[9] [9]

OpenAI o1 System Card

URL https:// huggingface.co/datasets/inclusionAI/ AReaL-boba-Data. Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Car- ney, A., et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free eval- uation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Crafting papers on machine learning

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stan- ford, CA,

work page 2000

[12] [12]

Rho-1: Not all tokens are what you need.arXiv preprint arXiv:2404.07965, 2024a

Lin, Z., Gou, Z., Gong, Y ., Liu, X., Shen, Y ., Xu, R., Lin, C., Yang, Y ., Jiao, J., Duan, N., et al. Rho-1: Not all tokens are what you need.arXiv preprint arXiv:2404.07965, 2024a. Lin, Z., Liang, T., Xu, J., Lin, Q., Wang, X., Luo, R., Shi, C., Li, S., Yang, Y ., and Tu, Z. Critical tokens matter: Token-level contrastive estimation enhances llm’s reas...

work page arXiv

[13] [13]

Deconstructing long chain-of- thought: A structured reasoning optimization framework for long cot distillation.arXiv preprint arXiv:2503.16385,

Luo, Y ., Song, Y ., Zhang, X., Liu, J., Wang, W., Chen, G., Su, W., and Zheng, B. Deconstructing long chain-of- thought: A structured reasoning optimization framework for long cot distillation.arXiv preprint arXiv:2503.16385,

work page arXiv

[14] [14]

Phybench: A physical commonsense benchmark for evaluating text- to-image models.arXiv preprint arXiv:2406.11802,

Meng, F., Shao, W., Luo, L., Wang, Y ., Chen, Y ., Lu, Q., Yang, Y ., Yang, T., Zhang, K., Qiao, Y ., et al. Phybench: A physical commonsense benchmark for evaluating text- to-image models.arXiv preprint arXiv:2406.11802,

work page arXiv

[15] [15]

s1: Simple test-time scaling

Muennighoff, N., Yang, Z., Shi, W., Li, X. L., Fei-Fei, L., Hajishirzi, H., Zettlemoyer, L., Liang, P., Cand `es, E., and Hashimoto, T. s1: Simple test-time scaling.arXiv preprint arXiv:2501.19393,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

12 Training-Trajectory-Aware Token Selection Ni, J., Liu, Q., Dou, L., Du, C., Wang, Z., Yan, H., Pang, T., and Shieh, M. Q. Diffusion language models are super data learners.arXiv preprint arXiv:2511.03276,

work page arXiv

[17] [17]

Large Language Diffusion Models

Nie, S., Zhu, F., You, Z., Zhang, X., Ou, J., Hu, J., Zhou, J., Lin, Y ., Wen, J.-R., and Li, C. Large language diffusion models.arXiv preprint arXiv:2502.09992,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

H., Fung, Y

Qin, Z., Dong, Q., Zhang, X., Dong, L., Huang, X., Yang, Z., Khademi, M., Zhang, D., Awadalla, H. H., Fung, Y . R., et al. Scaling laws of synthetic data for language models. arXiv preprint arXiv:2503.19551,

work page arXiv

[19] [19]

Cycle-instruct: Fully seed- free instruction tuning via dual self-training and cycle consistency

Shen, Z., Chen, H., Tang, Y ., Zhu, S., Ye, W., Hu, X., Wang, H., Chen, G., and Zhao, J. Cycle-instruct: Fully seed- free instruction tuning via dual self-training and cycle consistency. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 5123–5137, 2025a. Shen, Z., Qin, Z., Huang, Z., Chen, H., Hu, J., Zhuang, Y ...

work page arXiv 2025

[20] [20]

Zephyr: Direct Distillation of LM Alignment

Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Ra- sul, K., Belkada, Y ., Huang, S., V on Werra, L., Fourrier, C., Habib, N., et al. Zephyr: Direct distillation of lm alignment.arXiv preprint arXiv:2310.16944,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Vassoyan, J., Beau, N., and Plaud, R

URL https: //huggingface.co/datasets/typhoon-ai/ rl-code-math-v5. Vassoyan, J., Beau, N., and Plaud, R. Ignore the kl penalty! boosting exploration on critical tokens to enhance rl fine- tuning.arXiv preprint arXiv:2502.06533,

work page arXiv

[22] [22]

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Wang, S., Yu, L., Gao, C., Zheng, C., Liu, S., Lu, R., Dang, K., Chen, X., Yang, J., Zhang, Z., et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Beyond scaling law: A data-efficient distillation framework for reasoning.arXiv preprint arXiv:2508.09883,

Wu, X., Jiang, X., Li, H., Zhai, J., Liu, D., Hao, Q., Liu, H., Yang, Z., Xie, J., Gu, N., et al. Beyond scaling law: A data-efficient distillation framework for reasoning.arXiv preprint arXiv:2508.09883,

work page arXiv

[24] [24]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., and Jiang, D. Wizardlm: Empowering large language models to follow complex instructions.arXiv preprint arXiv:2304.12244,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Dream 7B: Diffusion Large Language Models

Ye, J., Xie, Z., Zheng, L., Gao, J., Wu, Z., Jiang, X., Li, Z., and Kong, L. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025a. Ye, Y ., Huang, Z., Xiao, Y ., Chern, E., Xia, S., and Liu, P. Limo: Less is more for reasoning.arXiv preprint arXiv:2502.03387, 2025b. 13 Training-Trajectory-Aware Token Selection Zhang, Q., Zhang,...

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Instruction-Following Evaluation for Large Language Models

Zhong, H., Shan, Z., Feng, G., Xiong, W., Cheng, X., Zhao, L., He, D., Bian, J., and Wang, L. Dpo meets ppo: Re- inforced token optimization for rlhf. InForty-second International Conference on Machine Learning. Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y ., Ma, X., Efrat, A., Yu, P., Yu, L., et al. Lima: Less is more for alignment.Advances in Ne...

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

as the primary performance metric; for thecode reasoningsetting, we useLiveCodeBench(Jain et al., 2024). (1) Different teacher model.We distill Qwen3-8B fromQWQ, a widely regarded in-distribution teacher for small Qwen-series models (Shen et al., 2025b; Wu et al., 2025), using the same BOBA-200 prompts and evaluation protocol. We observe the same performa...

work page 2024

[29] [29]

Imitation Shock persists (Figure 2(b)), indicating the phenomenon is not specific to a single dataset

(math reasoning) with the same training configuration. Imitation Shock persists (Figure 2(b)), indicating the phenomenon is not specific to a single dataset. (3) Larger dataset scale.To go beyond the efficient-distillation regime, we distill Qwen3-8B from DeepSeek-R1 on OpenThought3(Guha et al., 2025), which has ∼65K samples. Even though training does not...

work page 2025

[30] [30]

and distill Qwen3-8B fromQWQ. UsingLiveCodeBenchas the metric, we again observe the same crash-and-recovery dynamics (Figure 2(e)), suggesting that Imitation Shock is not confined to math reasoning but extends to code reasoning domains as well. C. Additional Distillation Dynamics 0 10 20 30 40 50 Step 0.30 0.35 0.40 0.45 0.50 0.55 0.60Loss Training Loss (...

work page 2024

[31] [31]

and S1K (Muennighoff et al., 2025). Following previous work (Shen et al., 2025b), from each dataset, we sample 200 prompts and 16 Training-Trajectory-Aware Token Selection Dataset Teacher (Method) PHYBENCH GPQA-D LCB MMLU-Pro∆Avg Base 20.47 57.77 55.76 71.42 – BOBA-200 R1 (SFT) 21.79 56.34 53.41 69.92↓-0.99 R1 (T3S)23.95 59.47 59.32 72.65↑2.49 QWQ (SFT) 1...

work page arXiv 2025

[32] [32]

∆Avg is computed as the average score change relative to the Base model across these retention benchmarks (higher is better)

Forgetting mitigation of T3S.We evaluate on SUPER-GPQA, CEval, and IFEval. ∆Avg is computed as the average score change relative to the Base model across these retention benchmarks (higher is better). denote the resulting subsets asBOBA-200andS1K-200. For BOBA-200, we directly use the default 200 problems released with the BOBA benchmark. For S1K-200, we ...

work page 2025

[33] [33]

to-be-learned

and AIME25 (Math-AI, 2025), with each score reported as the average over 16 runs. 17 Training-Trajectory-Aware Token Selection F. Why Static Signals are Insufficient for Targeting This appendix section studies whether the anchor/non-anchor partition discovered by T3S can be recovered fromstatic signals available at a single checkpoint, without using train...

work page 2025