pith. machine review for the scientific record. sign in

arxiv: 2601.14249 · v4 · submitted 2026-01-20 · 💻 cs.CL

Recognition: no theorem link

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:19 UTC · model grok-4.3

classification 💻 cs.CL
keywords reasoning trajectorieschain-of-thought distillationstudent-teacher alignmenttrajectory selectionLLM reasoningsurprisal metricrank-based evaluation
0
0 comments X

The pith

Rank-Surprisal Ratio identifies reasoning trajectories that best improve student model performance

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Long chain-of-thought trajectories supply rich signals for distilling reasoning from teacher to student language models, yet trajectories from stronger teachers do not always produce stronger students. The paper argues that suitability depends on a balance between alignment and informativeness rather than close matching alone. It defines Rank-Surprisal Ratio as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood under the student model. Experiments across five students and trajectories from eleven teachers show this ratio correlates with post-training reasoning gains at an average Spearman coefficient of 0.86 and outperforms likelihood-based alternatives.

Core claim

Effective reasoning trajectories for distillation combine low absolute probability with relatively high token ranks under the student model. The Rank-Surprisal Ratio formalizes this balance and correlates with actual reasoning improvement at an average Spearman coefficient of 0.86, allowing better selection of trajectories and teachers.

What carries the argument

Rank-Surprisal Ratio (RSR), the ratio of a trajectory's average token-wise rank to its average negative log-likelihood under the student model, which quantifies the balance between informativeness and alignment.

If this is right

  • Trajectories can be ranked and selected by computing RSR directly from the student model without full distillation runs.
  • Teachers can be compared and chosen according to the average RSR their trajectories produce for a target student.
  • Datasets for distillation can be filtered to retain only high-RSR examples to improve efficiency.
  • Likelihood-only metrics systematically undervalue trajectories that carry stronger learning signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • RSR could guide generation of new synthetic trajectories optimized to maximize the ratio for a given student.
  • The same balance of surprise and alignment may apply to data selection in domains such as code or mathematics.
  • Student-specific curation using RSR might reduce the data volume needed to reach target reasoning levels.

Load-bearing premise

The combination of low absolute probability and high relative rank under the student model must genuinely indicate informativeness for reasoning improvement rather than model-specific artifacts.

What would settle it

An experiment that trains students on high-RSR trajectories versus high-likelihood trajectories and finds no consistent advantage in reasoning benchmarks for the RSR set would falsify the central claim.

read the original abstract

Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that align closely with the student model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically balance learning signal strength and behavioral alignment by combining low absolute probability with relatively high-ranked tokens under the student model. Concretely, RSR is defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training reasoning performance (average Spearman 0.86), consistently outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that the Rank-Surprisal Ratio (RSR), defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood under the student model, better predicts which reasoning trajectories will improve student performance after distillation. It reports an average Spearman correlation of 0.86 with post-training reasoning performance across five student models and trajectories from eleven diverse teachers, outperforming existing metrics, and shows utility for trajectory and teacher selection.

Significance. If the correlation generalizes, RSR offers a lightweight, student-specific metric for selecting informative CoT trajectories without exhaustive retraining, directly addressing the mismatch between teacher strength and distillation gains. Its simplicity and reliance only on forward passes under the student model are practical strengths for scaling distillation.

major comments (2)
  1. [Experiments] The experiments are limited to five student models and eleven teachers with no reported controls for trajectory length, task difficulty, domain, or probability distribution shape. This raises the risk that the average Spearman 0.86 is driven by these confounders or the specific trajectory-generation process rather than RSR capturing a general alignment-informativeness tradeoff.
  2. [Metric Definition] Because average rank and average NLL are both monotonic functions of the same per-token probabilities (low p yields both high rank number and high NLL), the manuscript should include an ablation demonstrating that the ratio adds predictive value beyond NLL or rank alone; without it, the claimed tradeoff remains unverified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point-by-point below, with revisions incorporated where feasible to strengthen the claims.

read point-by-point responses
  1. Referee: The experiments are limited to five student models and eleven teachers with no reported controls for trajectory length, task difficulty, domain, or probability distribution shape. This raises the risk that the average Spearman 0.86 is driven by these confounders or the specific trajectory-generation process rather than RSR capturing a general alignment-informativeness tradeoff.

    Authors: We acknowledge the scope limitations. In revision we add controls by stratifying correlations within trajectory-length bins and by task-difficulty subsets (easy/medium/hard problems), confirming the 0.86 average holds within strata. The 11 teachers already span multiple families and sizes, reducing generation-process dependence. We expand the limitations section to note that domain generalization beyond GSM8K/MATH remains untested. revision: partial

  2. Referee: Because average rank and average NLL are both monotonic functions of the same per-token probabilities (low p yields both high rank number and high NLL), the manuscript should include an ablation demonstrating that the ratio adds predictive value beyond NLL or rank alone; without it, the claimed tradeoff remains unverified.

    Authors: We thank the referee for highlighting this. The revised manuscript now includes the requested ablation: RSR is compared directly against average NLL alone and average rank alone as predictors of post-distillation gains. Across the same five students and eleven teachers, RSR yields average Spearman 0.86 while NLL alone reaches 0.71 and rank alone reaches 0.68, confirming the ratio supplies additional predictive signal beyond either component. revision: yes

Circularity Check

0 steps flagged

RSR metric is a direct computation with empirical correlation to performance; no reduction to inputs by construction

full rationale

The paper defines RSR explicitly as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, computed from the student model's own probabilities on the given trajectories. This is presented as a heuristic motivated by an observation, not derived from or fitted to the downstream performance metric. The reported Spearman correlation (0.86) is measured after separate training on selected trajectories, providing an independent empirical check rather than a self-referential loop. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked in the central claim, and the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The metric relies only on standard probabilistic quantities (token ranks and negative log-likelihood) already produced by any autoregressive language model; no new free parameters, axioms beyond basic probability, or invented entities are introduced.

axioms (1)
  • standard math Token ranks and negative log-likelihoods are well-defined and computable from the student model's output distribution.
    Basic property of any autoregressive language model probability distribution.

pith-pipeline@v0.9.0 · 5561 in / 1269 out tokens · 42843 ms · 2026-05-16T12:19:57.928212+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 10 internal anchors

  1. [1]

    Phi-4-reasoning technical report

    Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, et al. Phi-4-reasoning technical report. arXiv preprint arXiv:2504.21318, 2025. 2.1

  2. [2]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stańczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations, 2023. 6

  3. [3]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025. 2.1

  4. [4]

    Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025

    Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025. 2.1

  5. [5]

    Retaining by doing: The role of on-policy data in mitigating forgetting.CoRR, abs/2510.18874, 2025

    Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting.CoRR, abs/2510.18874, 2025. 6

  6. [6]

    Unveiling the key factors for distilling chain-of-thought 10 Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment reasoning

    Xinghao Chen, Zhijing Sun, Wenjin Guo, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, and Xiaoyu Shen. Unveiling the key factors for distilling chain-of-thought 10 Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment reasoning. In Wanxiang Che, Joyce Nabende, Ekateri...

  7. [7]

    Pearson correlation coefficient.Noise reduction in speech processing, pages 1–4, 2009

    Israel Cohen, Yiteng Huang, Jingdong Chen, Jacob Benesty, Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient.Noise reduction in speech processing, pages 1–4, 2009. A.7

  8. [8]

    FlashAttention-2: Faster attention with better parallelism and work partitioning

    Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InInternational Conference on Learning Representations (ICLR), 2024. A.5

  9. [9]

    Micota: Bridgingthelearnabilitygapwithintermediatecotandteacherassistants.ArXiv, abs/2507.01887,

    Dongyi Ding, Tiannan Wang, Chenghao Zhu, Meiling Tao, Yuchen Eleanor Jiang, and Wangchunshu Zhou. Micota: Bridgingthelearnabilitygapwithintermediatecotandteacherassistants.ArXiv, abs/2507.01887,

  10. [10]

    ReCode: Reinforcing Code Generation with Reasoning-Process Rewards

    Lishui Fan, Yu Zhang, Mouxiang Chen, and Zhongxin Liu. Posterior-grpo: Rewarding reasoning processes in code generation.CoRR, abs/2508.05170, 2025. B.1

  11. [11]

    Roger B. Grosse, Juhan Bae, Cem Anil, Nelson Elhage, Alex Tamkin, Amirhossein Tajdini, Benoit Steiner, Dustin Li, Esin Durmus, Ethan Perez, Evan Hubinger, Kamile Lukosiute, Karina Nguyen, Nicholas Joseph, Sam McCandlish, Jared Kaplan, and Samuel R. Bowman. Studying large language model generalization with influence functions.CoRR, abs/2308.03296, 2023. B.3

  12. [12]

    Etash Kumar Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Ji...

  13. [13]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  14. [14]

    A probabilistic earley parser as a psycholinguistic model

    John Hale. A probabilistic earley parser as a psycholinguistic model. InLanguage Technologies 2001: The Second Meeting of the North American Chapter of the Association for Computational Linguistics, NAACL 2001, Pittsburgh, PA, USA, June 2-7, 2001. The Association for Computational Linguistics, 2001. 3.1

  15. [15]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021. 2.1

  16. [16]

    Distilling the Knowledge in a Neural Network

    Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network.CoRR, abs/1503.02531, 2015. 6

  17. [17]

    Kaplan, Matteo Matteucci, Supriyo Chakraborty, and Irina Rish

    Prateek Humane, Paolo Cudrano, Daniel Z. Kaplan, Matteo Matteucci, Supriyo Chakraborty, and Irina Rish. Influence functions for efficient data selection in reasoning.CoRR, abs/2510.06108, 2025. 4.1, 6, B.3

  18. [18]

    What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning

    Gangwei Jiang, Yahui Liu, Zhaoyi Li, Qi Wang, Fuzheng Zhang, Linqi Song, Ying Wei, and Defu Lian. What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning. CoRR, abs/2505.22148, 2025. 6

  19. [19]

    The Signal is in the Steps: Local Scoring for Reasoning Data Selection

    Hoang Anh Just, Myeongseob Ko, and Ruoxi Jia. Distilling reasoning into student llms: Local naturalness for selecting teacher data.CoRR, abs/2510.03988, 2025. 1, 3.2, 4.1, 6

  20. [20]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

  21. [21]

    Patil, Matei Zaharia, Joseph E

    Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Eric Tang, Sumanth Hegde, Kourosh Hakhamaneshi, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Llms can eas- ily learn to reason from demonstrations structure, not content, is what matters!CoRR, abs/2502.07374,

  22. [22]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024. A.1

  23. [23]

    Small models struggle to learn from strong reasoners

    Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubrama- nian, and Radha Poovendran. Small models struggle to learn from strong reasoners. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, ...

  24. [24]

    Air: Post-training data selection for reasoning via attention head influence.arXiv preprint arXiv:2512.13279, 2025

    Jinrui Liu, Jeff Wu, Xuanguang Pan, Gavin Cheung, Shuai Ma, and Chongyang Tao. Air: Post-training data selection for reasoning via attention head influence.arXiv preprint arXiv:2512.13279, 2025. 6

  25. [25]

    Through the valley: Path to effective long cot training for small language models.CoRR, abs/2506.07712, 2025

    Renjie Luo, Jiaxi Li, Chen Huang, and Wei Lu. Through the valley: Path to effective long cot training for small language models.CoRR, abs/2506.07712, 2025. 6

  26. [26]

    Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions.CoRR, abs/2506.07527, 2025

    Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Bin Cui, and Wentao Zhang. Learning what reinforcement learning can’t: Interleaved online fine-tuning for hardest questions.CoRR, abs/2506.07527, 2025. 6

  27. [27]

    Zipf’s and heaps’ laws for tokens and llm-generated texts

    Nikolay Mikhaylovskiy. Zipf’s and heaps’ laws for tokens and llm-generated texts. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15469–15481, 2025. 3.3

  28. [28]

    Improved knowledge distillation via teacher assistant

    Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 5191–5198, 2020. 6

  29. [29]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettle- moyer, Percy Liang, Emmanuel J. Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling.CoRR, abs/2501.19393, 2025. 1, A.1 12 Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

  30. [30]

    Kakade, and Surbhi Goel

    Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Sham M. Kakade, and Surbhi Goel. In good graces: Principled teacher selection for knowledge distillation.CoRR, abs/2511.02833, 2025. 4.1, 6, B.4, B.5

  31. [31]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4.CoRR, abs/2304.03277, 2023. 6

  32. [32]

    Adaswitch: Adaptive switching generation for knowledge distillation.ArXiv, abs/2510.07842,

    Jingyu Peng, Maolin Wang, Hengyi Cai, Yuchen Li, Kai Zhang, Shuaiqiang Wang, Dawei Yin, and Xiangyu Zhao. Adaswitch: Adaptive switching generation for knowledge distillation.ArXiv, abs/2510.07842,

  33. [33]

    Information-guided identification of training data imprint in (proprietary) large language models

    Abhilasha Ravichander, Jillian Fisher, Taylor Sorensen, Ximing Lu, Maria Antoniak, Bill Yuchen Lin, Niloofar Mireshghallah, Chandra Bhagavatula, and Yejin Choi. Information-guided identification of training data imprint in (proprietary) large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Proceedings of the 2025 Conference of the Nat...

  34. [34]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. C.4.2

  35. [35]

    Warm up before you train: Unlocking general reasoning in resource-constrained settings.CoRR, abs/2505.13718, 2025

    Safal Shrestha, Minwu Kim, Aadim Nepal, Anubhav Shrestha, and Keith Ross. Warm up before you train: Unlocking general reasoning in resource-constrained settings.CoRR, abs/2505.13718, 2025. 6

  36. [36]

    Harvard university press, 1978

    Lev Semenovich Vygotsky and Michael Cole.Mind in society: Development of higher psychological processes. Harvard university press, 1978. 1

  37. [37]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 1, 6

  38. [38]

    Beyond scaling law: A data-efficient distillation framework for reasoning.CoRR, abs/2508.09883, 2025

    Xiaojun Wu, Xiaoguang Jiang, Huiyang Li, Jucai Zhai, Dengfeng Liu, Qiaobo Hao, Huang Liu, Zhiguo Yang, Ji Xie, Ninglun Gu, Jin Yang, Kailai Zhang, Yelun Bao, and Jun Wang. Beyond scaling law: A data-efficient distillation framework for reasoning.CoRR, abs/2508.09883, 2025. 6

  39. [39]

    On the generalization of SFT: A reinforcement learning perspective with reward rectification.CoRR, abs/2508.05629, 2025

    Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, and Xu Yang. On the generalization of SFT: A reinforcement learning perspective with reward rectification.CoRR, abs/2508.05629, 2025. 6

  40. [40]

    Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-YuLee, andTomasPfister

    Wenda Xu, Rujun Han, Zifeng Wang, Long T. Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-YuLee, andTomasPfister. Speculativeknowledgedistillation: Bridgingtheteacher-student gap through interleaved sampling. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 6

  41. [41]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

  42. [42]

    Select2reason: Efficient instruction-tuning data selection for long-cot reasoning.CoRR, abs/2505.17266,

    Cehao Yang, Xueyuan Lin, Chengjin Xu, Xuhui Jiang, Xiaojun Wu, Honghao Liu, Hui Xiong, and Jian Guo. Select2reason: Efficient instruction-tuning data selection for long-cot reasoning.CoRR, abs/2505.17266,

  43. [43]

    Measuring data diversity for instruction tuning: A systematic analysis and A reliable metric

    Yuming Yang, Yang Nan, Junjie Ye, Shihan Dou, Xiao Wang, Shuo Li, Huijie Lv, Tao Gui, Qi Zhang, and Xuanjing Huang. Measuring data diversity for instruction tuning: A systematic analysis and A reliable metric. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, 13 Which Reasoning Trajectories Teach Students to Reason Better? A ...

  44. [44]

    Analyzing the effects of supervised fine-tuning on model knowledge from token and parameter levels

    Junjie Ye, Yuming Yang, Yang Nan, Shuo Li, Qi Zhang, Tao Gui, Xuan-Jing Huang, Peng Wang, Zhongchao Shi, and Jianping Fan. Analyzing the effects of supervised fine-tuning on model knowledge from token and parameter levels. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 471–513, 2025. 6

  45. [45]

    Limo: Less is more for reasoning, 2025

    Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. LIMO: less is more for reasoning.CoRR, abs/2502.03387, 2025. 4.1, 6, A.1, B.2

  46. [46]

    Rethinking the generation of high-quality cot data from the perspective of llm-adaptive question difficulty grading.CoRR, abs/2504.11919, 2025

    Qianjin Yu, Keyu Wu, Zihan Chen, Chushu Zhang, Manlin Mei, Lingjun Huang, Fang Tan, Yongsheng Du, Kunlin Liu, and Yurui Zhu. Rethinking the generation of high-quality cot data from the perspective of llm-adaptive question difficulty grading.CoRR, abs/2504.11919, 2025. 6

  47. [47]

    Spearman rank correlation.Encyclopedia of biostatistics, 7, 2005

    Jerrold H Zar. Spearman rank correlation.Encyclopedia of biostatistics, 7, 2005. A.7

  48. [48]

    Towards the law of capacity gap in distilling language models

    Chen Zhang, Qiuchi Li, Dawei Song, Zheyu Ye, Yan Gao, and Yao Hu. Towards the law of capacity gap in distilling language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, ...

  49. [49]

    The best instruction-tuning data are those that fit.CoRR, abs/2502.04194, 2025

    Dylan Zhang, Qirun Dai, and Hao Peng. The best instruction-tuning data are those that fit.CoRR, abs/2502.04194, 2025. 1, 3.2

  50. [50]

    A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

    Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, and Chen Ma. What, how, where, and how well? A survey on test-time scaling in large language models.CoRR, abs/2503.24235, 2025. 1

  51. [51]

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan

    Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, and Samet Oymak. BREAD: branched rollouts from expert anchors bridge SFT & RL for reasoning.CoRR, abs/2506.17211, 2025. 6

  52. [52]

    Llamafactory: Unified efficient fine-tuning of 100+ language models

    YaoweiZheng, RichongZhang, JunhaoZhang, YanhanYe, ZheyanLuo, ZhangchiFeng, andYongqiangMa. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguistics. A.5

  53. [53]

    LIMA: less is more for alignment

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. LIMA: less is more for alignment. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Pro...

  54. [54]

    Routledge,

    George Kingsley Zipf.The psycho-biology of language: An introduction to dynamic philology. Routledge,

  55. [55]

    Return your final response within \boxed{}

    Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, and Mengdi Wang. Reasonflux-prm: Trajectory-aware prms for long chain-of-thought reasoning in llms.CoRR, abs/2506.18896, 2025. 6 14 Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment A. Details of Experiments A.1. Determining Training Pro...