pith. machine review for the scientific record. sign in

arxiv: 2604.14164 · v2 · submitted 2026-03-23 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords supervised fine-tuningsynthetic datareasoning modelsteacher-student cooperationstylistic divergencecode generationTESSY frameworkmodel fine-tuning
0
0 comments X

The pith

Interleaving teacher and student token generation creates synthetic data that improves reasoning model fine-tuning instead of causing drops.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Using data from a stronger model for supervised fine-tuning often fails to boost or even harms reasoning performance in models like Qwen3-8B. The authors trace the problem to stylistic divergence between the teacher's output and the student's natural distribution. Their TESSY framework has the models alternate token generation so the student supplies stylistic tokens while the teacher supplies reasoning tokens. This produces training sequences that transfer advanced capabilities without the mismatch. Experiments on code generation show TESSY data yields gains where pure teacher data produces losses.

Core claim

The paper claims that stylistic divergence between teacher-generated data and the student's distribution is a major cause of SFT failure in reasoning models. TESSY addresses this by interleaving the teacher and student models to generate style tokens from the student and non-style tokens from the teacher alternately, producing synthetic sequences that inherit the teacher's advanced reasoning capabilities while maintaining stylistic consistency with the student.

What carries the argument

TESSY, the Teacher-Student Cooperation Data Synthesis framework that interleaves teacher and student models to alternately generate style and non-style tokens.

If this is right

  • Fine-tuning Qwen3-8B on TESSY data improves performance by 11.25% on LiveCodeBench-Pro.
  • The same data yields 6.68% gains on OJBench.
  • Pure teacher-generated data causes drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench.
  • The method enables effective transfer of reasoning capabilities through style-consistent synthetic data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The alternation approach could generalize to non-code reasoning domains such as mathematics or general problem solving.
  • Controlling stylistic consistency may prove more important than content alignment alone in making synthetic data useful for fine-tuning.
  • Similar teacher-student cooperation might reduce error accumulation in long generated sequences for other tasks.
  • The framework suggests hybrid data pipelines could combine model strengths more reliably than single-model generation.

Load-bearing premise

Stylistic divergence is the primary cause of SFT performance drops and alternating token generation preserves reasoning quality without introducing new inconsistencies.

What would settle it

Experiments showing that TESSY data produces no improvement or continued performance drops on LiveCodeBench-Pro and OJBench would falsify the claim that the alternation method effectively bridges the stylistic gap.

read the original abstract

A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that supervised fine-tuning of reasoning models such as Qwen3-8B on synthetic data from stronger teachers (e.g., GPT-OSS-120B) often degrades performance due to stylistic divergence between teacher-generated sequences and the student's distribution. To address this, the authors introduce TESSY, a Teacher-Student Cooperation framework that interleaves teacher-generated non-style (reasoning) tokens with student-generated style tokens. On code generation benchmarks, pure teacher data causes drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, while TESSY yields gains of 11.25% and 6.68%.

Significance. If the central empirical results hold under further verification, the work provides a practical, low-overhead method for synthesizing SFT data that retains teacher-level reasoning while aligning stylistically with the student model. This could meaningfully improve fine-tuning outcomes for reasoning tasks where direct teacher data transfer fails, and it draws attention to distribution mismatch as a controllable factor in SFT.

major comments (2)
  1. [TESSY Framework] TESSY Framework (method description): The claim that alternating teacher non-style tokens and student style tokens preserves the teacher's advanced reasoning steps without introducing inconsistencies at token boundaries is load-bearing for attributing the 11.25% and 6.68% gains to stylistic consistency. No ablation is reported that isolates this mechanism (e.g., comparing interleaved sequences against full teacher traces, style-transferred data, or random interleaving) or measures logical coherence of the resulting traces.
  2. [Experiments] Experiments section: The reported benchmark improvements lack error bars, standard deviations across multiple runs, or explicit controls for data volume and diversity. Without these, it is difficult to determine whether the gains over the teacher-data baseline are statistically reliable or could arise from incidental factors unrelated to the style/non-style split.
minor comments (1)
  1. [Abstract and Method] The abstract and method sections use the terms 'style tokens' and 'non-style tokens' without an explicit operational definition or example of how the split is performed at inference time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript accordingly to strengthen the empirical support for TESSY.

read point-by-point responses
  1. Referee: [TESSY Framework] TESSY Framework (method description): The claim that alternating teacher non-style tokens and student style tokens preserves the teacher's advanced reasoning steps without introducing inconsistencies at token boundaries is load-bearing for attributing the 11.25% and 6.68% gains to stylistic consistency. No ablation is reported that isolates this mechanism (e.g., comparing interleaved sequences against full teacher traces, style-transferred data, or random interleaving) or measures logical coherence of the resulting traces.

    Authors: We agree that additional ablations are needed to isolate the interleaving mechanism. In the revised manuscript we will add three controlled comparisons: (1) full teacher traces, (2) style-transferred teacher data (student model used only for style adaptation), and (3) random token interleaving. We will also report a coherence metric that checks logical step consistency across token boundaries using an automated verifier on the generated reasoning traces. revision: yes

  2. Referee: [Experiments] Experiments section: The reported benchmark improvements lack error bars, standard deviations across multiple runs, or explicit controls for data volume and diversity. Without these, it is difficult to determine whether the gains over the teacher-data baseline are statistically reliable or could arise from incidental factors unrelated to the style/non-style split.

    Authors: We acknowledge the importance of statistical controls. The revised experiments section will report mean performance and standard deviations over three independent runs with different random seeds. Data volume will be matched exactly (same total tokens) across all conditions, and we will add diversity statistics (unique n-gram coverage and entropy) to rule out incidental differences in data quantity or variety. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of inputs

full rationale

The paper presents TESSY as a data synthesis method that interleaves teacher and student token generation to address observed stylistic divergence. Performance deltas (drops for teacher-only data, gains for TESSY) are reported as direct experimental measurements on LiveCodeBench-Pro and OJBench. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or result to the inputs by construction. The framework description does not invoke uniqueness theorems, rename known patterns, or smuggle ansatzes. The derivation chain is therefore self-contained and externally falsifiable via the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that style matching drives SFT success and introduces the interleaving procedure as the core mechanism; no free parameters or new entities are explicitly quantified in the abstract.

axioms (1)
  • domain assumption Stylistic divergence between teacher-generated data and student model distribution is a major factor limiting SFT effectiveness for reasoning models.
    Directly stated in the abstract as the identified cause of performance drops.

pith-pipeline@v0.9.0 · 5525 in / 1163 out tokens · 39567 ms · 2026-05-15T00:14:06.591867+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 9 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations. 5

  2. [2]

    gpt-oss-120b & gpt-oss-20b Model Card

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025. 1, 3.2

  3. [3]

    Opencodereasoning: Advancing data distillation for competitive coding

    Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jo- celyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding.arXiv preprint arXiv:2504.01943, 2025. 1

  4. [4]

    Lora learns less and forgets less.Transactions on Machine Learning Research

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.Transactions on Machine Learning Research. 5

  5. [5]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machin...

  6. [6]

    Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025

    Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025. 1, 5

  7. [7]

    Your thoughts tell who you are: Characterize the reasoning patterns of lrms.arXiv preprint arXiv:2509.24147, 2025

    Yida Chen, Yuning Mao, Xianjun Yang, Suyu Ge, Shengjie Bi, Lijuan Liu, Saghar Hosseini, Liang Tan, Yixin Nie, and Shaoliang Nie. Your thoughts tell who you are: Characterize the reasoning patterns of lrms.arXiv preprint arXiv:2509.24147, 2025. 1, 5

  8. [8]

    Opencompass: A universal evaluation platform for foundation models

    OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023. 3.2

  9. [9]

    Xtuner: A toolkit for efficiently fine-tuning llm

    XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/ InternLM/xtuner, 2023. 3.1

  10. [10]

    Think before you speak: Training language models with pause tokens

    SachinGoyal, ZiweiJi, AnkitSinghRawat, AdityaKrishnaMenon, SanjivKumar, andVaishnavhNagarajan. Think before you speak: Training language models with pause tokens. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

  11. [11]

    OpenThoughts: Data Recipes for Reasoning Models

    Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhu- rina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models. arXiv preprint arXiv:2506.04178, 2025. 1, 3.2

  12. [12]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 5

  13. [13]

    StyleBench: Evaluating thinking styles in Large Language Models

    Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, and Javad Lavaei. Stylebench: Evaluating thinking styles in large language models.arXiv preprint arXiv:2509.20868, 2025. 5

  14. [14]

    Selective self-to-supervised fine-tuning for generalization in large language models

    Sonam Gupta, Yatin Nandwani, Asaf Yehudai, Dinesh Khandelwal, Dinesh Raghu, and Sachindra Joshi. Selective self-to-supervised fine-tuning for generalization in large language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6240–6249, 2025. 5

  15. [15]

    Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

  16. [16]

    Deepmath- 103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning

    Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Deepmath- 103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. 2025. A.2

  17. [17]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations. 5

  18. [18]

    Transforming decoder-only models into encoder-only models with improved understanding capabilities.Knowl

    Zixian Huang, Xinwei Huang, Ao Wu, Xiaxia Wang, and Gong Cheng. Transforming decoder-only models into encoder-only models with improved understanding capabilities.Knowl. Based Syst., 309:112907,

  19. [19]

    Pipelined decoder for efficient context-aware text generation.CoRR, abs/2506.23431, 2025

    Zixian Huang, Chenxu Niu, Yu Gu, Gengyang Xiao, Xinwei Huang, and Gong Cheng. Pipelined decoder for efficient context-aware text generation.CoRR, abs/2506.23431, 2025. 6

  20. [20]

    A branching decoder for set generation

    Zixian Huang, Gengyang Xiao, Yu Gu, and Gong Cheng. A branching decoder for set generation. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 6

  21. [21]

    Spans, not tokens: a span-centric model for multi-span reading comprehension

    Zixian Huang, Jiaying Zhou, Chenxu Niu, and Gong Cheng. Spans, not tokens: a span-centric model for multi-span reading comprehension. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 874–884, 2023. 2.2

  22. [22]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 1

  23. [23]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, AleksanderMadry, AlexBeutel, AlexCarney, etal. Openaio1systemcard.arXivpreprintarXiv:2412.16720,

  24. [24]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 3.2

  25. [25]

    Taia: Large language models are out-of-distribution data learners.Advances in Neural Information Processing Systems, 37:105200–105235,

    Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, and Yu Wang. Taia: Large language models are out-of-distribution data learners.Advances in Neural Information Processing Systems, 37:105200–105235,

  26. [26]

    Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

    Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?CoRR, abs/2603.24472, 2026. 3.5

  27. [27]

    Sequence-level knowledge distillation

    Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016. 3.4, 5

  28. [28]

    Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017. 5

  29. [29]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

  30. [30]

    Small models struggle to learn from strong reasoners

    Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubrama- nian, and Radha Poovendran. Small models struggle to learn from strong reasoners. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, ...

  31. [31]

    Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

    Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017. 5

  32. [32]

    Scar: Data selection via style consistency-aware response ranking for efficient instruction-tuning of large language models

    Zhuang Li, Yuncheng Hua, Thuy Vu, Haolan Zhan, Lizhen Qu, and Gholamreza Haffari. Scar: Data selection via style consistency-aware response ranking for efficient instruction-tuning of large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12756–12790, 2025. 5

  33. [33]

    Autoregressive knowledge distillation throughimitationlearning

    Alexander Lin, Jeremy Wohlwend, Howard Chen, and Tao Lei. Autoregressive knowledge distillation throughimitationlearning. InProceedingsofthe2020ConferenceonEmpiricalMethodsinNaturalLanguage Processing (EMNLP), pages 6121–6133, 2020. 3.4, 5

  34. [34]

    Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy

    Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy.arXiv preprint arXiv:2506.13284, 2025. 1, 3.2

  35. [35]

    Through the valley: Path to effective long cot training for small language models

    Renjie Luo, Jiaxi Li, Chen Huang, and Wei Lu. Through the valley: Path to effective long cot training for small language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, pa...

  36. [36]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025. 5

  37. [37]

    Learning student-friendly teacher networks for knowledge distillation.Advances in neural information processing systems, 34:13292–13303,

    Dae Young Park, Moon-Hyun Cha, Daesin Kim, Bohyung Han, et al. Learning student-friendly teacher networks for knowledge distillation.Advances in neural information processing systems, 34:13292–13303,

  38. [38]

    Adaswitch: Adaptive switching generation for knowledge distillation.arXiv preprint arXiv:2510.07842, 2025

    Jingyu Peng, Maolin Wang, Hengyi Cai, Yuchen Li, Kai Zhang, Shuaiqiang Wang, Dawei Yin, and Xiangyu Zhao. Adaswitch: Adaptive switching generation for knowledge distillation.arXiv preprint arXiv:2510.07842, 2025. 5

  39. [39]

    Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning.arXiv preprint arXiv:2506.02867, 2025

    Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, and Jing Shao. Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning.arXiv preprint arXiv:2506.02867, 2025. 5

  40. [40]

    Gpqa: A graduate-level google-proof q&a benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. 3.2

  41. [41]

    A good learner can teach better: Teacher-student collaborative knowledge distillation

    Ayan Sengupta, Shantanu Dixit, Md Shad Akhtar, and Tanmoy Chakraborty. A good learner can teach better: Teacher-student collaborative knowledge distillation. InICLR, 2024. 5

  42. [42]

    Rl’s razor: Why online reinforcement learning forgets less, 2025

    Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025. 5

  43. [43]

    A survey of neural code intelligence: Paradigms, advances and beyond.arXiv preprint arXiv:2403.14734, 2024

    QiushiSun, ZhiruiChen, FangzhiXu, KanzhiCheng, ChangMa, ZhangyueYin, JianingWang, Chengcheng Han, Renyu Zhu, Shuai Yuan, et al. A survey of neural code intelligence: Paradigms, advances and beyond.arXiv preprint arXiv:2403.14734, 2024. 1

  44. [44]

    Supergpqa: Scaling llm evaluation across 285 graduate disciplines, 2025

    M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming X...

  45. [45]

    Ojbench: A competition level code benchmark for large language models.arXiv preprint arXiv:2506.16395, 2025

    Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao, Muxi Diao, Yanxu Chen, Kelin Fu, Flood Sung, Zhilin Yang, et al. Ojbench: A competition level code benchmark for large language models.arXiv preprint arXiv:2506.16395, 2025. 1, 1, 3.2

  46. [46]

    Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-YuLee, andTomasPfister

    Wenda Xu, Rujun Han, Zifeng Wang, Long T. Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-YuLee, andTomasPfister. Speculativeknowledgedistillation: Bridgingtheteacher-student gap through interleaved sampling. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 5

  47. [47]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1, 1, 5

  48. [48]

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024. 1

  49. [49]

    Self- distillation bridges distribution gap in language model fine-tuning

    Zhaorui Yang, Tianyu Pang, Haozhe Feng, Han Wang, Wei Chen, Minfeng Zhu, and Qian Liu. Self- distillation bridges distribution gap in language model fine-tuning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thail...

  50. [50]

    Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928, 2025

    Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, et al. Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928, 2025. 1, 1, 3.2 15 How to Fine-Tune a Reasoning Model? A Teacher–Student Cooperation Framework to Synthesiz...

  51. [51]

    Each span must be copied verbatim from the original text

  52. [52]

    Preserve order of appearance

  53. [53]

    If there are none, return an empty list:[]

  54. [54]

    Fizz”, by𝑏 with “Buzz

    Output only the JSON array — no explanation or extra text. <input_text> {think_text} </input_text> Table5: Example output of Qwen3-8B-Base Problem Description Given a transcript of a Fizz Buzz game (where numbers divisible by𝑎 are replaced with “Fizz”, by𝑏 with “Buzz”, and by both with “FizzBuzz”), find any valid values of𝑎and𝑏that could generate the tran...