arxiv: 2604.14164 · v2 · submitted 2026-03-23 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

Zixian Huang , Kaichen Yang , Xu Huang , Feiyang Hao , Qiming Ge , Bowen Li , He Du , Kai Chen

show 1 more author

Qipeng Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords supervised fine-tuningsynthetic datareasoning modelsteacher-student cooperationstylistic divergencecode generationTESSY frameworkmodel fine-tuning

0 comments

The pith

Interleaving teacher and student token generation creates synthetic data that improves reasoning model fine-tuning instead of causing drops.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Using data from a stronger model for supervised fine-tuning often fails to boost or even harms reasoning performance in models like Qwen3-8B. The authors trace the problem to stylistic divergence between the teacher's output and the student's natural distribution. Their TESSY framework has the models alternate token generation so the student supplies stylistic tokens while the teacher supplies reasoning tokens. This produces training sequences that transfer advanced capabilities without the mismatch. Experiments on code generation show TESSY data yields gains where pure teacher data produces losses.

Core claim

The paper claims that stylistic divergence between teacher-generated data and the student's distribution is a major cause of SFT failure in reasoning models. TESSY addresses this by interleaving the teacher and student models to generate style tokens from the student and non-style tokens from the teacher alternately, producing synthetic sequences that inherit the teacher's advanced reasoning capabilities while maintaining stylistic consistency with the student.

What carries the argument

TESSY, the Teacher-Student Cooperation Data Synthesis framework that interleaves teacher and student models to alternately generate style and non-style tokens.

If this is right

Fine-tuning Qwen3-8B on TESSY data improves performance by 11.25% on LiveCodeBench-Pro.
The same data yields 6.68% gains on OJBench.
Pure teacher-generated data causes drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench.
The method enables effective transfer of reasoning capabilities through style-consistent synthetic data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The alternation approach could generalize to non-code reasoning domains such as mathematics or general problem solving.
Controlling stylistic consistency may prove more important than content alignment alone in making synthetic data useful for fine-tuning.
Similar teacher-student cooperation might reduce error accumulation in long generated sequences for other tasks.
The framework suggests hybrid data pipelines could combine model strengths more reliably than single-model generation.

Load-bearing premise

Stylistic divergence is the primary cause of SFT performance drops and alternating token generation preserves reasoning quality without introducing new inconsistencies.

What would settle it

Experiments showing that TESSY data produces no improvement or continued performance drops on LiveCodeBench-Pro and OJBench would falsify the claim that the alternation method effectively bridges the stylistic gap.

read the original abstract

A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TESSY's interleaving of teacher reasoning tokens with student style tokens turns a performance drop into gains on code benchmarks, but without ablations the mechanism stays unproven.

read the letter

The core finding is straightforward: fine-tuning Qwen3-8B on pure teacher data from GPT-OSS-120B hurts results on LiveCodeBench-Pro and OJBench, while their TESSY setup that alternates token generation lifts performance instead. They frame the issue as stylistic divergence between teacher output and student distribution, then solve it by letting the student handle style tokens and the teacher handle the rest in the same sequence. That interleaving approach is the new piece; it is more targeted than standard distillation or simple mixing of datasets. The paper shows the numbers clearly enough in the abstract to make the contrast visible, which is useful for anyone who has seen synthetic data backfire on reasoning models. The experiments stay focused on code generation, which keeps the scope manageable. The soft spots sit in the mechanism. The claim that style consistency is the main driver rests on the performance delta, yet there is no reported check on whether the interleaved traces keep the teacher's full reasoning steps intact or whether boundary switches introduce new errors. Without an ablation that isolates the alternation from other factors like data diversity, it is hard to rule out alternative explanations for the gains. The abstract also gives no error bars or details on how style versus non-style tokens are identified in practice. This paper is aimed at people building synthetic data pipelines for LLM reasoning fine-tuning. A reader who has run into the same drop when using stronger teachers will find the setup worth testing locally. It deserves peer review because the empirical contrast is concrete and the problem is common, even though the causal story needs tighter verification before the method can be treated as settled.

Referee Report

2 major / 1 minor

Summary. The paper claims that supervised fine-tuning of reasoning models such as Qwen3-8B on synthetic data from stronger teachers (e.g., GPT-OSS-120B) often degrades performance due to stylistic divergence between teacher-generated sequences and the student's distribution. To address this, the authors introduce TESSY, a Teacher-Student Cooperation framework that interleaves teacher-generated non-style (reasoning) tokens with student-generated style tokens. On code generation benchmarks, pure teacher data causes drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, while TESSY yields gains of 11.25% and 6.68%.

Significance. If the central empirical results hold under further verification, the work provides a practical, low-overhead method for synthesizing SFT data that retains teacher-level reasoning while aligning stylistically with the student model. This could meaningfully improve fine-tuning outcomes for reasoning tasks where direct teacher data transfer fails, and it draws attention to distribution mismatch as a controllable factor in SFT.

major comments (2)

[TESSY Framework] TESSY Framework (method description): The claim that alternating teacher non-style tokens and student style tokens preserves the teacher's advanced reasoning steps without introducing inconsistencies at token boundaries is load-bearing for attributing the 11.25% and 6.68% gains to stylistic consistency. No ablation is reported that isolates this mechanism (e.g., comparing interleaved sequences against full teacher traces, style-transferred data, or random interleaving) or measures logical coherence of the resulting traces.
[Experiments] Experiments section: The reported benchmark improvements lack error bars, standard deviations across multiple runs, or explicit controls for data volume and diversity. Without these, it is difficult to determine whether the gains over the teacher-data baseline are statistically reliable or could arise from incidental factors unrelated to the style/non-style split.

minor comments (1)

[Abstract and Method] The abstract and method sections use the terms 'style tokens' and 'non-style tokens' without an explicit operational definition or example of how the split is performed at inference time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript accordingly to strengthen the empirical support for TESSY.

read point-by-point responses

Referee: [TESSY Framework] TESSY Framework (method description): The claim that alternating teacher non-style tokens and student style tokens preserves the teacher's advanced reasoning steps without introducing inconsistencies at token boundaries is load-bearing for attributing the 11.25% and 6.68% gains to stylistic consistency. No ablation is reported that isolates this mechanism (e.g., comparing interleaved sequences against full teacher traces, style-transferred data, or random interleaving) or measures logical coherence of the resulting traces.

Authors: We agree that additional ablations are needed to isolate the interleaving mechanism. In the revised manuscript we will add three controlled comparisons: (1) full teacher traces, (2) style-transferred teacher data (student model used only for style adaptation), and (3) random token interleaving. We will also report a coherence metric that checks logical step consistency across token boundaries using an automated verifier on the generated reasoning traces. revision: yes
Referee: [Experiments] Experiments section: The reported benchmark improvements lack error bars, standard deviations across multiple runs, or explicit controls for data volume and diversity. Without these, it is difficult to determine whether the gains over the teacher-data baseline are statistically reliable or could arise from incidental factors unrelated to the style/non-style split.

Authors: We acknowledge the importance of statistical controls. The revised experiments section will report mean performance and standard deviations over three independent runs with different random seeds. Data volume will be matched exactly (same total tokens) across all conditions, and we will add diversity statistics (unique n-gram coverage and entropy) to rule out incidental differences in data quantity or variety. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of inputs

full rationale

The paper presents TESSY as a data synthesis method that interleaves teacher and student token generation to address observed stylistic divergence. Performance deltas (drops for teacher-only data, gains for TESSY) are reported as direct experimental measurements on LiveCodeBench-Pro and OJBench. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or result to the inputs by construction. The framework description does not invoke uniqueness theorems, rename known patterns, or smuggle ansatzes. The derivation chain is therefore self-contained and externally falsifiable via the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that style matching drives SFT success and introduces the interleaving procedure as the core mechanism; no free parameters or new entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Stylistic divergence between teacher-generated data and student model distribution is a major factor limiting SFT effectiveness for reasoning models.
Directly stated in the abstract as the identified cause of performance drops.

pith-pipeline@v0.9.0 · 5525 in / 1163 out tokens · 39567 ms · 2026-05-15T00:14:06.591867+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TESSY ... interleaves teacher and student models to alternately generate style and non-style tokens ... y = [s1, t1, s2, t2, ...] (Eq. 4, Alg. 1)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

performance drops of 3.25% ... TESSY achieves improvements of 11.25%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 9 internal anchors

[1]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations. 5

work page
[2]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025. 1, 3.2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Opencodereasoning: Advancing data distillation for competitive coding

Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jo- celyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding.arXiv preprint arXiv:2504.01943, 2025. 1

work page arXiv 2025
[4]

Lora learns less and forgets less.Transactions on Machine Learning Research

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.Transactions on Machine Learning Research. 5

work page
[5]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machin...

work page 2024
[6]

Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025

Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025. 1, 5

work page arXiv 2025
[7]

Your thoughts tell who you are: Characterize the reasoning patterns of lrms.arXiv preprint arXiv:2509.24147, 2025

Yida Chen, Yuning Mao, Xianjun Yang, Suyu Ge, Shengjie Bi, Lijuan Liu, Saghar Hosseini, Liang Tan, Yixin Nie, and Shaoliang Nie. Your thoughts tell who you are: Characterize the reasoning patterns of lrms.arXiv preprint arXiv:2509.24147, 2025. 1, 5

work page arXiv 2025
[8]

Opencompass: A universal evaluation platform for foundation models

OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023. 3.2

work page 2023
[9]

Xtuner: A toolkit for efficiently fine-tuning llm

XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/ InternLM/xtuner, 2023. 3.1

work page 2023
[10]

Think before you speak: Training language models with pause tokens

SachinGoyal, ZiweiJi, AnkitSinghRawat, AdityaKrishnaMenon, SanjivKumar, andVaishnavhNagarajan. Think before you speak: Training language models with pause tokens. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

work page 2024
[11]

OpenThoughts: Data Recipes for Reasoning Models

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhu- rina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models. arXiv preprint arXiv:2506.04178, 2025. 1, 3.2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

StyleBench: Evaluating thinking styles in Large Language Models

Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, and Javad Lavaei. Stylebench: Evaluating thinking styles in large language models.arXiv preprint arXiv:2509.20868, 2025. 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Selective self-to-supervised fine-tuning for generalization in large language models

Sonam Gupta, Yatin Nandwani, Asaf Yehudai, Dinesh Khandelwal, Dinesh Raghu, and Sachindra Joshi. Selective self-to-supervised fine-tuning for generalization in large language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6240–6249, 2025. 5

work page 2025
[15]

Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...

work page 2024
[16]

Deepmath- 103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Deepmath- 103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. 2025. A.2

work page 2025
[17]

Lora: Low-rank adaptation of large language models

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations. 5

work page
[18]

Transforming decoder-only models into encoder-only models with improved understanding capabilities.Knowl

Zixian Huang, Xinwei Huang, Ao Wu, Xiaxia Wang, and Gong Cheng. Transforming decoder-only models into encoder-only models with improved understanding capabilities.Knowl. Based Syst., 309:112907,

work page
[19]

Pipelined decoder for efficient context-aware text generation.CoRR, abs/2506.23431, 2025

Zixian Huang, Chenxu Niu, Yu Gu, Gengyang Xiao, Xinwei Huang, and Gong Cheng. Pipelined decoder for efficient context-aware text generation.CoRR, abs/2506.23431, 2025. 6

work page arXiv 2025
[20]

A branching decoder for set generation

Zixian Huang, Gengyang Xiao, Yu Gu, and Gong Cheng. A branching decoder for set generation. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 6

work page 2024
[21]

Spans, not tokens: a span-centric model for multi-span reading comprehension

Zixian Huang, Jiaying Zhou, Chenxu Niu, and Gong Cheng. Spans, not tokens: a span-centric model for multi-span reading comprehension. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 874–884, 2023. 2.2

work page 2023
[22]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, AleksanderMadry, AlexBeutel, AlexCarney, etal. Openaio1systemcard.arXivpreprintarXiv:2412.16720,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 3.2

work page 2025
[25]

Taia: Large language models are out-of-distribution data learners.Advances in Neural Information Processing Systems, 37:105200–105235,

Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, and Yu Wang. Taia: Large language models are out-of-distribution data learners.Advances in Neural Information Processing Systems, 37:105200–105235,

work page
[26]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?CoRR, abs/2603.24472, 2026. 3.5

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016. 3.4, 5

work page 2016
[28]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017. 5

work page 2017
[29]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

work page
[30]

Small models struggle to learn from strong reasoners

Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubrama- nian, and Radha Poovendran. Small models struggle to learn from strong reasoners. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, ...

work page 2025
[31]

Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017

Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017. 5

work page 2017
[32]

Scar: Data selection via style consistency-aware response ranking for efficient instruction-tuning of large language models

Zhuang Li, Yuncheng Hua, Thuy Vu, Haolan Zhan, Lizhen Qu, and Gholamreza Haffari. Scar: Data selection via style consistency-aware response ranking for efficient instruction-tuning of large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12756–12790, 2025. 5

work page 2025
[33]

Autoregressive knowledge distillation throughimitationlearning

Alexander Lin, Jeremy Wohlwend, Howard Chen, and Tao Lei. Autoregressive knowledge distillation throughimitationlearning. InProceedingsofthe2020ConferenceonEmpiricalMethodsinNaturalLanguage Processing (EMNLP), pages 6121–6133, 2020. 3.4, 5

work page 2020
[34]

Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy

Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy.arXiv preprint arXiv:2506.13284, 2025. 1, 3.2

work page arXiv 2025
[35]

Through the valley: Path to effective long cot training for small language models

Renjie Luo, Jiaxi Li, Chen Huang, and Wei Lu. Through the valley: Path to effective long cot training for small language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, pa...

work page 2025
[36]

An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025. 5

work page 2025
[37]

Learning student-friendly teacher networks for knowledge distillation.Advances in neural information processing systems, 34:13292–13303,

Dae Young Park, Moon-Hyun Cha, Daesin Kim, Bohyung Han, et al. Learning student-friendly teacher networks for knowledge distillation.Advances in neural information processing systems, 34:13292–13303,

work page
[38]

Adaswitch: Adaptive switching generation for knowledge distillation.arXiv preprint arXiv:2510.07842, 2025

Jingyu Peng, Maolin Wang, Hengyi Cai, Yuchen Li, Kai Zhang, Shuaiqiang Wang, Dawei Yin, and Xiangyu Zhao. Adaswitch: Adaptive switching generation for knowledge distillation.arXiv preprint arXiv:2510.07842, 2025. 5

work page arXiv 2025
[39]

Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning.arXiv preprint arXiv:2506.02867, 2025

Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, and Jing Shao. Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning.arXiv preprint arXiv:2506.02867, 2025. 5

work page arXiv 2025
[40]

Gpqa: A graduate-level google-proof q&a benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. 3.2

work page 2024
[41]

A good learner can teach better: Teacher-student collaborative knowledge distillation

Ayan Sengupta, Shantanu Dixit, Md Shad Akhtar, and Tanmoy Chakraborty. A good learner can teach better: Teacher-student collaborative knowledge distillation. InICLR, 2024. 5

work page 2024
[42]

Rl’s razor: Why online reinforcement learning forgets less, 2025

Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025. 5

work page arXiv 2025
[43]

A survey of neural code intelligence: Paradigms, advances and beyond.arXiv preprint arXiv:2403.14734, 2024

QiushiSun, ZhiruiChen, FangzhiXu, KanzhiCheng, ChangMa, ZhangyueYin, JianingWang, Chengcheng Han, Renyu Zhu, Shuai Yuan, et al. A survey of neural code intelligence: Paradigms, advances and beyond.arXiv preprint arXiv:2403.14734, 2024. 1

work page arXiv 2024
[44]

Supergpqa: Scaling llm evaluation across 285 graduate disciplines, 2025

M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming X...

work page 2025
[45]

Ojbench: A competition level code benchmark for large language models.arXiv preprint arXiv:2506.16395, 2025

Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao, Muxi Diao, Yanxu Chen, Kelin Fu, Flood Sung, Zhilin Yang, et al. Ojbench: A competition level code benchmark for large language models.arXiv preprint arXiv:2506.16395, 2025. 1, 1, 3.2

work page arXiv 2025
[46]

Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-YuLee, andTomasPfister

Wenda Xu, Rujun Han, Zifeng Wang, Long T. Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-YuLee, andTomasPfister. Speculativeknowledgedistillation: Bridgingtheteacher-student gap through interleaved sampling. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 5

work page 2025
[47]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1, 1, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Self- distillation bridges distribution gap in language model fine-tuning

Zhaorui Yang, Tianyu Pang, Haozhe Feng, Han Wang, Wei Chen, Minfeng Zhu, and Qian Liu. Self- distillation bridges distribution gap in language model fine-tuning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thail...

work page 2024
[50]

Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928, 2025

Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, et al. Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928, 2025. 1, 1, 3.2 15 How to Fine-Tune a Reasoning Model? A Teacher–Student Cooperation Framework to Synthesiz...

work page arXiv 2025
[51]

Each span must be copied verbatim from the original text

work page
[52]

Preserve order of appearance

work page
[53]

If there are none, return an empty list:[]

work page
[54]

Fizz”, by𝑏 with “Buzz

Output only the JSON array — no explanation or extra text. <input_text> {think_text} </input_text> Table5: Example output of Qwen3-8B-Base Problem Description Given a transcript of a Fizz Buzz game (where numbers divisible by𝑎 are replaced with “Fizz”, by𝑏 with “Buzz”, and by both with “FizzBuzz”), find any valid values of𝑎and𝑏that could generate the tran...

work page