Recognition: 2 theorem links
· Lean TheoremHow to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
Pith reviewed 2026-05-15 00:14 UTC · model grok-4.3
The pith
Interleaving teacher and student token generation creates synthetic data that improves reasoning model fine-tuning instead of causing drops.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that stylistic divergence between teacher-generated data and the student's distribution is a major cause of SFT failure in reasoning models. TESSY addresses this by interleaving the teacher and student models to generate style tokens from the student and non-style tokens from the teacher alternately, producing synthetic sequences that inherit the teacher's advanced reasoning capabilities while maintaining stylistic consistency with the student.
What carries the argument
TESSY, the Teacher-Student Cooperation Data Synthesis framework that interleaves teacher and student models to alternately generate style and non-style tokens.
If this is right
- Fine-tuning Qwen3-8B on TESSY data improves performance by 11.25% on LiveCodeBench-Pro.
- The same data yields 6.68% gains on OJBench.
- Pure teacher-generated data causes drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench.
- The method enables effective transfer of reasoning capabilities through style-consistent synthetic data.
Where Pith is reading between the lines
- The alternation approach could generalize to non-code reasoning domains such as mathematics or general problem solving.
- Controlling stylistic consistency may prove more important than content alignment alone in making synthetic data useful for fine-tuning.
- Similar teacher-student cooperation might reduce error accumulation in long generated sequences for other tasks.
- The framework suggests hybrid data pipelines could combine model strengths more reliably than single-model generation.
Load-bearing premise
Stylistic divergence is the primary cause of SFT performance drops and alternating token generation preserves reasoning quality without introducing new inconsistencies.
What would settle it
Experiments showing that TESSY data produces no improvement or continued performance drops on LiveCodeBench-Pro and OJBench would falsify the claim that the alternation method effectively bridges the stylistic gap.
read the original abstract
A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that supervised fine-tuning of reasoning models such as Qwen3-8B on synthetic data from stronger teachers (e.g., GPT-OSS-120B) often degrades performance due to stylistic divergence between teacher-generated sequences and the student's distribution. To address this, the authors introduce TESSY, a Teacher-Student Cooperation framework that interleaves teacher-generated non-style (reasoning) tokens with student-generated style tokens. On code generation benchmarks, pure teacher data causes drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, while TESSY yields gains of 11.25% and 6.68%.
Significance. If the central empirical results hold under further verification, the work provides a practical, low-overhead method for synthesizing SFT data that retains teacher-level reasoning while aligning stylistically with the student model. This could meaningfully improve fine-tuning outcomes for reasoning tasks where direct teacher data transfer fails, and it draws attention to distribution mismatch as a controllable factor in SFT.
major comments (2)
- [TESSY Framework] TESSY Framework (method description): The claim that alternating teacher non-style tokens and student style tokens preserves the teacher's advanced reasoning steps without introducing inconsistencies at token boundaries is load-bearing for attributing the 11.25% and 6.68% gains to stylistic consistency. No ablation is reported that isolates this mechanism (e.g., comparing interleaved sequences against full teacher traces, style-transferred data, or random interleaving) or measures logical coherence of the resulting traces.
- [Experiments] Experiments section: The reported benchmark improvements lack error bars, standard deviations across multiple runs, or explicit controls for data volume and diversity. Without these, it is difficult to determine whether the gains over the teacher-data baseline are statistically reliable or could arise from incidental factors unrelated to the style/non-style split.
minor comments (1)
- [Abstract and Method] The abstract and method sections use the terms 'style tokens' and 'non-style tokens' without an explicit operational definition or example of how the split is performed at inference time.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript accordingly to strengthen the empirical support for TESSY.
read point-by-point responses
-
Referee: [TESSY Framework] TESSY Framework (method description): The claim that alternating teacher non-style tokens and student style tokens preserves the teacher's advanced reasoning steps without introducing inconsistencies at token boundaries is load-bearing for attributing the 11.25% and 6.68% gains to stylistic consistency. No ablation is reported that isolates this mechanism (e.g., comparing interleaved sequences against full teacher traces, style-transferred data, or random interleaving) or measures logical coherence of the resulting traces.
Authors: We agree that additional ablations are needed to isolate the interleaving mechanism. In the revised manuscript we will add three controlled comparisons: (1) full teacher traces, (2) style-transferred teacher data (student model used only for style adaptation), and (3) random token interleaving. We will also report a coherence metric that checks logical step consistency across token boundaries using an automated verifier on the generated reasoning traces. revision: yes
-
Referee: [Experiments] Experiments section: The reported benchmark improvements lack error bars, standard deviations across multiple runs, or explicit controls for data volume and diversity. Without these, it is difficult to determine whether the gains over the teacher-data baseline are statistically reliable or could arise from incidental factors unrelated to the style/non-style split.
Authors: We acknowledge the importance of statistical controls. The revised experiments section will report mean performance and standard deviations over three independent runs with different random seeds. Data volume will be matched exactly (same total tokens) across all conditions, and we will add diversity statistics (unique n-gram coverage and entropy) to rule out incidental differences in data quantity or variety. revision: yes
Circularity Check
No significant circularity; empirical results independent of inputs
full rationale
The paper presents TESSY as a data synthesis method that interleaves teacher and student token generation to address observed stylistic divergence. Performance deltas (drops for teacher-only data, gains for TESSY) are reported as direct experimental measurements on LiveCodeBench-Pro and OJBench. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or result to the inputs by construction. The framework description does not invoke uniqueness theorems, rename known patterns, or smuggle ansatzes. The derivation chain is therefore self-contained and externally falsifiable via the stated benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Stylistic divergence between teacher-generated data and student model distribution is a major factor limiting SFT effectiveness for reasoning models.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel (J-cost uniqueness) unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
TESSY ... interleaves teacher and student models to alternately generate style and non-style tokens ... y = [s1, t1, s2, t2, ...] (Eq. 4, Alg. 1)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
performance drops of 3.25% ... TESSY achieves improvements of 11.25%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations. 5
-
[2]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025. 1, 3.2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Opencodereasoning: Advancing data distillation for competitive coding
Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar, Aleksander Ficek, Siddhartha Jain, Jo- celyn Huang, Vahid Noroozi, and Boris Ginsburg. Opencodereasoning: Advancing data distillation for competitive coding.arXiv preprint arXiv:2504.01943, 2025. 1
-
[4]
Lora learns less and forgets less.Transactions on Machine Learning Research
Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.Transactions on Machine Learning Research. 5
-
[5]
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machin...
work page 2024
-
[6]
Howard Chen, Noam Razin, Karthik Narasimhan, and Danqi Chen. Retaining by doing: The role of on-policy data in mitigating forgetting.arXiv preprint arXiv:2510.18874, 2025. 1, 5
-
[7]
Yida Chen, Yuning Mao, Xianjun Yang, Suyu Ge, Shengjie Bi, Lijuan Liu, Saghar Hosseini, Liang Tan, Yixin Nie, and Shaoliang Nie. Your thoughts tell who you are: Characterize the reasoning patterns of lrms.arXiv preprint arXiv:2509.24147, 2025. 1, 5
-
[8]
Opencompass: A universal evaluation platform for foundation models
OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023. 3.2
work page 2023
-
[9]
Xtuner: A toolkit for efficiently fine-tuning llm
XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/ InternLM/xtuner, 2023. 3.1
work page 2023
-
[10]
Think before you speak: Training language models with pause tokens
SachinGoyal, ZiweiJi, AnkitSinghRawat, AdityaKrishnaMenon, SanjivKumar, andVaishnavhNagarajan. Think before you speak: Training language models with pause tokens. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,
work page 2024
-
[11]
OpenThoughts: Data Recipes for Reasoning Models
Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhu- rina, Jean Mercat, Trung Vu, Zayne Sprague, et al. Openthoughts: Data recipes for reasoning models. arXiv preprint arXiv:2506.04178, 2025. 1, 3.2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
StyleBench: Evaluating thinking styles in Large Language Models
Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, and Javad Lavaei. Stylebench: Evaluating thinking styles in large language models.arXiv preprint arXiv:2509.20868, 2025. 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Selective self-to-supervised fine-tuning for generalization in large language models
Sonam Gupta, Yatin Nandwani, Asaf Yehudai, Dinesh Khandelwal, Dinesh Raghu, and Sachindra Joshi. Selective self-to-supervised fine-tuning for generalization in large language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 6240–6249, 2025. 5
work page 2025
-
[15]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad- level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pape...
work page 2024
-
[16]
Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Deepmath- 103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning. 2025. A.2
work page 2025
-
[17]
Lora: Low-rank adaptation of large language models
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations. 5
-
[18]
Zixian Huang, Xinwei Huang, Ao Wu, Xiaxia Wang, and Gong Cheng. Transforming decoder-only models into encoder-only models with improved understanding capabilities.Knowl. Based Syst., 309:112907,
-
[19]
Pipelined decoder for efficient context-aware text generation.CoRR, abs/2506.23431, 2025
Zixian Huang, Chenxu Niu, Yu Gu, Gengyang Xiao, Xinwei Huang, and Gong Cheng. Pipelined decoder for efficient context-aware text generation.CoRR, abs/2506.23431, 2025. 6
-
[20]
A branching decoder for set generation
Zixian Huang, Gengyang Xiao, Yu Gu, and Gong Cheng. A branching decoder for set generation. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. 6
work page 2024
-
[21]
Spans, not tokens: a span-centric model for multi-span reading comprehension
Zixian Huang, Jiaying Zhou, Chenxu Niu, and Gong Cheng. Spans, not tokens: a span-centric model for multi-span reading comprehension. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 874–884, 2023. 2.2
work page 2023
-
[22]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, AleksanderMadry, AlexBeutel, AlexCarney, etal. Openaio1systemcard.arXivpreprintarXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Livecodebench: Holistic and contamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 3.2
work page 2025
-
[25]
Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, and Yu Wang. Taia: Large language models are out-of-distribution data learners.Advances in Neural Information Processing Systems, 37:105200–105235,
-
[26]
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?CoRR, abs/2603.24472, 2026. 3.5
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
Sequence-level knowledge distillation
Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 conference on empirical methods in natural language processing, pages 1317–1327, 2016. 3.4, 5
work page 2016
-
[28]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017. 5
work page 2017
-
[29]
Gonzalez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,
-
[30]
Small models struggle to learn from strong reasoners
Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubrama- nian, and Radha Poovendran. Small models struggle to learn from strong reasoners. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, ...
work page 2025
-
[31]
Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017. 5
work page 2017
-
[32]
Zhuang Li, Yuncheng Hua, Thuy Vu, Haolan Zhan, Lizhen Qu, and Gholamreza Haffari. Scar: Data selection via style consistency-aware response ranking for efficient instruction-tuning of large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12756–12790, 2025. 5
work page 2025
-
[33]
Autoregressive knowledge distillation throughimitationlearning
Alexander Lin, Jeremy Wohlwend, Howard Chen, and Tao Lei. Autoregressive knowledge distillation throughimitationlearning. InProceedingsofthe2020ConferenceonEmpiricalMethodsinNaturalLanguage Processing (EMNLP), pages 6121–6133, 2020. 3.4, 5
work page 2020
-
[34]
Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy
Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acereason-nemotron 1.1: Advancing math and code reasoning through sft and rl synergy.arXiv preprint arXiv:2506.13284, 2025. 1, 3.2
-
[35]
Through the valley: Path to effective long cot training for small language models
Renjie Luo, Jiaxi Li, Chen Huang, and Wei Lu. Through the valley: Path to effective long cot training for small language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025, pa...
work page 2025
-
[36]
Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025. 5
work page 2025
-
[37]
Dae Young Park, Moon-Hyun Cha, Daesin Kim, Bohyung Han, et al. Learning student-friendly teacher networks for knowledge distillation.Advances in neural information processing systems, 34:13292–13303,
-
[38]
Jingyu Peng, Maolin Wang, Hengyi Cai, Yuchen Li, Kai Zhang, Shuaiqiang Wang, Dawei Yin, and Xiangyu Zhao. Adaswitch: Adaptive switching generation for knowledge distillation.arXiv preprint arXiv:2510.07842, 2025. 5
-
[39]
Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, and Jing Shao. Demystifying reasoning dynamics with mutual information: Thinking tokens are information peaks in llm reasoning.arXiv preprint arXiv:2506.02867, 2025. 5
-
[40]
Gpqa: A graduate-level google-proof q&a benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. 3.2
work page 2024
-
[41]
A good learner can teach better: Teacher-student collaborative knowledge distillation
Ayan Sengupta, Shantanu Dixit, Md Shad Akhtar, and Tanmoy Chakraborty. A good learner can teach better: Teacher-student collaborative knowledge distillation. InICLR, 2024. 5
work page 2024
-
[42]
Rl’s razor: Why online reinforcement learning forgets less, 2025
Idan Shenfeld, Jyothish Pari, and Pulkit Agrawal. Rl’s razor: Why online reinforcement learning forgets less.arXiv preprint arXiv:2509.04259, 2025. 5
-
[43]
QiushiSun, ZhiruiChen, FangzhiXu, KanzhiCheng, ChangMa, ZhangyueYin, JianingWang, Chengcheng Han, Renyu Zhu, Shuai Yuan, et al. A survey of neural code intelligence: Paradigms, advances and beyond.arXiv preprint arXiv:2403.14734, 2024. 1
-
[44]
Supergpqa: Scaling llm evaluation across 285 graduate disciplines, 2025
M-A-P Team, Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, Kang Zhu, Minghao Liu, Yiming Liang, Xiaolong Jin, Zhenlin Wei, Chujie Zheng, Kaixing Deng, Shuyue Guo, Shian Jia, Sichao Jiang, Yiyan Liao, Rui Li, Qinrui Li, Sirun Li, Yizhi Li, Yunwen Li, Dehua Ma, Yuansheng Ni, Haoran Que, Qiyao Wang, Zhoufutu Wen, Siwei Wu, Tianshun Xing, Ming X...
work page 2025
-
[45]
Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao, Muxi Diao, Yanxu Chen, Kelin Fu, Flood Sung, Zhilin Yang, et al. Ojbench: A competition level code benchmark for large language models.arXiv preprint arXiv:2506.16395, 2025. 1, 1, 3.2
-
[46]
Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-YuLee, andTomasPfister
Wenda Xu, Rujun Han, Zifeng Wang, Long T. Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-YuLee, andTomasPfister. Speculativeknowledgedistillation: Bridgingtheteacher-student gap through interleaved sampling. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. 5
work page 2025
-
[47]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 1, 1, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, et al. Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement.arXiv preprint arXiv:2409.12122, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Self- distillation bridges distribution gap in language model fine-tuning
Zhaorui Yang, Tianyu Pang, Haozhe Feng, Han Wang, Wei Chen, Minfeng Zhu, and Qian Liu. Self- distillation bridges distribution gap in language model fine-tuning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thail...
work page 2024
-
[50]
Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao, Jianzhu Yao, et al. Livecodebench pro: How do olympiad medalists judge llms in competitive programming?arXiv preprint arXiv:2506.11928, 2025. 1, 1, 3.2 15 How to Fine-Tune a Reasoning Model? A Teacher–Student Cooperation Framework to Synthesiz...
-
[51]
Each span must be copied verbatim from the original text
-
[52]
Preserve order of appearance
-
[53]
If there are none, return an empty list:[]
-
[54]
Output only the JSON array — no explanation or extra text. <input_text> {think_text} </input_text> Table5: Example output of Qwen3-8B-Base Problem Description Given a transcript of a Fizz Buzz game (where numbers divisible by𝑎 are replaced with “Fizz”, by𝑏 with “Buzz”, and by both with “FizzBuzz”), find any valid values of𝑎and𝑏that could generate the tran...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.