arxiv: 2605.11260 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Curriculum Learning-Guided Progressive Distillation in Large Language Models

Jincheng Cao , Fanzhi Zeng , Leqi Liu , Aryan Mokhtari

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords curriculum learningknowledge distillationlarge language modelsprogressive distillationreasoning benchmarksdata orderingteacher schedulingmodel compression

0 comments

The pith

Aligning data difficulty with teacher strength boosts distillation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that combining a data curriculum from easy to hard examples with a progressive schedule of teachers from lower to higher capacity produces better student models than either technique alone or standard distillation. This matters because distillation is a common way to make powerful LLMs practical, yet current methods suffer from capacity mismatches that waste the teacher's knowledge. CLPD provides a modular way to align these two factors during training. Experiments confirm gains on reasoning tasks, suggesting the joint consideration is necessary for effective transfer. If true, this changes how practitioners should design distillation pipelines for small language models.

Core claim

CLPD constructs an explicit curriculum by organizing training examples from easy to hard, while simultaneously applying an implicit curriculum over supervision signals by progressively scheduling teachers of increasing capacity. This unified framework outperforms standard distillation, data ordering alone, and teacher scheduling alone on reasoning benchmarks.

What carries the argument

The CLPD framework aligning an explicit data difficulty curriculum with an implicit teacher capacity curriculum

Load-bearing premise

That examples in the training set have a stable, meaningful difficulty ranking that aligns with teacher capacities without causing mismatches

What would settle it

Observing no performance difference when the data order is randomized but the teacher progression is kept would falsify the importance of the alignment

Figures

Figures reproduced from arXiv: 2605.11260 by Aryan Mokhtari, Fanzhi Zeng, Jincheng Cao, Leqi Liu.

**Figure 2.** Figure 2: CLPD vs. PD under different data partitions. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of CLPD and OPD on GSM8K [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

read the original abstract

Knowledge distillation is a key technique for transferring the capabilities of large language models (LLMs) into smaller, more efficient student models. Existing distillation approaches often overlook two critical factors: the learning order of training data and the capacity mismatch between teacher and student models. This oversight limits distillation performance, as manifested by the counter-intuitive phenomenon where stronger teachers fail to produce better students. In this work, we propose Curriculum Learning-Guided Progressive Distillation (CLPD), a unified framework that explicitly accounts for both factors by aligning data difficulty with teacher strength. CLPD constructs an explicit curriculum by organizing training examples from easy to hard, while simultaneously applying an implicit curriculum over supervision signals by progressively scheduling teachers of increasing capacity. Our framework is modular and can be integrated into standard distillation algorithms with minimal overhead. Empirical results on the reasoning benchmarks demonstrate that CLPD consistently outperforms standard distillation, data ordering alone, and teacher scheduling alone across multiple settings. These findings highlight the importance of jointly considering data ordering and teacher capacity when distilling reasoning abilities into small language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLPD combines data curriculum with progressive teacher scheduling in distillation and reports gains on reasoning tasks, but the alignment between difficulty order and teacher capacity looks fragile and unproven.

read the letter

The paper's core move is to treat distillation as two aligned curricula at once: sort the data from easy to hard, then feed it to a sequence of teachers that grow stronger. That joint schedule is presented as the fix for the known problem where a stronger teacher sometimes produces a worse student. The modular framing lets it drop into existing distillation pipelines with little extra cost, and the abstract says it beats plain distillation, data ordering by itself, and teacher ramp-up by itself on reasoning benchmarks. That is the actual new piece: not the separate ideas, which have been around, but the explicit claim that their combination is what matters and that it can be made to work reliably.

Referee Report

2 major / 1 minor

Summary. The paper proposes Curriculum Learning-Guided Progressive Distillation (CLPD), a modular framework for LLM knowledge distillation that combines an explicit curriculum ordering training examples from easy to hard with an implicit curriculum that progressively schedules teachers of increasing capacity. It claims this joint alignment addresses capacity mismatch and data-ordering issues, yielding consistent outperformance over standard distillation, data ordering alone, and teacher scheduling alone on reasoning benchmarks.

Significance. If the empirical results hold under scrutiny, the work offers a practical, low-overhead extension to existing distillation pipelines that could improve transfer of reasoning capabilities to smaller models. The modularity and explicit handling of two known failure modes (stronger teachers not always producing stronger students, and suboptimal data ordering) are strengths that could influence follow-on research in efficient LLM training.

major comments (2)

[Method / Experiments] The central empirical claim requires that a single difficulty ordering of examples remains stably aligned with a sequence of teachers of increasing capacity. No section demonstrates that the chosen ordering (whether loss-based, length-based, or otherwise) is consistent across student sizes or robust to reasonable variations in the difficulty metric; without such evidence the reported gains over the separate ablations cannot be attributed to the unified framework rather than to a particular, non-general alignment.
[Abstract] The abstract states that CLPD 'consistently outperforms' the three baselines 'across multiple settings' yet supplies no information on the reasoning benchmarks used, the concrete difficulty metric, teacher/student model sizes, number of runs, or statistical tests. This absence makes the load-bearing performance claim impossible to evaluate from the manuscript as presented.

minor comments (1)

[Abstract] The phrase 'counter-intuitive phenomenon where stronger teachers fail to produce better students' would benefit from a citation to prior distillation literature that documents this mismatch.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where the manuscript can be strengthened. We address each major comment below and outline the revisions we will incorporate to improve clarity, robustness, and evaluability of the claims.

read point-by-point responses

Referee: [Method / Experiments] The central empirical claim requires that a single difficulty ordering of examples remains stably aligned with a sequence of teachers of increasing capacity. No section demonstrates that the chosen ordering (whether loss-based, length-based, or otherwise) is consistent across student sizes or robust to reasonable variations in the difficulty metric; without such evidence the reported gains over the separate ablations cannot be attributed to the unified framework rather than to a particular, non-general alignment.

Authors: We agree that demonstrating the stability of the difficulty ordering across student sizes and its robustness to metric variations is necessary to attribute gains specifically to the joint CLPD framework. Our current results use a loss-based ordering and show consistent outperformance over ablations, but we did not include explicit cross-size consistency checks or alternative-metric comparisons. In the revised manuscript we will add a dedicated analysis subsection that (i) evaluates the same loss-based ordering on student models ranging from 1B to 7B parameters and (ii) reports performance when the ordering is derived from length-based and perplexity-based metrics instead. These additions will clarify whether the observed benefits generalize beyond a single alignment choice. revision: yes
Referee: [Abstract] The abstract states that CLPD 'consistently outperforms' the three baselines 'across multiple settings' yet supplies no information on the reasoning benchmarks used, the concrete difficulty metric, teacher/student model sizes, number of runs, or statistical tests. This absence makes the load-bearing performance claim impossible to evaluate from the manuscript as presented.

Authors: We acknowledge that the current abstract is insufficiently specific and prevents readers from assessing the empirical claims. In the revised version we will expand the abstract to state the concrete reasoning benchmarks (GSM8K, MATH), the difficulty metric (loss-based), the teacher and student model sizes (teachers from 7B to 70B, students 1B–3B), the number of independent runs (three), and the use of statistical significance testing (paired t-tests, p < 0.05). This will make the performance statements directly evaluable while preserving the abstract’s brevity. revision: yes

Circularity Check

0 steps flagged

Empirical modular framework with no load-bearing self-referential steps

full rationale

The manuscript presents CLPD as a practical combination of explicit data curriculum (easy-to-hard ordering) and implicit teacher scheduling (progressive capacity increase). All central claims rest on empirical outperformance versus standard distillation, data-ordering alone, and teacher-scheduling alone on reasoning benchmarks. No equations, fitted parameters, or derivations are introduced that reduce any reported gain to a quantity defined inside the paper itself. Prior curriculum-learning and distillation literature is cited only for background; the framework is explicitly modular and requires no uniqueness theorems or self-citation chains to justify its construction. The derivation chain is therefore self-contained and externally falsifiable via the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions about data difficulty ordering and teacher capacity progression rather than new theoretical derivations or invented entities.

axioms (2)

domain assumption Training examples possess a stable difficulty ordering that can be discovered and used for curriculum construction
Invoked when the paper states it constructs an explicit curriculum by organizing examples from easy to hard.
domain assumption Teacher models of increasing capacity can be scheduled to provide progressively better supervision signals without introducing harmful mismatches
Basis for the implicit curriculum over supervision signals.

pith-pipeline@v0.9.0 · 5485 in / 1123 out tokens · 30303 ms · 2026-05-13T01:49:37.249301+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CLPD constructs an explicit curriculum by organizing training examples from easy to hard, while simultaneously applying an implicit curriculum over supervision signals by progressively scheduling teachers of increasing capacity.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We estimate example difficulty using one of two strategies... When CoT demonstrations are not available, we instead estimate difficulty using the student model itself... record the loss

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 10 internal anchors

[1]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[2]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.Meta AI Blog

AI Meta. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models.Meta AI Blog. Retrieved December, 20:2024, 2024

work page 2024
[3]

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Anto...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning, 2025. Preprint

work page 2025
[5]

Light- paff: A two-stage distillation framework for pre-training and fine-tuning.arXiv preprint arXiv:2004.12817, 2020

Kaitao Song, Hao Sun, Xu Tan, Tao Qin, Jianfeng Lu, Hongzhi Liu, and Tie-Yan Liu. Light- paff: A two-stage distillation framework for pre-training and fine-tuning.arXiv preprint arXiv:2004.12817, 2020

work page arXiv 2004
[6]

Minillm: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[7]

URL https://www.aclweb.org/anthology/D13-1170

Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowledge distillation for bert model compression.arXiv preprint arXiv:1908.09355, 2019

work page arXiv 1908
[8]

Tinybert: Distilling bert for natural language understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. InFindings of the association for computational linguistics: EMNLP 2020, pages 4163–4174, 2020

work page 2020
[9]

Efficient knowledge distillation via curriculum extraction

Shivam Gupta and Sushrut Karmalkar. Efficient knowledge distillation via curriculum extraction. arXiv preprint arXiv:2503.17494, 2025

work page arXiv 2025
[10]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[11]

Sequence-level knowledge distillation

Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, 2016

work page 2016
[12]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Claude 3.5 sonnet

Anthropic. Claude 3.5 sonnet. https://www.anthropic.com/news/claude-3-5-sonnet ,

work page
[14]

Published: 21 Jun 2024. 10

work page 2024
[15]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325, 2024

Wenda Xu, Rujun Han, Zifeng Wang, Long T Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling.arXiv preprint arXiv:2410.11325, 2024

work page arXiv 2024
[17]

Being strong progressively! enhancing knowledge distillation of large language models through a curriculum learning framework.arXiv preprint arXiv:2506.05695, 2025

Lingyuan Liu and Mengxiang Zhang. Being strong progressively! enhancing knowledge distillation of large language models through a curriculum learning framework.arXiv preprint arXiv:2506.05695, 2025

work page arXiv 2025
[18]

When do curricula work?arXiv preprint arXiv:2012.03107,

Xiaoxia Wu, Ethan Dyer, and Behnam Neyshabur. When do curricula work?arXiv preprint arXiv:2012.03107, 2020

work page arXiv 2012
[19]

A survey on curriculum learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4555–4576, 2021

Xin Wang, Yudong Chen, and Wenwu Zhu. A survey on curriculum learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9):4555–4576, 2021

work page 2021
[20]

Improved knowledge distillation via teacher assistant

Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 5191–5198, 2020

work page 2020
[21]

Jafari, Mehdi Rezagholizadeh, Pranav Sharma, and A

A. Jafari, Mehdi Rezagholizadeh, Pranav Sharma, and A. Ghodsi. Annealing knowledge distil- lation.Conference of the European Chapter of the Association for Computational Linguistics, 2021

work page 2021
[22]

Supervision complexity and its role in knowledge distillation.arXiv preprint arXiv:2301.12245, 2023

Hrayr Harutyunyan, Ankit Singh Rawat, Aditya Krishna Menon, Seungyeon Kim, and San- jiv Kumar. Supervision complexity and its role in knowledge distillation.arXiv preprint arXiv:2301.12245, 2023

work page arXiv 2023
[23]

Beyond one-step distillation: Bridging the capacity gap in small language models via multi-step knowledge transfer

Gaeun Yim, Nayoung Ko, and Manasa Bharadwaj. Beyond one-step distillation: Bridging the capacity gap in small language models via multi-step knowledge transfer. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 182–187, 2026

work page 2026
[24]

Towards the law of capacity gap in distilling language models

Chen Zhang, Qiuchi Li, Dawei Song, Zheyu Ye, Yan Gao, and Yao Hu. Towards the law of capacity gap in distilling language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22504–22528, 2025

work page 2025
[25]

Revisiting the Capacity Gap in Chain-of-Thought Distillation from a Practical Perspective

Tokio Kajitsuka, Ukyo Honda, and Sho Takase. Revisiting the capacity gap in chain-of-thought distillation from a practical perspective.arXiv preprint arXiv:2604.08880, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[26]

Distillm: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

work page arXiv 2024
[27]

Distillm-2: A contrastive approach boosts the distillation of llms.arXiv preprint arXiv:2503.07067, 2025

Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. Distillm-2: A contrastive approach boosts the distillation of llms.arXiv preprint arXiv:2503.07067, 2025

work page arXiv 2025
[28]

Teach small models to reason by curriculum distillation

Wangyi Jiang, Yaojie Lu, Hongyu Lin, Xianpei Han, and Le Sun. Teach small models to reason by curriculum distillation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7423–7433, 2025

work page 2025
[29]

Progres- sive distillation induces an implicit curriculum.arXiv preprint arXiv:2410.05464, 2024

Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Andrej Risteski, and Surbhi Goel. Progres- sive distillation induces an implicit curriculum.arXiv preprint arXiv:2410.05464, 2024

work page arXiv 2024
[30]

Minimal distillation schedule for extreme language model compression

Chen Zhang, Yang Yang, Qifan Wang, Jiahao Liu, Jingang Wang, Wei Wu, and Dawei Song. Minimal distillation schedule for extreme language model compression. InFindings of the Association for Computational Linguistics: EACL 2024, pages 1378–1394, 2024

work page 2024
[31]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 11

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021.URL https://arxiv. org/abs/2103.03874, 2, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2021
[33]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021

work page 2021
[36]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

Angles don’t lie: Unlocking training-efficient rl through the model’s own signals, 2025

Qinsi Wang, Jinghan Ke, Hancheng Ye, Yueqian Lin, Yuzhe Fu, Jianyi Zhang, Kurt Keutzer, Chenfeng Xu, and Yiran Chen. Angles don’t lie: Unlocking training-efficient rl through the model’s own signals.arXiv preprint arXiv:2506.02281, 2025

work page arXiv 2025
[38]

To cot or not to cot? chain- of-thought helps mainly on math and symbolic reasoning.arXiv preprint arXiv:2409.12183, 2024

Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett. To cot or not to cot? chain- of-thought helps mainly on math and symbolic reasoning.arXiv preprint arXiv:2409.12183, 2024

work page arXiv 2024
[39]

Trl: Transformer reinforcement learning library

Leandro V on Werra, Younes Belkada, Sajjad Bayat, and Thomas Wolf. Trl: Transformer reinforcement learning library. https://github.com/huggingface/trl, 2020. Hugging Face

work page 2020
[40]

Explain your reasoning step by step, then end exactly with the final answer

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 12 Table 5: Performance of CLPD with three teachers across different student models and datasets. Numbers in parentheses denote the change relative to the two-teacher CLPD. ...

work page 2022
[41]

Determine the discounted price per unit

work page
[42]

Undo the discount to recover the original unit price

work page
[43]

19 The discounted price per unit is: 500 18 ≈27.78 LetPbe the original price per unit

Multiply by the number of units. 19 The discounted price per unit is: 500 18 ≈27.78 LetPbe the original price per unit. Since the discounted price is 80% of the original: 0.8P= 27.78 Solving forP: P≈34.725 The total original price for 18 units is: 18×34.725 = 625 Thus, the original price is: 625 Qwen-2.5-7B Instruct To solve this problem, let’s break it d...

work page
[44]

Identify the given information

work page
[45]

Understand the relationship between the original price and the discounted price

work page
[46]

Set up an equation to represent the relationship. Since a 20% discount means the paid price is 80% of the original price, let P be the original price: 0.8P= 500 Solving forP: P= 500 0.8 = 625 Thus, the original price is: 625 For this easy instance, all models produce well-formed step-by-step responses; however, closer inspection reveals clear differences ...

work page