pith. machine review for the scientific record. sign in

arxiv: 2605.07396 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Rubric-based On-policy Distillation

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords on-policy distillationrubric-based optimizationblack-box model alignmentLLM knowledge distillationsample efficiencyteacher-student contrastpolicy optimization
0
0 comments X

The pith

Rubrics induced from teacher-student response contrasts can replace logits for on-policy distillation in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that on-policy distillation, which typically requires access to a teacher's internal probability outputs, can instead use structured semantic rubrics derived solely from comparing teacher and student answers. This approach allows the method to work even when the teacher model is a black box whose internals are inaccessible. By generating these rubrics for each prompt and then using them to score the student's own generated responses during training, the student model improves its alignment without needing logit information. If successful, this opens distillation to a much wider range of teacher models, including proprietary ones, and can dramatically reduce the number of samples needed for effective training.

Core claim

ROPD induces prompt-specific rubrics from contrasts between teacher-generated responses and student rollouts, then applies these rubrics to score the student's on-policy generations for optimization. This rubric-based scoring serves as a scalable substitute for direct teacher logits, enabling effective distillation in black-box settings. Experiments demonstrate that ROPD surpasses advanced logit-based OPD techniques in most tested scenarios while requiring up to ten times fewer samples.

What carries the argument

The ROPD framework, which creates rubrics by contrasting teacher and student outputs for each prompt and then scores student policy rollouts against these rubrics to drive on-policy updates.

If this is right

  • Distillation becomes possible from closed-source or proprietary teacher models without internal access.
  • Training requires significantly fewer samples, up to a factor of ten improvement in efficiency.
  • Provides a unified approach applicable to both open-source and proprietary LLMs.
  • Offers a straightforward baseline method that is simple to implement compared to logit-dependent techniques.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Rubric-based feedback might generalize to other reinforcement learning from AI feedback setups where direct logit access is unavailable.
  • Future work could explore automating rubric generation further or adapting rubrics across related prompts to reduce overhead.
  • Testing on a broader set of tasks like reasoning or coding could reveal whether the efficiency gains hold beyond the reported scenarios.

Load-bearing premise

The assumption that contrasts between teacher and student responses yield rubrics that supply feedback as useful or better than the teacher's raw probability distributions for guiding student improvements.

What would settle it

Running ROPD head-to-head against logit-based OPD on the same benchmarks and finding that it matches or exceeds performance only when using equivalent or more samples, or underperforms in most cases, would indicate the rubric approach does not deliver the claimed advantages.

Figures

Figures reproduced from arXiv: 2605.07396 by Dan Zhang, Gengsheng Li, Haiyun Guo, Houcheng Jiang, Junfeng Fang, Mao Zheng, Mingyang Song, Tat-Seng Chua, Xiang Wang, Zhepei Hong.

Figure 1
Figure 1. Figure 1: ROPD efficiency and reasoning performance. (a) Training dynamics averaged over four math benchmarks (i.e., AIME 24/25 [1, 2] and HMMT 25 Feb./Nov. [3]). ROPD achieves a 10× sample efficiency boost. (b) Comparative results. For fair comparison, all models are trained on DAPO-Math-17K [4] using Qwen3-4B [5] (student) and Qwen3-30B-A3B [5] (teacher). See Section 3.1 for comprehensive experimental settings. 1 … view at source ↗
Figure 2
Figure 2. Figure 2: The ROPD Pipeline. A Rubricator induces prompt-specific rubrics by contrasting teacher and student rollouts, which a Verifier then utilizes to provide rewards for on-policy optimization. complex reasoning capabilities and has become a standard practice in the development of advancing open-source models [5, 9, 10]. However, the above logit-based OPD is fundamentally tied to a “white-box” setting, requiring … view at source ↗
Figure 3
Figure 3. Figure 3: ROPD efficiency advantage over LOPD (Qwen3-30B-A3B teacher and Qwen3-4B student, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Reward signal alignment with correctness (AIME24). (a) Correctness-alignment AUC for [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evolution of stylistic mimicry and comparative performance. (a) Mimicry Trajectories: [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

On-policy distillation (OPD) is a powerful paradigm for model alignment, yet its reliance on teacher logits restricts its application to white-box scenarios. We contend that structured semantic rubrics can serve as a scalable alternative to teacher logits, enabling OPD using only teacher-generated responses. To prove it, we introduce ROPD, a simple yet foundational framework for rubric-based OPD. Specifically, ROPD induces prompt-specific rubrics from teacher-student contrasts, and then utilizes these rubrics to score the student rollouts for on-policy optimization. Empirically, ROPD outperforms the advanced logit-based OPD methods across most scenarios, and achieving up to a 10x gain in sample efficiency. These results position rubric-based OPD as a flexible, black-box-compatible alternative to the prevailing logit-based OPD, offering a simple yet strong baseline for scalable distillation across proprietary and open-source LLMs. Code is available at https://github.com/Peregrine123/ROPD_official.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ROPD, a framework for rubric-based on-policy distillation (OPD) that induces prompt-specific structured semantic rubrics from teacher-student response contrasts and uses them to score student rollouts for on-policy optimization. This enables OPD without access to teacher logits, positioning it as a black-box alternative. The central empirical claim is that ROPD outperforms advanced logit-based OPD methods across most scenarios while achieving up to 10x gains in sample efficiency.

Significance. If the results hold under rigorous validation, ROPD could broaden on-policy distillation to proprietary LLMs by replacing logit access with rubric-based rewards, offering a scalable baseline for alignment. The availability of code at the provided GitHub link is a positive step toward reproducibility, though the absence of detailed experimental protocols in the abstract limits immediate assessment of the claimed efficiency gains.

major comments (3)
  1. [Abstract] Abstract: The central claims of outperformance over logit-based OPD and up to 10x sample efficiency gains lack any supporting details on datasets, evaluation metrics, baselines, number of runs, or statistical significance, rendering it impossible to evaluate whether the results support the claims or if the rubric scoring preserves sufficient ranking information relative to logits.
  2. [Abstract] The method description states that rubrics are induced from teacher-student contrasts and then used to score rollouts, but provides no formal definition of the rubric construction process, scoring function, or how it approximates the teacher policy's preference ordering; this is load-bearing because the 10x efficiency claim requires the rubric rewards to be at least as informative as full logit distributions for stable policy gradients.
  3. [Abstract] No ablation studies or analysis of rubric quality across training stages are mentioned, which is critical given that early-training contrasts may be dominated by gross errors while later stages risk overfitting to student-specific mistakes rather than general teacher preferences.
minor comments (2)
  1. [Abstract] The abstract refers to 'structured semantic rubrics' without clarifying their format (e.g., bullet points, criteria lists) or how they are elicited from the teacher.
  2. [Abstract] The GitHub link is provided but the manuscript does not indicate whether it includes the exact rubric induction code, experimental configurations, or seeds needed for reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the abstract to provide greater clarity on our claims, method, and supporting analyses while respecting length constraints.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of outperformance over logit-based OPD and up to 10x sample efficiency gains lack any supporting details on datasets, evaluation metrics, baselines, number of runs, or statistical significance, rendering it impossible to evaluate whether the results support the claims or if the rubric scoring preserves sufficient ranking information relative to logits.

    Authors: The abstract serves as a high-level summary; full details on datasets, metrics, baselines, multiple runs with statistical significance testing, and the comparative informativeness of rubric rewards versus logits appear in the Experiments section. We have revised the abstract to include brief references to the primary evaluation settings and the consistent gains observed across runs. revision: partial

  2. Referee: [Abstract] The method description states that rubrics are induced from teacher-student contrasts and then used to score rollouts, but provides no formal definition of the rubric construction process, scoring function, or how it approximates the teacher policy's preference ordering; this is load-bearing because the 10x efficiency claim requires the rubric rewards to be at least as informative as full logit distributions for stable policy gradients.

    Authors: A complete formalization of rubric induction from contrasts, the scoring function, and its approximation of teacher preferences is given in Section 3. We have added a concise formal statement to the abstract clarifying how rubric-based rewards enable stable on-policy gradients without logit access. revision: yes

  3. Referee: [Abstract] No ablation studies or analysis of rubric quality across training stages are mentioned, which is critical given that early-training contrasts may be dominated by gross errors while later stages risk overfitting to student-specific mistakes rather than general teacher preferences.

    Authors: Ablation studies and rubric-quality analyses across training stages, including checks against early-stage noise and later-stage overfitting, are reported in the Experiments section and appendix. We have added a short statement to the abstract noting the observed robustness of rubric rewards throughout training. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework and claims are empirically grounded without self-referential reduction

full rationale

The paper proposes ROPD as a new framework that induces prompt-specific rubrics from teacher-student response contrasts and applies them to score student rollouts for on-policy optimization. This is presented as an alternative to logit-based methods, with central claims (outperformance and up to 10x sample efficiency) resting on empirical results across scenarios rather than any derivation that reduces to fitted inputs, self-definitions, or self-citation chains. No equations or steps in the abstract or description equate predictions to inputs by construction, and no load-bearing self-citations or ansatzes are invoked. The approach is self-contained, with code released for independent verification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

With only the abstract available, the ledger is minimal; the paper introduces a new framework but the underlying assumptions about rubric quality are not detailed.

invented entities (1)
  • structured semantic rubrics no independent evidence
    purpose: alternative to teacher logits for scoring student responses
    The rubrics are a key innovation but their construction details and validation are not provided in the abstract.

pith-pipeline@v0.9.0 · 5491 in / 1172 out tokens · 56921 ms · 2026-05-11T02:07:44.292949+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Sparse RL on capable teachers followed by dense distillation to students beats direct GRPO on students for verifiable math reasoning.

  2. Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

    cs.LG 2026-05 unverdicted novelty 5.0

    Sparse RL on a strong teacher followed by dense distillation to the student outperforms direct GRPO on the student for math tasks, with a forward-KL + OPD bridge enabling further gains.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 1 Pith paper · 18 internal anchors

  1. [1]

    Aime 2024: American invitational mathematics examination, 2024

    MAA. Aime 2024: American invitational mathematics examination, 2024

  2. [2]

    Aime 2025: American invitational mathematics examination, 2025

    MAA. Aime 2025: American invitational mathematics examination, 2025

  3. [3]

    Hmmt 2025: Harvard-mit mathematics tournament, 2025

    HMMT. Hmmt 2025: Harvard-mit mathematics tournament, 2025

  4. [4]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  5. [5]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  6. [6]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations, 2024

  7. [7]

    On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

    Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

  8. [8]

    Minillm: Knowledge distillation of large language models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. InInternational Conference on Learning Representations, 2024

  9. [9]

    MiMo-V2-Flash Technical Report

    Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

  10. [10]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  11. [11]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023

  12. [12]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, and Joaquin Quiñonero- Candela. Healthbench: Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775, 2025

  13. [13]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

  14. [14]

    Gemma 3 Technical Report

    Gemma Team, Google DeepMind. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  15. [15]

    Introducing gpt-5.2

    OpenAI. Introducing gpt-5.2. https://openai.com/index/introducing-gpt-5-2/ ,

  16. [16]

    Accessed: 2026-05-06

  17. [17]

    TIP: Token Importance in On-Policy Distillation

    Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. Tip: Token importance in on-policy distillation.arXiv preprint arXiv:2604.14084, 2026

  18. [18]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy dis- tillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  19. [19]

    A Survey of On-Policy Distillation for Large Language Models

    Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026. 10

  20. [20]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  21. [21]

    Ovd: On-policy verbal distillation.arXiv preprint arXiv:2601.21968, 2026

    Jing Xiong, Hui Shen, Shansan Gong, Yuxin Cheng, Jianghan Shen, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, and Ngai Wong. Ovd: On-policy verbal distillation.arXiv preprint arXiv:2601.21968, 2026

  22. [22]

    Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643,

    Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, and Furu Wei. Black-box on-policy distillation of large language models.arXiv preprint arXiv:2511.10643, 2026

  23. [23]

    Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

    Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learn- ing beyond teacher: Generalized on-policy distillation with reward extrapolation.CoRR, abs/2602.12125, 2026

  24. [24]

    Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

  25. [25]

    Entropy-aware on-policy distillation of language models

    Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models. arXiv preprint arXiv:2603.07079, 2026

  26. [26]

    Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

    Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D Lyng, Sanjit Singh Batra, and Robert E Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

  27. [27]

    Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    Yecheng Wu, Song Han, and Hai Cai. Lightning opd: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint arXiv:2604.13010, 2026

  28. [28]

    PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence

    Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Paced: Distillation and on- policy self-distillation at the frontier of student competence.arXiv preprint arXiv:2603.11178, 2026

  29. [29]

    SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

    Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, and Xunliang Cai. Scope: Signal-calibrated on-policy distillation enhancement with dual-path adaptive weighting.arXiv preprint arXiv:2604.10688, 2026

  30. [30]

    A dual-space framework for general knowledge distillation of large language models

    Xue Zhang, Songming Zhang, Yunlong Liang, Fandong Meng, Yufeng Chen, Jinan Xu, and Jie Zhou. A dual-space framework for general knowledge distillation of large language models. arXiv preprint arXiv:2504.11426, 2025

  31. [31]

    Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

    Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

  32. [32]

    Orpo-distill: Mixed-policy preference optimization for cross-architecture llm distillation.arXiv preprint arXiv:2509.25100, 2025

    Aasheesh Singh, Vishal Vaddina, and Dagnachew Birru. Orpo-distill: Mixed-policy preference optimization for cross-architecture llm distillation.arXiv preprint arXiv:2509.25100, 2025

  33. [33]

    Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

    Yiyang Shen, Lifu Tu, and Weiran Wang. Reinforcement learning-based knowledge distillation with llm-as-a-judge.arXiv preprint arXiv:2604.02621, 2026

  34. [34]

    Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025

    Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, and Haoyu Wang. Openrubrics: Towards scalable synthetic rubric generation for reward modeling and llm alignment.arXiv preprint arXiv:2510.07743, 2025

  35. [35]

    Reinforcement learning with rubric anchors

    Zenan Huang, Yihong Zhuang, Guoshan Lu, Zeyu Qin, Haokai Xu, Tianyu Zhao, Ru Peng, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Xijun Gu, Peiyi Tu, Jiaxin Liu, Wenyu Chen, Yuzhuo Fu, Zhiting Fan, Yanmei Gu, Yuanyuan Wang, Zhengkai Yang, Jianguo Li, and Junbo Zhao. Reinforcement learning with rubric anchors.arXiv preprint arXiv:2508.12790, 2025. 11

  36. [36]

    Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, and Pang Wei Koh. Dr tulu: Reinforcement learning with ev...

  37. [37]

    Sibylsense: Adaptive rubric learning via memory tuning and adversarial probing.arXiv preprint arXiv:2602.20751, 2026

    Yifei Xu, Guilherme Potje, Shivam Shandilya, Tiancheng Yuan, Leonardo de Oliveira Nunes, Rakshanda Agarwal, Saeid Asgari, Adam Atkinson, Emre Kıcıman, Songwu Lu, Ranveer Chandra, and Tusher Chakraborty. Sibylsense: Adaptive rubric learning via memory tuning and adversarial probing.arXiv preprint arXiv:2602.20751, 2026

  38. [38]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  39. [39]

    Criterion 7: Uses proof by induction – 2/4 teachers support, 2/4 use direct computation

    Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, 2016. 12 Appendix Appendix Overview §A Related Work (Complete Version) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 §...

  40. [40]

    Specific and Measurable: Clearly define a concrete answer-quality merit

  41. [41]

    Binary Evaluable: A verifier should be able to mark it True or False for one response alone

  42. [42]

    Instructionally Useful: It should point to a meaningful improvement direction for the students

  43. [43]

    Alternative-Method Safe: A different valid approach that exhibits the same merit should still be rewarded. 17

  44. [44]

    Distinguishing: Prefer merits that teachers consistently show and students systematically lack

  45. [45]

    # Required Category Taxonomy Your rubric should be guided by the following three categories

    Black-Box Compatible: Prefer criteria that evaluate observable answer behavior and response quality. # Required Category Taxonomy Your rubric should be guided by the following three categories. Use the ‘category‘ field to assign each criterion to exactly one category

  46. [46]

    This includes identifying the target quantity, presenting the answer explicitly, and meeting format requirements

    Task Completion Whether the response completes the task and produces the required final answer in the correct form. This includes identifying the target quantity, presenting the answer explicitly, and meeting format requirements

  47. [47]

    Observable Quality Whether the response demonstrates strong observable correctness signals under black-box evaluation. This includes correct intermediate steps, valid factorization or algebraic manipulation, identification of key constraints (\textit{e.g.}, parity obstructions), and absence of hallucinated claims or guessed answers

  48. [48]

    Use this category when such qualities are genuinely relevant and improve teacher-student separation

    General Reasoning Broad reasoning qualities such as logical coherence, step-by-step derivation flow, planning structure, self-checking behavior, clarity, and focus. Use this category when such qualities are genuinely relevant and improve teacher-student separation. # Category Priorities

  49. [49]

    Preserve general validity of the rubric for the question

  50. [50]

    Prioritize Task Completion by default---at least one high-weight criterion should verify that the response answers the requested target and presents it in the required form

  51. [51]

    Prioritize Observable Quality criteria that directly check correctness of intermediate steps, mathematical manipulations, and domain-specific reasoning (\textit{e.g.}, factorization, constraint identification)

  52. [52]

    Use General Reasoning when genuinely relevant and it improves teacher-student separation, but avoid rewarding superficial stylistic performance

  53. [53]

    uses the same method as the teacher(s)

    Make the rubric produce actionable learning-direction signals for the student. Most of the total points should come from criteria that are likely satisfied by most teacher responses but not by most student responses. # Additional Design Rules - At least one high-value criterion should check whether the response answers the requested final target. - At lea...