SocraticPO: Policy Optimization via Interactive Guidance

Enhong Chen; Jiayu Liu; Jie Ouyang; Jing Sha; Qi Liu; Qingchuan Li; Shijin Wang; Tingyue Pan; Xianquan Wang; Zhenya Huang

arxiv: 2606.09887 · v1 · pith:HVIUSVTDnew · submitted 2026-06-03 · 💻 cs.LG · cs.AI· cs.CL

SocraticPO: Policy Optimization via Interactive Guidance

Zirui Liu , Jie Ouyang , Qi Liu , Xianquan Wang , Jiayu Liu , Tingyue Pan , Qingchuan Li , Jing Sha

show 3 more authors

Zhenya Huang Shijin Wang Enhong Chen

This is my paper

Pith reviewed 2026-06-28 07:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords SocraticPOpolicy optimizationreinforcement learninglarge language modelsreasoningreward decayteacher guidanceSciKnowEval

0 comments

The pith

SocraticPO improves LLM reasoning policies by inserting teacher guidance into RL rollouts paired with reward decay for assisted answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SocraticPO to fix the problem that scalar outcome rewards in RL for language models give little information on how to fix mistakes, often leading to shortcut learning. It lets the student model generate an answer first, then receives concise natural-language diagnosis and guidance from a teacher only if wrong, after which the student continues in the expanded context. Correct final answers reached after this help receive decayed rewards so the policy does not learn to treat teacher input as a free reward path. The method leaves the standard expected-reward objective untouched, so it works with any existing policy-gradient optimizer and with black-box teachers that supply only text. Experiments on undergraduate scientific reasoning benchmarks show gains over strong RL and self-distillation baselines, and ablations confirm both the guidance and the decay are required.

Core claim

SocraticPO modifies only the data-collection phase of reinforcement learning by letting a student answer independently, receiving targeted natural-language corrective guidance from a teacher on errors, and then continuing under the augmented context, while applying reward decay to any correct answers obtained after intervention. Because the expected-reward objective remains standard, the approach integrates directly with backends such as Reinforce++ and requires no access to teacher logits or distribution matching. On SciKnowEval undergraduate scientific reasoning tasks this yields higher performance than strong RL and self-distillation baselines, with ablations establishing that both the gu

What carries the argument

Socratic-style natural-language guidance inserted after incorrect student attempts, paired with reward decay on assisted successes, inside an otherwise standard rollout for policy optimization.

If this is right

Any existing policy-gradient method can adopt SocraticPO by changing only the rollout generation step while keeping the same loss.
Stronger black-box teacher models can be used because the framework requires only text output, not logits or probability distributions.
Reward decay specifically counters the risk that the policy learns to treat teacher intervention as a reliable high-reward route.
Ablation results indicate that guidance without decay or decay without targeted guidance each fails to deliver the observed gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rollout modification could be tested on non-science domains such as mathematical proof or code generation to check whether the benefit generalizes.
Because the teacher need only supply text, the approach opens a route to scaling interactive RL without requiring white-box access to the teacher model.
One could examine whether the learned policy begins to generate self-corrections that mimic the teacher style even when no external guidance is present.
The method might combine usefully with existing chain-of-thought or self-consistency techniques that already operate inside a single rollout.

Load-bearing premise

That concise natural-language teacher guidance combined with lower rewards for answers reached after help will produce stronger independent reasoning rather than encouraging reliance on external input or shifting the rollout distribution in unhelpful ways.

What would settle it

Evaluating the final SocraticPO policy in a setting with no teacher guidance available and finding performance no higher than the RL baselines, or finding that removal of the decay term causes the model to generate more requests for help, would falsify the central claim.

read the original abstract

Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should revise its mistaken reasoning, which can encourage shortcut learning and brittle policies. We propose \textbf{SocraticPO} (Socratic Policy Optimization), a policy-optimization framework that augments RL rollouts with Socratic-style natural-language guidance. During rollout, the student first answers independently; if the answer is incorrect, a teacher diagnoses the attempt and provides concise corrective guidance, after which the student continues under the expanded context. Crucially, this guidance is paired with reward decay: correct answers obtained after teacher intervention only receive decayed rewards, preventing the policy from treating teacher help as a free path to reward. Since SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact, it can be plugged into existing policy-gradient backends such as Reinforce++. Moreover, because the teacher provides only text-level guidance, SocraticPO can leverage stronger black-box teacher models without requiring access to logits or distribution matching. On undergraduate-level scientific reasoning benchmarks from SciKnowEval, SocraticPO improves over strong RL and self-distillation baselines. Ablations show that both targeted guidance and reward decay are necessary, with reward decay mitigating reliance on assisted correction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SocraticPO adds teacher guidance to RL rollouts with reward decay to push independent reasoning, but the reliance risk the stress-test flags needs the actual numbers and decay details to judge.

read the letter

The main idea in this paper is to inject natural-language corrective guidance into the rollout process for policy optimization in language models. The student tries first, gets help if wrong, but then gets a lower reward if it succeeds only after the help. This keeps the usual RL objective but changes how the trajectories are generated.

It does a few things cleanly. The method slots into existing backends like Reinforce++ without altering the loss. It also works with black-box teachers, which is useful when you have access to a stronger model but not its internals. The claim is that this leads to better policies on scientific reasoning tasks from SciKnowEval compared to plain RL or self-distillation.

The ablations are said to show that both the guidance and the decay are needed, and that decay reduces reliance on assistance.

The potential issue is whether the decay is strong enough to prevent the policy from learning to seek or rely on the guidance. The stress-test note points out that if the model can arrange to get help by making initial mistakes, the effective training signal might still favor assisted paths. The abstract mentions that correct answers after intervention get decayed rewards, but without the exact implementation details like the decay rate or how it affects the gradient, it's difficult to assess how well this holds up.

The paper appears to treat the standard objective as unchanged, which is a strength if the experiments confirm no distribution shift problems.

This work is aimed at people doing RL fine-tuning for reasoning capabilities in LLMs. Someone looking for ways to add more informative signals during training without major changes to the setup would find it relevant.

I would send it for peer review so the results and any implementation choices can be checked thoroughly.

Referee Report

2 major / 1 minor

Summary. The paper proposes SocraticPO, a policy-optimization framework for LLMs that augments standard RL rollouts with Socratic natural-language guidance from a teacher model when the student's initial answer is incorrect. Guidance is followed by continued generation under the expanded context, but correct answers reached only after intervention receive decayed rewards. The method leaves the expected-reward objective unchanged and is compatible with existing policy-gradient algorithms such as Reinforce++. It reports improvements over strong RL and self-distillation baselines on undergraduate scientific reasoning tasks from SciKnowEval, with ablations indicating that both guidance and reward decay are required.

Significance. If the empirical claims hold under full implementation details, SocraticPO would offer a practical route to leverage stronger black-box teacher models for reasoning improvement without requiring logit access or explicit distribution matching. The preservation of the standard RL objective while modifying only the data-generation process is a clean design choice that could generalize to other policy-gradient backends.

major comments (2)

[Abstract] Abstract: The central claim that reward decay 'prevent[s] the policy from treating teacher help as a free path to reward' and that ablations confirm both components are necessary rests on an unspecified decay implementation (multiplicative factor, whether applied only to the terminal reward or the full trajectory, and how the policy gradient is computed on guidance-augmented contexts). Without these details the ablation results cannot be interpreted as evidence against learned reliance or rollout-distribution bias.
[Abstract] Abstract: The statement that 'SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact' requires explicit confirmation that the policy-gradient estimator remains unbiased when the context distribution is altered by teacher interventions; the current description does not address whether the guidance tokens are treated as part of the state or masked during gradient computation.

minor comments (1)

[Abstract] The abstract mentions 'SciKnowEval' benchmarks but provides no dataset sizes, number of runs, or statistical significance tests for the reported improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater implementation clarity in the abstract. We address both major comments below and will revise the manuscript accordingly to include the requested details.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that reward decay 'prevent[s] the policy from treating teacher help as a free path to reward' and that ablations confirm both components are necessary rests on an unspecified decay implementation (multiplicative factor, whether applied only to the terminal reward or the full trajectory, and how the policy gradient is computed on guidance-augmented contexts). Without these details the ablation results cannot be interpreted as evidence against learned reliance or rollout-distribution bias.

Authors: We agree the abstract omits these specifics. The full manuscript (Section 3.2) defines reward decay as a multiplicative factor of 0.5 applied exclusively to the terminal reward on guided trajectories; the full trajectory reward is not decayed. The policy gradient (REINFORCE++) is computed over the complete sequence, treating guidance tokens as part of the state. We will update the abstract and add an explicit paragraph in Methods to make these choices transparent so that the ablation results can be interpreted correctly. revision: yes
Referee: [Abstract] Abstract: The statement that 'SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact' requires explicit confirmation that the policy-gradient estimator remains unbiased when the context distribution is altered by teacher interventions; the current description does not address whether the guidance tokens are treated as part of the state or masked during gradient computation.

Authors: The estimator remains unbiased for the following reason: we apply the standard policy-gradient estimator directly to the full generated trajectories under the modified rollout distribution. Guidance tokens are included as part of the state (i.e., they appear in the context for subsequent token prediction and are never masked from the gradient). Because the objective is the expected (decayed) reward under this augmented sampling process, the estimator is unbiased with respect to the policy that generates those trajectories. We will add a short clarifying paragraph in the Methods section confirming this treatment. revision: yes

Circularity Check

0 steps flagged

No circularity: standard expected-reward objective preserved with empirical support only

full rationale

The paper describes a rollout modification (Socratic guidance plus reward decay) while explicitly stating that the objective remains the unmodified expected-reward objective used by standard policy-gradient methods such as Reinforce++. No equations, derivations, or fitted parameters are introduced that would make any claimed improvement equivalent to the inputs by construction. Ablations and benchmark results are presented as external evidence rather than self-referential definitions. This is the most common honest non-finding for an applied RL method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method depends on the teacher providing accurate, concise diagnoses and on reward decay successfully discouraging reliance on assistance; these are not independently evidenced in the abstract.

axioms (2)

domain assumption A teacher model can reliably diagnose student errors and supply useful corrective natural-language guidance
Invoked as the core mechanism that augments the rollout when the initial answer is incorrect.
domain assumption Reward decay applied to post-guidance correct answers prevents the policy from treating teacher help as a free reward path
Stated as crucial to avoid shortcut learning.

pith-pipeline@v0.9.1-grok · 5800 in / 1255 out tokens · 24275 ms · 2026-06-28T07:14:32.589224+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 18 linked inside Pith

[1]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[2]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, volume 2024, pages 21246–21263, 2024

2024
[3]

Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023
[4]

Constitutional ai: Harmlessness from ai feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

Pith/arXiv arXiv 2022
[5]

Language models are few-shot learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advancesin Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

1901
[6]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvancesin Neural Information Processing Systems, volume 30, 2017

2017
[7]

Hdpo: Hybrid distillation policy optimization via privileged self-distillation

Ken Ding. Hdpo: Hybrid distillation policy optimization via privileged self-distillation. arXiv preprint arXiv:2603.23871, 2026

arXiv 2026
[8]

Sciknoweval: Evaluating multi-level scientific knowledge of large language models

Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, and Huajun Chen. Sciknoweval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098, 2024

arXiv 2024
[9]

Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

Yuqian Fu, Haohuan Huang, and Kaiwen Jiang. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

Pith/arXiv arXiv 2026
[10]

Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015
[11]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, 2023. 11

2023
[12]

Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

Pith/arXiv arXiv 2025
[13]

How to fine-tune a reasoning model? a teacher-student cooperation framework to synthesize student-consistent sft data

Zixian Huang, Kaichen Yang, Xu Huang, Feiyang Hao, Qiming Ge, Bowen Li, He Du, Kai Chen, and Qipeng Guo. How to fine-tune a reasoning model? a teacher-student cooperation framework to synthesize student-consistent sft data. arXiv preprint arXiv:2604.14164, 2026

Pith/arXiv arXiv 2026
[14]

Reinforcement learning via self-distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802, 2026

Pith/arXiv arXiv 2026
[15]

Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

arXiv 2026
[16]

Socraticlm: Exploring socratic personalized teaching with large language models.Advancesin Neural Information Processing Systems, 37:85693–85721, 2024

Jiayu Liu, Zhenya Huang, Tong Xiao, Jing Sha, Jinze Wu, Qi Liu, Shijin Wang, and Enhong Chen. Socraticlm: Exploring socratic personalized teaching with large language models.Advancesin Neural Information Processing Systems, 37:85693–85721, 2024

2024
[17]

Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707, 2023

Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707, 2023

Pith/arXiv arXiv 2023
[18]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin Neural Information Processing Systems, 35:27730–27744, 2022

2022
[19]

Proximal policy optimization algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[20]

Plato’sMeno

Dominic Scott. Plato’sMeno. Cambridge University Press, 2006

2006
[21]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[22]

A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

Pith/arXiv arXiv 2026
[23]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. InAdvancesin Neural Information Processing Systems, volume 33, pages 3008–3021, 2020

2020
[24]

Policy gradient methods for reinforce- ment learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforce- ment learning with function approximation. InAdvances in neural information processing systems, volume 12, 1999

1999
[25]

Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models, 2023

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models, 2023

2023
[26]

Llama: Open and efficient foundation language models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023
[27]

Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

Pith/arXiv arXiv 2022
[28]

The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems

Kurt VanLehn. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational psychologist, 46(4):197–221, 2011

2011
[29]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017. 12

2017
[30]

Steppo: Step-aligned policy optimization for agentic reinforcement learning.arXiv preprint arXiv:2604.18401, 2026

Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, and Enhong Chen. Steppo: Step-aligned policy optimization for agentic reinforcement learning.arXiv preprint arXiv:2604.18401, 2026

Pith/arXiv arXiv 2026
[31]

Self-instruct: Aligning language models with self-generated instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 13484–13508, 2023

2023
[32]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992

1992
[33]

The role of tutoring in problem solving.Journal of childpsychology and psychiatry, 17(2):89–100, 1976

David Wood, Jerome S Bruner, and Gail Ross. The role of tutoring in problem solving.Journal of childpsychology and psychiatry, 17(2):89–100, 1976

1976
[34]

Wizardlm: Empowering large language models to follow complex instructions.arXiv preprint arXiv:2304.12244, 2023

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions.arXiv preprint arXiv:2304.12244, 2023

Pith/arXiv arXiv 2023
[35]

On-policy context distillation for language models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models. arXiv preprint arXiv:2602.12275, 2026

Pith/arXiv arXiv 2026
[36]

Material C also has a slightly higher density (6.111) than Material D,

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019. 13 A Proof of Theorem 1 Proof.We prove the three properties in Theorem 1. Letmk =|A k|and pk = 1 mk X i∈Ak δ(k) i .(21) LetN k−1 =Pk−1 j=1 |Aj...

Pith/arXiv arXiv 1909

[1] [1]

Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[2] [2]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations, volume 2024, pages 21246–21263, 2024

2024

[3] [3]

Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

Pith/arXiv arXiv 2023

[4] [4]

Constitutional ai: Harmlessness from ai feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

Pith/arXiv arXiv 2022

[5] [5]

Language models are few-shot learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advancesin Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

1901

[6] [6]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvancesin Neural Information Processing Systems, volume 30, 2017

2017

[7] [7]

Hdpo: Hybrid distillation policy optimization via privileged self-distillation

Ken Ding. Hdpo: Hybrid distillation policy optimization via privileged self-distillation. arXiv preprint arXiv:2603.23871, 2026

arXiv 2026

[8] [8]

Sciknoweval: Evaluating multi-level scientific knowledge of large language models

Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, and Huajun Chen. Sciknoweval: Evaluating multi-level scientific knowledge of large language models. arXiv preprint arXiv:2406.09098, 2024

arXiv 2024

[9] [9]

Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

Yuqian Fu, Haohuan Huang, and Kaiwen Jiang. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026

Pith/arXiv arXiv 2026

[10] [10]

Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015

[11] [11]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, 2023. 11

2023

[12] [12]

Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models.arXiv preprint arXiv:2501.03262, 2025

Pith/arXiv arXiv 2025

[13] [13]

How to fine-tune a reasoning model? a teacher-student cooperation framework to synthesize student-consistent sft data

Zixian Huang, Kaichen Yang, Xu Huang, Feiyang Hao, Qiming Ge, Bowen Li, He Du, Kai Chen, and Qipeng Guo. How to fine-tune a reasoning model? a teacher-student cooperation framework to synthesize student-consistent sft data. arXiv preprint arXiv:2604.14164, 2026

Pith/arXiv arXiv 2026

[14] [14]

Reinforcement learning via self-distillation

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation. arXiv preprint arXiv:2601.20802, 2026

Pith/arXiv arXiv 2026

[15] [15]

Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

arXiv 2026

[16] [16]

Socraticlm: Exploring socratic personalized teaching with large language models.Advancesin Neural Information Processing Systems, 37:85693–85721, 2024

Jiayu Liu, Zhenya Huang, Tong Xiao, Jing Sha, Jinze Wu, Qi Liu, Shijin Wang, and Enhong Chen. Socraticlm: Exploring socratic personalized teaching with large language models.Advancesin Neural Information Processing Systems, 37:85693–85721, 2024

2024

[17] [17]

Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707, 2023

Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4.arXiv preprint arXiv:2306.02707, 2023

Pith/arXiv arXiv 2023

[18] [18]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesin Neural Information Processing Systems, 35:27730–27744, 2022

2022

[19] [19]

Proximal policy optimization algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[20] [20]

Plato’sMeno

Dominic Scott. Plato’sMeno. Cambridge University Press, 2006

2006

[21] [21]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[22] [22]

A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

Pith/arXiv arXiv 2026

[23] [23]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. InAdvancesin Neural Information Processing Systems, volume 33, pages 3008–3021, 2020

2020

[24] [24]

Policy gradient methods for reinforce- ment learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforce- ment learning with function approximation. InAdvances in neural information processing systems, volume 12, 1999

1999

[25] [25]

Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models, 2023

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models, 2023

2023

[26] [26]

Llama: Open and efficient foundation language models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023

[27] [27]

Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback.arXiv preprint arXiv:2211.14275, 2022

Pith/arXiv arXiv 2022

[28] [28]

The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems

Kurt VanLehn. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational psychologist, 46(4):197–221, 2011

2011

[29] [29]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017. 12

2017

[30] [30]

Steppo: Step-aligned policy optimization for agentic reinforcement learning.arXiv preprint arXiv:2604.18401, 2026

Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, and Enhong Chen. Steppo: Step-aligned policy optimization for agentic reinforcement learning.arXiv preprint arXiv:2604.18401, 2026

Pith/arXiv arXiv 2026

[31] [31]

Self-instruct: Aligning language models with self-generated instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 13484–13508, 2023

2023

[32] [32]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992

1992

[33] [33]

The role of tutoring in problem solving.Journal of childpsychology and psychiatry, 17(2):89–100, 1976

David Wood, Jerome S Bruner, and Gail Ross. The role of tutoring in problem solving.Journal of childpsychology and psychiatry, 17(2):89–100, 1976

1976

[34] [34]

Wizardlm: Empowering large language models to follow complex instructions.arXiv preprint arXiv:2304.12244, 2023

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions.arXiv preprint arXiv:2304.12244, 2023

Pith/arXiv arXiv 2023

[35] [35]

On-policy context distillation for language models

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models. arXiv preprint arXiv:2602.12275, 2026

Pith/arXiv arXiv 2026

[36] [36]

Material C also has a slightly higher density (6.111) than Material D,

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593, 2019. 13 A Proof of Theorem 1 Proof.We prove the three properties in Theorem 1. Letmk =|A k|and pk = 1 mk X i∈Ak δ(k) i .(21) LetN k−1 =Pk−1 j=1 |Aj...

Pith/arXiv arXiv 1909