Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
Pith reviewed 2026-06-27 01:06 UTC · model grok-4.3
The pith
Reformulating hard questions as candidate prompts lets small students learn from larger teachers while staying on-policy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ZPPO constructs Binary Candidate-included Questions that pair one teacher response with one student error as anonymized options the student must choose between, and Negative Candidate-included Questions that bundle the student's failed rollouts into one prompt; both are stored in a replay buffer and re-presented until the question graduates or is evicted, so the teacher influences learning only through the prompt while policy-gradient updates remain strictly on the student's own rollouts.
What carries the argument
BCQ and NCQ reformulated prompts recirculated by a replay buffer that keeps each question active until the student's mean rollout accuracy reaches half.
If this is right
- ZPPO outperforms both off-policy and on-policy distillation baselines as well as GRPO on the 31-benchmark suite.
- Relative gains increase as student size decreases from 9B to 0.8B.
- Questions remain in the buffer only while they are still inside the student's current capability band, then graduate or evict.
- Teacher responses appear exclusively inside prompts rather than inside the advantage or loss terms.
Where Pith is reading between the lines
- The same replay-buffer logic could be tested in non-VLM RLHF settings where external guidance must not alter the policy distribution.
- If the buffer eviction policy proves sensitive to capacity, a priority-based variant might further stabilize training.
- The method implicitly assumes that discriminating between candidates transfers to open-ended generation; direct measurement of that transfer could be added as an auxiliary metric.
Load-bearing premise
Reformulating hard questions into BCQ and NCQ prompts and recirculating them keeps training inside the student's zone of proximal development while preserving valid on-policy gradient updates without new distribution shift.
What would settle it
A controlled run in which BCQ and NCQ prompts are replaced by neutral prompts but the same teacher answers are still shown elsewhere, and performance gains disappear, would falsify that the prompt reformulation itself is what maintains on-policy validity.
read the original abstract
Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Zone of Proximal Policy Optimization (ZPPO) for knowledge distillation, keeping the teacher inside prompts rather than gradients. For hard questions where student rollouts yield zero advantage, it constructs BCQ prompts (pairing one teacher response with one student error as anonymized candidates) and NCQ prompts (aggregating student errors), recirculating them via a replay buffer until mean accuracy reaches 0.5 or FIFO eviction. On Qwen3.5 students (0.8B–9B) with a 27B teacher, post-trained as VLMs and evaluated on 31 benchmarks (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with largest gains at the smallest scale.
Significance. If the results hold after addressing the on-policy validity concerns, ZPPO could offer a practical alternative to logit-based distillation that avoids mode collapse in small-student regimes while staying closer to standard RLHF-style updates. The replay-buffer mechanism for staying within the student's zone of proximal development is a novel empirical heuristic, but its soundness as an unbiased estimator remains unverified.
major comments (3)
- [Abstract / Method] Abstract and method description: The claim that BCQ/NCQ reformulations and replay-buffer recirculation 'preserve the validity of the policy gradient updates without introducing new distribution shift' is load-bearing for the central contribution, yet no analysis, correction term, or diagnostic is provided showing that advantages computed on the modified prompt distribution remain unbiased estimators for the original question distribution.
- [Experiments] Experimental section: The abstract states performance gains over baselines but supplies no details on statistical significance testing, baseline hyperparameter search procedures, number of seeds, or controls for prompt-length and formatting confounds introduced by BCQ/NCQ construction; without these, it is impossible to assess whether the reported improvements are attributable to the proposed mechanism.
- [Method] Method description: The FIFO-eviction and graduation rule (mean accuracy = 0.5) induces non-uniform sampling over the original task distribution, yet no ablation or importance-weighting analysis is reported to quantify selection bias in the resulting policy-gradient estimates.
minor comments (2)
- [Method] Notation for BCQ and NCQ is introduced only in the abstract; a dedicated subsection with explicit prompt templates and an example would improve reproducibility.
- [Experiments] The 31-benchmark suite is described only at the category level (16 VLM, 10 LLM, 5 Video); listing the individual benchmarks and their sources would strengthen the evaluation section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on ZPPO. We address each major comment below, clarifying our design choices while committing to revisions that strengthen the manuscript's rigor and transparency.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and method description: The claim that BCQ/NCQ reformulations and replay-buffer recirculation 'preserve the validity of the policy gradient updates without introducing new distribution shift' is load-bearing for the central contribution, yet no analysis, correction term, or diagnostic is provided showing that advantages computed on the modified prompt distribution remain unbiased estimators for the original question distribution.
Authors: We agree that a formal analysis of unbiasedness would strengthen the central claim. BCQ and NCQ are constructed so that the student policy still samples responses to questions derived from the original distribution, with the teacher signal provided only via prompt content rather than gradient terms; the replay buffer simply re-exposes the same hard questions until the student's accuracy improves. This is intended as a practical heuristic to avoid zero-advantage discards while remaining closer to on-policy updates than direct teacher logit injection. However, we did not supply a correction term or diagnostic because the method prioritizes empirical focus on the zone of proximal development over theoretical guarantees. We will revise to remove the strong phrasing about preserving validity without shift, explicitly label the approach as heuristic, and add a limitations paragraph discussing possible distribution effects. revision: partial
-
Referee: [Experiments] Experimental section: The abstract states performance gains over baselines but supplies no details on statistical significance testing, baseline hyperparameter search procedures, number of seeds, or controls for prompt-length and formatting confounds introduced by BCQ/NCQ construction; without these, it is impossible to assess whether the reported improvements are attributable to the proposed mechanism.
Authors: We accept that the experimental reporting is insufficient. The revised manuscript will report: (i) all main results averaged over three independent seeds with standard deviations; (ii) hyperparameter selection via grid search on a 5% held-out validation split of the training questions, with the same search budget applied to all baselines; (iii) explicit controls in which BCQ/NCQ prompts were truncated or padded to match the token length distribution of the original prompts, with an ablation showing that length-matched variants retain the reported gains. We will also add paired t-test p-values against the strongest baseline for the primary 31-benchmark aggregate. revision: yes
-
Referee: [Method] Method description: The FIFO-eviction and graduation rule (mean accuracy = 0.5) induces non-uniform sampling over the original task distribution, yet no ablation or importance-weighting analysis is reported to quantify selection bias in the resulting policy-gradient estimates.
Authors: The graduation threshold and replay buffer deliberately create a curriculum that keeps questions inside the student's current zone of proximal development; without them, hard questions would be discarded after a single zero-advantage pass. This does produce non-uniform sampling. We will add a new ablation that disables the replay buffer (single-pass training on hard questions only) and reports both aggregate performance and per-question accuracy trajectories, allowing readers to assess the magnitude of any selection effect. We will also note in the method section that importance weighting was not applied because the buffer operates on a per-question basis rather than reweighting the original task distribution. revision: yes
Circularity Check
No circularity: empirical method with no derivation chain
full rationale
The paper introduces ZPPO as a practical RL training procedure using BCQ/NCQ prompt reformulations and a replay buffer, then reports empirical gains on Qwen3.5 models across 31 benchmarks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on benchmark results rather than any mathematical reduction to inputs by construction. The method is presented as an empirical contribution whose validity is to be judged by external replication, not by internal definitional equivalence.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
Pith/arXiv arXiv 2024
-
[2]
Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
Pith/arXiv arXiv 2023
-
[3]
The claude 3 model family: Opus, sonnet, haiku
Anthropic. The claude 3 model family: Opus, sonnet, haiku. https: //www.anthropic.com, 2024. URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
2024
-
[4]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5
2026
-
[5]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
Pith/arXiv arXiv 2025
-
[6]
Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025
Pith/arXiv arXiv 2025
-
[7]
Understanding R1-Zero-Like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-Like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025
Pith/arXiv arXiv 2025
-
[8]
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
Pith/arXiv arXiv 2025
-
[9]
JustRL: Scaling a 1.5B LLM with a simple RL recipe.arXiv preprint arXiv:2512.16649, 2025
Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. JustRL: Scaling a 1.5B LLM with a simple RL recipe.arXiv preprint arXiv:2512.16649, 2025
arXiv 2025
-
[10]
The art of scaling reinforcement learning compute for LLMs.arXiv preprint arXiv:2510.13786, 2025
Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for LLMs.arXiv preprint arXiv:2510.13786, 2025
Pith/arXiv arXiv 2025
-
[11]
Vlsi: Verbalized layers-to-interactions from large to small vision language models
Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, and Yueh-Hua Wu. Vlsi: Verbalized layers-to-interactions from large to small vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29545–29557, June 2025
2025
-
[12]
Uni- fied reinforcement and imitation learning for vision-language models
Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Frank Wang, and Yueh-Hua Wu. Uni- fied reinforcement and imitation learning for vision-language models. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 156508–156534. Curran As- sociates, Inc., 2025. UR...
2025
-
[13]
Masking teacher and reinforcing student for distilling vision-language models
Byung-Kwan Lee, Yu-Chiang Frank Wang, and Ryo Hachiuma. Masking teacher and reinforcing student for distilling vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10126–10141, June 2026
2026
-
[14]
Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025
Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025
Pith/arXiv arXiv 2025
-
[15]
Fastvlm: Efficient vision encoding for vision language models
Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokula Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, et al. Fastvlm: Efficient vision encoding for vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19769–19780, 2025
2025
-
[16]
Mobile edge intelligence for large language models: A contemporary survey.IEEE Communications Surveys & Tutorials, 2025
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, and Kaibin Huang. Mobile edge intelligence for large language models: A contemporary survey.IEEE Communications Surveys & Tutorials, 2025
2025
-
[17]
Distilling the knowledge in a neural network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015
Pith/arXiv arXiv 2015
-
[18]
Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021
Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021
2021
-
[19]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019
Pith/arXiv arXiv 1910
-
[20]
MiniLLM: Knowledge distillation of large language models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[21]
Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. DistiLLM: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024
arXiv 2024
-
[22]
Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026
Pith/arXiv arXiv 2026
-
[23]
Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026
Pith/arXiv arXiv 2026
-
[24]
A survey of on-policy distillation for large language models
Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026
Pith/arXiv arXiv 2026
-
[25]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[26]
On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025
Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on- policy-distillation. 14 Zone of Proximal Policy Optimization
-
[27]
Revisiting on-policy distillation: Empirical failure modes and simple fixes
Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes. arXiv preprint arXiv:2603.25562, 2026
Pith/arXiv arXiv 2026
-
[28]
Walid Bousselham, Hilde Kuehne, and Cordelia Schmid. Vold: Reasoning transfer from llms to vision-language models via on-policy distillation.arXiv preprint arXiv:2510.23497, 2025
Pith/arXiv arXiv 2025
-
[29]
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026
Pith/arXiv arXiv 2026
-
[30]
Lyng, Sanjit Singh Batra, and Robert E
Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D. Lyng, Sanjit Singh Batra, and Robert E. Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026
arXiv 2026
-
[31]
Yecheng Wu, Song Han, and Han Cai. Lightning OPD: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint arXiv:2604.13010, 2026
Pith/arXiv arXiv 2026
-
[32]
Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026
Pith/arXiv arXiv 2026
-
[33]
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026
Pith/arXiv arXiv 2026
-
[34]
Reinforce- ment learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforce- ment learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026
Pith/arXiv arXiv 2026
-
[35]
Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026
Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026
Pith/arXiv arXiv 2026
-
[36]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
Pith/arXiv arXiv 2024
-
[37]
REINFORCE++: A simple and efficient approach for aligning large language models
Jian Hu. REINFORCE++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025
Pith/arXiv arXiv 2025
-
[38]
Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025
Pith/arXiv arXiv 2025
-
[39]
Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Harchaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, Jun Yang, Jan Kautz, et al. BroRL: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025
arXiv 2025
-
[40]
Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. Dler: Doing length penalty right - incentivizing more intelligence per token via reinforcement learning.arXiv preprint arXiv:2510.15110, 2025. 15 Zone of Proximal Policy Optimization
arXiv 2025
-
[41]
Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, et al. GDPO: Group reward- decoupled normalization policy optimization for multi-reward RL optimization.arXiv preprint arXiv:2601.05242, 2026
Pith/arXiv arXiv 2026
-
[42]
Harvard university press, 1978
Lev Semenovich Vygotsky and Michael Cole.Mind in society: Development of higher psycho- logical processes. Harvard university press, 1978
1978
-
[43]
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning.arXiv preprint arXiv:2203.14465, 2022
arXiv 2022
-
[44]
Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
Pith/arXiv arXiv 2017
-
[45]
Hongling Xu, Qi Zhu, Heyuan Deng, Jinpeng Li, Lu Hou, Yasheng Wang, Lifeng Shang, Ruifeng Xu, and Fei Mi. Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning.arXiv preprint arXiv:2506.02208, 2025
arXiv 2025
-
[46]
RLKD: Distilling LLMs’ reasoning via reinforcement learning
Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, and Xueqi Cheng. RLKD: Distilling LLMs’ reasoning via reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, 2026
2026
-
[47]
Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, and Yu Cheng. Exgrpo: Learning to reason from experience.arXiv preprint arXiv:2510.02245, 2025
arXiv 2025
-
[48]
Hongzhi Zhang, Jia Fu, Jingyuan Zhang, Kai Fu, Qi Wang, Fuzheng Zhang, and Guorui Zhou. Rlep: Reinforcement learning with experience replay for llm reasoning.arXiv preprint arXiv:2507.07451, 2025
arXiv 2025
-
[49]
Shijie Zhang, Guohao Sun, Kevin Zhang, Xiang Guo, and Rujun Guo. Clpo: Curriculum learning meets policy optimization for llm reasoning.arXiv preprint arXiv:2509.25004, 2025
Pith/arXiv arXiv 2025
-
[50]
Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, and Rui Yan. StepHint: Multi-level stepwise hints enhance reinforcement learning to reason.arXiv preprint arXiv:2507.02841, 2025
arXiv 2025
-
[51]
Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, and Samet Oymak. BREAD: Branched rollouts from expert anchors bridge SFT & RL for reasoning.arXiv preprint arXiv:2506.17211, 2025
arXiv 2025
-
[52]
Ziheng Li, Zexu Sun, Jinman Zhao, Erxue Min, Yongcheng Zeng, Hui Wu, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Xu Chen, et al. Staying in the sweet spot: Responsive reasoning evolution via capability-adaptive hint scaffolding.arXiv preprint arXiv:2509.06923, 2025
arXiv 2025
-
[53]
CoLLaVO: Crayon large language and vision mOdel
Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. CoLLaVO: Crayon large language and vision mOdel. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 1121–1138, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/ v1/2024.fin...
2024
-
[54]
Moai: Mixture of all intelligence for large language and vision models
Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. Moai: Mixture of all intelligence for large language and vision models. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 273–302, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031-72967-6. 16 Zone of ...
2024
-
[55]
Meteor: Mamba-based traversal of rationale for large language and vision models
Byung-Kwan Lee, Chae Won Kim, Beomchan Park, and Yong Man Ro. Meteor: Mamba-based traversal of rationale for large language and vision models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 40278–40315. Curran Associates, Inc., 2024. doi: 10....
-
[56]
TroL: Traversal of layers for large language and vision models
Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, and Yong Man Ro. TroL: Traversal of layers for large language and vision models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11314–11342, Miami, Florida, USA, November 2024. Association...
-
[57]
Phantom of latent for large language and vision models.arXiv preprint arXiv:2409.14713, 2024
Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, and Yong Man Ro. Phantom of latent for large language and vision models.arXiv preprint arXiv:2409.14713, 2024
arXiv 2024
-
[58]
Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, and Yueh-Hua Wu. Genrecal: Generation after recalibration from large to small vision-language models.arXiv preprint arXiv:2506.15681, 2025
Pith/arXiv arXiv 2025
-
[59]
Building high-performing, efficient-size vision language models: merge, modify, and distill
Byung-Kwan Lee. Building high-performing, efficient-size vision language models: merge, modify, and distill. 2025
2025
-
[60]
Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, and Jeany Son. Hide to see: Reasoning-prefix masking for visual-anchored thinking in vlm distillation.arXiv preprint arXiv:2605.11651, 2026
Pith/arXiv arXiv 2026
-
[61]
Recursive think-answer process for llms and vlms
Byung-Kwan Lee, Youngchae Chee, and Yong Man Ro. Recursive think-answer process for llms and vlms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, pages 9608–9621, June 2026
2026
-
[62]
Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov, Yu-Chiang Frank Wang, and Byung-Kwan Lee. Agent explorative policy optimization for multimodal agentic reasoning.arXiv preprint arXiv:2605.28774, 2026
Pith/arXiv arXiv 2026
-
[63]
Spatialclaw: Rethinking action interface for agentic spatial reasoning, 2026
Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, and Min-Hung Chen. Spatialclaw: Rethinking action interface for agentic spatial reasoning, 2026. URL https://arxiv.org/abs/2606.13673
Pith/arXiv arXiv 2026
-
[64]
Towards adversarial robustness of bayesian neural network through hierarchical variational inference, 2021
Byung-Kwan Lee, Youngjoon Yu, and Yong Man Ro. Towards adversarial robustness of bayesian neural network through hierarchical variational inference, 2021. URLhttps:// openreview.net/forum?id=Cue2ZEBf12
2021
-
[65]
Distilling robust and non-robust fea- tures in adversarial examples by information bottleneck
Junho Kim, Byung-Kwan Lee, and Yong Man Ro. Distilling robust and non-robust fea- tures in adversarial examples by information bottleneck. In M. Ranzato, A. Beygelz- imer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neu- ral Information Processing Systems, volume 34, pages 17148–17159. Curran Associates, Inc., 2021. URL https://pro...
2021
-
[66]
Masking adversarial damage: Finding adversarial saliency for robust and sparse network
Byung-Kwan Lee, Junho Kim, and Yong Man Ro. Masking adversarial damage: Finding adversarial saliency for robust and sparse network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15126–15136, June 2022
2022
-
[67]
Demystifying causal features on adversarial examples and causal inoculation for robust network by adversarial instrumental variable regression
Junho Kim, Byung-Kwan Lee, and Yong Man Ro. Demystifying causal features on adversarial examples and causal inoculation for robust network by adversarial instrumental variable regression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12302–12312, June 2023
2023
-
[68]
Mitigating adversarial vulnerability through causal parameter estimation by adversarial double machine learning
Byung-Kwan Lee, Junho Kim, and Yong Man Ro. Mitigating adversarial vulnerability through causal parameter estimation by adversarial double machine learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4499–4509, October 2023
2023
-
[69]
Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, and Chanyoung Park. Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding. arXiv preprint arXiv:2604.12358, 2026
Pith/arXiv arXiv 2026
-
[70]
Enhancing conversational agents with skill-of-mind-infused large language model
Young-Jun Lee, Byung-Kwan Lee, Dokyong Lee, Kyeong-Jin Oh, Yechan Hwang, Ho-Jin Choi, et al. Enhancing conversational agents with skill-of-mind-infused large language model
-
[71]
Training encoder-attention through fully-connected crfs for efficient end-to- end lane detection model
Byung-Kwan Lee. Training encoder-attention through fully-connected crfs for efficient end-to- end lane detection model. 2020
2020
-
[72]
Youngjoon Yu, Sangyun Chung, Byung-Kwan Lee, and Yong Man Ro. Spark: Multi-vision sensor perception and reasoning benchmark for large-scale vision-language models.arXiv preprint arXiv:2408.12114, 2024
arXiv 2024
-
[73]
Multiverse: A multi-turn conversation benchmark for evaluating large vision and language models
Young-Jun Lee, Byung-Kwan Lee, Jianshu Zhang, Yechan Hwang, Byungsoo Ko, Han-Gyu Kim, Dongyu Yao, Xuankun Rong, Eojin Joo, Seung-Ho Han, Bowon Ko, and Ho-Jin Choi. Multiverse: A multi-turn conversation benchmark for evaluating large vision and language models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 708–719,...
2025
-
[74]
Refinebench: Evaluatingrefinement capability of language models via checklists
Young-Jun Lee, Seungone Kim, Byung-Kwan Lee, Minkyeong Moon, Yechan Hwang, Jong My- oungKim, GrahamNeubig, SeanWelleck, andHo-JinChoi. Refinebench: Evaluatingrefinement capability of language models via checklists. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=GYJFJz9Dy5
2026
-
[75]
Mitigating dataset bias in image captioning through clip confounder-free captioning network
Yeonju Kim, Junho Kim, Byung-Kwan Lee, Sebin Shin, and Yong Man Ro. Mitigating dataset bias in image captioning through clip confounder-free captioning network. In2023 IEEE International Conference on Image Processing (ICIP), pages 1720–1724, 2023. doi: 10.1109/ICIP49359.2023.10222502
-
[76]
Causal unsupervised semantic segmen- tation.Pattern Recognition, 171:112173, 2026
Junho Kim, Byung-Kwan Lee, and Yong Man Ro. Causal unsupervised semantic segmen- tation.Pattern Recognition, 171:112173, 2026. ISSN 0031-3203. doi: https://doi.org/10. 1016/j.patcog.2025.112173. URL https://www.sciencedirect.com/science/article/pii/ S0031320325008349
arXiv 2026
-
[77]
Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J
Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, et al. Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585, 2023. 18 Zone of Proximal Policy Optimization
arXiv 2023
-
[78]
R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V,
Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V,
-
[79]
Accessed: 2025-02-02
2025
-
[80]
Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025
Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.