Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

Andrew Tao; Byung-Kwan Lee; Karan Sapra; Minki Kang; Pavlo Molchanov; Ryo Hachiuma; Saurav Muralidharan; Shizhe Diao; Ximing Lu; Yejin Choi

arxiv: 2606.18216 · v1 · pith:WWMAWD23new · submitted 2026-06-16 · 💻 cs.CL

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

Byung-Kwan Lee , Ximing Lu , Shizhe Diao , Minki Kang , Saurav Muralidharan , Karan Sapra , Andrew Tao , Pavlo Molchanov

show 3 more authors

Yejin Choi Yu-Chiang Frank Wang Ryo Hachiuma

This is my paper

Pith reviewed 2026-06-27 01:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords knowledge distillationreinforcement learningpolicy optimizationvision-language modelsprompt reformulationreplay buffermodel compression

0 comments

The pith

Reformulating hard questions as candidate prompts lets small students learn from larger teachers while staying on-policy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard distillation fails for small students because logit imitation narrows them to the teacher's modes and direct teacher injection into gradients violates the on-policy assumption. ZPPO instead keeps the teacher inside the prompt by turning failed questions into BCQ prompts that force the student to discriminate between one correct teacher answer and one wrong student answer, and NCQ prompts that surface patterns across multiple wrong answers. A replay buffer recirculates each hard question until the student's accuracy on it reaches half, ensuring the material stays inside the student's current capability range. Experiments on Qwen3.5 students from 0.8B to 9B with a 27B teacher show consistent gains over distillation and GRPO across 31 benchmarks, largest at the smallest scale.

Core claim

ZPPO constructs Binary Candidate-included Questions that pair one teacher response with one student error as anonymized options the student must choose between, and Negative Candidate-included Questions that bundle the student's failed rollouts into one prompt; both are stored in a replay buffer and re-presented until the question graduates or is evicted, so the teacher influences learning only through the prompt while policy-gradient updates remain strictly on the student's own rollouts.

What carries the argument

BCQ and NCQ reformulated prompts recirculated by a replay buffer that keeps each question active until the student's mean rollout accuracy reaches half.

If this is right

ZPPO outperforms both off-policy and on-policy distillation baselines as well as GRPO on the 31-benchmark suite.
Relative gains increase as student size decreases from 9B to 0.8B.
Questions remain in the buffer only while they are still inside the student's current capability band, then graduate or evict.
Teacher responses appear exclusively inside prompts rather than inside the advantage or loss terms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same replay-buffer logic could be tested in non-VLM RLHF settings where external guidance must not alter the policy distribution.
If the buffer eviction policy proves sensitive to capacity, a priority-based variant might further stabilize training.
The method implicitly assumes that discriminating between candidates transfers to open-ended generation; direct measurement of that transfer could be added as an auxiliary metric.

Load-bearing premise

Reformulating hard questions into BCQ and NCQ prompts and recirculating them keeps training inside the student's zone of proximal development while preserving valid on-policy gradient updates without new distribution shift.

What would settle it

A controlled run in which BCQ and NCQ prompts are replaced by neutral prompts but the same teacher answers are still shown elsewhere, and performance gains disappear, would falsify that the prompt reformulation itself is what maintains on-policy validity.

read the original abstract

Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZPPO's BCQ/NCQ prompt reformulations plus replay buffer give a concrete way to handle zero-advantage cases in RL distillation without direct teacher gradients, but the on-policy claims rest on unverified assumptions about distribution shift.

read the letter

The main contribution is a prompt-engineering approach for hard questions during RL-based distillation: BCQ pairs a teacher answer with a student error as candidates, NCQ aggregates student failures, and a replay buffer recirculates them until mean accuracy hits 0.5 or eviction. This keeps the teacher signal in the prompt rather than the gradient.

What stands out is the combination of those two prompt types with the buffer mechanism, aimed at staying inside the student's current capability range. The experiments cover four student sizes (0.8B to 9B) from the Qwen3.5 family, a 27B teacher, vision-language post-training, and a 31-benchmark suite. The abstract reports larger gains at the smallest scales compared with off/on-policy distillation and GRPO.

The soft spots are in the supporting evidence. No details appear on baseline implementations, hyperparameter controls, statistical significance, or checks for prompt-length and formatting effects. The stress-test concern lands: once prompts are rewritten to surface teacher answers and failure modes, the state distribution changes, and nothing in the abstract shows that advantages remain unbiased for the original question distribution or that FIFO eviction avoids selection bias.

This is for groups doing post-training of small VLMs and LLMs where standard distillation collapses on hard examples. A reader looking for a practical tweak to try on edge-scale models would find the idea worth testing, even if the current write-up leaves the validity questions open.

It deserves peer review because the problem is common and the method is a distinct angle from prior work.

Referee Report

3 major / 2 minor

Summary. The paper introduces Zone of Proximal Policy Optimization (ZPPO) for knowledge distillation, keeping the teacher inside prompts rather than gradients. For hard questions where student rollouts yield zero advantage, it constructs BCQ prompts (pairing one teacher response with one student error as anonymized candidates) and NCQ prompts (aggregating student errors), recirculating them via a replay buffer until mean accuracy reaches 0.5 or FIFO eviction. On Qwen3.5 students (0.8B–9B) with a 27B teacher, post-trained as VLMs and evaluated on 31 benchmarks (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with largest gains at the smallest scale.

Significance. If the results hold after addressing the on-policy validity concerns, ZPPO could offer a practical alternative to logit-based distillation that avoids mode collapse in small-student regimes while staying closer to standard RLHF-style updates. The replay-buffer mechanism for staying within the student's zone of proximal development is a novel empirical heuristic, but its soundness as an unbiased estimator remains unverified.

major comments (3)

[Abstract / Method] Abstract and method description: The claim that BCQ/NCQ reformulations and replay-buffer recirculation 'preserve the validity of the policy gradient updates without introducing new distribution shift' is load-bearing for the central contribution, yet no analysis, correction term, or diagnostic is provided showing that advantages computed on the modified prompt distribution remain unbiased estimators for the original question distribution.
[Experiments] Experimental section: The abstract states performance gains over baselines but supplies no details on statistical significance testing, baseline hyperparameter search procedures, number of seeds, or controls for prompt-length and formatting confounds introduced by BCQ/NCQ construction; without these, it is impossible to assess whether the reported improvements are attributable to the proposed mechanism.
[Method] Method description: The FIFO-eviction and graduation rule (mean accuracy = 0.5) induces non-uniform sampling over the original task distribution, yet no ablation or importance-weighting analysis is reported to quantify selection bias in the resulting policy-gradient estimates.

minor comments (2)

[Method] Notation for BCQ and NCQ is introduced only in the abstract; a dedicated subsection with explicit prompt templates and an example would improve reproducibility.
[Experiments] The 31-benchmark suite is described only at the category level (16 VLM, 10 LLM, 5 Video); listing the individual benchmarks and their sources would strengthen the evaluation section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on ZPPO. We address each major comment below, clarifying our design choices while committing to revisions that strengthen the manuscript's rigor and transparency.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: The claim that BCQ/NCQ reformulations and replay-buffer recirculation 'preserve the validity of the policy gradient updates without introducing new distribution shift' is load-bearing for the central contribution, yet no analysis, correction term, or diagnostic is provided showing that advantages computed on the modified prompt distribution remain unbiased estimators for the original question distribution.

Authors: We agree that a formal analysis of unbiasedness would strengthen the central claim. BCQ and NCQ are constructed so that the student policy still samples responses to questions derived from the original distribution, with the teacher signal provided only via prompt content rather than gradient terms; the replay buffer simply re-exposes the same hard questions until the student's accuracy improves. This is intended as a practical heuristic to avoid zero-advantage discards while remaining closer to on-policy updates than direct teacher logit injection. However, we did not supply a correction term or diagnostic because the method prioritizes empirical focus on the zone of proximal development over theoretical guarantees. We will revise to remove the strong phrasing about preserving validity without shift, explicitly label the approach as heuristic, and add a limitations paragraph discussing possible distribution effects. revision: partial
Referee: [Experiments] Experimental section: The abstract states performance gains over baselines but supplies no details on statistical significance testing, baseline hyperparameter search procedures, number of seeds, or controls for prompt-length and formatting confounds introduced by BCQ/NCQ construction; without these, it is impossible to assess whether the reported improvements are attributable to the proposed mechanism.

Authors: We accept that the experimental reporting is insufficient. The revised manuscript will report: (i) all main results averaged over three independent seeds with standard deviations; (ii) hyperparameter selection via grid search on a 5% held-out validation split of the training questions, with the same search budget applied to all baselines; (iii) explicit controls in which BCQ/NCQ prompts were truncated or padded to match the token length distribution of the original prompts, with an ablation showing that length-matched variants retain the reported gains. We will also add paired t-test p-values against the strongest baseline for the primary 31-benchmark aggregate. revision: yes
Referee: [Method] Method description: The FIFO-eviction and graduation rule (mean accuracy = 0.5) induces non-uniform sampling over the original task distribution, yet no ablation or importance-weighting analysis is reported to quantify selection bias in the resulting policy-gradient estimates.

Authors: The graduation threshold and replay buffer deliberately create a curriculum that keeps questions inside the student's current zone of proximal development; without them, hard questions would be discarded after a single zero-advantage pass. This does produce non-uniform sampling. We will add a new ablation that disables the replay buffer (single-pass training on hard questions only) and reports both aggregate performance and per-question accuracy trajectories, allowing readers to assess the magnitude of any selection effect. We will also note in the method section that importance weighting was not applied because the buffer operates on a per-question basis rather than reweighting the original task distribution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no derivation chain

full rationale

The paper introduces ZPPO as a practical RL training procedure using BCQ/NCQ prompt reformulations and a replay buffer, then reports empirical gains on Qwen3.5 models across 31 benchmarks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claims rest on benchmark results rather than any mathematical reduction to inputs by construction. The method is presented as an empirical contribution whose validity is to be judged by external replication, not by internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities beyond the high-level algorithmic components described.

pith-pipeline@v0.9.1-grok · 5885 in / 1282 out tokens · 34506 ms · 2026-06-27T01:06:19.571206+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

156 extracted references · 4 canonical work pages

[1]

Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024
[2]

Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023
[3]

The claude 3 model family: Opus, sonnet, haiku

Anthropic. The claude 3 model family: Opus, sonnet, haiku. https: //www.anthropic.com, 2024. URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

2024
[4]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026
[5]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025
[6]

Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

Pith/arXiv arXiv 2025
[7]

Understanding R1-Zero-Like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-Like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

Pith/arXiv arXiv 2025
[8]

DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Pith/arXiv arXiv 2025
[9]

JustRL: Scaling a 1.5B LLM with a simple RL recipe.arXiv preprint arXiv:2512.16649, 2025

Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. JustRL: Scaling a 1.5B LLM with a simple RL recipe.arXiv preprint arXiv:2512.16649, 2025

arXiv 2025
[10]

The art of scaling reinforcement learning compute for LLMs.arXiv preprint arXiv:2510.13786, 2025

Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for LLMs.arXiv preprint arXiv:2510.13786, 2025

Pith/arXiv arXiv 2025
[11]

Vlsi: Verbalized layers-to-interactions from large to small vision language models

Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, and Yueh-Hua Wu. Vlsi: Verbalized layers-to-interactions from large to small vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29545–29557, June 2025

2025
[12]

Uni- fied reinforcement and imitation learning for vision-language models

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Frank Wang, and Yueh-Hua Wu. Uni- fied reinforcement and imitation learning for vision-language models. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 156508–156534. Curran As- sociates, Inc., 2025. UR...

2025
[13]

Masking teacher and reinforcing student for distilling vision-language models

Byung-Kwan Lee, Yu-Chiang Frank Wang, and Ryo Hachiuma. Masking teacher and reinforcing student for distilling vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10126–10141, June 2026

2026
[14]

Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025

Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025

Pith/arXiv arXiv 2025
[15]

Fastvlm: Efficient vision encoding for vision language models

Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokula Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, et al. Fastvlm: Efficient vision encoding for vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19769–19780, 2025

2025
[16]

Mobile edge intelligence for large language models: A contemporary survey.IEEE Communications Surveys & Tutorials, 2025

Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, and Kaibin Huang. Mobile edge intelligence for large language models: A contemporary survey.IEEE Communications Surveys & Tutorials, 2025

2025
[17]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015
[18]

Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021

2021
[19]

Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

Pith/arXiv arXiv 1910
[20]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024
[21]

DistiLLM: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. DistiLLM: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

arXiv 2024
[22]

Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026

Pith/arXiv arXiv 2026
[23]

Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

Pith/arXiv arXiv 2026
[24]

A survey of on-policy distillation for large language models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026

Pith/arXiv arXiv 2026
[25]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024

2024
[26]

On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on- policy-distillation. 14 Zone of Proximal Policy Optimization

work page doi:10.64434/tml.20251026 2025
[27]

Revisiting on-policy distillation: Empirical failure modes and simple fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes. arXiv preprint arXiv:2603.25562, 2026

Pith/arXiv arXiv 2026
[28]

Vold: Reasoning transfer from llms to vision-language models via on-policy distillation.arXiv preprint arXiv:2510.23497, 2025

Walid Bousselham, Hilde Kuehne, and Cordelia Schmid. Vold: Reasoning transfer from llms to vision-language models via on-policy distillation.arXiv preprint arXiv:2510.23497, 2025

Pith/arXiv arXiv 2025
[29]

Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

Pith/arXiv arXiv 2026
[30]

Lyng, Sanjit Singh Batra, and Robert E

Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D. Lyng, Sanjit Singh Batra, and Robert E. Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

arXiv 2026
[31]

Lightning OPD: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint arXiv:2604.13010, 2026

Yecheng Wu, Song Han, and Han Cai. Lightning OPD: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint arXiv:2604.13010, 2026

Pith/arXiv arXiv 2026
[32]

Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

Pith/arXiv arXiv 2026
[33]

Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

Pith/arXiv arXiv 2026
[34]

Reinforce- ment learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforce- ment learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

Pith/arXiv arXiv 2026
[35]

Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

Pith/arXiv arXiv 2026
[36]

DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[37]

REINFORCE++: A simple and efficient approach for aligning large language models

Jian Hu. REINFORCE++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025

Pith/arXiv arXiv 2025
[38]

ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025

Pith/arXiv arXiv 2025
[39]

BroRL: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Harchaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, Jun Yang, Jan Kautz, et al. BroRL: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

arXiv 2025
[40]

Dler: Doing length penalty right - incentivizing more intelligence per token via reinforcement learning.arXiv preprint arXiv:2510.15110, 2025

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. Dler: Doing length penalty right - incentivizing more intelligence per token via reinforcement learning.arXiv preprint arXiv:2510.15110, 2025. 15 Zone of Proximal Policy Optimization

arXiv 2025
[41]

GDPO: Group reward- decoupled normalization policy optimization for multi-reward RL optimization.arXiv preprint arXiv:2601.05242, 2026

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, et al. GDPO: Group reward- decoupled normalization policy optimization for multi-reward RL optimization.arXiv preprint arXiv:2601.05242, 2026

Pith/arXiv arXiv 2026
[42]

Harvard university press, 1978

Lev Semenovich Vygotsky and Michael Cole.Mind in society: Development of higher psycho- logical processes. Harvard university press, 1978

1978
[43]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning.arXiv preprint arXiv:2203.14465, 2022

arXiv 2022
[44]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[45]

Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning.arXiv preprint arXiv:2506.02208, 2025

Hongling Xu, Qi Zhu, Heyuan Deng, Jinpeng Li, Lu Hou, Yasheng Wang, Lifeng Shang, Ruifeng Xu, and Fei Mi. Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning.arXiv preprint arXiv:2506.02208, 2025

arXiv 2025
[46]

RLKD: Distilling LLMs’ reasoning via reinforcement learning

Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, and Xueqi Cheng. RLKD: Distilling LLMs’ reasoning via reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026
[47]

Wong, and Yu Cheng

Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, and Yu Cheng. Exgrpo: Learning to reason from experience.arXiv preprint arXiv:2510.02245, 2025

arXiv 2025
[48]

Rlep: Reinforcement learning with experience replay for llm reasoning.arXiv preprint arXiv:2507.07451, 2025

Hongzhi Zhang, Jia Fu, Jingyuan Zhang, Kai Fu, Qi Wang, Fuzheng Zhang, and Guorui Zhou. Rlep: Reinforcement learning with experience replay for llm reasoning.arXiv preprint arXiv:2507.07451, 2025

arXiv 2025
[49]

Clpo: Curriculum learning meets policy optimization for llm reasoning.arXiv preprint arXiv:2509.25004, 2025

Shijie Zhang, Guohao Sun, Kevin Zhang, Xiang Guo, and Rujun Guo. Clpo: Curriculum learning meets policy optimization for llm reasoning.arXiv preprint arXiv:2509.25004, 2025

Pith/arXiv arXiv 2025
[50]

StepHint: Multi-level stepwise hints enhance reinforcement learning to reason.arXiv preprint arXiv:2507.02841, 2025

Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, and Rui Yan. StepHint: Multi-level stepwise hints enhance reinforcement learning to reason.arXiv preprint arXiv:2507.02841, 2025

arXiv 2025
[51]

BREAD: Branched rollouts from expert anchors bridge SFT & RL for reasoning.arXiv preprint arXiv:2506.17211, 2025

Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, and Samet Oymak. BREAD: Branched rollouts from expert anchors bridge SFT & RL for reasoning.arXiv preprint arXiv:2506.17211, 2025

arXiv 2025
[52]

Staying in the sweet spot: Responsive reasoning evolution via capability-adaptive hint scaffolding.arXiv preprint arXiv:2509.06923, 2025

Ziheng Li, Zexu Sun, Jinman Zhao, Erxue Min, Yongcheng Zeng, Hui Wu, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Xu Chen, et al. Staying in the sweet spot: Responsive reasoning evolution via capability-adaptive hint scaffolding.arXiv preprint arXiv:2509.06923, 2025

arXiv 2025
[53]

CoLLaVO: Crayon large language and vision mOdel

Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. CoLLaVO: Crayon large language and vision mOdel. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 1121–1138, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/ v1/2024.fin...

2024
[54]

Moai: Mixture of all intelligence for large language and vision models

Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. Moai: Mixture of all intelligence for large language and vision models. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 273–302, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031-72967-6. 16 Zone of ...

2024
[55]

Meteor: Mamba-based traversal of rationale for large language and vision models

Byung-Kwan Lee, Chae Won Kim, Beomchan Park, and Yong Man Ro. Meteor: Mamba-based traversal of rationale for large language and vision models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 40278–40315. Curran Associates, Inc., 2024. doi: 10....

work page doi:10.52202/079017-1274 2024
[56]

TroL: Traversal of layers for large language and vision models

Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, and Yong Man Ro. TroL: Traversal of layers for large language and vision models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11314–11342, Miami, Florida, USA, November 2024. Association...

work page doi:10.18653/v1/2024.emnlp-main.633 2024
[57]

Phantom of latent for large language and vision models.arXiv preprint arXiv:2409.14713, 2024

Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, and Yong Man Ro. Phantom of latent for large language and vision models.arXiv preprint arXiv:2409.14713, 2024

arXiv 2024
[58]

Genrecal: Generation after recalibration from large to small vision-language models.arXiv preprint arXiv:2506.15681, 2025

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, and Yueh-Hua Wu. Genrecal: Generation after recalibration from large to small vision-language models.arXiv preprint arXiv:2506.15681, 2025

Pith/arXiv arXiv 2025
[59]

Building high-performing, efficient-size vision language models: merge, modify, and distill

Byung-Kwan Lee. Building high-performing, efficient-size vision language models: merge, modify, and distill. 2025

2025
[60]

Hide to see: Reasoning-prefix masking for visual-anchored thinking in vlm distillation.arXiv preprint arXiv:2605.11651, 2026

Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, and Jeany Son. Hide to see: Reasoning-prefix masking for visual-anchored thinking in vlm distillation.arXiv preprint arXiv:2605.11651, 2026

Pith/arXiv arXiv 2026
[61]

Recursive think-answer process for llms and vlms

Byung-Kwan Lee, Youngchae Chee, and Yong Man Ro. Recursive think-answer process for llms and vlms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, pages 9608–9621, June 2026

2026
[62]

Agent explorative policy optimization for multimodal agentic reasoning.arXiv preprint arXiv:2605.28774, 2026

Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov, Yu-Chiang Frank Wang, and Byung-Kwan Lee. Agent explorative policy optimization for multimodal agentic reasoning.arXiv preprint arXiv:2605.28774, 2026

Pith/arXiv arXiv 2026
[63]

Spatialclaw: Rethinking action interface for agentic spatial reasoning, 2026

Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, and Min-Hung Chen. Spatialclaw: Rethinking action interface for agentic spatial reasoning, 2026. URL https://arxiv.org/abs/2606.13673

Pith/arXiv arXiv 2026
[64]

Towards adversarial robustness of bayesian neural network through hierarchical variational inference, 2021

Byung-Kwan Lee, Youngjoon Yu, and Yong Man Ro. Towards adversarial robustness of bayesian neural network through hierarchical variational inference, 2021. URLhttps:// openreview.net/forum?id=Cue2ZEBf12

2021
[65]

Distilling robust and non-robust fea- tures in adversarial examples by information bottleneck

Junho Kim, Byung-Kwan Lee, and Yong Man Ro. Distilling robust and non-robust fea- tures in adversarial examples by information bottleneck. In M. Ranzato, A. Beygelz- imer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neu- ral Information Processing Systems, volume 34, pages 17148–17159. Curran Associates, Inc., 2021. URL https://pro...

2021
[66]

Masking adversarial damage: Finding adversarial saliency for robust and sparse network

Byung-Kwan Lee, Junho Kim, and Yong Man Ro. Masking adversarial damage: Finding adversarial saliency for robust and sparse network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15126–15136, June 2022

2022
[67]

Demystifying causal features on adversarial examples and causal inoculation for robust network by adversarial instrumental variable regression

Junho Kim, Byung-Kwan Lee, and Yong Man Ro. Demystifying causal features on adversarial examples and causal inoculation for robust network by adversarial instrumental variable regression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12302–12312, June 2023

2023
[68]

Mitigating adversarial vulnerability through causal parameter estimation by adversarial double machine learning

Byung-Kwan Lee, Junho Kim, and Yong Man Ro. Mitigating adversarial vulnerability through causal parameter estimation by adversarial double machine learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4499–4509, October 2023

2023
[69]

Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding

Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, and Chanyoung Park. Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding. arXiv preprint arXiv:2604.12358, 2026

Pith/arXiv arXiv 2026
[70]

Enhancing conversational agents with skill-of-mind-infused large language model

Young-Jun Lee, Byung-Kwan Lee, Dokyong Lee, Kyeong-Jin Oh, Yechan Hwang, Ho-Jin Choi, et al. Enhancing conversational agents with skill-of-mind-infused large language model
[71]

Training encoder-attention through fully-connected crfs for efficient end-to- end lane detection model

Byung-Kwan Lee. Training encoder-attention through fully-connected crfs for efficient end-to- end lane detection model. 2020

2020
[72]

Spark: Multi-vision sensor perception and reasoning benchmark for large-scale vision-language models.arXiv preprint arXiv:2408.12114, 2024

Youngjoon Yu, Sangyun Chung, Byung-Kwan Lee, and Yong Man Ro. Spark: Multi-vision sensor perception and reasoning benchmark for large-scale vision-language models.arXiv preprint arXiv:2408.12114, 2024

arXiv 2024
[73]

Multiverse: A multi-turn conversation benchmark for evaluating large vision and language models

Young-Jun Lee, Byung-Kwan Lee, Jianshu Zhang, Yechan Hwang, Byungsoo Ko, Han-Gyu Kim, Dongyu Yao, Xuankun Rong, Eojin Joo, Seung-Ho Han, Bowon Ko, and Ho-Jin Choi. Multiverse: A multi-turn conversation benchmark for evaluating large vision and language models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 708–719,...

2025
[74]

Refinebench: Evaluatingrefinement capability of language models via checklists

Young-Jun Lee, Seungone Kim, Byung-Kwan Lee, Minkyeong Moon, Yechan Hwang, Jong My- oungKim, GrahamNeubig, SeanWelleck, andHo-JinChoi. Refinebench: Evaluatingrefinement capability of language models via checklists. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=GYJFJz9Dy5

2026
[75]

Mitigating dataset bias in image captioning through clip confounder-free captioning network

Yeonju Kim, Junho Kim, Byung-Kwan Lee, Sebin Shin, and Yong Man Ro. Mitigating dataset bias in image captioning through clip confounder-free captioning network. In2023 IEEE International Conference on Image Processing (ICIP), pages 1720–1724, 2023. doi: 10.1109/ICIP49359.2023.10222502

work page doi:10.1109/icip49359.2023.10222502 2023
[76]

Causal unsupervised semantic segmen- tation.Pattern Recognition, 171:112173, 2026

Junho Kim, Byung-Kwan Lee, and Yong Man Ro. Causal unsupervised semantic segmen- tation.Pattern Recognition, 171:112173, 2026. ISSN 0031-3203. doi: https://doi.org/10. 1016/j.patcog.2025.112173. URL https://www.sciencedirect.com/science/article/pii/ S0031320325008349

arXiv 2026
[77]

Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J

Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, et al. Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585, 2023. 18 Zone of Proximal Policy Optimization

arXiv 2023
[78]

R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V,

Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V,
[79]

Accessed: 2025-02-02

2025
[80]

R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

Pith/arXiv arXiv 2024

[2] [2]

Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

Pith/arXiv arXiv 2023

[3] [3]

The claude 3 model family: Opus, sonnet, haiku

Anthropic. The claude 3 model family: Opus, sonnet, haiku. https: //www.anthropic.com, 2024. URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

2024

[4] [4]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026

[5] [5]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

Pith/arXiv arXiv 2025

[6] [6]

Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599, 2025

Pith/arXiv arXiv 2025

[7] [7]

Understanding R1-Zero-Like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding R1-Zero-Like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

Pith/arXiv arXiv 2025

[8] [8]

DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Pith/arXiv arXiv 2025

[9] [9]

JustRL: Scaling a 1.5B LLM with a simple RL recipe.arXiv preprint arXiv:2512.16649, 2025

Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, et al. JustRL: Scaling a 1.5B LLM with a simple RL recipe.arXiv preprint arXiv:2512.16649, 2025

arXiv 2025

[10] [10]

The art of scaling reinforcement learning compute for LLMs.arXiv preprint arXiv:2510.13786, 2025

Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S Dhillon, David Brandfonbrener, and Rishabh Agarwal. The art of scaling reinforcement learning compute for LLMs.arXiv preprint arXiv:2510.13786, 2025

Pith/arXiv arXiv 2025

[11] [11]

Vlsi: Verbalized layers-to-interactions from large to small vision language models

Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, and Yueh-Hua Wu. Vlsi: Verbalized layers-to-interactions from large to small vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 29545–29557, June 2025

2025

[12] [12]

Uni- fied reinforcement and imitation learning for vision-language models

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Frank Wang, and Yueh-Hua Wu. Uni- fied reinforcement and imitation learning for vision-language models. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Advances in Neural Information Processing Systems, volume 38, pages 156508–156534. Curran As- sociates, Inc., 2025. UR...

2025

[13] [13]

Masking teacher and reinforcing student for distilling vision-language models

Byung-Kwan Lee, Yu-Chiang Frank Wang, and Ryo Hachiuma. Masking teacher and reinforcing student for distilling vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10126–10141, June 2026

2026

[14] [14]

Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025

Andrés Marafioti, Orr Zohar, Miquel Farré, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025

Pith/arXiv arXiv 2025

[15] [15]

Fastvlm: Efficient vision encoding for vision language models

Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokula Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, et al. Fastvlm: Efficient vision encoding for vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19769–19780, 2025

2025

[16] [16]

Mobile edge intelligence for large language models: A contemporary survey.IEEE Communications Surveys & Tutorials, 2025

Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, and Kaibin Huang. Mobile edge intelligence for large language models: A contemporary survey.IEEE Communications Surveys & Tutorials, 2025

2025

[17] [17]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015

[18] [18]

Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021

Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey.International journal of computer vision, 129(6):1789–1819, 2021

2021

[19] [19]

Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

Pith/arXiv arXiv 1910

[20] [20]

MiniLLM: Knowledge distillation of large language models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024

[21] [21]

DistiLLM: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. DistiLLM: Towards streamlined distillation for large language models.arXiv preprint arXiv:2402.03898, 2024

arXiv 2024

[22] [22]

Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?arXiv preprint arXiv:2603.24472, 2026

Pith/arXiv arXiv 2026

[23] [23]

Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

Pith/arXiv arXiv 2026

[24] [24]

A survey of on-policy distillation for large language models

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026

Pith/arXiv arXiv 2026

[25] [25]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024

2024

[26] [26]

On-policy distillation.Thinking Machines Lab: Con- nectionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on- policy-distillation. 14 Zone of Proximal Policy Optimization

work page doi:10.64434/tml.20251026 2025

[27] [27]

Revisiting on-policy distillation: Empirical failure modes and simple fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes. arXiv preprint arXiv:2603.25562, 2026

Pith/arXiv arXiv 2026

[28] [28]

Vold: Reasoning transfer from llms to vision-language models via on-policy distillation.arXiv preprint arXiv:2510.23497, 2025

Walid Bousselham, Hilde Kuehne, and Cordelia Schmid. Vold: Reasoning transfer from llms to vision-language models via on-policy distillation.arXiv preprint arXiv:2510.23497, 2025

Pith/arXiv arXiv 2025

[29] [29]

Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

Pith/arXiv arXiv 2026

[30] [30]

Lyng, Sanjit Singh Batra, and Robert E

Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D. Lyng, Sanjit Singh Batra, and Robert E. Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026

arXiv 2026

[31] [31]

Lightning OPD: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint arXiv:2604.13010, 2026

Yecheng Wu, Song Han, and Han Cai. Lightning OPD: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint arXiv:2604.13010, 2026

Pith/arXiv arXiv 2026

[32] [32]

Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

Pith/arXiv arXiv 2026

[33] [33]

Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

Pith/arXiv arXiv 2026

[34] [34]

Reinforce- ment learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforce- ment learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

Pith/arXiv arXiv 2026

[35] [35]

Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

Pith/arXiv arXiv 2026

[36] [36]

DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[37] [37]

REINFORCE++: A simple and efficient approach for aligning large language models

Jian Hu. REINFORCE++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025

Pith/arXiv arXiv 2025

[38] [38]

ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025

Mingjie Liu, Shizhe Diao, Ximing Lu, Jian Hu, Xin Dong, Yejin Choi, Jan Kautz, and Yi Dong. ProRL: Prolonged reinforcement learning expands reasoning boundaries in large language models.arXiv preprint arXiv:2505.24864, 2025

Pith/arXiv arXiv 2025

[39] [39]

BroRL: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Harchaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, Jun Yang, Jan Kautz, et al. BroRL: Scaling reinforcement learning via broadened exploration.arXiv preprint arXiv:2510.01180, 2025

arXiv 2025

[40] [40]

Dler: Doing length penalty right - incentivizing more intelligence per token via reinforcement learning.arXiv preprint arXiv:2510.15110, 2025

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. Dler: Doing length penalty right - incentivizing more intelligence per token via reinforcement learning.arXiv preprint arXiv:2510.15110, 2025. 15 Zone of Proximal Policy Optimization

arXiv 2025

[41] [41]

GDPO: Group reward- decoupled normalization policy optimization for multi-reward RL optimization.arXiv preprint arXiv:2601.05242, 2026

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, et al. GDPO: Group reward- decoupled normalization policy optimization for multi-reward RL optimization.arXiv preprint arXiv:2601.05242, 2026

Pith/arXiv arXiv 2026

[42] [42]

Harvard university press, 1978

Lev Semenovich Vygotsky and Michael Cole.Mind in society: Development of higher psycho- logical processes. Harvard university press, 1978

1978

[43] [43]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning.arXiv preprint arXiv:2203.14465, 2022

arXiv 2022

[44] [44]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[45] [45]

Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning.arXiv preprint arXiv:2506.02208, 2025

Hongling Xu, Qi Zhu, Heyuan Deng, Jinpeng Li, Lu Hou, Yasheng Wang, Lifeng Shang, Ruifeng Xu, and Fei Mi. Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning.arXiv preprint arXiv:2506.02208, 2025

arXiv 2025

[46] [46]

RLKD: Distilling LLMs’ reasoning via reinforcement learning

Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, and Xueqi Cheng. RLKD: Distilling LLMs’ reasoning via reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026

[47] [47]

Wong, and Yu Cheng

Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F. Wong, and Yu Cheng. Exgrpo: Learning to reason from experience.arXiv preprint arXiv:2510.02245, 2025

arXiv 2025

[48] [48]

Rlep: Reinforcement learning with experience replay for llm reasoning.arXiv preprint arXiv:2507.07451, 2025

Hongzhi Zhang, Jia Fu, Jingyuan Zhang, Kai Fu, Qi Wang, Fuzheng Zhang, and Guorui Zhou. Rlep: Reinforcement learning with experience replay for llm reasoning.arXiv preprint arXiv:2507.07451, 2025

arXiv 2025

[49] [49]

Clpo: Curriculum learning meets policy optimization for llm reasoning.arXiv preprint arXiv:2509.25004, 2025

Shijie Zhang, Guohao Sun, Kevin Zhang, Xiang Guo, and Rujun Guo. Clpo: Curriculum learning meets policy optimization for llm reasoning.arXiv preprint arXiv:2509.25004, 2025

Pith/arXiv arXiv 2025

[50] [50]

StepHint: Multi-level stepwise hints enhance reinforcement learning to reason.arXiv preprint arXiv:2507.02841, 2025

Kaiyi Zhang, Ang Lv, Jinpeng Li, Yongbo Wang, Feng Wang, Haoyuan Hu, and Rui Yan. StepHint: Multi-level stepwise hints enhance reinforcement learning to reason.arXiv preprint arXiv:2507.02841, 2025

arXiv 2025

[51] [51]

BREAD: Branched rollouts from expert anchors bridge SFT & RL for reasoning.arXiv preprint arXiv:2506.17211, 2025

Xuechen Zhang, Zijian Huang, Yingcong Li, Chenshun Ni, Jiasi Chen, and Samet Oymak. BREAD: Branched rollouts from expert anchors bridge SFT & RL for reasoning.arXiv preprint arXiv:2506.17211, 2025

arXiv 2025

[52] [52]

Staying in the sweet spot: Responsive reasoning evolution via capability-adaptive hint scaffolding.arXiv preprint arXiv:2509.06923, 2025

Ziheng Li, Zexu Sun, Jinman Zhao, Erxue Min, Yongcheng Zeng, Hui Wu, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Xu Chen, et al. Staying in the sweet spot: Responsive reasoning evolution via capability-adaptive hint scaffolding.arXiv preprint arXiv:2509.06923, 2025

arXiv 2025

[53] [53]

CoLLaVO: Crayon large language and vision mOdel

Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. CoLLaVO: Crayon large language and vision mOdel. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024, pages 1121–1138, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/ v1/2024.fin...

2024

[54] [54]

Moai: Mixture of all intelligence for large language and vision models

Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. Moai: Mixture of all intelligence for large language and vision models. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 273–302, Cham, 2025. Springer Nature Switzerland. ISBN 978-3-031-72967-6. 16 Zone of ...

2024

[55] [55]

Meteor: Mamba-based traversal of rationale for large language and vision models

Byung-Kwan Lee, Chae Won Kim, Beomchan Park, and Yong Man Ro. Meteor: Mamba-based traversal of rationale for large language and vision models. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 40278–40315. Curran Associates, Inc., 2024. doi: 10....

work page doi:10.52202/079017-1274 2024

[56] [56]

TroL: Traversal of layers for large language and vision models

Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, and Yong Man Ro. TroL: Traversal of layers for large language and vision models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11314–11342, Miami, Florida, USA, November 2024. Association...

work page doi:10.18653/v1/2024.emnlp-main.633 2024

[57] [57]

Phantom of latent for large language and vision models.arXiv preprint arXiv:2409.14713, 2024

Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, and Yong Man Ro. Phantom of latent for large language and vision models.arXiv preprint arXiv:2409.14713, 2024

arXiv 2024

[58] [58]

Genrecal: Generation after recalibration from large to small vision-language models.arXiv preprint arXiv:2506.15681, 2025

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, and Yueh-Hua Wu. Genrecal: Generation after recalibration from large to small vision-language models.arXiv preprint arXiv:2506.15681, 2025

Pith/arXiv arXiv 2025

[59] [59]

Building high-performing, efficient-size vision language models: merge, modify, and distill

Byung-Kwan Lee. Building high-performing, efficient-size vision language models: merge, modify, and distill. 2025

2025

[60] [60]

Hide to see: Reasoning-prefix masking for visual-anchored thinking in vlm distillation.arXiv preprint arXiv:2605.11651, 2026

Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, and Jeany Son. Hide to see: Reasoning-prefix masking for visual-anchored thinking in vlm distillation.arXiv preprint arXiv:2605.11651, 2026

Pith/arXiv arXiv 2026

[61] [61]

Recursive think-answer process for llms and vlms

Byung-Kwan Lee, Youngchae Chee, and Yong Man Ro. Recursive think-answer process for llms and vlms. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, pages 9608–9621, June 2026

2026

[62] [62]

Agent explorative policy optimization for multimodal agentic reasoning.arXiv preprint arXiv:2605.28774, 2026

Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov, Yu-Chiang Frank Wang, and Byung-Kwan Lee. Agent explorative policy optimization for multimodal agentic reasoning.arXiv preprint arXiv:2605.28774, 2026

Pith/arXiv arXiv 2026

[63] [63]

Spatialclaw: Rethinking action interface for agentic spatial reasoning, 2026

Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, and Min-Hung Chen. Spatialclaw: Rethinking action interface for agentic spatial reasoning, 2026. URL https://arxiv.org/abs/2606.13673

Pith/arXiv arXiv 2026

[64] [64]

Towards adversarial robustness of bayesian neural network through hierarchical variational inference, 2021

Byung-Kwan Lee, Youngjoon Yu, and Yong Man Ro. Towards adversarial robustness of bayesian neural network through hierarchical variational inference, 2021. URLhttps:// openreview.net/forum?id=Cue2ZEBf12

2021

[65] [65]

Distilling robust and non-robust fea- tures in adversarial examples by information bottleneck

Junho Kim, Byung-Kwan Lee, and Yong Man Ro. Distilling robust and non-robust fea- tures in adversarial examples by information bottleneck. In M. Ranzato, A. Beygelz- imer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors,Advances in Neu- ral Information Processing Systems, volume 34, pages 17148–17159. Curran Associates, Inc., 2021. URL https://pro...

2021

[66] [66]

Masking adversarial damage: Finding adversarial saliency for robust and sparse network

Byung-Kwan Lee, Junho Kim, and Yong Man Ro. Masking adversarial damage: Finding adversarial saliency for robust and sparse network. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15126–15136, June 2022

2022

[67] [67]

Demystifying causal features on adversarial examples and causal inoculation for robust network by adversarial instrumental variable regression

Junho Kim, Byung-Kwan Lee, and Yong Man Ro. Demystifying causal features on adversarial examples and causal inoculation for robust network by adversarial instrumental variable regression. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12302–12312, June 2023

2023

[68] [68]

Mitigating adversarial vulnerability through causal parameter estimation by adversarial double machine learning

Byung-Kwan Lee, Junho Kim, and Yong Man Ro. Mitigating adversarial vulnerability through causal parameter estimation by adversarial double machine learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4499–4509, October 2023

2023

[69] [69]

Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding

Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, and Chanyoung Park. Why and when visual token pruning fails? a study on relevant visual information shift in mllms decoding. arXiv preprint arXiv:2604.12358, 2026

Pith/arXiv arXiv 2026

[70] [70]

Enhancing conversational agents with skill-of-mind-infused large language model

Young-Jun Lee, Byung-Kwan Lee, Dokyong Lee, Kyeong-Jin Oh, Yechan Hwang, Ho-Jin Choi, et al. Enhancing conversational agents with skill-of-mind-infused large language model

[71] [71]

Training encoder-attention through fully-connected crfs for efficient end-to- end lane detection model

Byung-Kwan Lee. Training encoder-attention through fully-connected crfs for efficient end-to- end lane detection model. 2020

2020

[72] [72]

Spark: Multi-vision sensor perception and reasoning benchmark for large-scale vision-language models.arXiv preprint arXiv:2408.12114, 2024

Youngjoon Yu, Sangyun Chung, Byung-Kwan Lee, and Yong Man Ro. Spark: Multi-vision sensor perception and reasoning benchmark for large-scale vision-language models.arXiv preprint arXiv:2408.12114, 2024

arXiv 2024

[73] [73]

Multiverse: A multi-turn conversation benchmark for evaluating large vision and language models

Young-Jun Lee, Byung-Kwan Lee, Jianshu Zhang, Yechan Hwang, Byungsoo Ko, Han-Gyu Kim, Dongyu Yao, Xuankun Rong, Eojin Joo, Seung-Ho Han, Bowon Ko, and Ho-Jin Choi. Multiverse: A multi-turn conversation benchmark for evaluating large vision and language models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 708–719,...

2025

[74] [74]

Refinebench: Evaluatingrefinement capability of language models via checklists

Young-Jun Lee, Seungone Kim, Byung-Kwan Lee, Minkyeong Moon, Yechan Hwang, Jong My- oungKim, GrahamNeubig, SeanWelleck, andHo-JinChoi. Refinebench: Evaluatingrefinement capability of language models via checklists. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=GYJFJz9Dy5

2026

[75] [75]

Mitigating dataset bias in image captioning through clip confounder-free captioning network

Yeonju Kim, Junho Kim, Byung-Kwan Lee, Sebin Shin, and Yong Man Ro. Mitigating dataset bias in image captioning through clip confounder-free captioning network. In2023 IEEE International Conference on Image Processing (ICIP), pages 1720–1724, 2023. doi: 10.1109/ICIP49359.2023.10222502

work page doi:10.1109/icip49359.2023.10222502 2023

[76] [76]

Causal unsupervised semantic segmen- tation.Pattern Recognition, 171:112173, 2026

Junho Kim, Byung-Kwan Lee, and Yong Man Ro. Causal unsupervised semantic segmen- tation.Pattern Recognition, 171:112173, 2026. ISSN 0031-3203. doi: https://doi.org/10. 1016/j.patcog.2025.112173. URL https://www.sciencedirect.com/science/article/pii/ S0031320325008349

arXiv 2026

[77] [77]

Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J

Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, et al. Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585, 2023. 18 Zone of Proximal Policy Optimization

arXiv 2023

[78] [78]

R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V,

Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep-Agent/R1-V,

[79] [79]

Accessed: 2025-02-02

2025

[80] [80]

R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-Onevision: Advancing generalized multimodal reasoning through cross-modal formalization.arXiv preprint arXiv:2503.10615, 2025

Pith/arXiv arXiv 2025