DOPD: Dual On-policy Distillation
Pith reviewed 2026-06-30 05:49 UTC · model grok-4.3
The pith
DOPD routes each token's supervision between privileged teacher and student policies using advantage gaps to reduce privilege illusion.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DOPD is an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals to alleviate privilege illusion.
What carries the argument
Advantage-aware dual routing that assigns per-token supervision from either privileged teacher or privileged student based on advantage gap and relative probabilities.
If this is right
- DOPD outperforms vanilla OPD and other counterparts on LLM and VLM settings.
- The method yields gains in stability, robustness, continual learning, and out-of-distribution performance.
- Tokens receive supervision varying in strength, objective, and strategy from either source.
- Privilege illusion is reduced by separating capability transfer from information asymmetry.
Where Pith is reading between the lines
- The routing logic may extend to other distillation or imitation settings where one party holds extra context that cannot be replicated.
- Non-uniform token importance could be used more broadly to focus training on capability-bearing signals rather than uniform dense supervision.
- The dual-policy approach might lower the requirement for exact capability matching between teacher and student in future distillation work.
Load-bearing premise
That advantage gap and relative probabilities can separate transferable capability signals from non-replicable information asymmetry without creating new biases or instability.
What would settle it
An experiment on a small-scale LLM task where DOPD routing is applied but the resulting model shows no gain over vanilla OPD or where the chosen routes fail to correlate with measured capability transfer on held-out data.
read the original abstract
On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals, to alleviate privilege illusion. Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts. Further results on stability, robustness, continual learning, and out-of-distribution tasks validate its superiority.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DOPD, an advantage-aware dual on-policy distillation paradigm for LLMs and VLMs. It dynamically routes token-level supervision between privileged teacher and privileged student policies using advantage gap and relative probabilities to address 'privilege illusion' (conflation of transferable capability gaps with non-replicable information asymmetry). The paper claims DOPD outperforms Vanilla OPD and other methods, with further benefits on stability, robustness, continual learning, and OOD tasks.
Significance. If the routing rule reliably isolates replicable capability signals from privilege-only asymmetry without introducing new biases or instability, the approach could meaningfully advance on-policy distillation by leveraging dual privileged policies and token-level adaptivity. The emphasis on non-uniform token supervision is a relevant direction for large-model training.
major comments (2)
- [Abstract] Abstract: the claim that 'DOPD consistently outperforms Vanilla OPD and other counterparts' supplies no metrics, baselines, statistical details, ablation results, or experimental setup, which is load-bearing for the central experimental claim.
- [Abstract] Abstract: the routing mechanism is described only at a conceptual level ('dynamically routes token-level supervision ... based on their advantage gap and relative probabilities') with no equations, algorithm, or pseudocode, preventing assessment of whether the heuristic correctly separates capability-bearing tokens from information-asymmetry tokens.
minor comments (1)
- [Abstract] Abstract: the newly introduced term 'privilege illusion' is not formally defined or illustrated with an example, which reduces clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We agree that the abstract can be strengthened for clarity and will revise it to better support the central claims while respecting length constraints. Below we address each point.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'DOPD consistently outperforms Vanilla OPD and other counterparts' supplies no metrics, baselines, statistical details, ablation results, or experimental setup, which is load-bearing for the central experimental claim.
Authors: We agree the abstract claim would be more informative with supporting details. In the revision we will add concise quantitative indicators (e.g., average relative gains on the primary benchmarks) and name the main baselines and settings, while keeping the statement within abstract length limits. revision: yes
-
Referee: [Abstract] Abstract: the routing mechanism is described only at a conceptual level ('dynamically routes token-level supervision ... based on their advantage gap and relative probabilities') with no equations, algorithm, or pseudocode, preventing assessment of whether the heuristic correctly separates capability-bearing tokens from information-asymmetry tokens.
Authors: The abstract intentionally remains high-level. The full routing rule, advantage-gap formula, probability-based selection, and Algorithm 1 appear in Section 3. To address the concern we will insert a compact inline expression for the token-level routing decision in the revised abstract. revision: yes
Circularity Check
No circularity: method defined conceptually without self-referential derivations or fitted predictions
full rationale
The provided abstract and description introduce DOPD as a routing heuristic based on advantage gap and relative probabilities to address privilege illusion, but contain no equations, derivations, or first-principles claims that reduce to inputs by construction. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems are invoked. The central contribution is presented as an empirical method validated on LLM/VLM tasks rather than a mathematical chain that collapses to its own definitions. This is the common case of a self-contained algorithmic proposal without detectable circularity in the derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In International Conference on Learning Representations (ICLR), volume 2024, pages 21246–21263, 2024
2024
-
[2]
Aime problems and solutions, 2025
AIME. Aime problems and solutions, 2025. URLhttps://artofproblemsolving.com/wiki/index.php/AIME_ Problems_and_Solutions. 16
2025
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Distillation scaling laws.arXiv preprint arXiv:2502.08606, 2025
Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, and Russ Webb. Distillation scaling laws.arXiv preprint arXiv:2502.08606, 2025
-
[5]
Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems (NeurIPS), 37:27056–27087, 2024
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems (NeurIPS), 37:27056–27087, 2024
2024
-
[6]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on-policy distillation: Empirical failure modes and simple fixes.arXiv preprint arXiv:2603.25562, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
MiniLLM: Knowledge distillation of large language models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. In International Conference on Learning Representations (ICLR), 2024
2024
-
[9]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Skywork Open Reasoner 1 Technical Report
Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, et al. Skywork open reasoner 1 technical report.arXiv preprint arXiv:2505.22312, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
Wenjin Hou, Shangpin Peng, Weinong Wang, Zheng Ruan, Yue Zhang, Zhenglin Zhou, Mingqi Gao, Yifei Chen, Kaiqi Wang, Hongming Yang, et al. Uni-opd: Unifying on-policy distillation with a dual-perspective recipe. arXiv preprint arXiv:2605.03677, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, 2023
2023
-
[14]
C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Yao Fu, et al. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advancesin Neural Information Processing Systems (NeurIPS), 36:62991–63010, 2023
2023
-
[15]
Reinforcement Learning via Self-Distillation
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Livecodebench: Holistic and contamination free evaluation of large language models for code
Naman Jain, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In International Conference on Learning Representations (ICLR), volume 2025, pages 58791–58831, 2025
2025
-
[17]
Stable On-Policy Distillation through Adaptive Target Reformulation
Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, and Taesup Kim. Stable on-policy distillation through adaptive target reformulation.arXiv preprint arXiv:2601.07155, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Entropy-Aware On-Policy Distillation of Language Models
Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Explain in your own words: Improving reasoning via token-selective dual knowledge distillation
Minsang Kim and Seung Jun Baek. Explain in your own words: Improving reasoning via token-selective dual knowledge distillation. InInternational Conference on Learning Representations (ICLR), 2026
2026
-
[20]
Sequence-level knowledge distillation
Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1317–1327, 2016
2016
-
[21]
DistiLLM-2: A contrastive approach boosts the distillation of LLMs
Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. DistiLLM-2: A contrastive approach boosts the distillation of LLMs. InInternational Conference on Machine Learning (ICML), 2025. 17
2025
-
[22]
Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026
-
[23]
On-policy distillation, 2025
Thinking Machines Lab. On-policy distillation, 2025. URLhttps://thinkingmachines.ai/blog/ on-policy-distillation
2025
-
[24]
Lavida: A large diffusion language model for multimodal under- standing
Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal under- standing. Advancesin Neural Information Processing Systems (NeurIPS), 38:105101–105134, 2026
2026
-
[25]
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
Small models struggle to learn from strong reasoners
Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, and Radha Poovendran. Small models struggle to learn from strong reasoners. InFindings of the Association for Computational Linguistics: ACL 2025, pages 25366–25394, 2025
2025
-
[27]
Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, and Yejin Choi. Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025
-
[28]
Visual-Advantage On-Policy Distillation for Vision-Language Models
Ruiqi Liu, Xiaolei Lv, Gengsheng Li, Ximo Zhu, Zhiheng Wang, Zhengbo Zhang, Junkai Chen, Zhiheng Li, Bo Li, Jun Gao, et al. Visual-advantage on-policy distillation for vision-language models. arXiv preprint arXiv:2605.21924, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Introducing gpt-5.4, 2026
OpenAI. Introducing gpt-5.4, 2026. URLhttps://openai.com/index/introducing-gpt-5-4
2026
-
[30]
Privileged Information Distillation for Language Models
Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, and Massimo Caccia. Privileged information distillation for language models.arXiv preprint arXiv:2602.04942, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Near-Future Policy Optimization
Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, and Jiaqi Wang. Near-future policy optimization.arXiv preprint arXiv:2604.20733, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[33]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning. arXiv preprint arXiv:2601.19897, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[35]
A Survey of On-Policy Distillation for Large Language Models
Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Gates: Self-distillation under privileged context with consensus gating
Alex Stein, Furong Huang, and Tom Goldstein. Gates: Self-distillation under privileged context with consensus gating. arXiv preprint arXiv:2602.20574, 2026
-
[37]
Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems (NeurIPS), 37:95095–95169, 2024
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with math-vision dataset.Advances in Neural Information Processing Systems (NeurIPS), 37:95095–95169, 2024
2024
-
[38]
Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation
Yuanyi Wang, Su Lu, Yanggan Gu, Pengkai Wang, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, and Hongxia Yang. Not all disagreement is learnable: Token teachability in on-policy distillation.arXiv preprint arXiv:2605.26844, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, et al. Livebench: A challenging, contamination-free llm benchmark. arXiv preprint arXiv:2406.19314, 4:2, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Yecheng Wu, Song Han, and Hai Cai. Lightning opd: Efficient post-training for large reasoning models with offline on-policy distillation.arXiv preprint arXiv:2604.13010, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[41]
Realworldqa: A benchmark for real-world spatial understanding, 2024
xAI. Realworldqa: A benchmark for real-world spatial understanding, 2024. URLhttps://huggingface.co/ datasets/xai-org/RealworldQA. 18
2024
-
[42]
MiMo-V2-Flash Technical Report
Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[43]
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Yijia Xiao, Edward Sun, Tianyu Liu, and Wei Wang. Logicvista: Multimodal llm logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
DeepSeek-V4: Towards highly efficient million-token context intelligence,
Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chenchen Ling, et al. Deepseek-v4: Towards highly efficient million-token context intelligence. arXiv preprint arXiv:2606.19348, 2026
-
[45]
Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling
Wenda Xu, Rujun Han, Zifeng Wang, Long Le, Dhruv Madeka, Lei Li, William Yang Wang, Rishabh Agarwal, Chen-Yu Lee, and Tomas Pfister. Speculative knowledge distillation: Bridging the teacher-student gap through interleaved sampling. InInternational Conference on Learning Representations (ICLR), 2025
2025
-
[46]
TIP: Token Importance in On-Policy Distillation
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. Tip: Token importance in on-policy distillation.arXiv preprint arXiv:2604.14084, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[47]
Patil, Ion Stoica, and Joseph E.Gonzalez
Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E.Gonzalez. Berkeley function calling leaderboard, 2024. URLhttps://gorilla.cs.berkeley.edu/blogs/8_ berkeley_function_calling_leaderboard.html
2024
-
[48]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[49]
Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[50]
Thinking in space: How multimodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Conference (CVPR), pages 10632–10643, 2025
2025
-
[51]
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[52]
Joyai-vl-interaction: Real-time vision-language interaction intelligence
Dingyu Yao, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Haowen Hou, Zheming Liang, Congcong Wang, Yuhang Cao, Shenglong Ye, Shuai Xie, et al. Joyai-vl-interaction: Real-time vision-language interaction intelligence. arXiv preprint arXiv:2606.14777, 2026
-
[53]
On-Policy Context Distillation for Language Models
Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models. arXiv preprint arXiv:2602.12275, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[54]
Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems (NeurIPS), 38:113222–113244, 2026
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.Advances in Neural Information Processing Systems (NeurIPS), 38:113222–113244, 2026
2026
-
[55]
The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook
Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, et al. The latent space: Foundation, evolution, mechanism, ability, and outlook.arXiv preprint arXiv:2604.02029, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[56]
Vismem: Latent vision memory unlocks potential of vision-language models
Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangn- ing Zhang, Xiaobin Hu, and Shuicheng Yan. Vismem: Latent vision memory unlocks potential of vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 31544–31555, 2026
2026
-
[57]
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
Qianhao Yuan, Jie Lou, Xing Yu, Hongyu Lin, Le Sun, Xianpei Han, and Yaojie Lu. Vision-opd: Learning to see fine details for multimodal llms via on-policy self-distillation.arXiv preprint arXiv:2605.18740, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[58]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9556–9567, 2024
2024
-
[59]
Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL: Long Papers), pages 15134–15186, 2025. 19
2025
-
[60]
Dongxu Zhang, Zhichao Yang, Sepehr Janghorbani, Jun Han, Andrew Ressler II, Qian Qian, Gregory D Lyng, Sanjit Singh Batra, and Robert E Tillman. Fast and effective on-policy distillation from reasoning prefixes.arXiv preprint arXiv:2602.15260, 2026
-
[61]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[62]
Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang, and Junyang Lin. Autologi: Automated generation of logic puzzles for evaluating reasoning abilities of large language models. arXiv preprint arXiv:2502.16906, 2025
-
[63]
Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models
Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InInternational Conference on Learning Representations (ICLR), volume 2025, pages 48337–48383, 2025. 20 Appendix A Details of Privileged Input Original Input: Factor the f...
2025
-
[64]
Check whether the quadratic has a common numerical factor, if so, simplify quadratic
-
[65]
Use the coefficient structure to decide which appropriate factorization strategy
-
[66]
Identify the needed pairwise relationship between two numbers for the middle-term split
-
[67]
Case 2 (LLM-based) Original Input: Suppose I have a physical, solid square pyramid
Indicate that the remaining expression can be factored into two linear binomials. Case 2 (LLM-based) Original Input: Suppose I have a physical, solid square pyramid. The bottom square has vertices A, B, C, D, and the final vertex is E. Then I make a cut through the plane defined by ACE. There are now two pieces. What are the pieces? Are they tetrahedra, s...
-
[68]
Identify the plane determined by the two opposite base vertices and the apex
-
[69]
Observe how this plane intersects the square base along a diagonal
-
[70]
Use that diagonal to partition the base into two congruent triangular regions
-
[71]
Extend each triangular base region to the common apex to determine the corresponding three-dimensional subsolid
-
[72]
Case 1 (LLM-based) Original Input: Consider all words constituted by eight letters from $\\{C ,H,M, O\\}$
Compare each resulting piece by its vertices, edges, and triangular faces. Case 1 (LLM-based) Original Input: Consider all words constituted by eight letters from $\\{C ,H,M, O\\}$. We arrange the words in an alphabet sequence.\nPrecisely, the first word is $CCCCCCCC$, the second one is $CCCCCCCH$, the third is $CCCCCCCM$, the fourth one is $CCCCCCCO, ......
2017
-
[73]
Recognize that the alphabetic ordering induces a four-symbol positional system
-
[74]
Assign each letter an ordered digit according to this alphabet
-
[75]
Convert the requested ordinal position to a zero-based rank before processing
-
[76]
Express this rank as an eight-place base-four representation, preserving leading positions
-
[77]
surfboard
translate each base-four digit back to its corresponding letter. Case 3 (LLM-based) Figure 11Demonstrations of LLM-based privileged input. 21 Original Input: Privileged Input: Which is the main topic of the image: A: A woman surfing, B: A man skating, C: A man surfing, D: A woman skiting. Case 1 (VLM-based) Original Input: Privileged Input: What color are...
-
[78]
Please add multiple boxes if necessary
-
[79]
Please generate both the object label and quadruple coordinates
-
[80]
label":
Please output only valid JSON format without any other redundant content. Output format: [ {"label": "object", "bbox": [x1, y1, x2, y2]} ] Given a question, and corresponding ground-truth label. Query: {Query} Label: {Label} Add necessary step-wise decomposition hints that support the answer. Rules:
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.