Recognition: unknown
Co-Evolving Policy Distillation
Pith reviewed 2026-05-07 08:18 UTC · model grok-4.3
The pith
Co-Evolving Policy Distillation integrates text, image, and video reasoning into one model by having experts train and distill bidirectionally during RLVR.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Co-Evolving Policy Distillation enables all-in-one integration of text, image, and video reasoning capabilities by encouraging parallel training of experts and introducing bidirectional off-policy distillation during each expert's ongoing RLVR training, with experts serving as mutual teachers to co-evolve. This produces more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout, leading to performance that significantly outperforms mixed RLVR and sequential OPD baselines and even surpasses domain-specific experts.
What carries the argument
Co-Evolving Policy Distillation (CoPD): a training procedure that runs multiple expert RLVR trainings in parallel and inserts bidirectional off-policy distillation steps at intervals during training rather than only after completion, so experts act as mutual teachers and adjust together.
If this is right
- A single model can deliver strong performance across text, image, and video reasoning without the interference seen when all tasks train together.
- Bidirectional distillation during ongoing training shrinks the behavioral gaps that limit knowledge transfer in sequential pipelines.
- Mutual teaching while experts adapt preserves complementary knowledge that would otherwise be lost to divergence.
- The parallel training pattern may open a new route to scaling post-training by co-evolving multiple experts instead of merging them after the fact.
- The unified model can exceed the accuracy of models trained for only one reasoning domain.
Where Pith is reading between the lines
- The same parallel co-evolution idea could extend to additional modalities or larger sets of experts, potentially lowering total compute by reducing redundant sequential stages.
- Bidirectional mid-training distillation might improve other knowledge-sharing settings such as supervised fine-tuning or alignment where behavioral consistency matters.
- If the pattern holds at larger scale, training pipelines may shift toward simultaneous expert development rather than post-training merging or mixing.
- One could measure whether the frequency of bidirectional steps can be tuned to optimize the trade-off between alignment and retained specialization.
Load-bearing premise
That inserting bidirectional distillation while experts are still running RLVR will align their behavioral patterns enough to avoid interference costs without erasing the distinct knowledge each expert holds.
What would settle it
A direct comparison experiment in which a CoPD-trained model performs no better than, or worse than, the strongest mixed RLVR run or sequential OPD run on the combined text-image-video tasks, or falls below separately trained domain experts on their individual tasks.
read the original abstract
RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert's ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout. Experiments validate that CoPD achieves all-in-one integration of text, image, and video reasoning capabilities, significantly outperforming strong baselines such as mixed RLVR and MOPD, and even surpassing domain-specific experts. The model parallel training pattern offered by CoPD may inspire a novel training scaling paradigm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper provides a unified analysis of RLVR and OPD for consolidating multiple expert capabilities (text, image, video reasoning) into one model. It identifies two failure modes: mixed RLVR incurs inter-capability divergence costs, while sequential OPD (train experts fully then distill) suffers from large behavioral pattern gaps that prevent full absorption of teacher capabilities. The proposed Co-Evolving Policy Distillation (CoPD) trains experts in parallel and inserts bidirectional OPD (mutual teaching) during each expert's ongoing RLVR training rather than after completion. This is claimed to produce more consistent behavioral patterns while preserving complementary knowledge. Experiments reportedly show CoPD outperforming mixed RLVR and MOPD baselines and even surpassing the original domain-specific experts.
Significance. If the empirical claims hold, CoPD would represent a meaningful advance in post-training methods for multi-modal reasoning models by enabling co-evolution of experts rather than sequential or mixed approaches. The parallel training pattern could open a new scaling direction for RLVR-style methods, particularly if the bidirectional distillation reliably trades off consistency against complementarity without collapse.
major comments (2)
- [Abstract / Experiments] The central claim that CoPD surpasses domain-specific experts rests on the assertion that bidirectional OPD during ongoing RLVR produces sufficiently consistent behavioral patterns to avoid mixed-RLVR divergence while still retaining complementary knowledge. The abstract states this occurs but supplies no supporting quantitative evidence such as per-modality reward trajectories, KL or action-distribution divergence between experts before/after each OPD step, or an ablation that removes the bidirectional component. Without these measurements it is impossible to verify that the claimed trade-off is achieved rather than one side dominating (e.g., video expert collapsing toward text-like patterns).
- [Method] The method description is high-level: it is unclear how the OPD loss is scheduled relative to the RLVR objective, what temperature or weighting is used for the bidirectional distillation, or whether any auxiliary regularization is added to enforce pattern consistency. These details are load-bearing for reproducibility and for understanding why CoPD avoids the behavioral-gap problem identified for sequential OPD.
minor comments (2)
- [Abstract] The baseline 'MOPD' is referenced without expansion; the acronym should be defined on first use.
- [Conclusion] The final sentence claims the parallel training pattern 'may inspire a novel training scaling paradigm.' This is an interesting forward-looking statement but would be strengthened by a short discussion of potential limitations (e.g., communication overhead in parallel training, sensitivity to expert initialization).
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important areas where additional evidence and implementation details would strengthen the manuscript. We address each major comment below and have revised the paper to incorporate the requested quantitative analyses, ablations, and methodological clarifications.
read point-by-point responses
-
Referee: [Abstract / Experiments] The central claim that CoPD surpasses domain-specific experts rests on the assertion that bidirectional OPD during ongoing RLVR produces sufficiently consistent behavioral patterns to avoid mixed-RLVR divergence while still retaining complementary knowledge. The abstract states this occurs but supplies no supporting quantitative evidence such as per-modality reward trajectories, KL or action-distribution divergence between experts before/after each OPD step, or an ablation that removes the bidirectional component. Without these measurements it is impossible to verify that the claimed trade-off is achieved rather than one side dominating (e.g., video expert collapsing toward text-like patterns).
Authors: We agree that intermediate quantitative diagnostics would make the mechanism more transparent. The submitted manuscript reports only final-task performance, where CoPD exceeds both mixed RLVR and the original domain-specific experts on text, image, and video reasoning benchmarks. This outcome is consistent with successful retention of complementary knowledge without collapse or divergence, but we did not include the requested per-modality reward trajectories, policy KL divergences, or a bidirectional ablation. In the revised version we will add (i) training curves of per-expert rewards, (ii) KL and action-distribution divergence statistics measured immediately before and after each bidirectional OPD step, and (iii) an ablation that disables the bidirectional component while keeping parallel RLVR. These additions will allow direct verification of the claimed consistency-complementarity balance. revision: yes
-
Referee: [Method] The method description is high-level: it is unclear how the OPD loss is scheduled relative to the RLVR objective, what temperature or weighting is used for the bidirectional distillation, or whether any auxiliary regularization is added to enforce pattern consistency. These details are load-bearing for reproducibility and for understanding why CoPD avoids the behavioral-gap problem identified for sequential OPD.
Authors: We acknowledge that the method section in the original submission remained at a conceptual level. The revised manuscript will expand the description with a detailed algorithm box that specifies: the interleaving schedule (OPD loss applied every k RLVR gradient steps), the temperature used for the bidirectional distillation softmax, the scalar weighting coefficient balancing the RLVR and OPD objectives, and any auxiliary consistency regularizer (if present). We will also clarify how the mutual-teacher structure is realized in the parallel training loop, thereby addressing why the online bidirectional setting mitigates the large behavioral-gap issue observed in sequential OPD. revision: yes
Circularity Check
No circularity: empirical training proposal with no derivation chain
full rationale
The paper proposes Co-Evolving Policy Distillation (CoPD) as a practical, empirical training procedure that runs bidirectional OPD in parallel with each expert's ongoing RLVR. It offers a descriptive analysis of limitations in mixed RLVR (inter-capability divergence) and sequential OPD (behavioral pattern gaps), then introduces the co-evolution method to address them. No mathematical equations, first-principles derivations, fitted parameters, or predictions appear in the provided text. Claims of all-in-one integration and outperformance rest on experimental validation rather than any self-referential reduction, self-citation chain, or ansatz smuggled via prior work. The central assumption about consistent behavioral patterns is an empirical hypothesis tested in experiments, not a quantity defined by construction from the inputs. This is a standard non-circular empirical methods paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption RLVR and OPD are effective base paradigms for post-training expert capabilities into models.
- ad hoc to paper Behavioral pattern consistency can be improved via mutual distillation without losing complementary knowledge.
invented entities (1)
-
Co-Evolving Policy Distillation (CoPD)
no independent evidence
Forward citations
Cited by 1 Pith paper
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
Reference graph
Works this paper leans on
-
[1]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URLhttps://arxiv.org/abs/2402.03300
work page internal anchor Pith review arXiv 2024
-
[2]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URLhttps://arxiv.org/abs/2503.20783
work page internal anchor Pith review arXiv 2025
-
[3]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review arXiv 2025
-
[4]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URLhttps: //arxiv.org/abs/2507.18071
work page internal anchor Pith review arXiv 2025
-
[5]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026. URL https://arxiv.org/abs/2503.06749
work page internal anchor Pith review arXiv 2026
-
[6]
arXiv preprint arXiv:2505.16673 (2025)
Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, and Jiaxing Huang. R1-sharevl: Incentivizing reasoning capability of multimodal large language models via share-grpo, 2025. URLhttps://arxiv.org/abs/2505.16673
-
[7]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review arXiv 2025
-
[8]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review arXiv 2025
-
[9]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025
work page internal anchor Pith review arXiv 2025
-
[10]
Time-r1: Post-training large vision language model for temporal video grounding,
Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, and Qin Jin. Time-r1: Post-training large vision language model for temporal video grounding, 2025. URLhttps: //arxiv.org/abs/2503.13377
-
[11]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, Y. Charles, H. S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen...
-
[12]
URLhttps://arxiv.org/abs/2602.02276
work page internal anchor Pith review arXiv
-
[13]
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, Chenzheng Zhu, Congfeng Yin, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zho...
work page internal anchor Pith review arXiv 2026
-
[14]
Scaling laws for optimal data mixtures.arXiv preprint arXiv:2507.09404,
Mustafa Shukor, Louis Bethune, Dan Busbridge, David Grangier, Enrico Fini, Alaaeldin El-Nouby, and Pierre Ablin. Scaling laws for optimal data mixtures, 2025. URLhttps://arxiv.org/abs/2507.09404
-
[15]
https://thinkingmachines.ai/blog/ on-policy-distillation/
Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation
-
[16]
Mimo-v2-flash technical report,
Core Team, Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, Gang Xie, Hailin Zhang, Hanglong Lv, Hanyu Li, Heyu Chen, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Lei Li, Liang Zhao, Linghao Zhang,...
-
[17]
URLhttps://arxiv.org/abs/2601.02780
work page internal anchor Pith review arXiv
-
[18]
Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026
2026
-
[19]
Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025
Chenxin An, Zhihui Xie, Xiaonan Li, Lei Li, Jun Zhang, Shansan Gong, Ming Zhong, Jingjing Xu, Xipeng Qiu, Mingxuan Wang, and Lingpeng Kong. Polaris: A post-training recipe for scaling reinforcement learning on advanced reasoning models, 2025. URLhttps://hkunlp.github.io/blog/2025/Polaris
2025
-
[20]
Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica
Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y. Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Sur- passing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,
-
[21]
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. Areal: A large-scale asynchronous reinforcement learning system for language reasoning, 2025. URLhttps://arxiv.org/abs/2505.24298
work page internal anchor Pith review arXiv 2025
-
[22]
Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods, 2026
Honglin Lin, Zheng Liu, Yun Zhu, Chonghan Qin, Juekai Lin, Xiaoran Shang, Conghui He, Wentao Zhang, and Lijun Wu. Mmfinereason: Closing the multimodal reasoning gap via open data-centric methods.arXiv preprint arXiv:2601.21821, 2026
-
[23]
OneThinker: All-in-one Reasoning Model for Image and Video
Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
arXiv preprint arXiv:2504.06958 (2025)
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning, 2025. URL https://arxiv.org/abs/2504.06958. 15
-
[25]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
XiangYue, YuanshengNi, KaiZhang, TianyuZheng, RuoqiLiu, GeZhang, SamuelStevens, DongfuJiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,...
work page internal anchor Pith review arXiv 2024
-
[26]
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025. URLhttps://arxiv.org/abs/2409.02813
work page internal anchor Pith review arXiv 2025
-
[27]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URLhttps://arxiv.org/abs/2310.02255
work page internal anchor Pith review arXiv 2024
-
[28]
Measuring multimodal mathematical reasoning with MATH-vision dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with MATH-vision dataset. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/ forum?id=QWTCcxMpPA
2024
-
[29]
Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion- Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, David I. Atkinson, Aaditya Baranwal, Alexa...
-
[30]
Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma GongQue, Shanglin Lei, YiFan Zhang, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Xiao Zong, Yida Xu, Peiqing Yang, Zhimin Bao, Muxi Diao, Chen Li, and Honggang Zhang. We-math: Does your large multimodal model achieve human-like mathematical reasoning? In Wan...
-
[31]
Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Peng Gao, and Hongsheng Li. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems?, 2024. URLhttps://arxiv.org/abs/2403.14624
-
[32]
Aime problems and solutions
MAA Committees. Aime problems and solutions. https://artofproblemsolving.com/wiki/index.php/AIME_ Problems_and_Solutions
-
[33]
Matharena: Evaluating llms on uncontaminated math competitions, February 2025
Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. Matharena: Evaluating llms on uncontaminated math competitions, February 2025. URLhttps://matharena.ai/
2025
-
[34]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URLhttps://arxiv.org/ abs/2103.03874
work page internal anchor Pith review arXiv 2021
-
[35]
Solving Quantitative Reasoning Problems with Language Models
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur- Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URLhttps: //arxiv.org/abs/2206.14858
work page internal anchor Pith review arXiv 2022
-
[36]
Video-holmes: Can MLLM think like holmes for complex video reasoning?CoRR, abs/2505.21374, 2025
Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?, 2025. URLhttps://arxiv.org/abs/2505.21374
-
[37]
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi-modal video understanding benchmark, 2024. URL https://arxiv.org/abs/2311.17005
-
[38]
Mmvu: Measuring expert-level multi-discipline video understanding,
Yilun Zhao, Lujing Xie, Haowei Zhang, Guo Gan, Yitao Long, Zhiyuan Hu, Tongyan Hu, Weiyuan Chen, Chuhan Li, Junyang Song, Zhijian Xu, Chengye Wang, Weifeng Pan, Ziyao Shangguan, Xiangru Tang, Zhenwen Liang, Yixin Liu, Chen Zhao, and Arman Cohan. Mmvu: Measuring expert-level multi-discipline video understanding,
- [39]
-
[40]
Videomathqa: Benchmarking mathematical reasoning via multimodal understanding in videos, 2025
Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, and Fahad Shahbaz Khan. Videomathqa: Benchmarking mathematical reasoning via multimodal understanding in videos, 2025. URLhttps://arxiv.org/abs/2506.05349
-
[41]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review arXiv 2025
-
[42]
EasyVideoR1: Easier RL for Video Understanding
Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, and Jiaqi Wang. Easyvideor1: Easier rl for video understanding, 2026. URLhttps://arxiv.org/abs/2604.16893
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[43]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review arXiv 2024
-
[44]
Easyr1: An efficient, scalable, multi-modality rl training framework.https://github.com/hiyouga/EasyR1, 2025
Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong, and Richong Zhang. Easyr1: An efficient, scalable, multi-modality rl training framework.https://github.com/hiyouga/EasyR1, 2025
2025
-
[45]
OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondri...
-
[46]
URLhttps://arxiv.org/abs/2412.16720. 17
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance
Linhao Yu, Tianmeng Yang, Siyu Ding, Renren Jin, Naibin Gu, Xiangzhao Hao, Shuaiyi Nie, Deyi Xiong, Weichong Yin, Yu Sun, and Hua Wu. Knowrl: Boosting llm reasoning via reinforcement learning with minimal-sufficient knowledge guidance, 2026. URLhttps://arxiv.org/abs/2604.12627
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
Muzhi Dai, Chenxu Yang, and Qingyi Si. S-grpo: Early exit via reinforcement learning in reasoning models, 2025. URLhttps://arxiv.org/abs/2505.07686
-
[49]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. In The TwelfthInternational Conference on Learning Representations, 2024. URL https://openreview.net/forum?id= 3zKtaqxLhW
2024
-
[50]
MiniLLM: On-Policy Distillation of Large Language Models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: On-policy distillation of large language models, 2026. URLhttps://arxiv.org/abs/2306.08543
work page internal anchor Pith review arXiv 2026
-
[51]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models, 2026. URLhttps://arxiv.org/abs/2601.18734
work page internal anchor Pith review arXiv 2026
-
[52]
Reinforcement Learning via Self-Distillation
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation, 2026. URLhttps://arxiv.org/abs/2601.20802
work page internal anchor Pith review arXiv 2026
-
[53]
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe, 2026. URLhttps://arxiv.org/abs/2604.13016
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[54]
Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr, 2026. URLhttps://arxiv.org/abs/2604.03128
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[55]
Near-Future Policy Optimization
Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao, Zheng Lin, Peng Fu, Nan Duan, and Jiaqi Wang. Near-future policy optimization, 2026. URLhttps://arxiv.org/abs/2604.20733
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[56]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347. 18 Appendix A Preliminaries A.1 Group Relative Policy Optimization Group Relative Policy Optimization (GRPO) [1] is a variant of Proximal Policy Optimization (PPO) [52] tailored for large langua...
work page internal anchor Pith review arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.