Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models
Pith reviewed 2026-06-29 22:16 UTC · model grok-4.3
The pith
Multi-domain contrastive policy optimization converts cross-domain interference into knowledge transfer for large reasoning models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MCPO analyzes the structural relationships among rollouts and promotes cross-domain knowledge sharing and in-domain knowledge consolidation in a contrastive manner. Specifically, for a given prompt, MCPO identifies transferable reasoning trajectories from other domains as positive examples, while treating incorrect rollouts as negative ones. It then encourages consistent representations for positive pairs and pushes negative pairs apart, thereby facilitating knowledge transfer and reducing interference. Moreover, MCPO aligns intra-domain correct rollouts to build a consolidated representation space, learning a harmonious representation space that can accommodate diverse multi-domain knowledg
What carries the argument
The contrastive mechanism that aligns representations of cross-domain correct trajectories as positives and separates incorrect trajectories as negatives, while also consolidating correct trajectories within each domain.
If this is right
- Reasoning capabilities improve across all participating domains.
- Performance can exceed single-domain training in some cases.
- Cross-domain interactions shift from harmful competition to beneficial transfer.
- Knowledge sharing is enabled while interference is reduced.
Where Pith is reading between the lines
- The same contrastive alignment could apply to other multi-task reinforcement learning settings where domains compete for policy updates.
- Representation-level contrast might help preserve earlier domain performance when new domains are added sequentially.
- Testing the reliability of positive-example selection on domains with conflicting reasoning styles would clarify the method's limits.
Load-bearing premise
Cross-domain correct rollouts can be reliably identified as transferable positive examples whose knowledge will improve rather than degrade the target domain policy without introducing noise or incorrect reasoning patterns.
What would settle it
Applying the method to a held-out domain pair and finding that target-domain performance drops below the single-domain baseline, or that the model begins producing the same errors as the cross-domain positives.
read the original abstract
Post-training has significantly enhanced the reasoning capability of Large Reasoning Models (LRMs), especially with Reinforcement Learning (RL) like Group Relative Policy Optimization (GRPO). However, GRPO-style RL methods in multi-domain settings often fail to achieve consistent improvements across all domains due to inherent interference in policy optimization. Prior studies on multi-domain RL primarily focus on alleviating cross-domain interference, while often neglecting the pivotal role of knowledge sharing, which we argue is the key to transforming cross-domain interactions from harmful competition into beneficial transfer. To address this limitation, we propose Multi-domain Contrastive Policy Optimization (MCPO), which analyzes the structural relationships among rollouts and promotes cross-domain knowledge sharing and in-domain knowledge consolidation in a contrastive manner. Specifically, for a given prompt, MCPO identifies transferable reasoning trajectories from other domains as positive examples, while treating incorrect rollouts as negative ones. It then encourages consistent representations for positive pairs and pushes negative pairs apart, thereby facilitating knowledge transfer and reducing interference. Moreover, MCPO aligns intra-domain correct rollouts to build a consolidated representation space. In this way, MCPO contrastively learns a harmonious representation space that can accommodate diverse multi-domain knowledge. Empirical results show that MCPO improves the reasoning capabilities of LRMs across multiple domains and even outperforms single-domain training in some cases. Code is available at https://github.com/Maricalce/MCPO.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Multi-domain Contrastive Policy Optimization (MCPO) to mitigate policy interference in multi-domain RL post-training of Large Reasoning Models. Building on GRPO, MCPO uses a contrastive objective that selects correct cross-domain rollouts as positive pairs (to encourage consistent representations and knowledge transfer) and incorrect rollouts as negatives, while also aligning intra-domain correct rollouts for consolidation. The central claim is that this produces a harmonious multi-domain representation space, yielding reasoning improvements across domains and, in some cases, outperforming single-domain training. Code is released.
Significance. If the empirical results hold under rigorous evaluation, the work would be significant for multi-domain reasoning model training: it reframes cross-domain interactions as opportunities for beneficial transfer rather than solely interference to be minimized. The explicit release of code supports reproducibility and is a strength.
major comments (2)
- [Abstract and §3] Abstract and §3 (method): The selection rule for 'transferable reasoning trajectories from other domains' is described only at a high level ('identifies transferable reasoning trajectories'). No concrete criterion, scoring function, or filtering step is given. This is load-bearing for the central claim, because if the positives contain domain-specific suboptimal patterns, the contrastive term would consolidate representations around noisy trajectories rather than improve the target policy.
- [§4] §4 (experiments): No experimental details, baselines (including GRPO and other multi-domain RL methods), dataset statistics, statistical tests, ablation studies on the contrastive components, or per-domain performance tables are supplied in the manuscript. Without these, the claim that MCPO 'improves ... across multiple domains and even outperforms single-domain training in some cases' cannot be evaluated.
minor comments (2)
- [§3] Notation for the contrastive loss (positive/negative pairs) should be formalized with an equation rather than prose description to allow precise reproduction.
- [Abstract] The abstract states 'Code is available at https://github.com/Maricalce/MCPO' but the manuscript does not indicate which exact experimental configurations are covered by the released code.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight areas where additional clarity and detail will strengthen the manuscript. We address each major comment below and commit to a revised version that incorporates the requested information.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method): The selection rule for 'transferable reasoning trajectories from other domains' is described only at a high level ('identifies transferable reasoning trajectories'). No concrete criterion, scoring function, or filtering step is given. This is load-bearing for the central claim, because if the positives contain domain-specific suboptimal patterns, the contrastive term would consolidate representations around noisy trajectories rather than improve the target policy.
Authors: We agree that the selection mechanism requires a concrete specification to support the central claim. In the revised manuscript we will expand §3 with the precise scoring function, similarity metric, and filtering threshold used to designate cross-domain trajectories as positives, along with a brief analysis showing that the selected positives do not simply propagate domain-specific errors. revision: yes
-
Referee: [§4] §4 (experiments): No experimental details, baselines (including GRPO and other multi-domain RL methods), dataset statistics, statistical tests, ablation studies on the contrastive components, or per-domain performance tables are supplied in the manuscript. Without these, the claim that MCPO 'improves ... across multiple domains and even outperforms single-domain training in some cases' cannot be evaluated.
Authors: We concur that the experimental reporting is currently insufficient for rigorous evaluation. The revised §4 will include complete implementation details, comparisons to GRPO and additional multi-domain RL baselines, dataset statistics, statistical significance tests, ablations isolating each contrastive term, and full per-domain performance tables with error bars. revision: yes
Circularity Check
No circularity: MCPO is an explicitly defined contrastive objective with independent empirical claims
full rationale
The paper introduces MCPO as a new loss that selects cross-domain correct rollouts as positives and incorrect ones as negatives, then applies standard contrastive alignment within and across domains. This construction is stated directly from rollout data without any fitted parameter being renamed as a prediction, without self-citation chains justifying uniqueness, and without any equation reducing the claimed improvement to the input selection rule by definition. The reported gains are presented as experimental outcomes of applying the objective, not quantities forced by the method's own parameterization. The central assumption about transferability is an empirical hypothesis, not a definitional equivalence, so the derivation chain remains self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Rollouts can be labeled correct or incorrect via an external reward or verification signal.
Reference graph
Works this paper leans on
-
[1]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan 11 Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reas...
2022
-
[3]
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. GDPO: group reward-decoupled normalization policy optimization for multi-reward RL optimization.CoRR, abs/2601.05242, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
ReCode: Reinforcing Code Generation with Reasoning-Process Rewards
Lishui Fan, Yu Zhang, Mouxiang Chen, and Zhongxin Liu. Posterior-grpo: Rewarding reasoning processes in code generation.CoRR, abs/2508.05170, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Reinforcement Learning via Self-Distillation
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation.CoRR, abs/2601.20802, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
Xuankun Rong, Wenke Huang, Tingfeng Wang, Daiguo Zhou, Bo Du, and Mang Ye. Safe- grpo: Self-rewarded multimodal safety alignment via rule-governed policy optimization.CoRR, abs/2511.12982, 2025
-
[7]
THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
Seanie Lee, Sangwoo Park, Yumin Choi, Gyeongman Kim, Minki Kang, Jihun Yun, Dongmin Park, Jongho Park, and Sung Ju Hwang. THINKSAFE: self-generated safety alignment for reasoning models.CoRR, abs/2601.23143, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
Agi-enabled solutions for service explosion in iox: A cyber-physical- social-thinking perspective.ACM Comput
Qikai Wei, Huansheng Ning, Feifei Shi, Tao Zhu, Jianguo Ding, Abdelouahid Derhab, and Mahmoud Daneshmand. Agi-enabled solutions for service explosion in iox: A cyber-physical- social-thinking perspective.ACM Comput. Surv., 58(8):208:1–208:32, 2026
2026
-
[9]
Navigating artificial general intelligence (AGI): societal implications, ethical considerations, and governance strategies.AI Ethics, 5(3):2021–2036, 2025
Dileesh Chandra Bikkasani. Navigating artificial general intelligence (AGI): societal implications, ethical considerations, and governance strategies.AI Ethics, 5(3):2021–2036, 2025
2021
-
[10]
Imbalanced gradients in RL post-training of multi-task llms
Runzhe Wu, Ankur Samanta, Ayush Jain, Scott Fujimoto, Jeongyeol Kwon, Ben Kretzu, Youliang Yu, Kaveh Hassani, Boris Vidolov, and Yonathan Efroni. Imbalanced gradients in RL post-training of multi-task llms. InEACL (Findings), pages 3137–3150, 2026
2026
-
[11]
Advancing general-purpose reasoning models with modular gradient surgery
Min Cai, Yu Liang, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Xi Ye, and Daiting Shi. Advancing general-purpose reasoning models with modular gradient surgery. CoRR, abs/2602.02301, 2026
-
[12]
Multi-task GRPO: reliable LLM reasoning across tasks.CoRR, abs/2602.05547, 2026
Shyam Sundhar Ramesh, Xiaotong Ji, Matthieu Zimmer, Sangwoong Yoon, Zhiyong Wang, Haitham Bou-Ammar, Aurélien Lucchi, and Ilija Bogunovic. Multi-task GRPO: reliable LLM reasoning across tasks.CoRR, abs/2602.05547, 2026
-
[13]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...
2025
-
[14]
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding
Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, and Ziyan Wu. Medgrpo: Multi-task reinforcement learning for heterogeneous medical video understanding.CoRR, abs/2512.06581, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
QiyingYu,ZhengZhang,RuofeiZhu,YufengYuan,XiaochenZuo,YuYue,TiantianFan,Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Group Sequence Policy Optimization
Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.CoRR, abs/2507.18071, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Nemotron-crossthink: Scaling self-learning beyond math reasoning.CoRR, abs/2504.13941, 2025
Syeda Nahida Akter, Shrimai Prabhumoye, Matvei Novikov, Seungju Han, Ying Lin, Evelina Bakh- turina, Eric Nyberg, Yejin Choi, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-crossthink: Scaling self-learning beyond math reasoning.CoRR, abs/2504.13941, 2025
-
[18]
Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, and Jiacheng Zhu. Modomodo: Multi-domain data mixtures for multimodal LLM reinforcement learning.CoRR, abs/2505.24871, 2025
-
[19]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. InICML, pages 1597–1607, 2020
2020
-
[20]
Girshick
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. InCVPR, pages 9726–9735, 2020
2020
-
[21]
Representation Learning with Contrastive Predictive Coding
Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.CoRR, abs/1807.03748, 2018. 13 Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.CoRR, abs/2103.00020, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
CLIPO: contrastive learning in policy optimization generalizes RLVR.CoRR, abs/2603.10101, 2026
Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, and Guanjun Jiang. CLIPO: contrastive learning in policy optimization generalizes RLVR.CoRR, abs/2603.10101, 2026
-
[24]
On Variational Bounds of Mutual Information
Ben Poole, Sherjil Ozair, Aäron van den Oord, Alexander A. Alemi, and George Tucker. On variational bounds of mutual information.CoRR, abs/1905.06922, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[25]
Christiano, Jan Leike, Tom B
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InNeurIPS, pages 4299–4307, 2017
2017
-
[26]
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. InNeurIPS, 2022
2022
-
[27]
Multi-task learning as multi-objective optimization
Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. InNeurIPS, pages 525–536, 2018
2018
-
[28]
Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks
Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InICML, pages 793–802, 2018
2018
-
[29]
Gradient surgery for multi-task learning
Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. InNeurIPS, 2020
2020
-
[30]
Conflict-averse gradient descent for multi-task learning
Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. InNeurIPS, pages 18878–18890, 2021
2021
-
[31]
Multi-task learning using uncertainty to weigh losses for scene geometry and semantics
Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. InCVPR, pages 7482–7491, 2018
2018
-
[32]
Reasoning curriculum: Bootstrapping broad LLM reasoning from math.CoRR, abs/2510.26143, 2025
Bo Pang, Deqian Kong, Silvio Savarese, Caiming Xiong, and Yingbo Zhou. Reasoning curriculum: Bootstrapping broad LLM reasoning from math.CoRR, abs/2510.26143, 2025
-
[33]
Supervised contrastive learning
Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InNeurIPS, 2020
2020
-
[34]
Feng He, Zijun Chen, Xinnian Liang, Tingting Ma, Yunqi Qiu, Shuangzhi Wu, and Junchi Yan. Protoreasoning: Prototypes as the foundation for generalizable reasoning in llms.CoRR, abs/2506.15211, 2025
-
[35]
GeometryZero: Advancing Geometry Solving via Group Contrastive Policy Optimization
Yikun Wang, Yibin Wang, Dianyi Wang, Zimian Peng, Qipeng Guo, Dacheng Tao, and Jiaqi Wang. Geometryzero: Improving geometry solving for LLM with group contrastive policy optimization. CoRR, abs/2506.07160, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Inference-time scaling for generalist reward modeling.CoRR, abs/2504.02495, 2025
Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.CoRR, abs/2504.02495, 2025
-
[37]
Rewards as labels: Revisiting RLVR from a classification perspective.CoRR, abs/2602.05630, 2026
Zepeng Zhai, Meilin Chen, Jiaxuan Zhao, Junlang Qian, Lei Shen, and Yuan Lu. Rewards as labels: Revisiting RLVR from a classification perspective.CoRR, abs/2602.05630, 2026
-
[38]
Qwen Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025. 14 Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Numinamath-1.5-rl-verifiable dataset
NuminaMath-1.5-RL-Verifiable Contributors. Numinamath-1.5-rl-verifiable dataset. Hugging Face dataset, 2024
2024
-
[40]
Code-r1: Reproducing r1 for code with reliable rewards
Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards. GitHub repository, 2025
2025
-
[41]
Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, and Huajun Chen. Sciknoweval: Evaluating multi-level scientific knowledge of large language models.CoRR, abs/2406.09098, 2024
-
[42]
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. Toolalpaca: Gen- eralized tool learning for language models with 3000 simulated cases.CoRR, abs/2306.05301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of LLM via a human-preference dataset.CoRR, abs/2307.04657, 2023
-
[44]
Math-AI Team. Amc23. Hugging Face dataset, 2024
2024
-
[45]
aime_2024
HuggingFaceH4 Team. aime_2024. Hugging Face dataset, 2024
2024
-
[46]
Aime2025
OpenCompass Team. Aime2025. Hugging Face dataset, 2025
2025
-
[47]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.CoRR, abs/2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. InICML, pages 50622–50649, 2024
2024
-
[50]
Hybridflow: A flexible and efficient RLHF framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InEuroSys, pages 1279–1297, 2025
2025
-
[51]
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
Pranjal Aggarwal and Sean Welleck. L1: controlling how long A reasoning model thinks with reinforcement learning.CoRR, abs/2503.04697, 2025. 15 Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models A. Appendix A.1. Implementation Details A.1.1. Baseline Implementation GRPO.GRPO [ 1] follows a standard PPO-style clip...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.