pith. sign in

arxiv: 2605.25443 · v1 · pith:FKWGND5Knew · submitted 2026-05-25 · 💻 cs.CL

Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models

Pith reviewed 2026-06-29 22:16 UTC · model grok-4.3

classification 💻 cs.CL
keywords multi-domain reinforcement learningcontrastive policy optimizationlarge reasoning modelsknowledge sharingpolicy optimizationreasoning capabilitiesGRPO
0
0 comments X

The pith

Multi-domain contrastive policy optimization converts cross-domain interference into knowledge transfer for large reasoning models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that a contrastive policy optimization method can turn potential conflicts between different training domains into useful knowledge sharing for large reasoning models. It identifies correct reasoning paths from other domains as positive examples whose representations should align, treats incorrect paths as negatives to repel, and also consolidates correct paths within each domain. If the method works, models could be trained on multiple domains at once without the usual uneven performance drops seen in standard reinforcement learning approaches. A sympathetic reader would care because current methods struggle to scale reasoning improvements reliably when tasks span several areas.

Core claim

MCPO analyzes the structural relationships among rollouts and promotes cross-domain knowledge sharing and in-domain knowledge consolidation in a contrastive manner. Specifically, for a given prompt, MCPO identifies transferable reasoning trajectories from other domains as positive examples, while treating incorrect rollouts as negative ones. It then encourages consistent representations for positive pairs and pushes negative pairs apart, thereby facilitating knowledge transfer and reducing interference. Moreover, MCPO aligns intra-domain correct rollouts to build a consolidated representation space, learning a harmonious representation space that can accommodate diverse multi-domain knowledg

What carries the argument

The contrastive mechanism that aligns representations of cross-domain correct trajectories as positives and separates incorrect trajectories as negatives, while also consolidating correct trajectories within each domain.

If this is right

  • Reasoning capabilities improve across all participating domains.
  • Performance can exceed single-domain training in some cases.
  • Cross-domain interactions shift from harmful competition to beneficial transfer.
  • Knowledge sharing is enabled while interference is reduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same contrastive alignment could apply to other multi-task reinforcement learning settings where domains compete for policy updates.
  • Representation-level contrast might help preserve earlier domain performance when new domains are added sequentially.
  • Testing the reliability of positive-example selection on domains with conflicting reasoning styles would clarify the method's limits.

Load-bearing premise

Cross-domain correct rollouts can be reliably identified as transferable positive examples whose knowledge will improve rather than degrade the target domain policy without introducing noise or incorrect reasoning patterns.

What would settle it

Applying the method to a held-out domain pair and finding that target-domain performance drops below the single-domain baseline, or that the model begins producing the same errors as the cross-domain positives.

read the original abstract

Post-training has significantly enhanced the reasoning capability of Large Reasoning Models (LRMs), especially with Reinforcement Learning (RL) like Group Relative Policy Optimization (GRPO). However, GRPO-style RL methods in multi-domain settings often fail to achieve consistent improvements across all domains due to inherent interference in policy optimization. Prior studies on multi-domain RL primarily focus on alleviating cross-domain interference, while often neglecting the pivotal role of knowledge sharing, which we argue is the key to transforming cross-domain interactions from harmful competition into beneficial transfer. To address this limitation, we propose Multi-domain Contrastive Policy Optimization (MCPO), which analyzes the structural relationships among rollouts and promotes cross-domain knowledge sharing and in-domain knowledge consolidation in a contrastive manner. Specifically, for a given prompt, MCPO identifies transferable reasoning trajectories from other domains as positive examples, while treating incorrect rollouts as negative ones. It then encourages consistent representations for positive pairs and pushes negative pairs apart, thereby facilitating knowledge transfer and reducing interference. Moreover, MCPO aligns intra-domain correct rollouts to build a consolidated representation space. In this way, MCPO contrastively learns a harmonious representation space that can accommodate diverse multi-domain knowledge. Empirical results show that MCPO improves the reasoning capabilities of LRMs across multiple domains and even outperforms single-domain training in some cases. Code is available at https://github.com/Maricalce/MCPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Multi-domain Contrastive Policy Optimization (MCPO) to mitigate policy interference in multi-domain RL post-training of Large Reasoning Models. Building on GRPO, MCPO uses a contrastive objective that selects correct cross-domain rollouts as positive pairs (to encourage consistent representations and knowledge transfer) and incorrect rollouts as negatives, while also aligning intra-domain correct rollouts for consolidation. The central claim is that this produces a harmonious multi-domain representation space, yielding reasoning improvements across domains and, in some cases, outperforming single-domain training. Code is released.

Significance. If the empirical results hold under rigorous evaluation, the work would be significant for multi-domain reasoning model training: it reframes cross-domain interactions as opportunities for beneficial transfer rather than solely interference to be minimized. The explicit release of code supports reproducibility and is a strength.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method): The selection rule for 'transferable reasoning trajectories from other domains' is described only at a high level ('identifies transferable reasoning trajectories'). No concrete criterion, scoring function, or filtering step is given. This is load-bearing for the central claim, because if the positives contain domain-specific suboptimal patterns, the contrastive term would consolidate representations around noisy trajectories rather than improve the target policy.
  2. [§4] §4 (experiments): No experimental details, baselines (including GRPO and other multi-domain RL methods), dataset statistics, statistical tests, ablation studies on the contrastive components, or per-domain performance tables are supplied in the manuscript. Without these, the claim that MCPO 'improves ... across multiple domains and even outperforms single-domain training in some cases' cannot be evaluated.
minor comments (2)
  1. [§3] Notation for the contrastive loss (positive/negative pairs) should be formalized with an equation rather than prose description to allow precise reproduction.
  2. [Abstract] The abstract states 'Code is available at https://github.com/Maricalce/MCPO' but the manuscript does not indicate which exact experimental configurations are covered by the released code.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight areas where additional clarity and detail will strengthen the manuscript. We address each major comment below and commit to a revised version that incorporates the requested information.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method): The selection rule for 'transferable reasoning trajectories from other domains' is described only at a high level ('identifies transferable reasoning trajectories'). No concrete criterion, scoring function, or filtering step is given. This is load-bearing for the central claim, because if the positives contain domain-specific suboptimal patterns, the contrastive term would consolidate representations around noisy trajectories rather than improve the target policy.

    Authors: We agree that the selection mechanism requires a concrete specification to support the central claim. In the revised manuscript we will expand §3 with the precise scoring function, similarity metric, and filtering threshold used to designate cross-domain trajectories as positives, along with a brief analysis showing that the selected positives do not simply propagate domain-specific errors. revision: yes

  2. Referee: [§4] §4 (experiments): No experimental details, baselines (including GRPO and other multi-domain RL methods), dataset statistics, statistical tests, ablation studies on the contrastive components, or per-domain performance tables are supplied in the manuscript. Without these, the claim that MCPO 'improves ... across multiple domains and even outperforms single-domain training in some cases' cannot be evaluated.

    Authors: We concur that the experimental reporting is currently insufficient for rigorous evaluation. The revised §4 will include complete implementation details, comparisons to GRPO and additional multi-domain RL baselines, dataset statistics, statistical significance tests, ablations isolating each contrastive term, and full per-domain performance tables with error bars. revision: yes

Circularity Check

0 steps flagged

No circularity: MCPO is an explicitly defined contrastive objective with independent empirical claims

full rationale

The paper introduces MCPO as a new loss that selects cross-domain correct rollouts as positives and incorrect ones as negatives, then applies standard contrastive alignment within and across domains. This construction is stated directly from rollout data without any fitted parameter being renamed as a prediction, without self-citation chains justifying uniqueness, and without any equation reducing the claimed improvement to the input selection rule by definition. The reported gains are presented as experimental outcomes of applying the objective, not quantities forced by the method's own parameterization. The central assumption about transferability is an empirical hypothesis, not a definitional equivalence, so the derivation chain remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, invented entities, or non-standard axioms are stated in the provided text.

axioms (1)
  • domain assumption Rollouts can be labeled correct or incorrect via an external reward or verification signal.
    Implicit in any RL setup for reasoning models; required for defining positive and negative pairs.

pith-pipeline@v0.9.1-grok · 5796 in / 1267 out tokens · 31525 ms · 2026-06-29T22:16:21.683267+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 29 canonical work pages · 17 internal anchors

  1. [1]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300, 2024

  2. [2]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan 11 Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reas...

  3. [3]

    GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

    Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, and Pavlo Molchanov. GDPO: group reward-decoupled normalization policy optimization for multi-reward RL optimization.CoRR, abs/2601.05242, 2026

  4. [4]

    ReCode: Reinforcing Code Generation with Reasoning-Process Rewards

    Lishui Fan, Yu Zhang, Mouxiang Chen, and Zhongxin Liu. Posterior-grpo: Rewarding reasoning processes in code generation.CoRR, abs/2508.05170, 2025

  5. [5]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation.CoRR, abs/2601.20802, 2026

  6. [6]

    Safe- grpo: Self-rewarded multimodal safety alignment via rule-governed policy optimization.CoRR, abs/2511.12982, 2025

    Xuankun Rong, Wenke Huang, Tingfeng Wang, Daiguo Zhou, Bo Du, and Mang Ye. Safe- grpo: Self-rewarded multimodal safety alignment via rule-governed policy optimization.CoRR, abs/2511.12982, 2025

  7. [7]

    THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

    Seanie Lee, Sangwoo Park, Yumin Choi, Gyeongman Kim, Minki Kang, Jihun Yun, Dongmin Park, Jongho Park, and Sung Ju Hwang. THINKSAFE: self-generated safety alignment for reasoning models.CoRR, abs/2601.23143, 2026

  8. [8]

    Agi-enabled solutions for service explosion in iox: A cyber-physical- social-thinking perspective.ACM Comput

    Qikai Wei, Huansheng Ning, Feifei Shi, Tao Zhu, Jianguo Ding, Abdelouahid Derhab, and Mahmoud Daneshmand. Agi-enabled solutions for service explosion in iox: A cyber-physical- social-thinking perspective.ACM Comput. Surv., 58(8):208:1–208:32, 2026

  9. [9]

    Navigating artificial general intelligence (AGI): societal implications, ethical considerations, and governance strategies.AI Ethics, 5(3):2021–2036, 2025

    Dileesh Chandra Bikkasani. Navigating artificial general intelligence (AGI): societal implications, ethical considerations, and governance strategies.AI Ethics, 5(3):2021–2036, 2025

  10. [10]

    Imbalanced gradients in RL post-training of multi-task llms

    Runzhe Wu, Ankur Samanta, Ayush Jain, Scott Fujimoto, Jeongyeol Kwon, Ben Kretzu, Youliang Yu, Kaveh Hassani, Boris Vidolov, and Yonathan Efroni. Imbalanced gradients in RL post-training of multi-task llms. InEACL (Findings), pages 3137–3150, 2026

  11. [11]

    Advancing general-purpose reasoning models with modular gradient surgery

    Min Cai, Yu Liang, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Xi Ye, and Daiting Shi. Advancing general-purpose reasoning models with modular gradient surgery. CoRR, abs/2602.02301, 2026

  12. [12]

    Multi-task GRPO: reliable LLM reasoning across tasks.CoRR, abs/2602.05547, 2026

    Shyam Sundhar Ramesh, Xiaotong Ji, Matthieu Zimmer, Sangwoong Yoon, Zhiyong Wang, Haitham Bou-Ammar, Aurélien Lucchi, and Ilija Bogunovic. Multi-task GRPO: reliable LLM reasoning across tasks.CoRR, abs/2602.05547, 2026

  13. [13]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  14. [14]

    MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

    Yuhao Su, Anwesa Choudhuri, Zhongpai Gao, Benjamin Planche, Van Nguyen Nguyen, Meng Zheng, Yuhan Shen, Arun Innanje, Terrence Chen, Ehsan Elhamifar, and Ziyan Wu. Medgrpo: Multi-task reinforcement learning for heterogeneous medical video understanding.CoRR, abs/2512.06581, 2025

  15. [15]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    QiyingYu,ZhengZhang,RuofeiZhu,YufengYuan,XiaochenZuo,YuYue,TiantianFan,Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, Wei-Ying Ma, Ya...

  16. [16]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.CoRR, abs/2507.18071, 2025

  17. [17]

    Nemotron-crossthink: Scaling self-learning beyond math reasoning.CoRR, abs/2504.13941, 2025

    Syeda Nahida Akter, Shrimai Prabhumoye, Matvei Novikov, Seungju Han, Ying Lin, Evelina Bakh- turina, Eric Nyberg, Yejin Choi, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro. Nemotron-crossthink: Scaling self-learning beyond math reasoning.CoRR, abs/2504.13941, 2025

  18. [18]

    Modomodo: Multi-domain data mixtures for multimodal LLM reinforcement learning.CoRR, abs/2505.24871, 2025

    Yiqing Liang, Jielin Qiu, Wenhao Ding, Zuxin Liu, James Tompkin, Mengdi Xu, Mengzhou Xia, Zhengzhong Tu, Laixi Shi, and Jiacheng Zhu. Modomodo: Multi-domain data mixtures for multimodal LLM reinforcement learning.CoRR, abs/2505.24871, 2025

  19. [19]

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. InICML, pages 1597–1607, 2020

  20. [20]

    Girshick

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. InCVPR, pages 9726–9735, 2020

  21. [21]

    Representation Learning with Contrastive Predictive Coding

    Aäron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.CoRR, abs/1807.03748, 2018. 13 Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models

  22. [22]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.CoRR, abs/2103.00020, 2021

  23. [23]

    CLIPO: contrastive learning in policy optimization generalizes RLVR.CoRR, abs/2603.10101, 2026

    Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, and Guanjun Jiang. CLIPO: contrastive learning in policy optimization generalizes RLVR.CoRR, abs/2603.10101, 2026

  24. [24]

    On Variational Bounds of Mutual Information

    Ben Poole, Sherjil Ozair, Aäron van den Oord, Alexander A. Alemi, and George Tucker. On variational bounds of mutual information.CoRR, abs/1905.06922, 2019

  25. [25]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InNeurIPS, pages 4299–4307, 2017

  26. [26]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. Star: Bootstrapping reasoning with reasoning. InNeurIPS, 2022

  27. [27]

    Multi-task learning as multi-objective optimization

    Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. InNeurIPS, pages 525–536, 2018

  28. [28]

    Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks

    Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. InICML, pages 793–802, 2018

  29. [29]

    Gradient surgery for multi-task learning

    Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. Gradient surgery for multi-task learning. InNeurIPS, 2020

  30. [30]

    Conflict-averse gradient descent for multi-task learning

    Bo Liu, Xingchao Liu, Xiaojie Jin, Peter Stone, and Qiang Liu. Conflict-averse gradient descent for multi-task learning. InNeurIPS, pages 18878–18890, 2021

  31. [31]

    Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

    Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. InCVPR, pages 7482–7491, 2018

  32. [32]

    Reasoning curriculum: Bootstrapping broad LLM reasoning from math.CoRR, abs/2510.26143, 2025

    Bo Pang, Deqian Kong, Silvio Savarese, Caiming Xiong, and Yingbo Zhou. Reasoning curriculum: Bootstrapping broad LLM reasoning from math.CoRR, abs/2510.26143, 2025

  33. [33]

    Supervised contrastive learning

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. InNeurIPS, 2020

  34. [34]

    Protoreasoning: Prototypes as the foundation for generalizable reasoning in llms.CoRR, abs/2506.15211, 2025

    Feng He, Zijun Chen, Xinnian Liang, Tingting Ma, Yunqi Qiu, Shuangzhi Wu, and Junchi Yan. Protoreasoning: Prototypes as the foundation for generalizable reasoning in llms.CoRR, abs/2506.15211, 2025

  35. [35]

    GeometryZero: Advancing Geometry Solving via Group Contrastive Policy Optimization

    Yikun Wang, Yibin Wang, Dianyi Wang, Zimian Peng, Qipeng Guo, Dacheng Tao, and Jiaqi Wang. Geometryzero: Improving geometry solving for LLM with group contrastive policy optimization. CoRR, abs/2506.07160, 2025

  36. [36]

    Inference-time scaling for generalist reward modeling.CoRR, abs/2504.02495, 2025

    Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, and Yu Wu. Inference-time scaling for generalist reward modeling.CoRR, abs/2504.02495, 2025

  37. [37]

    Rewards as labels: Revisiting RLVR from a classification perspective.CoRR, abs/2602.05630, 2026

    Zepeng Zhai, Meilin Chen, Jiaxuan Zhao, Junlang Qian, Lei Shen, and Yuan Lu. Rewards as labels: Revisiting RLVR from a classification perspective.CoRR, abs/2602.05630, 2026

  38. [38]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.CoRR, abs/2505.09388, 2025. 14 Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models

  39. [39]

    Numinamath-1.5-rl-verifiable dataset

    NuminaMath-1.5-RL-Verifiable Contributors. Numinamath-1.5-rl-verifiable dataset. Hugging Face dataset, 2024

  40. [40]

    Code-r1: Reproducing r1 for code with reliable rewards

    Jiawei Liu and Lingming Zhang. Code-r1: Reproducing r1 for code with reliable rewards. GitHub repository, 2025

  41. [41]

    Sciknoweval: Evaluating multi-level scientific knowledge of large language models.CoRR, abs/2406.09098, 2024

    Kehua Feng, Keyan Ding, Weijie Wang, Xiang Zhuang, Zeyuan Wang, Ming Qin, Yu Zhao, Jianhua Yao, Qiang Zhang, and Huajun Chen. Sciknoweval: Evaluating multi-level scientific knowledge of large language models.CoRR, abs/2406.09098, 2024

  42. [42]

    ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

    Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. Toolalpaca: Gen- eralized tool learning for language models with 3000 simulated cases.CoRR, abs/2306.05301, 2023

  43. [43]

    Beavertails: Towards improved safety alignment of LLM via a human-preference dataset.CoRR, abs/2307.04657, 2023

    Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of LLM via a human-preference dataset.CoRR, abs/2307.04657, 2023

  44. [44]

    Math-AI Team. Amc23. Hugging Face dataset, 2024

  45. [45]

    aime_2024

    HuggingFaceH4 Team. aime_2024. Hugging Face dataset, 2024

  46. [46]

    Aime2025

    OpenCompass Team. Aime2025. Hugging Face dataset, 2025

  47. [47]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.CoRR, abs/2403.07974, 2024

  48. [48]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark.CoRR, abs/2311.12022, 2023

  49. [49]

    Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang

    Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. InICML, pages 50622–50649, 2024

  50. [50]

    Hybridflow: A flexible and efficient RLHF framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InEuroSys, pages 1279–1297, 2025

  51. [51]

    L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

    Pranjal Aggarwal and Sean Welleck. L1: controlling how long A reasoning model thinks with reinforcement learning.CoRR, abs/2503.04697, 2025. 15 Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models A. Appendix A.1. Implementation Details A.1.1. Baseline Implementation GRPO.GRPO [ 1] follows a standard PPO-style clip...