MCPO applies contrastive learning to GRPO-style RL by treating cross-domain correct rollouts as positives and incorrect ones as negatives to improve multi-domain reasoning performance in LRMs.
Rewards as labels: Revisiting RLVR from a classification perspective.CoRR, abs/2602.05630, 2026
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models
MCPO applies contrastive learning to GRPO-style RL by treating cross-domain correct rollouts as positives and incorrect ones as negatives to improve multi-domain reasoning performance in LRMs.