POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
Clipo: Contrastive learning in policy optimization generalizes rlvr
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it