OrderGrad supplies unbiased likelihood-ratio and reparameterization gradient estimators for finite-sample L-statistics by applying a rank-based reward transformation usable in standard policy-gradient updates.
Learning to reason as action abstractions with scalable mid-training rl.arXiv preprint arXiv:2509.25810, 2025
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.
citing papers explorer
-
OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation
OrderGrad supplies unbiased likelihood-ratio and reparameterization gradient estimators for finite-sample L-statistics by applying a rank-based reward transformation usable in standard policy-gradient updates.
-
On Advantage Estimates for Max@K Policy Gradients
Proposes MaxPO using a Leave-Two-Out baseline for centered unbiased advantages in max@K policy gradients, with a unified derivation of finite-batch estimators.