PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
and Ravikumar, Pradeep and Wainwright, Martin J
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
A T-estimation-based procedure for adaptive density estimation and optimal control in offline contextual MDPs without stationarity, providing oracle risk bounds under two loss functions and finite-sample cost guarantees.
GAME is a convex estimator using overlapping nuclear-norm penalties on subgroup submatrices for low-rank matrix completion with known overlapping groups, providing finite-sample guarantees on reconstruction error and subgroup subspace recovery.
citing papers explorer
-
Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization
PBSD derives a reward-reweighted teacher distribution as the analytic optimum of a reward-regularized objective, yielding better stability and performance than KL-based self-distillation on math reasoning and tool-use tasks.
-
Adaptive Estimation and Optimal Control in Offline Contextual MDPs without Stationarity
A T-estimation-based procedure for adaptive density estimation and optimal control in offline contextual MDPs without stationarity, providing oracle risk bounds under two loss functions and finite-sample cost guarantees.
-
Group-Aware Matrix Estimation and Latent Subspace Recovery
GAME is a convex estimator using overlapping nuclear-norm penalties on subgroup submatrices for low-rank matrix completion with known overlapping groups, providing finite-sample guarantees on reconstruction error and subgroup subspace recovery.