Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
Optimizing language models for inference time objectives using reinforcement learning
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 4roles
background 1polarities
background 1representative citing papers
Derives new loss functions for SFT and RL that optimize directly for test-time inference operators like aggregation or filtering, with empirical gains in scaling.
Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.
citing papers explorer
-
What should post-training optimize? A test-time scaling law perspective
Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
-
Compute Aligned Training: Optimizing for Test Time Inference
Derives new loss functions for SFT and RL that optimize directly for test-time inference operators like aggregation or filtering, with empirical gains in scaling.
-
Polychromic Objectives for Reinforcement Learning
Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.
- Finite-Time Regret Analysis of Retry-Aware Bandits