Optimizing language models for inference time objectives using reinforcement learning.arXiv preprint arXiv:2503.19595

URLhttps://arxiv · 2025 · arXiv 2503.19595

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Finite-Time Regret Analysis of Retry-Aware Bandits

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

ReMax achieves the first sublinear regret bound for Gaussian rewards at M=2 by characterizing the optimal sampling distribution via an expected-improvement balance condition and separating saturation from underestimation effects.

What should post-training optimize? A test-time scaling law perspective

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.

Compute Aligned Training: Optimizing for Test Time Inference

cs.LG · 2026-04-27 · unverdicted · novelty 5.0 · 2 refs

Derives new loss functions for SFT and RL that optimize directly for test-time inference operators like aggregation or filtering, with empirical gains in scaling.

Polychromic Objectives for Reinforcement Learning

cs.LG · 2025-09-29 · unverdicted · novelty 5.0

Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Finite-Time Regret Analysis of Retry-Aware Bandits cs.LG · 2026-05-20 · unverdicted · none · ref 13
ReMax achieves the first sublinear regret bound for Gaussian rewards at M=2 by characterizing the optimal sampling distribution via an expected-improvement balance condition and separating saturation from underestimation effects.
What should post-training optimize? A test-time scaling law perspective cs.LG · 2026-05-11 · unverdicted · none · ref 22
Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
Compute Aligned Training: Optimizing for Test Time Inference cs.LG · 2026-04-27 · unverdicted · none · ref 13 · 2 links
Derives new loss functions for SFT and RL that optimize directly for test-time inference operators like aggregation or filtering, with empirical gains in scaling.

Optimizing language models for inference time objectives using reinforcement learning.arXiv preprint arXiv:2503.19595

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer