Optimizing language models for inference time objectives using reinforcement learning

URLhttps://arxiv · 2025 · arXiv 2503.19595

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

What should post-training optimize? A test-time scaling law perspective

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.

Compute Aligned Training: Optimizing for Test Time Inference

cs.LG · 2026-04-27 · unverdicted · novelty 5.0 · 2 refs

Derives new loss functions for SFT and RL that optimize directly for test-time inference operators like aggregation or filtering, with empirical gains in scaling.

Polychromic Objectives for Reinforcement Learning

cs.LG · 2025-09-29 · unverdicted · novelty 5.0

Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.

Finite-Time Regret Analysis of Retry-Aware Bandits

cs.LG · 2026-05-20

citing papers explorer

Showing 4 of 4 citing papers.

What should post-training optimize? A test-time scaling law perspective cs.LG · 2026-05-11 · unverdicted · none · ref 22
Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
Compute Aligned Training: Optimizing for Test Time Inference cs.LG · 2026-04-27 · unverdicted · none · ref 13 · 2 links
Derives new loss functions for SFT and RL that optimize directly for test-time inference operators like aggregation or filtering, with empirical gains in scaling.
Polychromic Objectives for Reinforcement Learning cs.LG · 2025-09-29 · unverdicted · none · ref 40
Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.
Finite-Time Regret Analysis of Retry-Aware Bandits cs.LG · 2026-05-20 · unreviewed · ref 13

Optimizing language models for inference time objectives using reinforcement learning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer