pith. sign in

arxiv: 2508.03058 · v2 · pith:U4EW5DVJnew · submitted 2025-08-05 · 💻 cs.LG · cs.AI· cs.CL

VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training

classification 💻 cs.LG cs.AIcs.CL
keywords modelvaluesupervisionnoiserobustvrponoisyadvantage
0
0 comments X
read the original abstract

Reinforcement Learning (RL) in real-world environments often suffers from ambiguous or incomplete reward supervision, which undermines policy stability and generalization. Such noise may cause models to ignore key information or even collapse in advantage estimation. We find that a strong value model is essential for absorbing unstable signals and producing reliable advantages, offering denser and more robust supervision than the reward model. To better optimize noisy supervision, we propose VRPO, a framework that enhances value modeling for robust RL in LLM post-training. VRPO integrates (1) auxiliary losses guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck, enabling the value model to filter noise and capture key words. This design allows the value model to correct noise rewards and generate more reliable advantage estimates, transforming it from a passive predictor into an active noise regulator. Experiments on multi-turn dialogue, math reasoning, and science QA with both rule-based and model-based rewards show that VRPO consistently outperforms baselines such as PPO and GRPO. Our work highlight the central role of the value model in Robust RL and provide a principled and practical approach to policy optimization under noisy supervision.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Explicit Critic Guidance for Aligning Diffusion Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.

  2. Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.

  3. Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning

    cs.LG 2026-06 unverdicted novelty 5.0

    Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.

  4. Trust Region On-Policy Distillation

    cs.LG 2026-05 unverdicted novelty 5.0

    TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.