VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training

Caishuang Huang; Chenhao Huang; Dingwei Zhu; Enyu Zhou; Guoqiang Zhang; Jiazheng Zhang; Junjie Ye; Mingxu Chai; Ming Zhang; Qi Zhang

arxiv: 2508.03058 · v2 · pith:U4EW5DVJnew · submitted 2025-08-05 · 💻 cs.LG · cs.AI· cs.CL

VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training

Dingwei Zhu , Shihan Dou , Zhiheng Xi , Senjie Jin , Guoqiang Zhang , Jiazheng Zhang , Junjie Ye , Mingxu Chai

show 11 more authors

Enyu Zhou Ming Zhang Yuhui Wang Caishuang Huang Chenhao Huang Yunke Zhang Yuran Wang Tao Gui Qi Zhang Xipeng Qiu Xuanjing Huang

This is my paper

classification 💻 cs.LG cs.AIcs.CL

keywords modelvaluesupervisionnoiserobustvrponoisyadvantage

0 comments

read the original abstract

Reinforcement Learning (RL) in real-world environments often suffers from ambiguous or incomplete reward supervision, which undermines policy stability and generalization. Such noise may cause models to ignore key information or even collapse in advantage estimation. We find that a strong value model is essential for absorbing unstable signals and producing reliable advantages, offering denser and more robust supervision than the reward model. To better optimize noisy supervision, we propose VRPO, a framework that enhances value modeling for robust RL in LLM post-training. VRPO integrates (1) auxiliary losses guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck, enabling the value model to filter noise and capture key words. This design allows the value model to correct noise rewards and generate more reliable advantage estimates, transforming it from a passive predictor into an active noise regulator. Experiments on multi-turn dialogue, math reasoning, and science QA with both rule-based and model-based rewards show that VRPO consistently outperforms baselines such as PPO and GRPO. Our work highlight the central role of the value model in Robust RL and provide a principled and practical approach to policy optimization under noisy supervision.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Explicit Critic Guidance for Aligning Diffusion Models
cs.LG 2026-05 unverdicted novelty 7.0

Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.
Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
Modularized Reinforcement Learning on LLMs: From MDP Creation to Exploration and Learning
cs.LG 2026-06 unverdicted novelty 5.0

Survey mapping RL techniques onto LLM training and highlighting gaps in value-based, off-policy, and bootstrapping methods.
Trust Region On-Policy Distillation
cs.LG 2026-05 unverdicted novelty 5.0

TrOPD stabilizes on-policy distillation for LLMs with trust-region learning, outlier estimation, and off-policy guidance, outperforming prior OPD methods on reasoning and code benchmarks.