A Additional Details for VRPO A.1 Pseudocode The full algorithm of VRPO is detailed in Algo- rithm

q♯: Provably optimal distributional rl for llm post-training · arXiv 2502.20548

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

representative citing papers

DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

cs.LG · 2025-12-03 · unverdicted · novelty 5.0

DVPO learns token-level value distributions and uses asymmetric risk regularization to contract lower tails while expanding upper tails, outperforming PPO and GRPO under noisy supervision in dialogue, math, and QA tasks.

citing papers explorer

Showing 1 of 1 citing paper.

DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training cs.LG · 2025-12-03 · unverdicted · none · ref 31
DVPO learns token-level value distributions and uses asymmetric risk regularization to contract lower tails while expanding upper tails, outperforming PPO and GRPO under noisy supervision in dialogue, math, and QA tasks.

A Additional Details for VRPO A.1 Pseudocode The full algorithm of VRPO is detailed in Algo- rithm

fields

years

verdicts

representative citing papers

citing papers explorer