Title resolution pending

B Mathematical Analysis of Stability, Generalization in Robust Bellman PPO, DVPO This section provides a rigorous mathematical derivation of the training dynamics for Standard PPO, Robust Bellman PPO, our DVPO framework · 2025

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

representative citing papers

DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

cs.LG · 2025-12-03 · unverdicted · novelty 5.0

DVPO learns token-level value distributions and uses asymmetric risk regularization to contract lower tails while expanding upper tails, outperforming PPO and GRPO under noisy supervision in dialogue, math, and QA tasks.

citing papers explorer

Showing 1 of 1 citing paper.

DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training cs.LG · 2025-12-03 · unverdicted · none · ref 32
DVPO learns token-level value distributions and uses asymmetric risk regularization to contract lower tails while expanding upper tails, outperforming PPO and GRPO under noisy supervision in dialogue, math, and QA tasks.

Title resolution pending

fields

years

verdicts

representative citing papers

citing papers explorer