DVPO learns token-level value distributions and uses asymmetric risk regularization to contract lower tails while expanding upper tails, outperforming PPO and GRPO under noisy supervision in dialogue, math, and QA tasks.
Continuous-Time Robust Dynamic Programming
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
This paper presents a new theory, known as robust dynamic pro- gramming, for a class of continuous-time dynamical systems. Different from traditional dynamic programming (DP) methods, this new theory serves as a fundamental tool to analyze the robustness of DP algorithms, and in par- ticular, to develop novel adaptive optimal control and reinforcement learning methods. In order to demonstrate the potential of this new framework, four illustrative applications in the fields of stochastic optimal control and adaptive DP are presented. Three numerical examples arising from both finance and engineering industries are also given, along with several possible extensions of the proposed framework.
fields
cs.LG 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training
DVPO learns token-level value distributions and uses asymmetric risk regularization to contract lower tails while expanding upper tails, outperforming PPO and GRPO under noisy supervision in dialogue, math, and QA tasks.