Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Spectral decomposition of the logit Jacobian yields an adaptive MSA with linear convergence and a tractable Newton method for path-based SUE, with reported speedups on networks up to Chicago Regional size.
citing papers explorer
-
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
-
Spectral analysis of the logit mapping and implications for stochastic user equilibrium algorithms
Spectral decomposition of the logit Jacobian yields an adaptive MSA with linear convergence and a tractable Newton method for path-based SUE, with reported speedups on networks up to Chicago Regional size.