BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
Phasic policy gradient
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.
citing papers explorer
-
Bounded Ratio Reinforcement Learning
BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
-
Mastering Diverse Domains through World Models
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
-
GAE Falls Short in Imperfect-Information Self-Play Reinforcement Learning
GAE suffers from amplified variance in imperfect-info self-play RL; VRPO with Q-boosting and multi-step Expected SARSA(λ) reduces it and improves performance on mid-to-large games.