Taming the Noise in Reinforcement Learning via Soft Updates
read the original abstract
Model-free reinforcement learning algorithms, such as Q-learning, perform poorly in the early stages of learning in noisy environments, because much effort is spent unlearning biased estimates of the state-action value function. The bias results from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. We propose G-learning, a new off-policy learning algorithm that regularizes the value estimates by penalizing deterministic policies in the beginning of the learning process. We show that this method reduces the bias of the value-function estimation, leading to faster convergence to the optimal value and the optimal policy. Moreover, G-learning enables the natural incorporation of prior domain knowledge, when available. The stochastic nature of G-learning also makes it avoid some exploration costs, a property usually attributed only to on-policy algorithms. We illustrate these ideas in several examples, where G-learning results in significant improvements of the convergence rate and the cost of the learning process.
This paper has not been read by Pith yet.
Forward citations
Cited by 6 Pith papers
-
Multi-Armed Sampling Problem and the End of Exploration
Multi-armed sampling framework shows near-optimal regret is achievable with minimal exploration, unlike bandits, and unifies both via a continuous temperature family.
-
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general functi...
-
Soft $Q(\lambda)$: A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces
Soft Q(λ) unifies an n-step formulation of soft Q-learning with a novel Soft Tree Backup operator into an online off-policy eligibility trace framework for learning entropy-regularized value functions.
-
Training Language Models to Self-Correct via Reinforcement Learning
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
-
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.
-
Failure Modes of Maximum Entropy RLHF
Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.