pith. sign in

arxiv: 1512.08562 · v4 · pith:QOA5QNG3new · submitted 2015-12-28 · 💻 cs.LG · cs.IT· math.IT

Taming the Noise in Reinforcement Learning via Soft Updates

classification 💻 cs.LG cs.ITmath.IT
keywords learningg-learningestimatesvaluealgorithmsbiasconvergencenoisy
0
0 comments X
read the original abstract

Model-free reinforcement learning algorithms, such as Q-learning, perform poorly in the early stages of learning in noisy environments, because much effort is spent unlearning biased estimates of the state-action value function. The bias results from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. We propose G-learning, a new off-policy learning algorithm that regularizes the value estimates by penalizing deterministic policies in the beginning of the learning process. We show that this method reduces the bias of the value-function estimation, leading to faster convergence to the optimal value and the optimal policy. Moreover, G-learning enables the natural incorporation of prior domain knowledge, when available. The stochastic nature of G-learning also makes it avoid some exploration costs, a property usually attributed only to on-policy algorithms. We illustrate these ideas in several examples, where G-learning results in significant improvements of the convergence rate and the cost of the learning process.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Multi-Armed Sampling Problem and the End of Exploration

    cs.LG 2025-07 conditional novelty 8.0

    Multi-armed sampling framework shows near-optimal regret is achievable with minimal exploration, unlike bandits, and unifies both via a continuous temperature family.

  2. Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

    cs.LG 2026-05 unverdicted novelty 7.0

    The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general functi...

  3. Soft $Q(\lambda)$: A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces

    cs.LG 2026-04 unverdicted novelty 6.0

    Soft Q(λ) unifies an n-step formulation of soft Q-learning with a novel Soft Tree Backup operator into an online off-policy eligibility trace framework for learning entropy-regularized value functions.

  4. Training Language Models to Self-Correct via Reinforcement Learning

    cs.LG 2024-09 unverdicted novelty 6.0

    SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

  5. RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

    cs.CL 2023-09 conditional novelty 6.0

    RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.

  6. Failure Modes of Maximum Entropy RLHF

    cs.LG 2025-09 unverdicted novelty 5.0

    Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.