pith. machine review for the scientific record. sign in

arxiv: 1903.08894 · v1 · submitted 2019-03-21 · 💻 cs.LG · cs.AI

Recognition: unknown

Towards Characterizing Divergence in Deep Q-Learning

Authors on Pith no claims yet
classification 💻 cs.LG cs.AI
keywords divergenceq-learninganalysisapproximationdeepalgorithmalgorithmscontrol
0
0 comments X
read the original abstract

Deep Q-Learning (DQL), a family of temporal difference algorithms for control, employs three techniques collectively known as the `deadly triad' in reinforcement learning: bootstrapping, off-policy learning, and function approximation. Prior work has demonstrated that together these can lead to divergence in Q-learning algorithms, but the conditions under which divergence occurs are not well-understood. In this note, we give a simple analysis based on a linear approximation to the Q-value updates, which we believe provides insight into divergence under the deadly triad. The central point in our analysis is to consider when the leading order approximation to the deep-Q update is or is not a contraction in the sup norm. Based on this analysis, we develop an algorithm which permits stable deep Q-learning for continuous control without any of the tricks conventionally used (such as target networks, adaptive gradient optimizers, or using multiple Q functions). We demonstrate that our algorithm performs above or near state-of-the-art on standard MuJoCo benchmarks from the OpenAI Gym.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AdamO: A Collapse-Suppressed Optimizer for Offline RL

    cs.LG 2026-05 unverdicted novelty 6.0

    AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.