Distributional Reinforcement Learning with Quantile Regression

Will Dabney , Mark Rowland , Marc G. Bellemare , R\'emi Munos

Authors on Pith no claims yet

classification 💻 cs.AI cs.LGstat.ML

keywords learningreinforcementdistributionalalgorithmdistributionresultsvalueactions

read the original abstract

In reinforcement learning an agent interacts with the environment by taking actions and observing the next state and reward. When sampled probabilistically, these state transitions, rewards, and actions can all induce randomness in the observed long-term return. Traditionally, reinforcement learning algorithms average over this randomness to estimate the value function. In this paper, we build on recent work advocating a distributional approach to reinforcement learning in which the distribution over returns is modeled explicitly instead of only estimating the mean. That is, we examine methods of learning the value distribution instead of the value function. We give results that close a number of gaps between the theoretical and algorithmic results given by Bellemare, Dabney, and Munos (2017). First, we extend existing results to the approximate distribution setting. Second, we present a novel distributional reinforcement learning algorithm consistent with our theoretical formulation. Finally, we evaluate this new algorithm on the Atari 2600 games, observing that it significantly outperforms many of the recent improvements on DQN, including the related distributional algorithm C51.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Affordable Energy: A Gymnasium Environment for Electric Utility Demand-Response Programs
cs.AI 2026-05 unverdicted novelty 7.0

DR-Gym is a new Gymnasium-compatible simulator for training utility demand-response policies with regime-switching wholesale prices and physics-based building demand.
Mastering Atari with Discrete World Models
cs.LG 2020-10 accept novelty 7.0

DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
What Does Flow Matching Bring To TD Learning?
cs.LG 2026-03 conditional novelty 6.0

Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.