pith. sign in

arxiv: 1906.10025 · v2 · pith:O6DHDJFYnew · submitted 2019-06-24 · 💻 cs.LG · cs.AI· stat.ML

Modern Deep Reinforcement Learning Algorithms

Pith reviewed 2026-05-25 17:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords deep reinforcement learningDRL algorithmssurveytheoretical justificationpractical limitationsempirical propertiesreinforcement learning
0
0 comments X

The pith

Combining classical reinforcement learning theory with deep neural networks produces algorithms that solve complex decision tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews the latest deep reinforcement learning algorithms that arise from merging classical RL results with the deep learning paradigm. This merger has led to breakthroughs in many AI tasks. The review focuses on theoretical justifications for these algorithms, their practical limitations, and the empirical properties observed in experiments. A sympathetic reader would care because it supplies a structured overview of the DRL field as it stood at the time of writing.

Core claim

Recent advances in Reinforcement Learning, grounded on combining classical theoretical results with Deep Learning paradigm, led to breakthroughs in many artificial intelligence tasks and gave birth to Deep Reinforcement Learning (DRL) as a field of research. The work reviews the latest DRL algorithms with emphasis on their theoretical basis, practical constraints, and empirical behaviors.

What carries the argument

The integration of classical RL theory with deep neural networks as the mechanism for scaling to complex problems.

Load-bearing premise

The algorithms selected for review are the most representative and important ones in the field at the time of writing.

What would settle it

Discovery of a major DRL algorithm from the review period that was omitted or whose reported theoretical and empirical properties were misstated.

Figures

Figures reproduced from arXiv: 1906.10025 by Alexander D'yakonov, Sergey Ivanov.

Figure 1
Figure 1. Figure 1: Network used for Atari Pong. All activation functions are ReLU. For Rainbow the fully-connected layer [PITH_FULL_IMAGE:figures/full_fig_p044_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training curves of vanilla and accelerated version of value-based algorithms on 1M steps of Pong. [PITH_FULL_IMAGE:figures/full_fig_p045_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training curves of vanilla and accelerated version of value-based algorithms on 1M steps of Pong from [PITH_FULL_IMAGE:figures/full_fig_p046_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training curves of all algorithms on 1M steps of Pong. [PITH_FULL_IMAGE:figures/full_fig_p046_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training curves of all algorithms on 1M steps of Pong from wall-clock time. [PITH_FULL_IMAGE:figures/full_fig_p046_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: DQN loss behaviour during training on Pong. [PITH_FULL_IMAGE:figures/full_fig_p052_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Loss behaviours of c51, QR-DQN and Rainbow during training on Pong. [PITH_FULL_IMAGE:figures/full_fig_p052_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Rainbow statistics during training. Left: smoothed with window 1000 median of importance sampling [PITH_FULL_IMAGE:figures/full_fig_p053_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A2C loss behaviour during training. 0 20000 40000 60000 80000 network update step 2.0 1.5 1.0 0.5 0.0 0.5 1.0 loss Proximal Policy Optimization loss behaviour Actor loss Critic loss Entropy loss [PITH_FULL_IMAGE:figures/full_fig_p053_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: PPO loss behaviour during training. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: DQN playing one episode of Pong. 0 200 400 600 800 1000 1200 1400 1600 episode step 0.0 0.5 1.0 1.5 2.0 2.5 state value c51 playing Pong Predicted V(s) Reward-to-go losses wins [PITH_FULL_IMAGE:figures/full_fig_p054_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: c51 playing one episode of Pong. 0 200 400 600 800 1000 1200 1400 1600 episode step -10.0 -8.0 -6.0 -4.0 -2.0 0.0 2.0 4.0 6.0 8.0 10.0 state value c51 value distribution during one played episode 0.0 0.1 0.2 0.3 0.4 0.5 [PITH_FULL_IMAGE:figures/full_fig_p054_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: c51 value distribution prediction during one episode of Pong. [PITH_FULL_IMAGE:figures/full_fig_p054_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Quantile Regression DQN playing one episode of Pong. [PITH_FULL_IMAGE:figures/full_fig_p055_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Quantile Regression DQN value distribution prediction during one episode of Pong. [PITH_FULL_IMAGE:figures/full_fig_p055_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Rainbow playing one episode of Pong (exploration turned o [PITH_FULL_IMAGE:figures/full_fig_p055_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Rainbow value distribution prediction during one episode of Pong (exploration turned o [PITH_FULL_IMAGE:figures/full_fig_p055_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: A2C playing one episode of Pong. 0 250 500 750 1000 1250 1500 1750 2000 episode step NOOP FIRE RIGHT LEFT RIGHTFIRE LEFTFIRE actions A2C policy during one played episode 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 [PITH_FULL_IMAGE:figures/full_fig_p056_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: A2C policy distribution during one episode of Pong. [PITH_FULL_IMAGE:figures/full_fig_p056_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: PPO playing one episode of Pong. 0 250 500 750 1000 1250 1500 1750 2000 episode step NOOP FIRE RIGHT LEFT RIGHTFIRE LEFTFIRE actions PPO policy during one played episode 0.2 0.4 0.6 0.8 [PITH_FULL_IMAGE:figures/full_fig_p056_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: PPO policy distribution during one episode of Pong. [PITH_FULL_IMAGE:figures/full_fig_p056_21.png] view at source ↗
read the original abstract

Recent advances in Reinforcement Learning, grounded on combining classical theoretical results with Deep Learning paradigm, led to breakthroughs in many artificial intelligence tasks and gave birth to Deep Reinforcement Learning (DRL) as a field of research. In this work latest DRL algorithms are reviewed with a focus on their theoretical justification, practical limitations and observed empirical properties.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that recent advances in Reinforcement Learning, grounded on combining classical theoretical results with the Deep Learning paradigm, have led to breakthroughs in many AI tasks and the emergence of Deep Reinforcement Learning (DRL) as a field. It reviews the latest DRL algorithms with a focus on their theoretical justification, practical limitations, and observed empirical properties.

Significance. If the selected algorithms are representative and the summaries of theory, limitations, and empirics are accurate, the survey could provide a useful synthesis for researchers navigating DRL, particularly by connecting classical RL foundations to modern deep methods. The descriptive framing (no new derivations or predictions) makes the contribution dependent on coverage and fidelity rather than novelty of results.

major comments (1)
  1. [Abstract] Abstract: the claim to review 'latest DRL algorithms' is load-bearing for the paper's utility as a field map, yet no explicit selection criteria or justification for the chosen set of algorithms is provided; this risks omitting representative methods or over-weighting others, directly affecting the central descriptive claim.
minor comments (2)
  1. Ensure that empirical properties discussed for each algorithm are tied to specific cited experiments or benchmarks rather than general statements.
  2. Clarify the publication cutoff date for 'latest' algorithms to allow readers to assess timeliness of the review.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation for minor revision. The single major comment identifies a genuine gap in the manuscript's framing as a survey, and we address it directly below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim to review 'latest DRL algorithms' is load-bearing for the paper's utility as a field map, yet no explicit selection criteria or justification for the chosen set of algorithms is provided; this risks omitting representative methods or over-weighting others, directly affecting the central descriptive claim.

    Authors: We agree that the absence of explicit selection criteria weakens the paper's utility as a field map. In the revised version we will add a short subsection (likely in the introduction) that states the criteria: (i) recency, with emphasis on algorithms published or popularized after the 2013-2015 deep-RL breakthroughs; (ii) demonstrated impact, measured by subsequent citations and influence on follow-up work; and (iii) coverage of distinct algorithmic families (value-based, policy-gradient, actor-critic, model-based). We will also note that the survey is necessarily non-exhaustive and flag a few prominent omissions (e.g., certain offline RL or meta-RL methods) with brief justification. This addition directly responds to the referee's concern without altering the descriptive nature of the contribution. revision: yes

Circularity Check

0 steps flagged

Survey paper with no derivations or self-referential predictions

full rationale

This is a literature review surveying recent DRL algorithms and their properties. No original derivations, equations, fitted parameters, or predictive claims are made that could reduce to the paper's own inputs by construction. The central claim is a descriptive review of selected algorithms with attention to theory, limitations, and empirics; this holds independently of any self-citation and does not invoke uniqueness theorems, ansatzes, or renamings of results. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a review paper the manuscript introduces no free parameters, axioms, or invented entities of its own; all content is drawn from previously published algorithms and results.

pith-pipeline@v0.9.0 · 5563 in / 982 out tokens · 19061 ms · 2026-05-25T17:33:32.702424+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

  1. [1]

    SELECTaRANDOMLYWITHPROBABILITY ε(t),ELSEa =argmax a ∑ iziζ∗ i (s,a,θ )

  2. [5]

    FOREACHTRANSITIONTFROMTHEBATCHCOMPUTETARGET: P(y(T ) =r′ +γzi) =ζ∗ i ( s′,argmax a′ ∑ i ziζ∗ i (s′,a′,θ−),θ− )

  3. [7]

    COMPUTELOSS: Loss = 1 B ∑ T KL(y(T ) ∥Z∗(s,a,θ ))

  4. [9]

    ELLMANEQUATION 21.(ENCETHELASTTHINGTODOTODESIGNAPRACTICALALGORITHMISTODEVELOPAPRO- CEDUREOFUNBIASEDESTIMATIONOFQUANTILESFORTHERANDOMVARIABLEONTHERIGHTSIDEOFDISTRIBUTION

    IFt mod K = 0:θ− ←θ 4.3. 1uANTILE2EGRESSIOND1.(12-D1.) CATEGORICALD1.DISCOVEREDAGAPBETWEENTHEORYANDPRACTICEAS KL-DIVERGENCE,USEDINPRAC- TICALALGORITHM,ISTHEORETICALLYUNJUSTI 1ED.4HEOREM12HINTSTHATTHETRUEDIVERGENCEWESHOULDCARE ABOUTISACTUALLY7ASSERSTEINMETRIC,BUTITREMAINEDUNCLEARHOWITCOULDBEOPTIMIZEDUSINGONLY SAMPLESFROMTRANSITIONPROBABILITIES T. IN;3]ITWA...

  5. [10]

    SELECTaRANDOMLYWITHPROBABILITY ε(t),ELSEa =argmax a 1 A ∑ iζ∗ i (s,a,θ )

  6. [12]

    ADDOBSERVEDTRANSITIONTOEXPERIENCEREPLAY

  7. [13]

    SAMPLEBATCHOFSIZEBFROMEXPERIENCEREPLAY

  8. [14]

    FOREACHTRANSITIONTFROMTHEBATCHCOMPUTETHESUPPORTOFTARGETDISTRIBUTION: y(T )j =r′ +γζ∗ j ( s′,argmax a′ 1 A ∑ i ζ∗ i (s′,a′,θ−),θ− ) 26

  9. [15]

    COMPUTELOSS: Loss = 1 BA ∑ T ∑ i ∑ j ( τi − I[ζ∗ i (s,a,θ )<y (T )j] ) ( ζ∗ i (s,a,θ ) −y(T )j )

  10. [17]

    2AINBOWD1

    IFt mod K = 0:θ− ←θ 4.4. 2AINBOWD1. 3UCCESSOFDEEP1-LEARNINGENCOURAGEDAFULL-SCALERESEARCHOFVALUE-BASEDDEEPREINFORCEMENT LEARNINGBYSTUDYINGVARIOUSDRAWBACKSOFD1.ANDDEVELOPINGAUXILIARYEXTENSIONS. INMANYARTI- CLESSOMEEXTENSIONSFROMPREVIOUSRESEARCHWEREALREADYCONSIDEREDANDEMBEDDEDINCOMPARED ALGORITHMSDURINGEMPIRICALSTUDIES. IN2AINBOWD1.;7],SEVEN1-LEARNING-BASEDI...

  11. [18]

    SELECTa =argmax a ∑ iziζ∗ i (s,a,θ,ε ),ε ∼ N (0,I )

  12. [19]

    OBSERVETRANSITION(s,a,r′,s′, done)

  13. [20]

    CONSTRUCTN-STEPTRANSITIONT = ( s,a, ∑N n=0γnr(n+1),s (N), done ) ANDADDITTO EXPERIENCEREPLAYWITHPRIORITY maxTρ(T )

  14. [21]

    SAMPLEBATCHOFSIZEBFROMEXPERIENCEREPLAYUSINGPROBABILITIES P(T ) ∝ρ(T )α

  15. [22]

    COMPUTEWEIGHTSFORTHEBATCH(WHEREMISTHESIZEOFEXPERIENCEREPLAYMEMORY) w(T ) = ( 1 M P(T ) )β(t)

  16. [23]

    FOREACHTRANSITIONT = (s,a, ¯r, ¯s, done)FROMTHEBATCHCOMPUTETARGET(DETACHED FROMCOMPUTATIONALGRAPHTOPREVENTBACKPROPAGATION): ε1,ε 2 ∼ N (0,I ) P(y(T ) = ¯r +γNzi) =ζ∗ i ( ¯s,argmax ¯a ∑ i ziζ∗ i (¯s, ¯a,θ,ε 1),θ−,ε 2 )

  17. [24]

    PROJECTy(T )ONSUPPORT {z0,z 1...z A−1}

  18. [25]

    UPDATETRANSITIONPRIORITIES ρ(T ) ← KL(y(T ) ∥Z∗(s,a,θ,ε )),ε ∼ N (0,I )

  19. [26]

    COMPUTELOSS: Loss = 1 B ∑ T w(T )ρ(T )

  20. [27]

    MAKEASTEPOFGRADIENTDESCENTUSING ∂ Loss ∂θ

  21. [28]

    IFt mod K = 0:θ− ←θ 28

  22. [29]

    0OLICyGRADIENTALGORITHMS 5.1. 0OLICyGRADIENTTHEOREM ALTERNATIVEAPPROACHTOSOLVING2,TASKISDIRECTOPTIMIZATIONOFOBJECTIVE J(θ) = ET∼πθ ∑ t=1 γt−1rt → max θ (33) ASAFUNCTIONOF θ.0OLICYGRADIENTMETHODSPROVIDEAFRAMEWORKHOWTOCONSTRUCTANE ZCIENTOPTI- MIZATIONPROCEDUREBASEDONSTOCHASTIC 1RST-ORDEROPTIMIZATIONWITHIN2,SETTING. 7EWILLASSUMETHAT πθ(a | s)ISASTOCHASTICPOL...

  23. [30]

    OBTAINAROLL-OUTOFSIZEBUSINGPOLICY π(θ)

  24. [31]

    FOREACHTRANSITIONTFROMTHEROLL-OUTCOMPUTEADVANTAGEESTIMATION: Aπ(T ) =r′ +γV π φ (s′) −Vπ φ

  25. [32]

    COMPUTETARGET(DETACHEDFROMCOMPUTATIONALGRAPHTOPREVENTBACKPROPAGATION): y(T ) =r′ +γV π φ (s′)

  26. [33]

    COMPUTECRITICLOSS: Loss = 1 B ∑ T ( y(T ) −Vπ φ )2

  27. [34]

    COMPUTECRITICGRADIENTS: ∇critic = ∂ Loss ∂φ

  28. [35]

    COMPUTEACTORGRADIENT: ∇actor = 1 B ∑ T ∇θ logπθ(a |s)Aπ(T )

  29. [36]

    MAKEASTEPOFGRADIENTDESCENTUSING ∇actor +α∇critic 5.4. GENERALIzEDADVANTAGEESTIMATION(GAE) 4HEREISADESIGNDILEMMAINADVANTAGEACTORCRITICALGORITHMCONCERNINGTHECHOICEWHETHERTO USETHECRITICTOESTIMATE Qπ(s,a )ANDINTRODUCEBIASINTOGRADIENTESTIMATIONORTORESTRICTCRITICEM- PLOYMENTONLYFORBASELINEANDCAUSEHIGHERVARIANCEWITHNECESSITYOFPLAYINGTHEWHOLEEPISODES FOREACHUPDA...

  30. [37]

    OBTAINAROLL-OUTOFSIZERUSINGPOLICY π(θ),STORINGACTIONPROBABILITIESAS πold(a |s)

  31. [38]

    FOREACHTRANSITIONT FROMTHEROLL-OUTCOMPUTEADVANTAGEESTIMATION(DETACHEDFROM COMPUTATIONALGRAPHTOPREVENTBACKPROPAGATION): Aπ(T ) =r′ +γV π φ (s′) −Vπ φ

  32. [39]

    PERFORMn?epochsPASSESTHROUGHROLL-OUTUSINGBATCHESOFSIZE B;FOREACHBATCH: • COMPUTECRITICTARGET(DETACHEDFROMCOMPUTATIONALGRAPHTOPREVENTBACKPROPA- GATION): y(T ) =r′ +γV π φ (s′) • COMPUTECRITICLOSS: Loss = 1 B ∑ T ( y(T ) −Vπ φ )2 • COMPUTECRITICGRADIENTS: ∇critic = ∂ Loss ∂φ • COMPUTEIMPORTANCESAMPLINGWEIGHTS: rθ(T ) = πθ(a |s) πold(a |s) • COMPUTECLIPPEDIM...

  33. [40]

    ExPERIMENTS 6.1. 3ETuP 7EPERFORMEDOUREXPERIMENTSUSINGCUSTOMIMPLEMENTATIONOFDISCUSSEDALGORITHMSATTEMPT- INGTOINCORPORATEBESTFEATURESFROMDI ffERENTOZCIALANDUNO ZCIALSOURCESANDUNIFYINGALLALGO- RITHMSINASINGLELIBRARYINTERFACE.4HEFULLCODEISAVAILABLEATOURGITHUB. 7HILECUSTOMIMPLEMENTATIONMIGHTNOTBETHEMOSTE ZCIENT,ITHINTEDUSSEVERALAMBIGUITIESIN ALGORITHMSWHICHARER...

  34. [41]

    OpenAI Gym

    DISCuSSION 7EHAVECONCERNEDTWOMAINDIRECTIONSOFUNIVERSALMODEL-FREE2,ALGORITHMDESIGNANDAT- TEMPTEDTORECREATESEVERALSTATE-OF-ARTPIPELINES. 7HILETHEEXTENSIONSOFD1.AREREASONABLESOLUTIONSOFEVIDENTD1.PROBLEMS, THEIRE ffECT ISNOTCLEARLYSEENONSIMPLETASKSLIKE0ONG 36. CURRENTSTATE-OF-ARTINSINGLE-THREADEDVALUE-BASED APPROACH,2AINBOWD1.,ISFULLOFjGLUEANDTAPEzDECISIONSTHA...