Modern Deep Reinforcement Learning Algorithms

Alexander D'yakonov; Sergey Ivanov

arxiv: 1906.10025 · v2 · pith:O6DHDJFYnew · submitted 2019-06-24 · 💻 cs.LG · cs.AI· stat.ML

Modern Deep Reinforcement Learning Algorithms

Sergey Ivanov , Alexander D'yakonov This is my paper

Pith reviewed 2026-05-25 17:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords deep reinforcement learningDRL algorithmssurveytheoretical justificationpractical limitationsempirical propertiesreinforcement learning

0 comments

The pith

Combining classical reinforcement learning theory with deep neural networks produces algorithms that solve complex decision tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews the latest deep reinforcement learning algorithms that arise from merging classical RL results with the deep learning paradigm. This merger has led to breakthroughs in many AI tasks. The review focuses on theoretical justifications for these algorithms, their practical limitations, and the empirical properties observed in experiments. A sympathetic reader would care because it supplies a structured overview of the DRL field as it stood at the time of writing.

Core claim

Recent advances in Reinforcement Learning, grounded on combining classical theoretical results with Deep Learning paradigm, led to breakthroughs in many artificial intelligence tasks and gave birth to Deep Reinforcement Learning (DRL) as a field of research. The work reviews the latest DRL algorithms with emphasis on their theoretical basis, practical constraints, and empirical behaviors.

What carries the argument

The integration of classical RL theory with deep neural networks as the mechanism for scaling to complex problems.

Load-bearing premise

The algorithms selected for review are the most representative and important ones in the field at the time of writing.

What would settle it

Discovery of a major DRL algorithm from the review period that was omitted or whose reported theoretical and empirical properties were misstated.

Figures

Figures reproduced from arXiv: 1906.10025 by Alexander D'yakonov, Sergey Ivanov.

**Figure 2.** Figure 2: Training curves of vanilla and accelerated version of value-based algorithms on 1M steps of Pong. [PITH_FULL_IMAGE:figures/full_fig_p045_2.png] view at source ↗

**Figure 3.** Figure 3: Training curves of vanilla and accelerated version of value-based algorithms on 1M steps of Pong from [PITH_FULL_IMAGE:figures/full_fig_p046_3.png] view at source ↗

**Figure 4.** Figure 4: Training curves of all algorithms on 1M steps of Pong. [PITH_FULL_IMAGE:figures/full_fig_p046_4.png] view at source ↗

**Figure 5.** Figure 5: Training curves of all algorithms on 1M steps of Pong from wall-clock time. [PITH_FULL_IMAGE:figures/full_fig_p046_5.png] view at source ↗

**Figure 6.** Figure 6: DQN loss behaviour during training on Pong. [PITH_FULL_IMAGE:figures/full_fig_p052_6.png] view at source ↗

**Figure 7.** Figure 7: Loss behaviours of c51, QR-DQN and Rainbow during training on Pong. [PITH_FULL_IMAGE:figures/full_fig_p052_7.png] view at source ↗

**Figure 8.** Figure 8: Rainbow statistics during training. Left: smoothed with window 1000 median of importance sampling [PITH_FULL_IMAGE:figures/full_fig_p053_8.png] view at source ↗

**Figure 9.** Figure 9: A2C loss behaviour during training. 0 20000 40000 60000 80000 network update step 2.0 1.5 1.0 0.5 0.0 0.5 1.0 loss Proximal Policy Optimization loss behaviour Actor loss Critic loss Entropy loss [PITH_FULL_IMAGE:figures/full_fig_p053_9.png] view at source ↗

**Figure 10.** Figure 10: PPO loss behaviour during training. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_10.png] view at source ↗

**Figure 11.** Figure 11: DQN playing one episode of Pong. 0 200 400 600 800 1000 1200 1400 1600 episode step 0.0 0.5 1.0 1.5 2.0 2.5 state value c51 playing Pong Predicted V(s) Reward-to-go losses wins [PITH_FULL_IMAGE:figures/full_fig_p054_11.png] view at source ↗

**Figure 12.** Figure 12: c51 playing one episode of Pong. 0 200 400 600 800 1000 1200 1400 1600 episode step -10.0 -8.0 -6.0 -4.0 -2.0 0.0 2.0 4.0 6.0 8.0 10.0 state value c51 value distribution during one played episode 0.0 0.1 0.2 0.3 0.4 0.5 [PITH_FULL_IMAGE:figures/full_fig_p054_12.png] view at source ↗

**Figure 13.** Figure 13: c51 value distribution prediction during one episode of Pong. [PITH_FULL_IMAGE:figures/full_fig_p054_13.png] view at source ↗

**Figure 14.** Figure 14: Quantile Regression DQN playing one episode of Pong. [PITH_FULL_IMAGE:figures/full_fig_p055_14.png] view at source ↗

**Figure 15.** Figure 15: Quantile Regression DQN value distribution prediction during one episode of Pong. [PITH_FULL_IMAGE:figures/full_fig_p055_15.png] view at source ↗

**Figure 16.** Figure 16: Rainbow playing one episode of Pong (exploration turned o [PITH_FULL_IMAGE:figures/full_fig_p055_16.png] view at source ↗

**Figure 17.** Figure 17: Rainbow value distribution prediction during one episode of Pong (exploration turned o [PITH_FULL_IMAGE:figures/full_fig_p055_17.png] view at source ↗

**Figure 18.** Figure 18: A2C playing one episode of Pong. 0 250 500 750 1000 1250 1500 1750 2000 episode step NOOP FIRE RIGHT LEFT RIGHTFIRE LEFTFIRE actions A2C policy during one played episode 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 [PITH_FULL_IMAGE:figures/full_fig_p056_18.png] view at source ↗

**Figure 19.** Figure 19: A2C policy distribution during one episode of Pong. [PITH_FULL_IMAGE:figures/full_fig_p056_19.png] view at source ↗

**Figure 20.** Figure 20: PPO playing one episode of Pong. 0 250 500 750 1000 1250 1500 1750 2000 episode step NOOP FIRE RIGHT LEFT RIGHTFIRE LEFTFIRE actions PPO policy during one played episode 0.2 0.4 0.6 0.8 [PITH_FULL_IMAGE:figures/full_fig_p056_20.png] view at source ↗

**Figure 21.** Figure 21: PPO policy distribution during one episode of Pong. [PITH_FULL_IMAGE:figures/full_fig_p056_21.png] view at source ↗

read the original abstract

Recent advances in Reinforcement Learning, grounded on combining classical theoretical results with Deep Learning paradigm, led to breakthroughs in many artificial intelligence tasks and gave birth to Deep Reinforcement Learning (DRL) as a field of research. In this work latest DRL algorithms are reviewed with a focus on their theoretical justification, practical limitations and observed empirical properties.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a plain survey of DRL algorithms that restates existing results with attention to theory and practice but adds no new findings of its own.

read the letter

The paper reviews a selection of deep reinforcement learning algorithms and tries to tie their theoretical roots to observed behavior and known limitations. It walks through methods that combine classical RL with neural networks, such as variants of policy optimization and value-based approaches, and notes where they have succeeded or run into trouble in reported experiments. That kind of organized recap can be handy for someone who needs a map of the main ideas without hunting through dozens of original papers. The abstract frames the scope clearly enough, and the stress-test note is right that the central claim is just that the review was performed on the chosen algorithms, not that the list is complete or optimal. The soft spot is the usual one for surveys: everything depends on whether the selection is balanced and the summaries are accurate. If key methods are left out or the practical limitations are described too loosely, the piece becomes less reliable as a reference. No new derivations or data are presented, so there is no circularity or fitting issue to worry about. This is the sort of paper a student or practitioner might consult to get oriented in the area around 2019, but it will not change how anyone does research. A serious editor should send it to referees because a competent survey that actually delivers on its stated scope is still worth the time even without original results.

Referee Report

1 major / 2 minor

Summary. The paper claims that recent advances in Reinforcement Learning, grounded on combining classical theoretical results with the Deep Learning paradigm, have led to breakthroughs in many AI tasks and the emergence of Deep Reinforcement Learning (DRL) as a field. It reviews the latest DRL algorithms with a focus on their theoretical justification, practical limitations, and observed empirical properties.

Significance. If the selected algorithms are representative and the summaries of theory, limitations, and empirics are accurate, the survey could provide a useful synthesis for researchers navigating DRL, particularly by connecting classical RL foundations to modern deep methods. The descriptive framing (no new derivations or predictions) makes the contribution dependent on coverage and fidelity rather than novelty of results.

major comments (1)

[Abstract] Abstract: the claim to review 'latest DRL algorithms' is load-bearing for the paper's utility as a field map, yet no explicit selection criteria or justification for the chosen set of algorithms is provided; this risks omitting representative methods or over-weighting others, directly affecting the central descriptive claim.

minor comments (2)

Ensure that empirical properties discussed for each algorithm are tied to specific cited experiments or benchmarks rather than general statements.
Clarify the publication cutoff date for 'latest' algorithms to allow readers to assess timeliness of the review.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation for minor revision. The single major comment identifies a genuine gap in the manuscript's framing as a survey, and we address it directly below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim to review 'latest DRL algorithms' is load-bearing for the paper's utility as a field map, yet no explicit selection criteria or justification for the chosen set of algorithms is provided; this risks omitting representative methods or over-weighting others, directly affecting the central descriptive claim.

Authors: We agree that the absence of explicit selection criteria weakens the paper's utility as a field map. In the revised version we will add a short subsection (likely in the introduction) that states the criteria: (i) recency, with emphasis on algorithms published or popularized after the 2013-2015 deep-RL breakthroughs; (ii) demonstrated impact, measured by subsequent citations and influence on follow-up work; and (iii) coverage of distinct algorithmic families (value-based, policy-gradient, actor-critic, model-based). We will also note that the survey is necessarily non-exhaustive and flag a few prominent omissions (e.g., certain offline RL or meta-RL methods) with brief justification. This addition directly responds to the referee's concern without altering the descriptive nature of the contribution. revision: yes

Circularity Check

0 steps flagged

Survey paper with no derivations or self-referential predictions

full rationale

This is a literature review surveying recent DRL algorithms and their properties. No original derivations, equations, fitted parameters, or predictive claims are made that could reduce to the paper's own inputs by construction. The central claim is a descriptive review of selected algorithms with attention to theory, limitations, and empirics; this holds independently of any self-citation and does not invoke uniqueness theorems, ansatzes, or renamings of results. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a review paper the manuscript introduces no free parameters, axioms, or invented entities of its own; all content is drawn from previously published algorithms and results.

pith-pipeline@v0.9.0 · 5563 in / 982 out tokens · 19061 ms · 2026-05-25T17:33:32.702424+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

[1]

SELECTaRANDOMLYWITHPROBABILITY ε(t),ELSEa =argmax a ∑ iziζ∗ i (s,a,θ )

work page
[5]

FOREACHTRANSITIONTFROMTHEBATCHCOMPUTETARGET: P(y(T ) =r′ +γzi) =ζ∗ i ( s′,argmax a′ ∑ i ziζ∗ i (s′,a′,θ−),θ− )

work page
[7]

COMPUTELOSS: Loss = 1 B ∑ T KL(y(T ) ∥Z∗(s,a,θ ))

work page
[9]

ELLMANEQUATION 21.(ENCETHELASTTHINGTODOTODESIGNAPRACTICALALGORITHMISTODEVELOPAPRO- CEDUREOFUNBIASEDESTIMATIONOFQUANTILESFORTHERANDOMVARIABLEONTHERIGHTSIDEOFDISTRIBUTION

IFt mod K = 0:θ− ←θ 4.3. 1uANTILE2EGRESSIOND1.(12-D1.) CATEGORICALD1.DISCOVEREDAGAPBETWEENTHEORYANDPRACTICEAS KL-DIVERGENCE,USEDINPRAC- TICALALGORITHM,ISTHEORETICALLYUNJUSTI 1ED.4HEOREM12HINTSTHATTHETRUEDIVERGENCEWESHOULDCARE ABOUTISACTUALLY7ASSERSTEINMETRIC,BUTITREMAINEDUNCLEARHOWITCOULDBEOPTIMIZEDUSINGONLY SAMPLESFROMTRANSITIONPROBABILITIES T. IN;3]ITWA...

work page
[10]

SELECTaRANDOMLYWITHPROBABILITY ε(t),ELSEa =argmax a 1 A ∑ iζ∗ i (s,a,θ )

work page
[12]

ADDOBSERVEDTRANSITIONTOEXPERIENCEREPLAY

work page
[13]

SAMPLEBATCHOFSIZEBFROMEXPERIENCEREPLAY

work page
[14]

FOREACHTRANSITIONTFROMTHEBATCHCOMPUTETHESUPPORTOFTARGETDISTRIBUTION: y(T )j =r′ +γζ∗ j ( s′,argmax a′ 1 A ∑ i ζ∗ i (s′,a′,θ−),θ− ) 26

work page
[15]

COMPUTELOSS: Loss = 1 BA ∑ T ∑ i ∑ j ( τi − I[ζ∗ i (s,a,θ )<y (T )j] ) ( ζ∗ i (s,a,θ ) −y(T )j )

work page
[17]

2AINBOWD1

IFt mod K = 0:θ− ←θ 4.4. 2AINBOWD1. 3UCCESSOFDEEP1-LEARNINGENCOURAGEDAFULL-SCALERESEARCHOFVALUE-BASEDDEEPREINFORCEMENT LEARNINGBYSTUDYINGVARIOUSDRAWBACKSOFD1.ANDDEVELOPINGAUXILIARYEXTENSIONS. INMANYARTI- CLESSOMEEXTENSIONSFROMPREVIOUSRESEARCHWEREALREADYCONSIDEREDANDEMBEDDEDINCOMPARED ALGORITHMSDURINGEMPIRICALSTUDIES. IN2AINBOWD1.;7],SEVEN1-LEARNING-BASEDI...

work page
[18]

SELECTa =argmax a ∑ iziζ∗ i (s,a,θ,ε ),ε ∼ N (0,I )

work page
[19]

OBSERVETRANSITION(s,a,r′,s′, done)

work page
[20]

CONSTRUCTN-STEPTRANSITIONT = ( s,a, ∑N n=0γnr(n+1),s (N), done ) ANDADDITTO EXPERIENCEREPLAYWITHPRIORITY maxTρ(T )

work page
[21]

SAMPLEBATCHOFSIZEBFROMEXPERIENCEREPLAYUSINGPROBABILITIES P(T ) ∝ρ(T )α

work page
[22]

COMPUTEWEIGHTSFORTHEBATCH(WHEREMISTHESIZEOFEXPERIENCEREPLAYMEMORY) w(T ) = ( 1 M P(T ) )β(t)

work page
[23]

FOREACHTRANSITIONT = (s,a, ¯r, ¯s, done)FROMTHEBATCHCOMPUTETARGET(DETACHED FROMCOMPUTATIONALGRAPHTOPREVENTBACKPROPAGATION): ε1,ε 2 ∼ N (0,I ) P(y(T ) = ¯r +γNzi) =ζ∗ i ( ¯s,argmax ¯a ∑ i ziζ∗ i (¯s, ¯a,θ,ε 1),θ−,ε 2 )

work page
[24]

PROJECTy(T )ONSUPPORT {z0,z 1...z A−1}

work page
[25]

UPDATETRANSITIONPRIORITIES ρ(T ) ← KL(y(T ) ∥Z∗(s,a,θ,ε )),ε ∼ N (0,I )

work page
[26]

COMPUTELOSS: Loss = 1 B ∑ T w(T )ρ(T )

work page
[27]

MAKEASTEPOFGRADIENTDESCENTUSING ∂ Loss ∂θ

work page
[28]

IFt mod K = 0:θ− ←θ 28

work page
[29]

0OLICyGRADIENTALGORITHMS 5.1. 0OLICyGRADIENTTHEOREM ALTERNATIVEAPPROACHTOSOLVING2,TASKISDIRECTOPTIMIZATIONOFOBJECTIVE J(θ) = ET∼πθ ∑ t=1 γt−1rt → max θ (33) ASAFUNCTIONOF θ.0OLICYGRADIENTMETHODSPROVIDEAFRAMEWORKHOWTOCONSTRUCTANE ZCIENTOPTI- MIZATIONPROCEDUREBASEDONSTOCHASTIC 1RST-ORDEROPTIMIZATIONWITHIN2,SETTING. 7EWILLASSUMETHAT πθ(a | s)ISASTOCHASTICPOL...

work page
[30]

OBTAINAROLL-OUTOFSIZEBUSINGPOLICY π(θ)

work page
[31]

FOREACHTRANSITIONTFROMTHEROLL-OUTCOMPUTEADVANTAGEESTIMATION: Aπ(T ) =r′ +γV π φ (s′) −Vπ φ

work page
[32]

COMPUTETARGET(DETACHEDFROMCOMPUTATIONALGRAPHTOPREVENTBACKPROPAGATION): y(T ) =r′ +γV π φ (s′)

work page
[33]

COMPUTECRITICLOSS: Loss = 1 B ∑ T ( y(T ) −Vπ φ )2

work page
[34]

COMPUTECRITICGRADIENTS: ∇critic = ∂ Loss ∂φ

work page
[35]

COMPUTEACTORGRADIENT: ∇actor = 1 B ∑ T ∇θ logπθ(a |s)Aπ(T )

work page
[36]

MAKEASTEPOFGRADIENTDESCENTUSING ∇actor +α∇critic 5.4. GENERALIzEDADVANTAGEESTIMATION(GAE) 4HEREISADESIGNDILEMMAINADVANTAGEACTORCRITICALGORITHMCONCERNINGTHECHOICEWHETHERTO USETHECRITICTOESTIMATE Qπ(s,a )ANDINTRODUCEBIASINTOGRADIENTESTIMATIONORTORESTRICTCRITICEM- PLOYMENTONLYFORBASELINEANDCAUSEHIGHERVARIANCEWITHNECESSITYOFPLAYINGTHEWHOLEEPISODES FOREACHUPDA...

work page
[37]

OBTAINAROLL-OUTOFSIZERUSINGPOLICY π(θ),STORINGACTIONPROBABILITIESAS πold(a |s)

work page
[38]

FOREACHTRANSITIONT FROMTHEROLL-OUTCOMPUTEADVANTAGEESTIMATION(DETACHEDFROM COMPUTATIONALGRAPHTOPREVENTBACKPROPAGATION): Aπ(T ) =r′ +γV π φ (s′) −Vπ φ

work page
[39]

PERFORMn?epochsPASSESTHROUGHROLL-OUTUSINGBATCHESOFSIZE B;FOREACHBATCH: • COMPUTECRITICTARGET(DETACHEDFROMCOMPUTATIONALGRAPHTOPREVENTBACKPROPA- GATION): y(T ) =r′ +γV π φ (s′) • COMPUTECRITICLOSS: Loss = 1 B ∑ T ( y(T ) −Vπ φ )2 • COMPUTECRITICGRADIENTS: ∇critic = ∂ Loss ∂φ • COMPUTEIMPORTANCESAMPLINGWEIGHTS: rθ(T ) = πθ(a |s) πold(a |s) • COMPUTECLIPPEDIM...

work page
[40]

ExPERIMENTS 6.1. 3ETuP 7EPERFORMEDOUREXPERIMENTSUSINGCUSTOMIMPLEMENTATIONOFDISCUSSEDALGORITHMSATTEMPT- INGTOINCORPORATEBESTFEATURESFROMDI ﬀERENTOZCIALANDUNO ZCIALSOURCESANDUNIFYINGALLALGO- RITHMSINASINGLELIBRARYINTERFACE.4HEFULLCODEISAVAILABLEATOURGITHUB. 7HILECUSTOMIMPLEMENTATIONMIGHTNOTBETHEMOSTE ZCIENT,ITHINTEDUSSEVERALAMBIGUITIESIN ALGORITHMSWHICHARER...

work page
[41]

OpenAI Gym

DISCuSSION 7EHAVECONCERNEDTWOMAINDIRECTIONSOFUNIVERSALMODEL-FREE2,ALGORITHMDESIGNANDAT- TEMPTEDTORECREATESEVERALSTATE-OF-ARTPIPELINES. 7HILETHEEXTENSIONSOFD1.AREREASONABLESOLUTIONSOFEVIDENTD1.PROBLEMS, THEIRE ﬀECT ISNOTCLEARLYSEENONSIMPLETASKSLIKE0ONG 36. CURRENTSTATE-OF-ARTINSINGLE-THREADEDVALUE-BASED APPROACH,2AINBOWD1.,ISFULLOFjGLUEANDTAPEzDECISIONSTHA...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

SELECTaRANDOMLYWITHPROBABILITY ε(t),ELSEa =argmax a ∑ iziζ∗ i (s,a,θ )

work page

[2] [5]

FOREACHTRANSITIONTFROMTHEBATCHCOMPUTETARGET: P(y(T ) =r′ +γzi) =ζ∗ i ( s′,argmax a′ ∑ i ziζ∗ i (s′,a′,θ−),θ− )

work page

[3] [7]

COMPUTELOSS: Loss = 1 B ∑ T KL(y(T ) ∥Z∗(s,a,θ ))

work page

[4] [9]

ELLMANEQUATION 21.(ENCETHELASTTHINGTODOTODESIGNAPRACTICALALGORITHMISTODEVELOPAPRO- CEDUREOFUNBIASEDESTIMATIONOFQUANTILESFORTHERANDOMVARIABLEONTHERIGHTSIDEOFDISTRIBUTION

IFt mod K = 0:θ− ←θ 4.3. 1uANTILE2EGRESSIOND1.(12-D1.) CATEGORICALD1.DISCOVEREDAGAPBETWEENTHEORYANDPRACTICEAS KL-DIVERGENCE,USEDINPRAC- TICALALGORITHM,ISTHEORETICALLYUNJUSTI 1ED.4HEOREM12HINTSTHATTHETRUEDIVERGENCEWESHOULDCARE ABOUTISACTUALLY7ASSERSTEINMETRIC,BUTITREMAINEDUNCLEARHOWITCOULDBEOPTIMIZEDUSINGONLY SAMPLESFROMTRANSITIONPROBABILITIES T. IN;3]ITWA...

work page

[5] [10]

SELECTaRANDOMLYWITHPROBABILITY ε(t),ELSEa =argmax a 1 A ∑ iζ∗ i (s,a,θ )

work page

[6] [12]

ADDOBSERVEDTRANSITIONTOEXPERIENCEREPLAY

work page

[7] [13]

SAMPLEBATCHOFSIZEBFROMEXPERIENCEREPLAY

work page

[8] [14]

FOREACHTRANSITIONTFROMTHEBATCHCOMPUTETHESUPPORTOFTARGETDISTRIBUTION: y(T )j =r′ +γζ∗ j ( s′,argmax a′ 1 A ∑ i ζ∗ i (s′,a′,θ−),θ− ) 26

work page

[9] [15]

COMPUTELOSS: Loss = 1 BA ∑ T ∑ i ∑ j ( τi − I[ζ∗ i (s,a,θ )<y (T )j] ) ( ζ∗ i (s,a,θ ) −y(T )j )

work page

[10] [17]

2AINBOWD1

IFt mod K = 0:θ− ←θ 4.4. 2AINBOWD1. 3UCCESSOFDEEP1-LEARNINGENCOURAGEDAFULL-SCALERESEARCHOFVALUE-BASEDDEEPREINFORCEMENT LEARNINGBYSTUDYINGVARIOUSDRAWBACKSOFD1.ANDDEVELOPINGAUXILIARYEXTENSIONS. INMANYARTI- CLESSOMEEXTENSIONSFROMPREVIOUSRESEARCHWEREALREADYCONSIDEREDANDEMBEDDEDINCOMPARED ALGORITHMSDURINGEMPIRICALSTUDIES. IN2AINBOWD1.;7],SEVEN1-LEARNING-BASEDI...

work page

[11] [18]

SELECTa =argmax a ∑ iziζ∗ i (s,a,θ,ε ),ε ∼ N (0,I )

work page

[12] [19]

OBSERVETRANSITION(s,a,r′,s′, done)

work page

[13] [20]

CONSTRUCTN-STEPTRANSITIONT = ( s,a, ∑N n=0γnr(n+1),s (N), done ) ANDADDITTO EXPERIENCEREPLAYWITHPRIORITY maxTρ(T )

work page

[14] [21]

SAMPLEBATCHOFSIZEBFROMEXPERIENCEREPLAYUSINGPROBABILITIES P(T ) ∝ρ(T )α

work page

[15] [22]

COMPUTEWEIGHTSFORTHEBATCH(WHEREMISTHESIZEOFEXPERIENCEREPLAYMEMORY) w(T ) = ( 1 M P(T ) )β(t)

work page

[16] [23]

FOREACHTRANSITIONT = (s,a, ¯r, ¯s, done)FROMTHEBATCHCOMPUTETARGET(DETACHED FROMCOMPUTATIONALGRAPHTOPREVENTBACKPROPAGATION): ε1,ε 2 ∼ N (0,I ) P(y(T ) = ¯r +γNzi) =ζ∗ i ( ¯s,argmax ¯a ∑ i ziζ∗ i (¯s, ¯a,θ,ε 1),θ−,ε 2 )

work page

[17] [24]

PROJECTy(T )ONSUPPORT {z0,z 1...z A−1}

work page

[18] [25]

UPDATETRANSITIONPRIORITIES ρ(T ) ← KL(y(T ) ∥Z∗(s,a,θ,ε )),ε ∼ N (0,I )

work page

[19] [26]

COMPUTELOSS: Loss = 1 B ∑ T w(T )ρ(T )

work page

[20] [27]

MAKEASTEPOFGRADIENTDESCENTUSING ∂ Loss ∂θ

work page

[21] [28]

IFt mod K = 0:θ− ←θ 28

work page

[22] [29]

0OLICyGRADIENTALGORITHMS 5.1. 0OLICyGRADIENTTHEOREM ALTERNATIVEAPPROACHTOSOLVING2,TASKISDIRECTOPTIMIZATIONOFOBJECTIVE J(θ) = ET∼πθ ∑ t=1 γt−1rt → max θ (33) ASAFUNCTIONOF θ.0OLICYGRADIENTMETHODSPROVIDEAFRAMEWORKHOWTOCONSTRUCTANE ZCIENTOPTI- MIZATIONPROCEDUREBASEDONSTOCHASTIC 1RST-ORDEROPTIMIZATIONWITHIN2,SETTING. 7EWILLASSUMETHAT πθ(a | s)ISASTOCHASTICPOL...

work page

[23] [30]

OBTAINAROLL-OUTOFSIZEBUSINGPOLICY π(θ)

work page

[24] [31]

FOREACHTRANSITIONTFROMTHEROLL-OUTCOMPUTEADVANTAGEESTIMATION: Aπ(T ) =r′ +γV π φ (s′) −Vπ φ

work page

[25] [32]

COMPUTETARGET(DETACHEDFROMCOMPUTATIONALGRAPHTOPREVENTBACKPROPAGATION): y(T ) =r′ +γV π φ (s′)

work page

[26] [33]

COMPUTECRITICLOSS: Loss = 1 B ∑ T ( y(T ) −Vπ φ )2

work page

[27] [34]

COMPUTECRITICGRADIENTS: ∇critic = ∂ Loss ∂φ

work page

[28] [35]

COMPUTEACTORGRADIENT: ∇actor = 1 B ∑ T ∇θ logπθ(a |s)Aπ(T )

work page

[29] [36]

MAKEASTEPOFGRADIENTDESCENTUSING ∇actor +α∇critic 5.4. GENERALIzEDADVANTAGEESTIMATION(GAE) 4HEREISADESIGNDILEMMAINADVANTAGEACTORCRITICALGORITHMCONCERNINGTHECHOICEWHETHERTO USETHECRITICTOESTIMATE Qπ(s,a )ANDINTRODUCEBIASINTOGRADIENTESTIMATIONORTORESTRICTCRITICEM- PLOYMENTONLYFORBASELINEANDCAUSEHIGHERVARIANCEWITHNECESSITYOFPLAYINGTHEWHOLEEPISODES FOREACHUPDA...

work page

[30] [37]

OBTAINAROLL-OUTOFSIZERUSINGPOLICY π(θ),STORINGACTIONPROBABILITIESAS πold(a |s)

work page

[31] [38]

FOREACHTRANSITIONT FROMTHEROLL-OUTCOMPUTEADVANTAGEESTIMATION(DETACHEDFROM COMPUTATIONALGRAPHTOPREVENTBACKPROPAGATION): Aπ(T ) =r′ +γV π φ (s′) −Vπ φ

work page

[32] [39]

PERFORMn?epochsPASSESTHROUGHROLL-OUTUSINGBATCHESOFSIZE B;FOREACHBATCH: • COMPUTECRITICTARGET(DETACHEDFROMCOMPUTATIONALGRAPHTOPREVENTBACKPROPA- GATION): y(T ) =r′ +γV π φ (s′) • COMPUTECRITICLOSS: Loss = 1 B ∑ T ( y(T ) −Vπ φ )2 • COMPUTECRITICGRADIENTS: ∇critic = ∂ Loss ∂φ • COMPUTEIMPORTANCESAMPLINGWEIGHTS: rθ(T ) = πθ(a |s) πold(a |s) • COMPUTECLIPPEDIM...

work page

[33] [40]

ExPERIMENTS 6.1. 3ETuP 7EPERFORMEDOUREXPERIMENTSUSINGCUSTOMIMPLEMENTATIONOFDISCUSSEDALGORITHMSATTEMPT- INGTOINCORPORATEBESTFEATURESFROMDI ﬀERENTOZCIALANDUNO ZCIALSOURCESANDUNIFYINGALLALGO- RITHMSINASINGLELIBRARYINTERFACE.4HEFULLCODEISAVAILABLEATOURGITHUB. 7HILECUSTOMIMPLEMENTATIONMIGHTNOTBETHEMOSTE ZCIENT,ITHINTEDUSSEVERALAMBIGUITIESIN ALGORITHMSWHICHARER...

work page

[34] [41]

OpenAI Gym

DISCuSSION 7EHAVECONCERNEDTWOMAINDIRECTIONSOFUNIVERSALMODEL-FREE2,ALGORITHMDESIGNANDAT- TEMPTEDTORECREATESEVERALSTATE-OF-ARTPIPELINES. 7HILETHEEXTENSIONSOFD1.AREREASONABLESOLUTIONSOFEVIDENTD1.PROBLEMS, THEIRE ﬀECT ISNOTCLEARLYSEENONSIMPLETASKSLIKE0ONG 36. CURRENTSTATE-OF-ARTINSINGLE-THREADEDVALUE-BASED APPROACH,2AINBOWD1.,ISFULLOFjGLUEANDTAPEzDECISIONSTHA...

work page internal anchor Pith review Pith/arXiv arXiv 2017