Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models

Ehsan Imani; Erin Talvitie; Farzane Aminmansour; Martha White; Micheal Bowling; Taher Jafferjee

arxiv: 2006.04363 · v2 · submitted 2020-06-08 · 💻 cs.LG · cs.AI· stat.ML

Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models

Farzane Aminmansour , Taher Jafferjee , Ehsan Imani , Erin Talvitie , Micheal Bowling , Martha White This is my paper

Pith reviewed 2026-05-24 14:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords Dyna planningvalue hallucinationpredecessor modelsmodel-based reinforcement learningmultistep updatesreinforcement learningsample efficiencymodel error robustness

0 comments

The pith

Multistep predecessor models in Dyna avoid updating real states from simulated values to prevent misleading action values under model error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that Dyna agents can fail when an imperfect model causes the value function to bootstrap real-state estimates from simulated states, producing hallucinated values that distort the learned policy. It introduces the Hallucinated Value Hypothesis to formalize this risk and examines a design space of four Dyna variants that differ in whether they simulate forward or backward and whether they perform one-step or multistep updates. Three of the variants update real states toward simulated states and are therefore exposed to the hypothesized failure mode. The fourth variant, multistep predecessor models, updates without bootstrapping real states from simulated ones and is therefore positioned as more robust to model inaccuracies.

Core claim

Updating the values of real states towards values of simulated states can result in misleading action values which adversely affect the control policy; the multistep predecessor variant avoids this by not updating real states toward simulated states, providing experimental evidence for the Hallucinated Value Hypothesis across the four Dyna design variants.

What carries the argument

The multistep predecessor model, which simulates environment dynamics backward over multiple steps so that updates do not bootstrap real-state values from simulated-state values.

If this is right

Dyna agents using the multistep predecessor approach remain effective even when the learned model contains small errors.
The other three design combinations remain vulnerable to policy distortion from hallucinated values.
Multistep updates paired with backward simulation provide a concrete way to retain the sample-efficiency gains of Dyna without the associated robustness penalty.
Model-based planning can be made more reliable by choosing update directions that isolate real experience from simulated experience.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of real and simulated value updates could be applied to other model-based methods that mix planning with real experience.
If the hypothesis holds, improvements in model accuracy alone may be less critical than previously thought for Dyna-style algorithms.
The approach may extend to settings where the model is learned online and errors cannot be fully eliminated.

Load-bearing premise

Bootstrapping real-state values from simulated-state values is the primary cause of Dyna failure under model error rather than other forms of model inaccuracy or update mechanics.

What would settle it

An experiment in which the multistep predecessor variant still underperforms standard Dyna despite model error, or in which one of the three variants that updates real states from simulated states succeeds at the same rate.

Figures

Figures reproduced from arXiv: 2006.04363 by Ehsan Imani, Erin Talvitie, Farzane Aminmansour, Martha White, Micheal Bowling, Taher Jafferjee.

**Figure 1.** Figure 1: (a) Borderworld (b) An erroneous simulated transition We also see that these errors are persistent and do not resolve on their own. In the next section, we develop a design space of Dyna algorithms and discuss their implications with respect to the HVH. 4. A Design Space of Dyna Algorithms Dyna is a flexible framework which admits a variety of implementations. In this work we focus on two design choices th… view at source ↗

**Figure 2.** Figure 2: Visual comparison of Dyna algorithms. Circles and black arrows show a trajectory; solid circles are real states and dashed circles are simulated states. A red arrow means that the value of the origin state is updated towards the destination state. All algorithms except Multi-Step Predecessor Dyna allow updates towards hallucinated states. generate trajectories into the border and update values for real sta… view at source ↗

**Figure 3.** Figure 3: Learning curves on Borderworld. Only Multi-step Predecessor Dyna & Uniterated One-step Predecessor Dyna succeed. Error bars are not visible as they are smaller than line thicknesses. environment model of Borderworld. Specifically, the model generates transitions from real states to hallucinated border states, as in Figure 1b (and from border states to border states). Further, we optimistically initialis… view at source ↗

**Figure 4.** Figure 4: Plot of maxa Q(s, a) ∀s ∈ S after 100, 000 steps. The red rectangles show where values of real states have been contaminated by values of simulated states. of real states – eventually they might no longer mislead the agent. However, this may take a long time and the agent will be catastrophically misled in the meantime. 5.2. Reinforcement Learning Benchmarks We now consider experiments that examine the HV… view at source ↗

**Figure 5.** Figure 5: Learning curves. The algorithms that update real state values to simulated state values stuggle while those that do not show robust performance. units to convergence using the DQN algorithm and froze its weights. In each step we input state st to the network and extracted the hidden layer activation to form a vector of state features φ(st). The value function was linear in φ(st). We initialised weights of … view at source ↗

**Figure 6.** Figure 6: Plot of maxa Q(s, a)∀s ∈ S for β = 0 (left) and β > 0 (right) after 2,000 steps. rithms converge to the optimal policy in about 5,000 real steps whereas Q-learning fails to reach that performance even after 20,000 steps. 6. The Impact of Model Iteration Multi-step Predecessor and Uniterated One-step Predecessor are robust to hallucinated values. Which algorithm is preferred? Here, focus on β, a parameter … view at source ↗

**Figure 7.** Figure 7: Performance versus β. Multi-step Predecessor is the best performing algorithm. 7. Conclusion We presented the HVH: planning updates that move values of real states towards values of simulated states may propagate misleading, arbitrary value that impedes learning of control policies. Under controlled settings we showed evidence supporting the hypothesis — all algorithms that update real state values to sim… view at source ↗

read the original abstract

Dyna-style reinforcement learning (RL) agents improve sample efficiency over model-free RL agents by updating the value function with simulated experience generated by an environment model. However, it is often difficult to learn accurate models of environment dynamics, and even small errors may result in failure of Dyna agents. In this paper, we highlight that one potential cause of that failure is bootstrapping off of the values of simulated states, and introduce a new Dyna algorithm to avoid this failure. We discuss a design space of Dyna algorithms, based on using successor or predecessor models -- simulating forwards or backwards -- and using one-step or multi-step updates. Three of the variants have been explored, but surprisingly the fourth variant has not: using predecessor models with multi-step updates. We present the \emph{Hallucinated Value Hypothesis} (HVH): updating the values of real states towards values of simulated states can result in misleading action values which adversely affect the control policy. We discuss and evaluate all four variants of Dyna amongst which three update real states toward simulated states -- so potentially toward hallucinated values -- and our proposed approach, which does not. The experimental results provide evidence for the HVH, and suggest that using predecessor models with multi-step updates is a promising direction toward developing Dyna algorithms that are more robust to model error.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper fills the empty cell in the Dyna 2x2 design space with multistep predecessor models that avoid updating real states from simulated values, and claims experiments back the Hallucinated Value Hypothesis.

read the letter

The paper's main contribution is spotting that predecessor models combined with multi-step updates had not been tried in Dyna, and proposing that this combination avoids the problem of pulling real state values toward potentially hallucinated simulated values. That's the new piece. They lay out the design space nicely—successor or predecessor, one-step or multi-step—and note that the other three have been done before. The Hallucinated Value Hypothesis is a reasonable way to think about one failure mode when the model is imperfect. By comparing all four, they can point to the one that doesn't do the real-to-simulated update as potentially better. What stands out is that they are explicit about the unexplored variant, which gives the work a clear novelty claim relative to prior Dyna papers. The soft spot is around whether this update direction is really the dominant issue. As the stress test notes, model error can show up in the simulated trajectories themselves, not just in how values are transferred back to real states. If the experiments don't isolate that, it's possible the gains come from something else about multi-step predecessors. The abstract says the results support the hypothesis, but without seeing the actual numbers, baselines, or statistical details, it's difficult to gauge how convincing that support is. Overall, this is the kind of paper that tries to make model-based methods more practical by addressing a specific failure mode. Readers working on Dyna or similar planning algorithms would find the design space discussion useful. It deserves to go to peer review because the idea is grounded in the literature and the empirical angle is there, even if the evidence strength needs closer look in revision.

Referee Report

3 major / 2 minor

Summary. The paper claims that a primary cause of Dyna-style RL failure under imperfect models is the 'Hallucinated Value Hypothesis' (HVH): bootstrapping real-state values from simulated-state values produces misleading action values that degrade the policy. It defines a 2x2 design space over successor vs. predecessor models and one-step vs. multi-step updates, observes that three of the four variants update real states toward simulated states, and introduces the previously unexplored multistep-predecessor variant that avoids this direction. The manuscript states that experiments across the four variants supply evidence for the HVH and that the new variant is a promising route to greater robustness.

Significance. If the HVH holds and the multistep-predecessor variant demonstrably outperforms the other three under realistic model error, the work supplies a lightweight algorithmic modification that could improve the reliability of Dyna-style planning without requiring more accurate models or additional regularization. The explicit enumeration of the four design-space cells is a clear organizational contribution.

major comments (3)

[Abstract / Experiments] Abstract and experimental section: the statement that 'the experimental results provide evidence for the HVH' supplies no information on experimental design, number of runs, baselines, statistical tests, or effect sizes, so the support for the central claim cannot be assessed from the manuscript as written.
[Design space / Experiments] Design-space argument (likely §3): the four combinations are separated solely by whether real states are updated toward simulated ones, but no analysis or controlled experiment quantifies the relative contribution of this update direction versus other error channels (biased rewards, incorrect transition probabilities, or rollout distribution shift) when the learned model is imperfect; this leaves the HVH's status as the dominant failure mode untested.
[Introduction / HVH definition] HVH statement: the hypothesis is introduced as an ad-hoc explanatory construct without a formal derivation, bound, or falsifiable prediction that would distinguish it from generic model-error effects; the empirical comparison therefore carries the entire burden of proof.

minor comments (2)

[Design space] Notation for the four variants is introduced informally; a compact table or diagram labeling each cell (successor/one-step, predecessor/multi-step, etc.) would improve readability.
[Related work] The manuscript does not cite prior work on backward models or multi-step value iteration in the Dyna literature; adding these references would clarify novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and note where the manuscript will be revised.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental section: the statement that 'the experimental results provide evidence for the HVH' supplies no information on experimental design, number of runs, baselines, statistical tests, or effect sizes, so the support for the central claim cannot be assessed from the manuscript as written.

Authors: We agree the abstract statement is too brief. The experimental section details the environments, number of independent runs with means and standard errors, and direct comparisons among variants. We will revise the abstract to qualify the claim or briefly indicate the nature of the supporting experiments. revision: yes
Referee: [Design space / Experiments] Design-space argument (likely §3): the four combinations are separated solely by whether real states are updated toward simulated ones, but no analysis or controlled experiment quantifies the relative contribution of this update direction versus other error channels (biased rewards, incorrect transition probabilities, or rollout distribution shift) when the learned model is imperfect; this leaves the HVH's status as the dominant failure mode untested.

Authors: All variants use the identical learned model, isolating the effect of update direction from other model errors. Performance differences are therefore attributable to whether real states are updated from simulated values. We will add explicit discussion of this controlled comparison. revision: partial
Referee: [Introduction / HVH definition] HVH statement: the hypothesis is introduced as an ad-hoc explanatory construct without a formal derivation, bound, or falsifiable prediction that would distinguish it from generic model-error effects; the empirical comparison therefore carries the entire burden of proof.

Authors: The HVH is derived from the design-space enumeration and yields the testable prediction that the multistep-predecessor variant avoids updating real states from simulated ones. The experiments directly test this prediction. No formal bound is supplied, but the hypothesis organizes the variants and is evaluated empirically. revision: no

Circularity Check

0 steps flagged

No significant circularity; algorithmic proposal and empirical test are self-contained.

full rationale

The paper defines a design space over successor/predecessor models and one-step/multi-step updates, then proposes the multistep-predecessor variant that avoids updating real-state values from simulated ones. The Hallucinated Value Hypothesis is stated as a claim about error propagation and is evaluated via experiments comparing the four variants. No equations, fitted parameters, or self-citations are used to derive the central claim; the benefit is demonstrated by direct comparison rather than by construction or renaming. This matches the default case of an independent algorithmic contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard RL assumptions about MDPs and value iteration plus the newly introduced Hallucinated Value Hypothesis; no free parameters or invented physical entities are described.

axioms (1)

standard math The environment can be modeled as a Markov decision process and value functions can be updated via bootstrapping.
The entire Dyna framework and the four variants presuppose the standard MDP and temporal-difference update setting.

invented entities (1)

Hallucinated Value Hypothesis no independent evidence
purpose: To explain a specific mechanism by which model error harms Dyna performance.
The hypothesis is introduced in the paper as the load-bearing explanation for observed failures.

pith-pipeline@v0.9.0 · 5792 in / 1315 out tokens · 54930 ms · 2026-05-24T14:22:16.941958+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Advantage-Guided Diffusion for Model-Based Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
Safety, Security, and Cognitive Risks in World Models
cs.CR 2026-04 unverdicted novelty 6.0

World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and D...

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 2 Pith papers · 4 internal anchors

[1]

M., Baker, B., Chociej, M., J \'o zefowicz, R., McGrew, B., Pachocki, J

Andrychowicz, O. M., Baker, B., Chociej, M., J \'o zefowicz, R., McGrew, B., Pachocki, J. W., Petron, A., Plappert, M., Powell, G., Ray, A., Schneider, J., Sidor, S., Tobin, J., Welinder, P., Weng, L., and Zaremba, W. Learning D exterous I n-hand M anipulation. The International Journal of Robotics Research, 39: 0 20 -- 3, 2018

work page 2018
[2]

G., Naddaf, Y., Veness, J., and Bowling, M

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The A rcade L earning E nvironment: A n E valuation P latform for G eneral A gents. Journal of Artificial Intelligence Research, 47: 0 253--279, 2013

work page 2013
[3]

OpenAI Gym

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Open AI G ym. ArXiv, abs/1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

Sample-efficient R einforcement L earning with S tochastic E nsemble V alue E xpansion

Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. Sample-efficient R einforcement L earning with S tochastic E nsemble V alue E xpansion. In Advances in Neural Information Processing Systems, pp.\ 8224--8234, 2018

work page 2018
[5]

Degris, T., White, M., and Sutton, R. S. Off- P olicy A ctor- C ritic. ArXiv, abs/1205.4839, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[6]

Recall T races: B acktracking M odels for E fficient R einforcement L earning

Goyal, A., Brakel, P., Fedus, W., Singhal, S., Lillicrap, T., Levine, S., Larochelle, H., and Bengio, Y. Recall T races: B acktracking M odels for E fficient R einforcement L earning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HygsfnR9Ym

work page 2019
[7]

Continuous deep Q -learning with M odel-based A cceleration

Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. Continuous deep Q -learning with M odel-based A cceleration. In International Conference on Machine Learning, pp.\ 2829--2838, 2016

work page 2016
[8]

The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces

Holland, G. Z., Talvitie, E., and Bowling, M. The E ffect of P lanning S hape on D yna-style P lanning in H igh-dimensional S tate S paces. arXiv preprint arXiv:1806.01825, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

and Boedecker, J

Kalweit, G. and Boedecker, J. Uncertainty-driven I magination for C ontinuous D eep R einforcement L earning. In Conference on Robot Learning, pp.\ 195--206, 2017

work page 2017
[10]

R., Singh, A., Touati, A., Goyal, A., Bengio, Y., Parikh, D., and Batra, D

Ke, N. R., Singh, A., Touati, A., Goyal, A., Bengio, Y., Parikh, D., and Batra, D. M odelling the L ong T erm F uture in M odel-based R einforcement L earning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkgQBn0cF7

work page 2019
[11]

A., Veness, J., Bellemare, M

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level C ontrol through D eep R einforcement L earning. Nature, 518 0 (7540): 0 529, 2015

work page 2015
[12]

Moore, A. W. and Atkeson, C. G. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13 0 (1): 0 103--130, 1993

work page 1993
[13]

Organizing Experience: A Deeper Look at Replay Mechanisms for Sample-based Planning in Continuous State Domains

Pan, Y., Zaheer, M., White, A., Patterson, A., and White, M. Organizing E xperience: A D eeper L ook at R eplay M echanisms for S ample-based P lanning in C ontinuous S tate D omains. arXiv preprint arXiv:1806.04624, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

and Williams, R

Peng, J. and Williams, R. J. Efficient L learning and P lanning W ithin the D yna F ramework. Adaptive Behavior, 1 0 (4): 0 437--454, 1993

work page 1993
[15]

Sutton, R. S. Integrated A rchitectures for L earning, P lanning, and R eacting B ased on A pproximating D ynamic P rogramming. In Machine Learning Proceedings 1990, pp.\ 216--224. 1990

work page 1990
[16]

Sutton, R. S. Dyna, an I ntegrated A rchitecture for L earning, P lanning, and R eacting. ACM SIGART Bulletin, 2 0 (4): 0 160--163, 1991

work page 1991
[17]

Sutton, R. S. and Barto, A. G. Reinforcement L earning: A n I ntroduction . 2018

work page 2018
[18]

S., Szepesv\' a ri, C., Geramifard, A., and Bowling, M

Sutton, R. S., Szepesv\' a ri, C., Geramifard, A., and Bowling, M. Dyna-style P lanning with L inear F unction A pproximation and P rioritized S weeping. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pp.\ 528--536, 2008

work page 2008
[19]

Model R egularization for S table S ample R ollouts

Talvitie, E. Model R egularization for S table S ample R ollouts. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pp.\ 780--789, 2014

work page 2014
[20]

Self-correcting M odels for M odel-based R einforcement L earning

Talvitie, E. Self-correcting M odels for M odel-based R einforcement L earning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017

work page 2017
[21]

Pygame L earning E nvironment

Tasfi, N. Pygame L earning E nvironment. https://github.com/ntasfi/PyGame-Learning-Environment, 2016

work page 2016
[22]

P., Hessel, M., and Aslanides, J

van Hasselt, H. P., Hessel, M., and Aslanides, J. When to U se P arametric M odels in R einforcement L earning? In Advances in Neural Information Processing Systems, pp.\ 14322--14333, 2019

work page 2019
[23]

Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8 0 (3-4): 0 279--292, 1992

work page 1992
[24]

S., and Szepesv \'a ri, C

Yao, H., Bhatnagar, S., Diao, D., Sutton, R. S., and Szepesv \'a ri, C. Multi-step D yna P lanning for P olicy E valuation and C ontrol. In Advances in Neural Information Processing Systems, pp.\ 2187--2195, 2009

work page 2009
[25]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[1] [1]

M., Baker, B., Chociej, M., J \'o zefowicz, R., McGrew, B., Pachocki, J

Andrychowicz, O. M., Baker, B., Chociej, M., J \'o zefowicz, R., McGrew, B., Pachocki, J. W., Petron, A., Plappert, M., Powell, G., Ray, A., Schneider, J., Sidor, S., Tobin, J., Welinder, P., Weng, L., and Zaremba, W. Learning D exterous I n-hand M anipulation. The International Journal of Robotics Research, 39: 0 20 -- 3, 2018

work page 2018

[2] [2]

G., Naddaf, Y., Veness, J., and Bowling, M

Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The A rcade L earning E nvironment: A n E valuation P latform for G eneral A gents. Journal of Artificial Intelligence Research, 47: 0 253--279, 2013

work page 2013

[3] [3]

OpenAI Gym

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Open AI G ym. ArXiv, abs/1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[4] [4]

Sample-efficient R einforcement L earning with S tochastic E nsemble V alue E xpansion

Buckman, J., Hafner, D., Tucker, G., Brevdo, E., and Lee, H. Sample-efficient R einforcement L earning with S tochastic E nsemble V alue E xpansion. In Advances in Neural Information Processing Systems, pp.\ 8224--8234, 2018

work page 2018

[5] [5]

Degris, T., White, M., and Sutton, R. S. Off- P olicy A ctor- C ritic. ArXiv, abs/1205.4839, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[6] [6]

Recall T races: B acktracking M odels for E fficient R einforcement L earning

Goyal, A., Brakel, P., Fedus, W., Singhal, S., Lillicrap, T., Levine, S., Larochelle, H., and Bengio, Y. Recall T races: B acktracking M odels for E fficient R einforcement L earning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HygsfnR9Ym

work page 2019

[7] [7]

Continuous deep Q -learning with M odel-based A cceleration

Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. Continuous deep Q -learning with M odel-based A cceleration. In International Conference on Machine Learning, pp.\ 2829--2838, 2016

work page 2016

[8] [8]

The Effect of Planning Shape on Dyna-style Planning in High-dimensional State Spaces

Holland, G. Z., Talvitie, E., and Bowling, M. The E ffect of P lanning S hape on D yna-style P lanning in H igh-dimensional S tate S paces. arXiv preprint arXiv:1806.01825, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

and Boedecker, J

Kalweit, G. and Boedecker, J. Uncertainty-driven I magination for C ontinuous D eep R einforcement L earning. In Conference on Robot Learning, pp.\ 195--206, 2017

work page 2017

[10] [10]

R., Singh, A., Touati, A., Goyal, A., Bengio, Y., Parikh, D., and Batra, D

Ke, N. R., Singh, A., Touati, A., Goyal, A., Bengio, Y., Parikh, D., and Batra, D. M odelling the L ong T erm F uture in M odel-based R einforcement L earning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkgQBn0cF7

work page 2019

[11] [11]

A., Veness, J., Bellemare, M

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level C ontrol through D eep R einforcement L earning. Nature, 518 0 (7540): 0 529, 2015

work page 2015

[12] [12]

Moore, A. W. and Atkeson, C. G. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13 0 (1): 0 103--130, 1993

work page 1993

[13] [13]

Organizing Experience: A Deeper Look at Replay Mechanisms for Sample-based Planning in Continuous State Domains

Pan, Y., Zaheer, M., White, A., Patterson, A., and White, M. Organizing E xperience: A D eeper L ook at R eplay M echanisms for S ample-based P lanning in C ontinuous S tate D omains. arXiv preprint arXiv:1806.04624, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

and Williams, R

Peng, J. and Williams, R. J. Efficient L learning and P lanning W ithin the D yna F ramework. Adaptive Behavior, 1 0 (4): 0 437--454, 1993

work page 1993

[15] [15]

Sutton, R. S. Integrated A rchitectures for L earning, P lanning, and R eacting B ased on A pproximating D ynamic P rogramming. In Machine Learning Proceedings 1990, pp.\ 216--224. 1990

work page 1990

[16] [16]

Sutton, R. S. Dyna, an I ntegrated A rchitecture for L earning, P lanning, and R eacting. ACM SIGART Bulletin, 2 0 (4): 0 160--163, 1991

work page 1991

[17] [17]

Sutton, R. S. and Barto, A. G. Reinforcement L earning: A n I ntroduction . 2018

work page 2018

[18] [18]

S., Szepesv\' a ri, C., Geramifard, A., and Bowling, M

Sutton, R. S., Szepesv\' a ri, C., Geramifard, A., and Bowling, M. Dyna-style P lanning with L inear F unction A pproximation and P rioritized S weeping. In Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence, pp.\ 528--536, 2008

work page 2008

[19] [19]

Model R egularization for S table S ample R ollouts

Talvitie, E. Model R egularization for S table S ample R ollouts. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pp.\ 780--789, 2014

work page 2014

[20] [20]

Self-correcting M odels for M odel-based R einforcement L earning

Talvitie, E. Self-correcting M odels for M odel-based R einforcement L earning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017

work page 2017

[21] [21]

Pygame L earning E nvironment

Tasfi, N. Pygame L earning E nvironment. https://github.com/ntasfi/PyGame-Learning-Environment, 2016

work page 2016

[22] [22]

P., Hessel, M., and Aslanides, J

van Hasselt, H. P., Hessel, M., and Aslanides, J. When to U se P arametric M odels in R einforcement L earning? In Advances in Neural Information Processing Systems, pp.\ 14322--14333, 2019

work page 2019

[23] [23]

Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8 0 (3-4): 0 279--292, 1992

work page 1992

[24] [24]

S., and Szepesv \'a ri, C

Yao, H., Bhatnagar, S., Diao, D., Sutton, R. S., and Szepesv \'a ri, C. Multi-step D yna P lanning for P olicy E valuation and C ontrol. In Advances in Neural Information Processing Systems, pp.\ 2187--2195, 2009

work page 2009

[25] [25]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page