pith. sign in

arxiv: 2606.21297 · v1 · pith:VWMW33TEnew · submitted 2026-06-19 · 💻 cs.LG · cs.AI

NASDAQ: Normalized Observation Space Dynamics-Augmented Q-Learning

Pith reviewed 2026-06-26 14:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningQ-learningobservation predictionnormalizationauxiliary taskssample efficiencydynamics model
0
0 comments X

The pith

Normalizing observations balances reconstruction losses and enables effective dynamics-augmented Q-learning across observation types.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Observation-predictive RL methods predict future observations to learn better representations, but fail on low-dimensional tasks because dimensions with larger value ranges dominate the prediction loss, causing the model to neglect smaller ones. The paper identifies this imbalance as the key issue and proposes normalizing the observations in an online RL setting to equalize the losses and gradients across dimensions. This normalization also allows dynamics prediction in a shared space for both low- and high-dimensional inputs. Building on this, NASDAQ augments Q-learning with predictions of short-term values and next normalized observations. Experiments indicate it performs competitively with less training time than alternatives.

Core claim

The central discovery is that normalizing the observation space before performing dynamics prediction corrects the unbalanced losses, providing a unified treatment for different input types and improving the effectiveness of observation-predictive RL.

What carries the argument

The NASDAQ framework, which couples value learning with auxiliary short-term value prediction and next normalized observation prediction tasks.

Load-bearing premise

The main reason prior observation-predictive RL underperforms on low-dimensional tasks is unbalanced reconstruction losses across observation dimensions.

What would settle it

Running the proposed normalization on low-dimensional benchmark tasks and observing no improvement in performance or no reduction in loss imbalance would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.21297 by (2) EPFL, (3) Purdue University, China, Jianting Zhang (3), Junyuan Liang (1), Switzerland, USA), Wuhui Chen (1) ((1) Sun Yat-sen University, Xinwei Liu (1), Zicong Hong (2).

Figure 1
Figure 1. Figure 1: Histograms and kernel density–estimated PDFs of per-dimension statistics, obtained from a [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the NASDAQ framework. (a) The value network contains three predictors [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training wall-time (in hours) comparisons on the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Gradient norm ratio ρ during training. Except for initial instability during early training, ρ predominantly lies within the interval [0.1, 2]. For λn-step, we set its value to 1 for visual RL benchmarks without tuning. For benchmarks with low-dimensional observations, we simply set λn-step to 0, indicating that short-term value prediction is not included, because we find that this auxiliary task, when λn-… view at source ↗
Figure 5
Figure 5. Figure 5: Learning curves on Gym. Results are over 5 seeds. The shaded area captures a 95% bootstrap confidence interval. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Learning curves on DMC (proprioceptive). Solid lines indicate average performance over 5 seeds, and shaded areas indicate the 95% bootstrap confidence interval. Discrete points with 95% bootstrap confidence interval denote the final results of TD-MPC2 and DreamerV3 reported in MR.Q. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Learning curves on DMC (visual). Solid lines indicate average performance over 5 seeds, and shaded areas indicate the 95% bootstrap confidence interval. Discrete points with 95% bootstrap confidence interval denote the final results of DreamerV3 reported in MR.Q. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Learning curves on Atari100k. Solid lines indicate average performance over 5 seeds, and shaded areas capture the 95% bootstrap confidence interval. Discrete points denote the final results reported in DreamerV3. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Histograms and kernel density-estimated PDFs of per-dimension statistics, obtained from a [PITH_FULL_IMAGE:figures/full_fig_p039_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Scatter plots of auxiliary loss versus standard deviation for each observation dimension [PITH_FULL_IMAGE:figures/full_fig_p039_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Histograms and kernel density estimated PDFs of per-dimension auxiliary loss across the [PITH_FULL_IMAGE:figures/full_fig_p040_11.png] view at source ↗
read the original abstract

Augmenting model-free reinforcement learning (RL) with representations learned through observation dynamics prediction (observation-predictive RL) can improve sample efficiency and performance, with minor modifications and limited additional computation. However, this approach still struggles in challenging tasks with low-dimensional observations. In this paper, we identify a key factor behind this problem: unbalanced reconstruction losses across observation dimensions, where dimensions with larger value ranges dominate the loss. This encourages the agent to neglect dimensions with relatively small ranges, leading to degraded performance. To address this issue, we propose a novel normalization method tailored to online RL, which normalizes low-dimensional observations and balances the resulting losses and gradients. Beyond balancing reconstruction losses, observation normalization enables dynamics prediction to be performed in a normalized observation space, thereby providing a unified treatment of low- and high-dimensional inputs (e.g., physical states and images). Building on this idea, we further introduce Normalized Observation Space Dynamics-Augmented Q-learning (NASDAQ), a framework for observation-predictive RL applicable across diverse domains. NASDAQ learns state-action representations by coupling value learning with two auxiliary tasks: short-term value prediction and next normalized observation prediction. Extensive experiments demonstrate that NASDAQ achieves competitive or superior performance compared with state-of-the-art model-based and self-predictive RL methods, while requiring significantly less training wall-time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies unbalanced per-dimension reconstruction losses as the cause of poor performance in observation-predictive RL on low-dimensional tasks. It proposes an online-RL-tailored normalization that balances losses and gradients, enabling dynamics prediction in normalized space. The resulting NASDAQ framework augments Q-learning with short-term value prediction and next-observation prediction auxiliaries, claiming competitive or superior performance to SOTA model-based and self-predictive methods at substantially lower wall-clock time.

Significance. If the performance claims survive controls that isolate normalization from capacity, exploration, and weighting choices, the work supplies a lightweight, domain-agnostic engineering fix that unifies low- and high-dimensional observation handling in dynamics-augmented RL while preserving sample efficiency.

major comments (2)
  1. [Abstract and §1] Abstract and §1: the diagnosis that unbalanced reconstruction losses constitute the dominant cause of prior performance gaps is asserted without evidence that alternative explanations (representation capacity, exploration policy, optimizer dynamics, or auxiliary-task weighting) were controlled for in the reported experiments.
  2. [§4 (Experiments)] §4 (Experiments): no ablation is described that removes or reweights the normalization while keeping all other NASDAQ components fixed, so it is impossible to verify that the reported gains are attributable to loss balancing rather than the auxiliary tasks or other design choices.
minor comments (1)
  1. [§3] Notation for the normalized observation space and the two auxiliary losses should be introduced with explicit equations early in §3 to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The two major comments highlight important points about evidence and experimental controls. We address each below and commit to revisions that strengthen the manuscript without misrepresenting the current results.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1: the diagnosis that unbalanced reconstruction losses constitute the dominant cause of prior performance gaps is asserted without evidence that alternative explanations (representation capacity, exploration policy, optimizer dynamics, or auxiliary-task weighting) were controlled for in the reported experiments.

    Authors: The diagnosis originates from direct inspection of per-dimension reconstruction losses in low-dimensional tasks, where scale differences cause certain dimensions to dominate (detailed in §3). We agree that the manuscript does not present explicit controls isolating this factor from representation capacity, exploration, or optimizer choices. In revision we will expand §1 with additional loss-component analysis and a clearer statement of the evidential limits, while noting that comparisons are to published SOTA methods that already vary in capacity and weighting. revision: partial

  2. Referee: [§4 (Experiments)] §4 (Experiments): no ablation is described that removes or reweights the normalization while keeping all other NASDAQ components fixed, so it is impossible to verify that the reported gains are attributable to loss balancing rather than the auxiliary tasks or other design choices.

    Authors: We concur that an ablation isolating normalization is necessary to attribute gains specifically to loss balancing. Although NASDAQ is compared against baselines lacking normalization, the current experiments do not hold auxiliary tasks fixed while toggling only the normalization. We will add this ablation to the revised §4, reporting performance with and without normalization under otherwise identical NASDAQ components. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical engineering contribution with independent content

full rationale

The paper presents NASDAQ as an empirical method that normalizes observations to balance reconstruction losses in observation-predictive RL. No equations, fitted parameters, or self-citations are described that reduce the reported performance gains or the normalization step to inputs by construction. The central performance claims rest on experimental comparisons rather than a mathematical derivation chain. The diagnosis of unbalanced losses as the primary cause is presented as an observation motivating the method, but does not create a self-referential loop in any derivation. This is a standard case of an engineering contribution without load-bearing circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; the method implicitly assumes that a per-dimension normalization can be computed stably from online data streams without introducing bias into the value function, but no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5809 in / 1144 out tokens · 15614 ms · 2026-06-26T14:37:45.843141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 10 canonical work pages · 2 internal anchors

  1. [1]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement learning - an introduction, 2nd Edition. MIT Press, 2018. URLhttp://www.incompleteideas.net/book/the-book-2nd.html

  2. [2]

    Nature , author =

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep rein- forcem...

  3. [3]

    Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017. URL http://arxiv.org/ abs/1707.06347

  4. [4]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In Jennifer G. Dy and Andreas Krause, editors,Proceedings of the 35th In- ternational Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Swe- den, July 10-15, 2018, volume 80 ofProceedings of Machine Learning Research, pages 1582–

  5. [5]

    URLhttp://proceedings.mlr.press/v80/fujimoto18a.html

    PMLR, 2018. URLhttp://proceedings.mlr.press/v80/fujimoto18a.html

  6. [6]

    Lillicrap, Jimmy Ba, and Mohammad Norouzi

    Danijar Hafner, Timothy P. Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. In8th International Conference on Learning Repre- sentations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=S1lOTC4tDS

  7. [7]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy P. Lillicrap. Mastering diverse domains through world models.CoRR, abs/2301.04104, 2023. doi: 10.48550/ARXIV .2301. 04104. URLhttps://doi.org/10.48550/arXiv.2301.04104

  8. [8]

    Temporal difference learning for model predictive control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. Temporal difference learning for model predictive control. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato, editors,International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 ofProceedings of Machine Learning Resear...

  9. [9]

    TD-MPC2: scalable, robust world models for continuous control

    Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: scalable, robust world models for continuous control. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview. net/forum?id=Oxh5CstDJU

  10. [10]

    Mastering Atari, Go, chess and shogi by planning with a learned model , volume=

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy P. Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model.Nat., 588(7839):604–609, 2020. doi: 10.1038/S41586-020-03051-4. URL https: //doi.or...

  11. [11]

    Stable reinforcement learning with autoencoders for tactile and visual data

    Herke van Hoof, Nutan Chen, Maximilian Karl, Patrick van der Smagt, and Jan Peters. Stable reinforcement learning with autoencoders for tactile and visual data. In2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2016, Daejeon, South Korea, October 9-14, 2016, pages 3928–3934. IEEE, 2016. doi: 10.1109/IROS.2016.7759578. URL ht...

  12. [12]

    Decoupling dynamics and reward for transfer learning

    Amy Zhang, Harsh Satija, and Joelle Pineau. Decoupling dynamics and reward for transfer learning. In6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings. OpenReview.net, 2018. URLhttps://openreview.net/forum?id=H1aoddyvM

  13. [13]

    Bellemare

    Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, and Marc G. Bellemare. Deep- mdp: Learning continuous latent space models for representation learning. In Kamalika 10 Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Con- ference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97...

  14. [14]

    Jha, Toshisada Mariyama, and Daniel Nikovski

    Kei Ota, Tomoaki Oiki, Devesh K. Jha, Toshisada Mariyama, and Daniel Nikovski. Can increasing input dimensionality improve deep reinforcement learning? InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 ofProceedings of Machine Learning Research, pages 7424–7433. PMLR,

  15. [15]

    URLhttp://proceedings.mlr.press/v119/ota20a.html

  16. [16]

    Bootstrap latent-predictive representations for multitask reinforcement learning

    Zhaohan Daniel Guo, Bernardo Ávila Pires, Bilal Piot, Jean-Bastien Grill, Florent Altché, Rémi Munos, and Mohammad Gheshlaghi Azar. Bootstrap latent-predictive representations for multitask reinforcement learning. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 ofProceedings of ...

  17. [17]

    Devon Hjelm, Aaron C

    Max Schwarzer, Ankesh Anand, Rishab Goel, R. Devon Hjelm, Aaron C. Courville, and Philip Bachman. Data-efficient reinforcement learning with self-predictive representations. In9th Inter- national Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7,

  18. [18]

    URLhttps://openreview.net/forum?id=uCQfPZwRaUu

    OpenReview.net, 2021. URLhttps://openreview.net/forum?id=uCQfPZwRaUu

  19. [19]

    Smith, Shixiang Gu, Doina Precup, and David Meger

    Scott Fujimoto, Wei-Di Chang, Edward J. Smith, Shixiang Gu, Doina Precup, and David Meger. For SALE: state-action representation learning for deep reinforcement learning. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference on Neural Info...

  20. [20]

    Towards general-purpose model-free reinforcement learning

    Scott Fujimoto, Pierluca D’Oro, Amy Zhang, Yuandong Tian, and Michael Rabbat. Towards general-purpose model-free reinforcement learning. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview.net/forum?id=R1hIXdST22

  21. [21]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis R. Bach and David M. Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 ofJMLR Workshop and Conference Proceedings, pages 448–456. ...

  22. [22]

    URLhttp://proceedings.mlr.press/v37/ioffe15.html

  23. [23]

    Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.CoRR, abs/1607.06450, 2016. URLhttp://arxiv.org/abs/1607.06450

  24. [24]

    Gomes, and Kilian Q

    Johan Bjorck, Carla P. Gomes, and Kilian Q. Weinberger. Towards deeper deep rein- forcement learning with spectral normalization. In Marc’Aurelio Ranzato, Alina Beygelz- imer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Ad- vances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing System...

  25. [25]

    Spectral normalisation for deep reinforcement learning: An optimisation perspective

    Florin Gogianu, Tudor Berariu, Mihaela Rosca, Claudia Clopath, Lucian Busoniu, and Razvan Pascanu. Spectral normalisation for deep reinforcement learning: An optimisation perspective. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceed...

  26. [26]

    Normaliza- tion enhances generalization in visual reinforcement learning

    Lu Li, Jiafei Lyu, Guozheng Ma, Zilin Wang, Zhenjie Yang, Xiu Li, and Zhiheng Li. Normaliza- tion enhances generalization in visual reinforcement learning. In Mehdi Dastani, Jaime Simão 11 Sichman, Natasha Alechina, and Virginia Dignum, editors,Proceedings of the 23rd Inter- national Conference on Autonomous Agents and Multiagent Systems, AAMAS 2024, Auck...

  27. [27]

    Image augmentation is all you need: Regular- izing deep reinforcement learning from pixels

    Denis Yarats, Ilya Kostrikov, and Rob Fergus. Image augmentation is all you need: Regular- izing deep reinforcement learning from pixels. In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URLhttps://openreview.net/forum?id=GY6-6sTvGaf

  28. [28]

    Mastering visual continu- ous control: Improved data-augmented reinforcement learning

    Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel Pinto. Mastering visual continu- ous control: Improved data-augmented reinforcement learning. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. Open- Review.net, 2022. URLhttps://openreview.net/forum?id=_SJ-_yyes8

  29. [29]

    Stable-baselines3: Reliable reinforcement learning implementations.J

    Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations.J. Mach. Learn. Res., 22:268:1–268:8, 2021. URLhttps://jmlr.org/papers/v22/20-1364.html

  30. [30]

    Bridging state and history representations: Understanding self-predictive RL

    Tianwei Ni, Benjamin Eysenbach, Erfan Seyedsalehi, Michel Ma, Clement Gehring, Aditya Mahajan, and Pierre-Luc Bacon. Bridging state and history representations: Understanding self-predictive RL. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview. net/...

  31. [31]

    When does self-prediction help? understanding auxiliary tasks in reinforcement learning.RLJ, 4: 1567–1597, 2024

    Claas V oelcker, Tyler Kastner, Igor Gilitschenski, and Amir-massoud Farahmand. When does self-prediction help? understanding auxiliary tasks in reinforcement learning.RLJ, 4: 1567–1597, 2024. URLhttps://rlj.cs.umass.edu/2024/papers/Paper197.html

  32. [32]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer G. Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of...

  33. [33]

    Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, Rodrigo Perez-Vicente, J

    Mark Towers, Ariel Kwiatkowski, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, Rodrigo Perez-Vicente, J. K. Terry, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Hannah Tan, and Omar G. Younis. Gymnasium: A standard interface for reinforcement learning environments. In Danielle Belgrave, Cheng ...

  34. [34]

    Lillicrap, and Martin A

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and Martin A. Riedmiller. Deepmind control suite.CoRR, abs/1801.00690, 2018. URL http://arxiv.org/ abs/1801.00690

  35. [35]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980

  36. [36]

    Rainbow: Com- bining improvements in deep reinforcement learning

    Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Gheshlaghi Azar, and David Silver. Rainbow: Com- bining improvements in deep reinforcement learning. In Sheila A. McIlraith and Kilian Q. 12 Weinberger, editors,Proceedings of the Thirty-Second AAAI Conference on Artificial Intelli- ...

  37. [37]

    Robust estimation of a location parameter

    Peter J Huber. Robust estimation of a location parameter. InBreakthroughs in statistics: Methodology and distribution, pages 492–518. Springer, 1992

  38. [38]

    An equivalence between loss functions and non-uniform sampling in experience replay

    Scott Fujimoto, David Meger, and Doina Precup. An equivalence between loss functions and non-uniform sampling in experience replay. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, ...

  39. [39]

    Ried- miller

    David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin A. Ried- miller. Deterministic policy gradient algorithms. InProceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 ofJMLR Workshop and Conference Proceedings, pages 387–395. JMLR.org, 2014. URL http://proce...

  40. [40]

    Gomes, and Kilian Q

    Johan Bjorck, Carla P. Gomes, and Kilian Q. Weinberger. Is high variance unavoidable in rl? A case study in continuous control. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=9xhgmsNVHu

  41. [41]

    Multi-agent actor-critic for mixed cooperative-competitive environments

    Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on N...

  42. [42]

    Discrete off-policy policy gradient using continuous relaxations.Unpublished

    Andre Cianflone, Zafarali Ahmed, Riashat Islam, Avishek Joey Bose, and William L Hamilton. Discrete off-policy policy gradient using continuous relaxations.Unpublished. https://joeybose. github. io/assets/Gradient_estimator. pdf, 2019

  43. [43]

    Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling

    Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents (extended abstract). In Qiang Yang and Michael J. Wooldridge, editors,Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015...

  44. [44]

    Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-pe...

  45. [45]

    foot-target distance

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2012, Vilamoura, Algarve, Portugal, October 7-12, 2012, pages 5026–5033. IEEE, 2012. doi: 10.1109/IROS.2012.6386109. URLhttps://doi.org/10.1109/IROS.2012.6386109. 13

  46. [46]

    Spectral normalization for generative adversarial networks

    Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In6th International Conference on Learning Representa- tions, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceed- ings. OpenReview.net, 2018. URLhttps://openreview.net/forum?id=B1QRgziT-

  47. [47]

    Dueling network architectures for deep reinforcement learning

    Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de Freitas. Dueling network architectures for deep reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Wor...

  48. [48]

    JAX: composable transformations of Python+NumPy programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Yash Katariya, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman- Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. URLhttps://github.com/jax-ml/jax

  49. [49]

    Fast and accurate deep network learning by exponential linear units (elus)

    Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In Yoshua Bengio and Yann LeCun, editors,4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511. 07289

  50. [50]

    Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In Johannes Fürnkranz and Thorsten Joachims, editors,Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, pages 807–814. Omnipress, 2010. URL https://icml.cc/Conferences/2010/papers/ 432.pdf

  51. [51]

    Deep sparse rectifier neural networks

    Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Geoffrey J. Gordon, David B. Dunson, and Miroslav Dudík, editors,Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, volume 15 ofJMLR Proceedings, pages 315–323. JM...

  52. [52]

    Karniadakis

    Lu Lu, Yeonjong Shin, Yanhui Su, and George E. Karniadakis. Dying relu and initialization: Theory and numerical examples.CoRR, abs/1903.06733, 2019. URL http://arxiv.org/ abs/1903.06733

  53. [53]

    and Boser, B

    Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. Backpropagation applied to handwritten zip code recognition.Neural Comput., 1(4):541–551, 1989. doi: 10.1162/NECO.1989.1.4.541. URL https://doi.org/10.1162/neco.1989.1.4.541

  54. [54]

    Proceedings of the IEEE , author =

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition.Proc. IEEE, 86(11):2278–2324, 1998. doi: 10.1109/5.726791. URLhttps://doi.org/10.1109/5.726791

  55. [55]

    Zeiler, Dilip Krishnan, Graham W

    Matthew D. Zeiler, Dilip Krishnan, Graham W. Taylor, and Robert Fergus. Deconvolutional networks. InThe Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, pages 2528–2535. IEEE Computer Society, 2010. doi: 10.1109/CVPR.2010.5539957. URL https://doi.org/10.1109/CVPR. 2010.5539957. 14...