pith. machine review for the scientific record. sign in

arxiv: 2512.04341 · v3 · submitted 2025-12-04 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:42 UTC · model grok-4.3

classification 💻 cs.LG
keywords offline reinforcement learningmodel-based RLBayesian methodsepistemic uncertaintylong-horizon rolloutsconservatismD4RL benchmarkworld models
0
0 comments X

The pith

A Bayesian approach using world-model posteriors and long-horizon rollouts can match conservative offline RL without explicit penalties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper questions whether explicit conservatism is required for stable offline reinforcement learning and proposes a complementary Bayesian method instead. By maintaining a posterior over possible world models and training a history-dependent agent to maximize expected return under that posterior, the approach directly targets epistemic uncertainty. In a bandit example it outperforms conservatism on low-quality data, while in full tasks long rollouts of several hundred steps become feasible once design choices limit compounding model errors. The resulting algorithm NEUBAY reaches competitive or better performance than leading conservative methods on D4RL and NeoRL benchmarks, with new state-of-the-art results on seven datasets. The authors further show that dataset quality and coverage can indicate when the Bayesian route is preferable.

Core claim

The paper claims that a neutral Bayesian principle suffices for long-horizon model-based offline RL: maintain a posterior over world models, then optimize a history-dependent policy to maximize expected return under the posterior. This directly handles epistemic uncertainty without penalizing out-of-dataset actions or shortening rollouts. With additional design choices that control compounding errors, the resulting method NEUBAY performs on par with or better than conservative algorithms on standard benchmarks, setting new records on seven datasets.

What carries the argument

A posterior distribution over world models together with a history-dependent agent that maximizes expected return under the posterior.

If this is right

  • Long-horizon rollouts become viable and necessary once explicit conservatism is removed.
  • The Bayesian method outperforms conservatism on low-quality datasets in bandit settings.
  • Careful design choices allow scaling to realistic tasks while keeping model errors in check.
  • NEUBAY achieves new state-of-the-art results on seven datasets from D4RL and NeoRL.
  • Characterizing datasets by quality and coverage helps decide when the Bayesian approach is preferable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset characterization could be used in practice to select between Bayesian and conservative algorithms based on data properties.
  • Similar posterior modeling over dynamics may reduce reliance on conservatism in other model-based planning settings.
  • The history-dependent agent structure suggests testable extensions to richer observation histories or partial observability.

Load-bearing premise

That specific design choices can sufficiently mitigate compounding model errors during long-horizon rollouts of several hundred steps.

What would settle it

An experiment in which NEUBAY without the proposed design choices for error mitigation produces clear value overestimation and degraded performance on long-horizon D4RL tasks.

Figures

Figures reproduced from arXiv: 2512.04341 by Esther Derman, Pierre-Luc Bacon, Siamak Ravanbakhsh, Tianwei Ni, Vincent Taboga, Vineet Jain.

Figure 1
Figure 1. Figure 1: Our algorithm NEUBAY’s result on a D4RL dataset. From left to right: normalized score on the real environment, estimated Q-value on the offline dataset, and rollout horizon statistics over 100 training rollouts (median with interquartile range). Here we vary the uncertainty quantile ζ ∈ {0.9, 0.99, 0.999, 1.0} for the rollout truncation threshold, without using conservatism. Reinforcement learning (RL) oft… view at source ↗
Figure 3
Figure 3. Figure 3: Average return (normalized by T) on test-time bandits with p ∗ 1 ∈ {0.01, 0.3, 0.55, 0.7, 0.99}. Since the observed arm has p ∗ 0 = 0.5, cases with p ∗ 1 < 0.5 are worse and those with p ∗ 1 > 0.5 are better. a dataset that only covers arm 0. Specifically, the dataset D = {(a i 0:T −1 , ri 1:T )}i is collected by a deterministic behavior policy πβ(a) = 1(a = 0). Thus, for each t < T = 100, at = 0, rt+1 ∼ B… view at source ↗
Figure 2
Figure 2. Figure 2: Histogram of es￾timated reward means p0, p1 across ensemble members. This skewed dataset D leaves the true reward parameter p ∗ 1 for arm 1 completely unobserved, inducing substantial epistemic uncertainty on p ∗ 1 . Theoretically, under an uninformative prior for Bernoulli rewards, this uncertainty corresponds to the entropy of Unif[0, 1]. We approximate the posterior by fitting an ensemble of reward mode… view at source ↗
Figure 4
Figure 4. Figure 4: Empirical CDFs of epistemic uncertainty Uθ over (s, a) ∈ suppS×A(D), with logit-scaled y-axis. Uncertainties are normalized by the dataset mean, so 1 is the average value. How to truncate long-horizon rollouts? Since model errors depend on specific (s, a) pairs, a key question is not simply when to truncate rollouts, but where. A natural criterion is the model’s uncertainty estimate Uθ(s, a), which correla… view at source ↗
Figure 5
Figure 5. Figure 5: Effect of LayerNorm in world models trained and evaluated on halfcheetah-medium-expert-v2. We collect 200 rollouts and truncate only on float32 overflow, without using an uncertainty threshold. For each metric, we plot the median (solid line) together with the 5-95% percentile band across rollouts. The rightmost scatter plot show the Spearman’s rank coefficient in the with-LayerNorm setting; vertical lines… view at source ↗
Figure 6
Figure 6. Figure 6: Ablation on the uncertainty quantile ζ for rollout truncation in D4RL locomotion datasets. Results for the remaining datasets are shown in the next figure. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation on the uncertainty quantile ζ for rollout truncation in D4RL locomotion datasets. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation on the uncertainty quantile ζ for rollout truncation in NeoRL datasets. 45 [PITH_FULL_IMAGE:figures/full_fig_p045_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation on the uncertainty quantile ζ for rollout truncation in three D4RL Adroit datasets and four D4RL AntMaze datasets. Adroit benchmark has short maximum episode steps: T = 100 < 2 7 in pen and T = 200 < 2 8 in hammer, which limits the rollout horizon. Maximum episode steps are T = 700 in umaze and T = 1000 in medium maze. Results on the remaining Adroit and AntMaze datasets are omitted as our algorit… view at source ↗
Figure 10
Figure 10. Figure 10: Effect of LayerNorm in world models trained on halfcheetah-random-v2 and halfcheetah [PITH_FULL_IMAGE:figures/full_fig_p048_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Effect of LayerNorm in world models trained on halfcheetah-medium-v2 and halfcheetah [PITH_FULL_IMAGE:figures/full_fig_p049_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Effect of LayerNorm in world models trained on hopper-medium-replay-v2 and hopper [PITH_FULL_IMAGE:figures/full_fig_p050_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Selective learning curves on datasets where performance is [PITH_FULL_IMAGE:figures/full_fig_p051_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Selective learning curves on datasets where performance is [PITH_FULL_IMAGE:figures/full_fig_p051_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Failure cases in antmaze-medium-play-v2 (left three; different seeds) and antmaze-large [PITH_FULL_IMAGE:figures/full_fig_p051_15.png] view at source ↗
read the original abstract

Popular offline reinforcement learning (RL) methods rely on explicit conservatism, penalizing out-of-dataset actions or restricting rollout horizons. We question the universality of this principle and revisit a complementary Bayesian perspective for test-time adaptation. By modeling a posterior over world models and training a history-dependent agent to maximize expected return, the Bayesian approach directly addresses epistemic uncertainty without explicit conservatism. We first illustrate in a bandit setting that Bayesianism excels on low-quality datasets where conservatism fails. Scaling to realistic tasks, we find that long-horizon rollouts are essential to control value overestimation once conservatism is removed. We introduce design choices that enable learning from long-horizon rollouts while mitigating compounding model errors, yielding our algorithm, NEUBAY, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, NEUBAY is competitive with leading conservative algorithms, achieving new state-of-the-art on 7 datasets with rollout horizons of several hundred steps. Finally, we characterize datasets by quality and coverage to identify when NEUBAY is preferable to conservative methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes NEUBAY, a model-based offline RL method that models a posterior over world models and trains a history-dependent agent to maximize expected return under this posterior. It argues that this Bayesian approach directly handles epistemic uncertainty without explicit conservatism, shows that long-horizon rollouts (hundreds of steps) become essential once conservatism is removed, and introduces design choices to mitigate compounding model errors. On D4RL and NeoRL benchmarks, NEUBAY is competitive with leading conservative methods and achieves new state-of-the-art on 7 datasets; the paper also characterizes datasets by quality and coverage to identify when the Bayesian method is preferable.

Significance. If the central claim holds, the work is significant because it challenges the prevailing reliance on explicit conservatism in offline RL and offers a complementary neutral Bayesian perspective that performs well on low-quality datasets where conservatism struggles. The empirical results, including new SOTAs on multiple benchmarks and the dataset characterization, provide practical value. The approach is grounded in Bayesian principles rather than ad-hoc penalties, and the long-horizon emphasis with error-mitigation design choices represents a distinct direction.

major comments (2)
  1. [§4] §4 (Algorithm and design choices): The claim that specific design choices suffice to mitigate compounding model errors over 100-500 step rollouts is load-bearing for the central argument that long-horizon rollouts control value overestimation without conservatism. However, the manuscript provides no explicit quantitative bounds on error accumulation, no sensitivity analysis to rollout horizon length, and no ablation comparing error growth rates against short-horizon baselines.
  2. [§5.1] §5.1 and §5.2 (Empirical evaluation): While new SOTA results are reported on 7 datasets, the verification that model errors remain controlled during long rollouts relies on final performance metrics; additional diagnostics such as rollout-wise prediction error curves or divergence rates under the learned posterior would strengthen the evidence that the Bayesian posterior plus history-dependent agent suffices.
minor comments (2)
  1. The abstract states that long-horizon rollouts are essential once conservatism is removed, but a brief comparison table or plot contrasting short vs. long horizons under the same Bayesian setup would improve clarity.
  2. [§3] Notation for the history-dependent agent and posterior sampling could be made more explicit in the method section to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment below and outline the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [§4] §4 (Algorithm and design choices): The claim that specific design choices suffice to mitigate compounding model errors over 100-500 step rollouts is load-bearing for the central argument that long-horizon rollouts control value overestimation without conservatism. However, the manuscript provides no explicit quantitative bounds on error accumulation, no sensitivity analysis to rollout horizon length, and no ablation comparing error growth rates against short-horizon baselines.

    Authors: We acknowledge that the manuscript does not provide explicit quantitative bounds on error accumulation, nor does it include a dedicated sensitivity analysis to rollout horizon or an ablation on error growth rates versus short-horizon baselines. Deriving such bounds for general learned dynamics models over hundreds of steps is technically challenging and was not attempted here; our central argument instead rests on the empirical performance across benchmarks together with the bandit illustration. To address the concern directly, we will add a sensitivity study varying rollout horizon length and an ablation comparing prediction error growth for long-horizon versus short-horizon rollouts in the revised manuscript. revision: yes

  2. Referee: [§5.1] §5.1 and §5.2 (Empirical evaluation): While new SOTA results are reported on 7 datasets, the verification that model errors remain controlled during long rollouts relies on final performance metrics; additional diagnostics such as rollout-wise prediction error curves or divergence rates under the learned posterior would strengthen the evidence that the Bayesian posterior plus history-dependent agent suffices.

    Authors: We agree that additional diagnostics would strengthen the evidence that model errors remain controlled. While final performance metrics and competitiveness with conservative baselines provide supporting evidence, we recognize that direct measures of error behavior during rollouts would be valuable. In the revised version we will include rollout-wise prediction error curves and divergence rates under the learned posterior, placed in the experimental section or an expanded appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on Bayesian principles and external empirical validation

full rationale

The paper's core argument proceeds from standard Bayesian modeling of a posterior over world models, followed by training a history-dependent policy to maximize expected return under that posterior. This is illustrated first in a bandit setting and then scaled via explicit design choices for long-horizon rollouts (several hundred steps) that mitigate compounding errors. The resulting algorithm NEUBAY is evaluated on D4RL and NeoRL benchmarks, where it is shown competitive with conservative baselines and achieves new SOTA on seven datasets. No load-bearing step reduces by construction to a fitted parameter renamed as a prediction, nor does any central claim rest on a self-citation chain whose prior result is itself unverified or defined in terms of the present work. The design choices for error mitigation are presented as practical engineering decisions whose robustness is assessed empirically rather than derived tautologically from the inputs. The derivation therefore remains self-contained against external benchmarks and does not exhibit the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach assumes standard RL Markov properties and that posterior approximation plus error-mitigation heuristics suffice for long-horizon stability; no new physical entities are postulated.

free parameters (2)
  • rollout horizon length
    Set to several hundred steps to control value overestimation once conservatism is removed.
  • error-mitigation design choices
    Specific heuristics introduced to limit compounding model errors during long rollouts.
axioms (1)
  • domain assumption A posterior distribution over world models can be maintained and used for expected-return maximization at test time.
    Central to the neutral Bayesian principle invoked in the abstract.

pith-pipeline@v0.9.0 · 5496 in / 1201 out tokens · 33114 ms · 2026-05-17T01:42:56.649416+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 3 internal anchors

  1. [1]

    Combating the Compounding-Error Problem with a Multi-step Model

    7, 19, 20, 24 Anonymous. ADM-v2: Pursuing full-horizon roll-out in dynamics models for offline policy learning and evaluation. InSubmitted to International Conference on Learning Representations, 2026. under review. 22 Arthur Argenson and Gabriel Dulac-Arnold. Model-based offline planning. InInternational Conference on Learning Representations, 2020. 19 K...

  2. [2]

    Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning

    2, 6, 21 Chenjia Bai, Lingxiao Wang, Zhuoran Yang, Zhi-Hong Deng, Animesh Garg, Peng Liu, and Zhaoran Wang. Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning. InInternational Conference on Learning Representations, 2022. 24 Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning ...

  3. [3]

    University of Massachusetts Amherst, 2002

    22 Michael O’Gordon Duff.Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. University of Massachusetts Amherst, 2002. 2, 3, 19, 21 Gabriel Dulac-Arnold, Nir Levine, Daniel J Mankowitz, Jerry Li, Cosmin Paduraru, Sven Gowal, and Todd Hester. Challenges of real-world reinforcement learning: definitions, benchmarks and ...

  4. [4]

    Tree-based batch mode reinforcement learning.Journal of Machine Learning Research, 6, 2005

    3, 21 Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-based batch mode reinforcement learning.Journal of Machine Learning Research, 6, 2005. 20 Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Model- based value estimation for efficient model-free reinforcement learning. InInternational Conference on ...

  5. [5]

    A clean slate for offline reinforcement learning.Advances in Neural Information Processing Systems, 2025

    22 Matthew Thomas Jackson, Uljad Berdica, Jarek Liesen, Shimon Whiteson, and Jakob Nicolaus Foerster. A clean slate for offline reinforcement learning.Advances in Neural Information Processing Systems, 2025. 10, 40 Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization.Advances in neural infor...

  6. [6]

    Model-based offline reinforcement learning with count-based conservatism

    36 Byeongchan Kim and Min-hwan Oh. Model-based offline reinforcement learning with count-based conservatism. InInternational Conference on Machine Learning, pp. 16728–16746. PMLR, 2023. 19 Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022. ...

  7. [7]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    32 Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020. 1 Gen Li, Laixi Shi, Yuxin Chen, Yuejie Chi, and Yuting Wei. Settling the sample complexity of model-based offline reinforcement learning.The Annals of Statistics, 52(1):23...

  8. [8]

    Uncertainty representations in state-space layers for deep reinforcement learning under partial observability.arXiv preprint arXiv:2409.16824, 2024

    2, 21, 47 Carlos E Luis, Alessandro G Bottero, Julia Vinogradska, Felix Berkenkamp, and Jan Peters. Uncertainty representations in state-space layers for deep reinforcement learning under partial observability.arXiv preprint arXiv:2409.16824, 2024. 22 Fan-Ming Luo, Zuolin Tu, Zefang Huang, and Yang Yu. Efficient recurrent off-policy rl requires a context-...

  9. [9]

    Playing Atari with Deep Reinforcement Learning

    21 V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602, 2013. 34 Amir Moeini, Jiuqi Wang, Jacob Beck, Ethan Blaser, Shimon Whiteson, Rohan Chandra, and Shangtong Zhang. A survey of in-context reinforcement lear...

  10. [10]

    Sumo: Search-based uncertainty estimation for model-based offline reinforcement learning

    2 Zhongjian Qiao, Jiafei Lyu, Kechen Jiao, Qi Liu, and Xiu Li. Sumo: Search-based uncertainty estimation for model-based offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 20033–20041, 2025. 7, 19 Rong-Jun Qin, Xingyuan Zhang, Songyi Gao, Xiong-Hui Chen, Zewen Li, Weinan Zhang, and Yang Yu. Neor...

  11. [11]

    37 Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem

    PMLR, 2016. 37 Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust markov decision processes.Mathematics of Operations Research, 38(1):153–183, 2013. 3, 22, 23 Andrew G Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697–4708, 2020. 3 Haoran Xu,...

  12. [12]

    soft robustness

    19 Zhihan Yang and Hai Nguyen. Recurrent off-policy baselines for memory-based continuous control.arXiv preprint arXiv:2110.12628, 2021. 22 Denis Yarats, David Brandfonbrener, Hao Liu, Michael Laskin, Pieter Abbeel, Alessandro Lazaric, and Lerrel Pinto. Don’t change the algorithm, change the data: Exploratory data for offline reinforcement learning. In IC...

  13. [13]

    Then, the numerator f(m 1)<∞ , the denominatorg(m 1)>0, makingC(π, m 1)<∞

    Model m1 is a small perturbation of m∗ on all of S × A. Then, the numerator f(m 1)<∞ , the denominatorg(m 1)>0, makingC(π, m 1)<∞

  14. [14]

    price of Bayesianism

    Model m2 is equivalent to m∗ on supp(β), i.e., m2(s, a) =m ∗(s, a),∀(s, a)∈supp(β) , but there exists an off-support pair (s†, a†)̸∈supp(β) with dπ m2(s†, a†)>0 and TV(m 2(s†, a†), m∗(s†, a†))>0. In that case,g(m 2) = 0, f(m 2)>0,soC(π, m 2) =∞. If the posteriorP D assigns weightsP D(m1) = 1−εandP D(m2) =εwith anyε∈(0,1), then Em∼PD[f(m)] = (1−ε)f(m 1) +ε...

  15. [15]

    implicit pessimism

    In particular, the statement holds for the optimal robust policyπ ∗(M). D.3 AUXILIARYLEMMAS FORTHEOREM1 We first recall a standard PAC bound for maximum likelihood estimation (MLE), adapted from Agarwal et al. (2020a, Theorem 21). Lemma 3(MLE PAC-bound).Letβ∈∆(S × A)be the offline distribution induced byD, and ˆm= argmax m∈M E(s,a,r,s′)∼D[logm(r, s ′ |s, ...

  16. [16]

    4.2, instead of enforcing a fixed horizon H, we truncate rollouts adaptively using an uncertainty threshold calibrated on the real dataset

    Uncertainty truncation: as described in Sec. 4.2, instead of enforcing a fixed horizon H, we truncate rollouts adaptively using an uncertainty threshold calibrated on the real dataset. This allows planning to extend as long as the model remains confident

  17. [17]

    Timout truncation: to remain consistent with test-time evaluation, we impose a hard cap at the environment’s maximum episode lengthT, regardless of rollout length

  18. [18]

    Including this prior knowledge makes our algorithm directly comparable to model-based baselines, which are our main focus

    Ground-truth termination: we retain the environment’s rule-based terminal function to provide true terminal signals ˆdt+1, following prior model-based RL methods (Yu et al., 2020). Including this prior knowledge makes our algorithm directly comparable to model-based baselines, which are our main focus. Importantly, only the terminal signal disables bootst...

  19. [19]

    Training is terminated early if the validation MSE fails to improve by more than 0.01 relative within five consecutive epochs, following the early stopping procedure in MOBILE

    with a weight decay coefficient of 5×10 −5, and a learning rate of 1×10 −3 for locomotion tasks and 3×10 −4 for Adroit tasks. Training is terminated early if the validation MSE fails to improve by more than 0.01 relative within five consecutive epochs, following the early stopping procedure in MOBILE. In the bandit task, the model learning rate is 1×10 −3...

  20. [20]

    Exploration is ϵ-greedy, annealed from 1.0 to 0.1 over the first10%of gradient steps

    as the discrete control algorithm. Exploration is ϵ-greedy, annealed from 1.0 to 0.1 over the first10%of gradient steps. Agent architecture implementation.As introduced in Sec. 4.4, our agent consists of a recurrent actor πν :H t →∆(A) and a recurrent critic Qω :H t × A →R 10. The critic outputs an ensemble of 10 Q-values, following the REDQ design adopte...