Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

Ahmed Hendawy; Carlo D'Eramo; Henrik Metternich; Jan Peters; Mahdi Kallel; Th\'eo Vincent

arxiv: 2510.02590 · v2 · pith:BCKMURT3new · submitted 2025-10-02 · 💻 cs.LG

Use the Online Network If You Can: Towards Fast and Stable Reinforcement Learning

Ahmed Hendawy , Henrik Metternich , Th\'eo Vincent , Mahdi Kallel , Jan Peters , Carlo D'Eramo This is my paper

Pith reviewed 2026-05-21 21:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningtarget networksvalue function learningoverestimation biasdeep Q-learningactor-criticonline and offline RL

0 comments

The pith

Using the minimum estimate between online and target networks produces faster and stable value learning in reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MINTO, a simple change to how targets are computed when updating value functions. Instead of relying solely on a slowly updated target network or the fast but unstable online network, the target becomes the lower of the two estimates. This targets the overestimation that arises when the online network is used directly for bootstrapping. The approach integrates into many value-based and actor-critic methods at almost no extra cost and shows better results across online and offline settings with both discrete and continuous actions.

Core claim

MINTO sets the bootstrapped target to the minimum of the online network's estimate and the target network's estimate. This yields faster convergence than a fixed target network while avoiding the instability and overestimation that typically occur when the online network serves as the target. The authors report consistent gains when the rule is added to existing algorithms and tested on a wide collection of online RL, offline RL, discrete-action, and continuous-action benchmarks.

What carries the argument

The MINTO target rule, which computes each update target as the minimum between the current online network estimate and the target network estimate.

If this is right

MINTO can be added to a broad range of value-based and actor-critic algorithms with negligible overhead.
Value-function updates converge more quickly while preserving stability.
Performance improves across online RL, offline RL, discrete actions, and continuous actions.
The overestimation bias that appears when the online network is used for bootstrapping is reduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar minimum operations might be worth testing in other learning systems that face a speed-stability tradeoff.
The approach could be examined in larger-scale or real-world control tasks beyond the current benchmarks.
It may be useful to check whether the same rule helps in policy-gradient or model-based methods.

Load-bearing premise

The minimum operation between the two estimates will reduce overestimation bias without creating underestimation or instability in the settings where it is applied.

What would settle it

A controlled comparison on a standard benchmark such as DQN on Atari or SAC on MuJoCo in which MINTO produces slower learning or more unstable training curves than the unmodified target-network baseline.

Figures

Figures reproduced from arXiv: 2510.02590 by Ahmed Hendawy, Carlo D'Eramo, Henrik Metternich, Jan Peters, Mahdi Kallel, Th\'eo Vincent.

**Figure 1.** Figure 1: Results of benchmarking the Minimum operator utilized by MINTO against other potential operators on 15 Atari games with the CNN architecture. We report the AUC metric using IQM and the confidence interval computed across 5 seeds. Methods are trained for 50 million frames. To evaluate the impact of MINTO, we design an empirical study that seeks to answer two central questions: (Q1) Is the minimum operato… view at source ↗

**Figure 2.** Figure 2: Results of benchmarking MINTO and DQN on [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: using the AUC metric. Although Maxmin DQN relies on an ensemble of Q-functions (N = 2 in our study), MINTO achieves better performance. This demonstrates the advantage of leveraging up-to-date estimates from the online network, in combination with the minimum operator, to miti6 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Results of benchmarking MINTO+IQN and IQN on [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Results of benchmarking CQL and CQL+MINTO on [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Results of evaluating the impact of MINTO on SimbaV2 ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Individual Results of benchmarking MINTO and DQN on 15 Atari games using the IM [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Individual Results of benchmarking MINTO and DQN on 15 Atari games using the CNN [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Individual Results of benchmarking DoubleDQN, FR-DQN, ScDQN, and MINTO on 15 [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Individual Results of benchmarking CQL and CQL+MINTO on 15 Atari games using [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: Individual Results of benchmarking CQL and CQL+MINTO on 15 Atari games using [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗

**Figure 12.** Figure 12: Individual Results of benchmarking MaxMinDQN and MINTO on 15 Atari games using [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: Individual Results of benchmarking SimbaV2 and SimbaV2+MINTO on the 5 MuJoCo [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: Individual Results of benchmarking SimbaV1 and SimbaV1+MINTO on the 5 MuJoCo [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Individual Results of benchmarking SimbaV2 and SimbaV2+MINTO on the 14 Hu [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

**Figure 16.** Figure 16: Individual Results of benchmarking SimbaV1 and SimbaV1+MINTO on the 14 Hu [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

**Figure 17.** Figure 17: Individual Results of benchmarking SimbaV2 and SimbaV2+MINTO on the 7 DMC [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗

**Figure 18.** Figure 18: Individual Results of benchmarking SimbaV1 and SimbaV1+MINTO on the 7 DMC [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗

**Figure 19.** Figure 19: Individual Results of benchmarking the Minimum operator of MINTO against other [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗

**Figure 20.** Figure 20: Cumulative results of benchmarking the Minimum operator of MINTO against other [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗

read the original abstract

The use of target networks is a popular approach for estimating value functions in deep Reinforcement Learning (RL). While effective, the target network remains a compromise solution that preserves stability at the cost of slowly moving targets, thus delaying learning. Conversely, using the online network as a bootstrapped target is intuitively appealing, albeit well-known to lead to unstable learning. In this work, we aim to obtain the best out of both worlds by introducing a novel update rule that computes the target using the MINimum estimate between the Target and Online network, giving rise to our method, MINTO. Through this simple, yet effective modification, we show that MINTO enables faster and stable value function learning, by mitigating the potential overestimation bias of using the online network for bootstrapping. Notably, MINTO can be seamlessly integrated into a wide range of value-based and actor-critic algorithms with a negligible cost. We evaluate MINTO extensively across diverse benchmarks, spanning online and offline RL, as well as discrete and continuous action spaces. Across all benchmarks, MINTO consistently improves performance, demonstrating its broad applicability and effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MINTO's min(online, target) rule is a simple plug-in that trades some overestimation risk for faster learning, but it can create underestimation when the online network is the smaller one.

read the letter

The paper's main contribution is MINTO, which sets the bootstrap target to the element-wise minimum of the online network and the target network. This is presented as a lightweight way to keep the fast-moving targets from online bootstrapping while cutting the overestimation that usually comes with it. The change is easy to drop into existing value-based and actor-critic methods with almost no added cost, and the authors test it across online and offline settings plus discrete and continuous actions.

Referee Report

2 major / 2 minor

Summary. The paper proposes MINTO, a simple modification to RL value function updates that sets the bootstrapped target to the element-wise minimum of the online network and target network estimates. The central claim is that this yields faster and more stable learning than standard target networks by reducing overestimation bias from online bootstrapping, while avoiding the instability of pure online targets. The method is presented as easily integrable into value-based and actor-critic algorithms and is evaluated across online/offline RL benchmarks in discrete and continuous action spaces, with consistent reported performance gains.

Significance. If the empirical gains hold under rigorous controls, MINTO offers a low-overhead, broadly applicable tweak that could accelerate value learning in many existing RL pipelines. The claimed seamless integration and cross-benchmark consistency are practical strengths, though the absence of analysis on underestimation risks and statistical validation limits the strength of the contribution.

major comments (2)

[§3] §3 (Method): The description of the min operation does not analyze or bound the cases in which the online estimate is smaller than the target estimate. When this occurs (common early in training or in high-variance settings), the min target can introduce underestimation bias, potentially slowing value propagation or yielding overly conservative policies; this directly affects the central claim that the modification reliably mitigates overestimation without symmetric drawbacks.
[§4] §4 (Experiments): Performance tables and figures report consistent improvements but omit error bars, run counts, statistical significance tests, and details on hyperparameter search or post-hoc selection. Without these, it is impossible to assess whether the gains are robust or could be explained by variance or tuning, which is load-bearing for the broad-applicability conclusion.

minor comments (2)

[Algorithm 1] The pseudocode in Algorithm 1 could explicitly annotate the min operation and clarify whether it is applied only to the critic or also affects the actor update.
Notation for online vs. target networks is occasionally inconsistent between text and equations; a single consistent symbol pair would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving the rigor of our analysis and empirical validation. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (Method): The description of the min operation does not analyze or bound the cases in which the online estimate is smaller than the target estimate. When this occurs (common early in training or in high-variance settings), the min target can introduce underestimation bias, potentially slowing value propagation or yielding overly conservative policies; this directly affects the central claim that the modification reliably mitigates overestimation without symmetric drawbacks.

Authors: We appreciate this point and agree that the current description in §3 lacks explicit discussion of underestimation cases. When the online network produces a lower estimate than the target network, the min operation yields a more conservative target. This can occur early in training or in high-variance environments. However, we argue this is not a symmetric drawback to overestimation: underestimation tends to produce safer, more stable updates that still allow value propagation, whereas overestimation can lead to divergence. Our extensive empirical results across benchmarks show faster convergence without the instability seen in pure online bootstrapping. In the revision, we will expand §3 with a new paragraph analyzing these cases, including conditions favoring underestimation and empirical statistics on how often the online estimate is smaller during training. revision: yes
Referee: [§4] §4 (Experiments): Performance tables and figures report consistent improvements but omit error bars, run counts, statistical significance tests, and details on hyperparameter search or post-hoc selection. Without these, it is impossible to assess whether the gains are robust or could be explained by variance or tuning, which is load-bearing for the broad-applicability conclusion.

Authors: We fully agree that stronger statistical reporting is needed to support the broad-applicability claims. The revised manuscript will add error bars (mean ± standard deviation) to all tables and figures, explicitly state that all results are averaged over 5 independent runs with different random seeds, include statistical significance tests (e.g., paired t-tests or Wilcoxon rank-sum tests with p-values reported), and provide a dedicated appendix section detailing the hyperparameter search procedure, ranges explored, and selection criteria. These changes will allow readers to better evaluate the robustness of the reported gains. revision: yes

Circularity Check

0 steps flagged

MINTO is an empirical algorithmic heuristic with no circular derivation

full rationale

The paper introduces MINTO as a direct modification to the target computation (min of online and target network values) for RL value updates. This is presented as a simple, effective change to balance stability and speed, with performance claims resting entirely on empirical results across benchmarks rather than any mathematical derivation or prediction that reduces to the inputs by construction. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or imported uniqueness theorems appear in the abstract or described method. The derivation chain is therefore self-contained and non-circular; the method is a heuristic whose validity is tested externally via experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone; the approach relies on standard RL assumptions about value estimation and overestimation bias.

pith-pipeline@v0.9.0 · 5738 in / 997 out tokens · 55674 ms · 2026-05-21T21:26:30.131369+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

y = r + γ max_a min(Q¯θ(s', a'), Qθ(s', a')) (Eq. 3); convergence via non-expansion of G_MINTO (Corollary 1, Appendix A)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MINTO integrated into DQN, IQN, CQL, SAC across discrete/continuous, online/offline settings

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 8 internal anchors

[1]

Dopamine: A Research Framework for Deep Reinforcement Learning

URLhttp: //github.com/jax-ml/jax. Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G Belle- mare. Dopamine: A research framework for deep reinforcement learning.arXiv preprint arXiv:1812.06110,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational confer- ence on machine learning, pp. 1861–1870. Pmlr, 2018a. Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhi...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Hyperspher- ical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280,

Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, and Jaegul Choo. Hyperspher- ical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280,

work page arXiv
[4]

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wier- stra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Humanoid- bench: Simulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506,

Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoid- bench: Simulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506,

work page arXiv
[8]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Bud- den, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite.arXiv preprint arXiv:1801.00690,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Deep Reinforcement Learning and the Deadly Triad

Hado Van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Mo- dayil. Deep reinforcement learning and the deadly triad.arXiv preprint arXiv:1812.02648,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Under Review

13 Preprint. Under Review. APPENDIX A Proof of Corollary 1 15 A.1 Proof of Condition A1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Proof of Condition A1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B Algorithmic Details 16 C Implementation Details 19 C.1 Online RL and Discrete Control . . . . . . . . . . ....

work page 2020
[12]

The second inequality holds because theminoperator is also a non-expansion

|GMINTO(Qs)−G MINTO(Q′ s)|= max a∈A min j∈T Qsa(j) −max a∈A min j∈T Q′ sa(j) (5) ≤max a∈A min j∈T Qsa(j)−min j∈T Q′ sa(j) (6) ≤max a∈A max j∈T |Qsa(j)−Q ′ sa(j)| (7) = max a∈A,j∈T |Qsa(j)−Q ′ sa(j)|(8) The first inequality holds because themaxoperator is a non-expansion. The second inequality holds because theminoperator is also a non-expansion. The final...

work page 2020
[13]

4, SAC is adapted to use a single Q-function critic, following the approach taken in Simba (Lee et al

2:Load offline datasetDintoB 3:repeat 4:Sample a batch ofBtransitions(s b, ab, rb, s′ b)B b=1 fromB 5:Compute target Q-values for next states: yb =r b +γmax a′ min(Q¯θ(s′ b, a′), Qθ(s′ b, a′)) 6:ifs ′ b is terminaltheny b ←r b 7:Compute standard TD loss: LTD(θ) = 1 2B BX b=1 ⌈yb⌉ −Q θ(sb, ab) 2 8:Compute conservative regularizer: LCQL(θ) =α·E sb∼B h log X...

work page 2024
[14]

Under Review

20 Preprint. Under Review. C.3 OFFLINERL For the offline reinforcement learning experiments on Atari, we use datasets from RL Unplugged 1 (Gulcehre et al. (2020)), which provide standardized and diverse benchmarks. The implementation is built on a stable and well-tested codebase to ensure reproducibility and fair comparison. The code will be shared upon a...

work page 2020
[15]

C.4 ONLINERLANDCONTINUOUSCONTROL For our continuous-control experiments with online reinforcement learning, we adopt SimbaV1 and SimbaV2

Hyperparameter CNN IMPALA CQL CQL+MINTO CQL CQL+MINTO Dataset Size 5,000,000 Batch Size 32 Update Horizon 1 Discount Factor (γ) 0.99 Epochs 100 Learning Rate 5×10 −5 Adam (ϵ) 5×3.125 −4 Training Steps per Epoch 62,500 Tradeoff Factor (α) 0.1 Target Update Frequency (T) 2000 Layer Norm no yes Table 4: Comparison of CQL and CQL+MINTO hyperparameters for the...

work page 2000
[16]

Identical values are merged

Hyperparameter SimbaV1, SimbaV1+MINTO DMC-Hard HumanoidBench MuJoCo Discount Factor (γ) 0.99 0.995 Learning Rate 1.0×10 −4 Weight Decay 0.01 Target (τ) 0.005 Update Horizon (n) 1 Temperature Initial Value 0.01 Temperature Target Entropy −0.5× |A| Batch Size 256 Buffer Max Length 1,000,000 Buffer Min Length 5,000 Num Train Envs 1 Action Repeat 2 1 Max Epis...

work page 2000

[1] [1]

Dopamine: A Research Framework for Deep Reinforcement Learning

URLhttp: //github.com/jax-ml/jax. Pablo Samuel Castro, Subhodeep Moitra, Carles Gelada, Saurabh Kumar, and Marc G Belle- mare. Dopamine: A research framework for deep reinforcement learning.arXiv preprint arXiv:1812.06110,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Soft Actor-Critic Algorithms and Applications

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational confer- ence on machine learning, pp. 1861–1870. Pmlr, 2018a. Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhi...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Hyperspher- ical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280,

Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, and Jaegul Choo. Hyperspher- ical normalization for scalable deep reinforcement learning.arXiv preprint arXiv:2502.15280,

work page arXiv

[4] [4]

VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wier- stra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Humanoid- bench: Simulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506,

Carmelo Sferrazza, Dun-Ming Huang, Xingyu Lin, Youngwoon Lee, and Pieter Abbeel. Humanoid- bench: Simulated humanoid benchmark for whole-body locomotion and manipulation.arXiv preprint arXiv:2403.10506,

work page arXiv

[8] [8]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm.arXiv preprint arXiv:1712.01815,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Bud- den, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite.arXiv preprint arXiv:1801.00690,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Deep Reinforcement Learning and the Deadly Triad

Hado Van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Mo- dayil. Deep reinforcement learning and the deadly triad.arXiv preprint arXiv:1812.02648,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Under Review

13 Preprint. Under Review. APPENDIX A Proof of Corollary 1 15 A.1 Proof of Condition A1.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Proof of Condition A1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B Algorithmic Details 16 C Implementation Details 19 C.1 Online RL and Discrete Control . . . . . . . . . . ....

work page 2020

[12] [12]

The second inequality holds because theminoperator is also a non-expansion

|GMINTO(Qs)−G MINTO(Q′ s)|= max a∈A min j∈T Qsa(j) −max a∈A min j∈T Q′ sa(j) (5) ≤max a∈A min j∈T Qsa(j)−min j∈T Q′ sa(j) (6) ≤max a∈A max j∈T |Qsa(j)−Q ′ sa(j)| (7) = max a∈A,j∈T |Qsa(j)−Q ′ sa(j)|(8) The first inequality holds because themaxoperator is a non-expansion. The second inequality holds because theminoperator is also a non-expansion. The final...

work page 2020

[13] [13]

4, SAC is adapted to use a single Q-function critic, following the approach taken in Simba (Lee et al

2:Load offline datasetDintoB 3:repeat 4:Sample a batch ofBtransitions(s b, ab, rb, s′ b)B b=1 fromB 5:Compute target Q-values for next states: yb =r b +γmax a′ min(Q¯θ(s′ b, a′), Qθ(s′ b, a′)) 6:ifs ′ b is terminaltheny b ←r b 7:Compute standard TD loss: LTD(θ) = 1 2B BX b=1 ⌈yb⌉ −Q θ(sb, ab) 2 8:Compute conservative regularizer: LCQL(θ) =α·E sb∼B h log X...

work page 2024

[14] [14]

Under Review

20 Preprint. Under Review. C.3 OFFLINERL For the offline reinforcement learning experiments on Atari, we use datasets from RL Unplugged 1 (Gulcehre et al. (2020)), which provide standardized and diverse benchmarks. The implementation is built on a stable and well-tested codebase to ensure reproducibility and fair comparison. The code will be shared upon a...

work page 2020

[15] [15]

C.4 ONLINERLANDCONTINUOUSCONTROL For our continuous-control experiments with online reinforcement learning, we adopt SimbaV1 and SimbaV2

Hyperparameter CNN IMPALA CQL CQL+MINTO CQL CQL+MINTO Dataset Size 5,000,000 Batch Size 32 Update Horizon 1 Discount Factor (γ) 0.99 Epochs 100 Learning Rate 5×10 −5 Adam (ϵ) 5×3.125 −4 Training Steps per Epoch 62,500 Tradeoff Factor (α) 0.1 Target Update Frequency (T) 2000 Layer Norm no yes Table 4: Comparison of CQL and CQL+MINTO hyperparameters for the...

work page 2000

[16] [16]

Identical values are merged

Hyperparameter SimbaV1, SimbaV1+MINTO DMC-Hard HumanoidBench MuJoCo Discount Factor (γ) 0.99 0.995 Learning Rate 1.0×10 −4 Weight Decay 0.01 Target (τ) 0.005 Update Horizon (n) 1 Temperature Initial Value 0.01 Temperature Target Entropy −0.5× |A| Batch Size 256 Buffer Max Length 1,000,000 Buffer Min Length 5,000 Num Train Envs 1 Action Repeat 2 1 Max Epis...

work page 2000