pith. machine review for the scientific record. sign in

arxiv: 2605.01968 · v1 · submitted 2026-05-03 · 💻 cs.LG

Recognition: unknown

AdamO: A Collapse-Suppressed Optimizer for Offline RL

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:44 UTC · model grok-4.3

classification 💻 cs.LG
keywords offline reinforcement learningAdam optimizerTD learning stabilitycritic collapseorthogonality constraintfeedback system analysis
0
0 comments X

The pith

AdamO modifies Adam with a regulated orthogonality correction to stabilize offline RL critics against collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Offline RL critics often collapse into extreme Q-values because bootstrapped TD updates amplify their own errors. The paper demonstrates that optimizer dynamics can themselves trigger this instability by distorting parameter geometry during Adam steps. By casting the local update as a feedback system, the authors derive a necessary and sufficient spectral-radius condition for stability. They introduce AdamO, which inserts a decoupled orthogonality term controlled by a strict task-alignment budget. The design is shown to guarantee worst-case task safety while leaving Adam's continuous-time dissipative behavior intact, making the method drop-in compatible with existing offline RL algorithms.

Core claim

The paper claims that offline TD critic updates remain stable if and only if the spectral radius of the induced update operator is strictly less than one. Standard Adam can push this radius above one by altering geometry, amplifying TD errors. AdamO adds an explicit orthogonality correction that is decoupled from the main update and bounded by a task-alignment budget; within the modeled regime this correction forces the radius below one, thereby guaranteeing task safety without destroying the dissipative structure of Adam's continuous-time dynamics.

What carries the argument

AdamO, an Adam-based optimizer that augments each step with a decoupled orthogonality correction whose magnitude is strictly limited by a task-alignment budget, thereby restoring the spectral-radius condition for stability.

If this is right

  • Any offline RL algorithm that swaps its optimizer for AdamO inherits the local stability guarantee and improved returns.
  • The same orthogonality correction can be added to other first-order methods without changing their continuous-time limits.
  • Worst-case task safety holds as long as the task-alignment budget is respected during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The feedback-system view of TD updates may be reusable for diagnosing instability in online RL or model-based methods.
  • Enforcing orthogonality only on the critic parameters could be tested as a lighter-weight alternative to full AdamO.
  • If the spectral condition is violated in practice, simply rescaling the learning rate might restore stability without the extra orthogonality term.

Load-bearing premise

The stability proof and spectral-radius condition apply only inside the specific regime where local update dynamics can be represented as a linear feedback system.

What would settle it

An offline RL run using AdamO in which the critic still diverges to unusable Q-values while the spectral radius of the update operator remains below one would falsify the safety claim.

Figures

Figures reproduced from arXiv: 2605.01968 by Ju Ren, Nan Qiao, Sheng Yue, Shuning Wang.

Figure 1
Figure 1. Figure 1: Convergence vs. collapse of critic loss using TD3+BC. Top: On the walker2d-medium-expert task, the update operator satisfies the contraction condition ρ(A(η)) < 1. This suppresses bootstrapping errors, which maintains a convergent critic loss. Bottom: On the pen-human task, the expansive regime ρ(A(η)) > 1 triggers a value explosion (critic loss collapse). The green vertical line highlights the critical bo… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of computational overhead and wall-clock runtime across different optimizers. 7. Main Results Computational Efficiency In [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MuJoCo locomotion environments used in our offline RL benchmark. Adroit. We evaluate dexterous manipulation with the Adroit suite, which requires fine grained contact rich control. Following D4RL, we include four tasks, pen, hammer, door, and relocate. We use the standard dataset variants human, cloned, and expert, which represent limited demonstrations, rollouts from an imitation policy mixed with demonst… view at source ↗
Figure 4
Figure 4. Figure 4: Adroit dexterous manipulation tasks in D4RL. AntMaze. We use AntMaze to test sparse reward goal reaching in mazes with the MuJoCo ant. The reward is given at the goal and is zero elsewhere, so offline learning must rely on credit assignment and good use of the dataset support. We include multiple maze scales and variants, and follow the D4RL collection protocol that uses waypoint-guided navigation to gener… view at source ↗
Figure 5
Figure 5. Figure 5: AntMaze layouts used for offline evaluation. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FrankaKitchen datasets and an example scene used in our benchmark. A.2. Implementation Details We implement our framework using PyTorch 2.5.1. Our codebase is built upon the open-source libraries CORL: https://github.com/tinkoff-ai/CORL (Apache-2.0 License) and OfflineRL-Kit: https://github.com/ yihaosun1124/OfflineRL-Kit (MIT License). All experiments were conducted on a server running Ubuntu 20.04.2 LTS,… view at source ↗
Figure 7
Figure 7. Figure 7: Radar charts visualizing the aggregated normalized scores across diverse D4RL domains. A.4.2. COMPARATIVE ADVANTAGE (Q2) We evaluate the performance of our proposed AdamO optimizer against a robust set of architectural interventions and optimization baselines to determine the efficacy of an optimizer-level fix for value collapse. Comparison with Normalization and Regularization. As demonstrated in [PITH_F… view at source ↗
Figure 8
Figure 8. Figure 8: Sensitivity Analysis of various algorithms across different κ values. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation study on the conflict budget τ . A moderate budget successfully prevents collapse while maintaining high performance. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Critic loss across different tasks. 0 100 25 50 75 hopper-mr Normarlized Score 0 100 10 4 10 11 Critic Loss 0 100 1.0 0.5 0.0 1e16 Real ( ( )) 0 100 0 2 1e10 Imag ( ( )) AdamO Adam 0 50 0 100 pen-cloned 0 50 10 5 10 10 0 50 10 3 10 8 0 50 0 5 1e6 0 50 100 0 50 100 pen-human 0 50 100 10 7 10 14 0 50 100 10 3 10 2 10 1 0 50 100 0 5 1e 3 0 50 100 Step 0 50 100 walker2d-mr 0 50 100 Step 10 2 10 3 10 4 0 50 10… view at source ↗
Figure 11
Figure 11. Figure 11: Training dynamics across four representative tasks. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
read the original abstract

Offline reinforcement learning (RL) can fail spectacularly when bootstrapped temporal-difference (TD) updates amplify their own errors, driving the critic toward extreme and unusable Q-values. A key counterintuitive insight of this work is that collapse is not only a property of the backup rule or network architecture: optimizer dynamics themselves can directly trigger or suppress instability. From a control-theoretic viewpoint, we model offline TD learning as a feedback system and analyze Adam-based critic updates. This yields a necessary and sufficient condition for stability of the induced local update dynamics: within the regime we analyze, these dynamics are stable if and only if the spectral radius of the corresponding update operator is strictly below one. Further analysis suggests that standard Adam updates can inadvertently distort the parameter geometry, motivating explicit orthogonality constraints to prevent TD error amplification. To this end, we propose AdamO, an Adam-based optimizer with a decoupled orthogonality correction regulated by a strict task-alignment budget. We prove that this design theoretically guarantees worst-case task safety and preserves Adam's continuous-time dissipative dynamics. Empirically, AdamO is broadly compatible with diverse offline RL baselines, improving stability and returns across a broad suite of benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AdamO, an Adam-based optimizer augmented with a decoupled orthogonality correction controlled by a task-alignment budget, for stabilizing critic updates in offline RL. It models offline TD learning as a feedback system, derives a necessary-and-sufficient spectral-radius condition for local stability of the update dynamics, and claims a proof that the design guarantees worst-case task safety while preserving Adam's continuous-time dissipative properties. Empirical results indicate improved stability and returns when plugged into diverse offline RL baselines across benchmarks.

Significance. If the local-to-global bridging argument and the worst-case safety proof hold beyond the modeled regime, the contribution would be notable: it reframes optimizer choice itself as a mechanism for suppressing TD collapse rather than relying solely on algorithmic or architectural fixes, and the control-theoretic lens could generalize to other bootstrapped settings. The compatibility with existing baselines is a practical strength.

major comments (2)
  1. [theoretical analysis / feedback-system derivation] The necessary-and-sufficient spectral-radius <1 condition is derived only for the local update dynamics when modeled as a feedback system (theoretical analysis section). No explicit argument is supplied showing why this local operator property prevents TD-error amplification over the full non-stationary offline trajectory, across the state-action distribution, or when bootstrapping pushes the system outside the modeled regime, yet the abstract and introduction assert a proof of worst-case task safety.
  2. [AdamO design / task-alignment budget] The task-alignment budget is introduced to regulate the orthogonality correction and avoid geometry distortion, but the manuscript does not demonstrate that its selection is independent of the stability objective; this creates a risk that the budget is implicitly tuned to the same spectral-radius condition it is meant to enforce (see the definition of the correction term and the budget parameter).
minor comments (2)
  1. [experiments] The experimental section would benefit from an ablation isolating the contribution of the orthogonality correction versus the budget alone, and from reporting the fraction of runs that still exhibit collapse under AdamO.
  2. [preliminaries / notation] Notation for the update operator and its spectral radius should be introduced earlier and used consistently; the transition from continuous-time dissipative dynamics to the discrete feedback model is not clearly sign-posted.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments raise important points about the scope of our theoretical guarantees and the independence of design parameters. We address each major comment below and outline revisions that will strengthen the manuscript.

read point-by-point responses
  1. Referee: The necessary-and-sufficient spectral-radius <1 condition is derived only for the local update dynamics when modeled as a feedback system (theoretical analysis section). No explicit argument is supplied showing why this local operator property prevents TD-error amplification over the full non-stationary offline trajectory, across the state-action distribution, or when bootstrapping pushes the system outside the modeled regime, yet the abstract and introduction assert a proof of worst-case task safety.

    Authors: We appreciate the referee highlighting the need for clearer bridging between local and global regimes. Our feedback-system model is constructed precisely to represent the core TD update operator under the fixed offline data distribution; the necessary-and-sufficient spectral-radius condition establishes contractivity of the linearized dynamics, which precludes local error amplification. In the offline setting the state-action distribution is stationary by construction, and the worst-case safety claim follows from the fact that the orthogonality correction enforces the radius bound uniformly. Nevertheless, we acknowledge that an explicit connection from the local linearization to the full trajectory (including potential distribution shifts induced by bootstrapping) is not spelled out in sufficient detail. In the revised manuscript we will add a dedicated subsection that invokes a discrete-time Lyapunov argument to show that local spectral-radius <1 implies bounded error propagation over finite-length trajectories under the bounded non-stationarity present in offline RL. revision: yes

  2. Referee: The task-alignment budget is introduced to regulate the orthogonality correction and avoid geometry distortion, but the manuscript does not demonstrate that its selection is independent of the stability objective; this creates a risk that the budget is implicitly tuned to the same spectral-radius condition it is meant to enforce (see the definition of the correction term and the budget parameter).

    Authors: The task-alignment budget is introduced to limit the magnitude of the orthogonality correction so that the update direction remains aligned with the original Adam gradient, thereby preserving the continuous-time dissipative properties we analyze. The stability proof shows that the spectral radius is strictly less than one for every budget value in (0,1], independent of the particular numerical choice; the budget therefore does not need to be tuned to the radius condition itself. Its role is geometric rather than stability-specific. To remove any ambiguity we will insert a clarifying paragraph in the AdamO design section that states this independence explicitly and will augment the experimental section with a sensitivity plot demonstrating that performance remains stable across a wide interval of budget values without requiring re-tuning for the spectral-radius guarantee. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper models offline TD updates as a feedback system, derives a necessary-and-sufficient spectral-radius condition for local stability directly from that model, identifies geometry distortion in standard Adam, introduces a decoupled orthogonality correction with task-alignment budget, and claims a proof that the resulting AdamO design guarantees worst-case task safety while preserving dissipative dynamics. Each step is an analytical derivation or design choice motivated by the preceding analysis rather than a definitional equivalence, fitted input renamed as prediction, or load-bearing self-citation. The local-to-global safety claim is presented as a substantive theorem rather than a reduction to the input model by construction, and no equations or steps are shown to collapse into their own premises.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on modeling offline TD learning as a linear feedback system whose stability is governed by the spectral radius of the update operator; the task-alignment budget is an introduced scalar that trades off orthogonality against task progress.

free parameters (1)
  • task-alignment budget
    Scalar that limits the strength of the orthogonality correction while preserving task-relevant updates; its value is chosen to satisfy the stability condition without fully nullifying the Adam step.
axioms (2)
  • domain assumption Offline TD learning can be represented as a feedback system whose local dynamics are captured by an update operator
    Invoked to obtain the necessary-and-sufficient stability condition via spectral radius.
  • domain assumption Standard Adam updates can distort parameter geometry in a manner that amplifies TD errors
    Motivates the need for the explicit orthogonality correction.

pith-pipeline@v0.9.0 · 5505 in / 1434 out tokens · 37830 ms · 2026-05-10T14:44:35.939550+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

99 extracted references · 17 canonical work pages · 6 internal anchors

  1. [1]

    The annals of mathematical statistics , pages=

    A stochastic approximation method , author=. The annals of mathematical statistics , pages=. 1951 , publisher=

  2. [2]

    Advances in neural information processing systems , volume=

    Scalable trust-region method for deep reinforcement learning using kronecker-factored approximation , author=. Advances in neural information processing systems , volume=

  3. [3]

    Advances in Neural Information Processing Systems , volume=

    Resetting the optimizer in deep rl: An empirical study , author=. Advances in Neural Information Processing Systems , volume=

  4. [4]

    Advances in Neural Information Processing Systems , volume=

    Fast trac: A parameter-free optimizer for lifelong reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  5. [5]

    2024 , eprint=

    Learning to Optimize for Reinforcement Learning , author=. 2024 , eprint=

  6. [6]

    Advances in Neural Information Processing Systems , volume=

    Can learned optimization make reinforcement learning less difficult? , author=. Advances in Neural Information Processing Systems , volume=

  7. [7]

    2025 , eprint=

    Stable Gradients for Stable Learning at Scale in Deep Reinforcement Learning , author=. 2025 , eprint=

  8. [8]

    Continuous control with deep reinforcement learning

    Continuous control with deep reinforcement learning , author=. arXiv preprint arXiv:1509.02971 , year=

  9. [9]

    2022 , booktitle=

    Provable General Function Class Representation Learning in Multitask Bandits and MDPs , author=. 2022 , booktitle=

  10. [10]

    Advances in Neural Information Processing Systems , volume=

    Conservative Q-Learning for Offline Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=

  11. [11]

    FOVA: Offline Federated Reinforcement Learning With Mixed-Quality Data , year=

    Qiao, Nan and Yue, Sheng and Ren, Ju and Zhang, Yaoxue , journal=. FOVA: Offline Federated Reinforcement Learning With Mixed-Quality Data , year=

  12. [12]

    International Conference on Machine Learning , pages=

    Off-policy deep reinforcement learning without exploration , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  13. [13]

    Advances in Neural Information Processing Systems , pages=

    Stabilizing off-policy q-learning via bootstrapping error reduction , author=. Advances in Neural Information Processing Systems , pages=

  14. [14]

    Advances in neural information processing systems , volume=

    A minimalist approach to offline reinforcement learning , author=. Advances in neural information processing systems , volume=

  15. [15]

    Advances in Neural Information Processing Systems , volume=

    Conservative q-learning for offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  16. [16]

    International Conference on Machine Learning , pages=

    Offline reinforcement learning with fisher divergence critic regularization , author=. International Conference on Machine Learning , pages=. 2021 , organization=

  17. [17]

    Advances in Neural Information Processing Systems , volume=

    Offline rl without off-policy evaluation , author=. Advances in Neural Information Processing Systems , volume=

  18. [18]

    International Conference on Learning Representations , year=

    Offline Reinforcement Learning with Implicit Q-Learning , author=. International Conference on Learning Representations , year=

  19. [19]

    Batch reinforcement learning , year =

    Lange, Sascha and Gabel, Thomas and Riedmiller, Martin , booktitle =. Batch reinforcement learning , year =

  20. [20]

    ICLR , year=

    Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling , author=. ICLR , year=

  21. [21]

    AAAI , year=

    Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning , author=. AAAI , year=

  22. [22]

    Advances in Neural Information Processing Systems , volume=

    Revisiting the minimalist approach to offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  23. [23]

    Advances in Neural Information Processing Systems , volume=

    Mildly conservative q-learning for offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  24. [24]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Sumo: Search-based uncertainty estimation for model-based offline reinforcement learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  25. [25]

    ICML , year=

    Efficient online reinforcement learning with offline data , author=. ICML , year=

  26. [26]

    A minimalist approach to offline reinforcement learning , year =

    Fujimoto, Scott and Gu, Shixiang Shane , journal =. A minimalist approach to offline reinforcement learning , year =

  27. [27]

    Advances in neural information processing systems , volume=

    Uncertainty-based offline reinforcement learning with diversified q-ensemble , author=. Advances in neural information processing systems , volume=

  28. [28]

    Advances in Neural Information Processing Systems , volume=

    Conservative data sharing for multi-task offline reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  29. [29]

    The Eleventh International Conference on Learning Representations , year=

    When Data Geometry Meets Deep Function: Generalizing Offline Reinforcement Learning , author=. The Eleventh International Conference on Learning Representations , year=

  30. [30]

    Advances in Neural Information Processing Systems , volume=

    Look beneath the surface: Exploiting fundamental symmetry for sample-efficient offline rl , author=. Advances in Neural Information Processing Systems , volume=

  31. [31]

    2018 , publisher=

    Reinforcement learning: An introduction , author=. 2018 , publisher=

  32. [32]

    , title =

    Baird, Leemon C. , title =. Proceedings of the Twelfth International Conference on Machine Learning (ICML 1995) , editor =. 1995 , pages =

  33. [33]

    An analysis of temporal-difference learning with function approximationtechnical , author=

  34. [34]

    Deep reinforcement learning and the deadly triad

    Deep reinforcement learning and the deadly triad , author=. arXiv preprint arXiv:1812.02648 , year=

  35. [35]

    NIPS , year=

    Double Q-learning , author=. NIPS , year=

  36. [36]

    ICML , year=

    Addressing function approximation error in actor-critic methods , author=. ICML , year=

  37. [37]

    arXiv preprint arXiv:1902.05605 , year=

    Crossnorm: Normalization for off-policy td reinforcement learning , author=. arXiv preprint arXiv:1902.05605 , year=

  38. [38]

    ICML , year=

    Batch normalization: Accelerating deep network training by reducing internal covariate shift , author=. ICML , year=

  39. [39]

    arXiv preprint arXiv:1903.08894 , year=

    Towards characterizing divergence in deep q-learning , author=. arXiv preprint arXiv:1903.08894 , year=

  40. [40]

    arXiv preprint arXiv:2211.11092 , year=

    Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch Size , author=. arXiv preprint arXiv:2211.11092 , year=

  41. [41]

    2023 , eprint=

    Improving and Benchmarking Offline Reinforcement Learning Algorithms , author=. 2023 , eprint=

  42. [42]

    ICLR , year=

    Offline Q-learning on Diverse Multi-Task Data Both Scales And Generalizes , author=. ICLR , year=

  43. [43]

    2022 , url=

    Aviral Kumar and Rishabh Agarwal and Tengyu Ma and Aaron Courville and George Tucker and Sergey Levine , booktitle=. 2022 , url=

  44. [44]

    arXiv preprint arXiv:2007.05520 , year=

    Representations for Stable Off-Policy Reinforcement Learning , author=. arXiv preprint arXiv:2007.05520 , year=

  45. [45]

    International Conference on Learning Representations , year=

    Implicit Under-Parameterization Inhibits Data-Efficient Deep Reinforcement Learning , author=. International Conference on Learning Representations , year=

  46. [46]

    Ishan Durugkar and Peter Stone , year=

  47. [47]

    Advances in Neural Information Processing Systems , volume=

    Understanding, predicting and better resolving q-value divergence in offline-rl , author=. Advances in Neural Information Processing Systems , volume=

  48. [48]

    2026 , eprint=

    Less is More: Clustered Cross-Covariance Control for Offline RL , author=. 2026 , eprint=

  49. [49]

    Advances in neural information processing systems , volume=

    Decision transformer: Reinforcement learning via sequence modeling , author=. Advances in neural information processing systems , volume=

  50. [50]

    International conference on machine learning , pages=

    Self-imitation learning , author=. International conference on machine learning , pages=. 2018 , organization=

  51. [51]

    2021 IEEE International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI) , pages=

    A rank-based sampling framework for offline reinforcement learning , author=. 2021 IEEE International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI) , pages=. 2021 , organization=

  52. [52]

    Decoupling representa- tion and classifier for long-tailed recognition,

    Decoupling representation and classifier for long-tailed recognition , author=. arXiv preprint arXiv:1910.09217 , year=

  53. [53]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Advantage-weighted regression: Simple and scalable off-policy reinforcement learning , author=. arXiv preprint arXiv:1910.00177 , year=

  54. [54]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    Awac: Accelerating online reinforcement learning with offline datasets , author=. arXiv preprint arXiv:2006.09359 , year=

  55. [55]

    arXiv preprint arXiv:2201.13425 , year=

    Don't change the algorithm, change the data: Exploratory data for offline reinforcement learning , author=. arXiv preprint arXiv:2201.13425 , year=

  56. [56]

    Nature , volume=

    Mastering atari, go, chess and shogi by planning with a learned model , author=. Nature , volume=. 2020 , publisher=

  57. [57]

    nature , volume=

    Grandmaster level in StarCraft II using multi-agent reinforcement learning , author=. nature , volume=. 2019 , publisher=

  58. [58]

    Behavior Regularized Offline Reinforcement Learning

    Behavior regularized offline reinforcement learning , author=. arXiv preprint arXiv:1911.11361 , year=

  59. [59]

    2020 , url=

    Way Off-Policy Batch Deep Reinforcement Learning of Human Preferences in Dialog , author=. 2020 , url=

  60. [60]

    Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

    Diffusion policies as an expressive policy class for offline reinforcement learning , author=. arXiv preprint arXiv:2208.06193 , year=

  61. [61]

    2013 , publisher=

    Introductory lectures on convex optimization: A basic course , author=. 2013 , publisher=

  62. [62]

    Advances in neural information processing systems , volume=

    Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in neural information processing systems , volume=

  63. [63]

    Advances in neural information processing systems , volume=

    Wide neural networks of any depth evolve as linear models under gradient descent , author=. Advances in neural information processing systems , volume=

  64. [64]

    Advances in neural information processing systems , volume=

    On lazy training in differentiable programming , author=. Advances in neural information processing systems , volume=

  65. [65]

    Advances in neural information processing systems , volume=

    Action-gap phenomenon in reinforcement learning , author=. Advances in neural information processing systems , volume=

  66. [66]

    2015 , eprint=

    Increasing the Action Gap: New Operators for Reinforcement Learning , author=. 2015 , eprint=

  67. [67]

    Encyclopedia of optimization , year=

    Neuro-dynamic programming , author=. Encyclopedia of optimization , year=

  68. [68]

    Borkar, V. S. and Meyn, S. P. , title =. SIAM Journal on Control and Optimization , volume =. 2000 , doi =

  69. [69]

    2019 , eprint=

    Decoupled Weight Decay Regularization , author=. 2019 , eprint=

  70. [70]

    Qingmao Yao and Zhichao Lei and Tianyuan Chen and Ziyue Yuan and Xuefan Chen and Jianxiang Liu and Faguo Wu and Xiao Zhang , booktitle=. Offline. 2025 , url=

  71. [71]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    D4rl: Datasets for deep data-driven reinforcement learning , author=. arXiv preprint arXiv:2004.07219 , year=

  72. [72]

    arXiv preprint arXiv:2507.08761 , year=

    Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data , author=. arXiv preprint arXiv:2507.08761 , year=

  73. [73]

    2025 , url=

    Tianyuan Chen and Ronglong Cai and Faguo Wu and Xiao Zhang , booktitle=. 2025 , url=

  74. [74]

    2025 , eprint=

    Cautious Optimizers: Improving Training with One Line of Code , author=. 2025 , eprint=

  75. [75]

    2003 , publisher=

    Stochastic approximation and recursive algorithms and applications , author=. 2003 , publisher=

  76. [76]

    Advances in neural information processing systems , volume=

    Analysis of temporal-diffference learning with function approximation , author=. Advances in neural information processing systems , volume=

  77. [77]

    2017 , eprint=

    Adam: A Method for Stochastic Optimization , author=. 2017 , eprint=

  78. [78]

    2019 , eprint=

    On the Convergence of Adam and Beyond , author=. 2019 , eprint=

  79. [79]

    2019 , eprint=

    On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , author=. 2019 , eprint=

  80. [80]

    nature , volume=

    Human-level control through deep reinforcement learning , author=. nature , volume=. 2015 , publisher=

Showing first 80 references.