Frictional Q-Learning

Hyo Kyung Lee; Hyunwoo Kim

arxiv: 2509.19771 · v5 · submitted 2025-09-24 · 💻 cs.LG · cs.AI

Frictional Q-Learning

Hyunwoo Kim , Hyo Kyung Lee This is my paper

Pith reviewed 2026-05-18 14:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningoff-policy algorithmsextrapolation errorsvariational autoencodersaction manifoldscontinuous controlQ-learning

0 comments

The pith

Frictional Q-Learning mitigates extrapolation errors in off-policy reinforcement learning by decomposing the replay buffer into tangential supported directions and normal error components using a friction analogy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve extrapolation errors that arise in off-policy reinforcement learning when policies choose actions poorly represented in the replay buffer. By analogizing to static friction, the authors model the replay buffer as a smooth low-dimensional manifold where supported actions are tangential and errors are normal. A contrastive variational autoencoder learns to identify these tangent directions for supported actions. This setup creates a natural stability condition like a friction threshold that discourages movement into unsupported regions. The result is more reliable learning demonstrated on standard continuous control problems.

Core claim

Off-policy reinforcement learning suffers from extrapolation errors when a learned policy selects actions that are weakly supported in the replay buffer. From this perspective, the replay buffer is represented as a smooth, low-dimensional action manifold, where the support directions correspond to the tangential component, while the normal component captures the dominant first-order extrapolation error. This decomposition reveals an intrinsic anisotropy in value sensitivity that naturally induces a stability condition analogous to a friction threshold. Frictional Q-Learning encodes supported actions as tangent directions using a contrastive variational autoencoder and shows that an orthonorm

What carries the argument

The contrastive variational autoencoder encoding supported actions as tangent directions on the action manifold, combined with the friction threshold stability condition derived from the tangential-normal decomposition.

If this is right

Robust and stable performance on standard continuous-control benchmarks compared with competitive baselines.
The method enforces avoidance of actions with large normal components, reducing value estimation errors.
Under local isometry assumptions, the orthogonal complement provides the normal directions for error correction.
Value sensitivity exhibits anisotropy due to the manifold structure, leading to directional stability preferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This friction-based decomposition might extend to other off-policy methods like actor-critic algorithms beyond Q-learning.
Applying the manifold representation to high-dimensional state spaces could test the scalability of the low-dimensional action manifold assumption.
Investigating whether the contrastive VAE can be replaced with simpler density estimators would clarify the necessity of the variational approach.

Load-bearing premise

The replay buffer forms a smooth low-dimensional manifold where deviations to unsupported actions are dominated by first-order extrapolation errors in the normal direction.

What would settle it

Running Frictional Q-Learning on a benchmark where the replay buffer does not approximate a low-dimensional manifold and observing persistent instability or no performance gain would falsify the core manifold assumption.

Figures

Figures reproduced from arXiv: 2509.19771 by Hyo Kyung Lee, Hyunwoo Kim.

**Figure 1.** Figure 1: Body of mass m on an inclined plane at angle θ, illustrating force components and static friction. Consider a body of mass m resting on a plane inclined at an angle θ relative to the horizontal. The gravitational force mg acts vertically downward and can be decomposed into two components with respect to the plane: a tangential component mg sin θ directed downslope, and a normal component mg cos θ perpendic… view at source ↗

**Figure 2.** Figure 2: Average return (solid line) and standard deviation (shaded area) across five independent [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Average return (solid line) and standard deviation (shaded area) across five runs with [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Average return (solid line) and standard deviation (shaded area) across five runs with differ [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Density distribution of replay buffer actions (blue) and orthonormal actions (orange) in [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Off-policy reinforcement learning suffers from extrapolation errors when a learned policy selects actions that are weakly supported in the replay buffer. In this study, we address this issue by drawing an analogy to static friction. From this perspective, the replay buffer is represented as a smooth, low-dimensional action manifold, where the support directions correspond to the tangential component, while the normal component captures the dominant first-order extrapolation error. This decomposition reveals an intrinsic anisotropy in value sensitivity that naturally induces a stability condition analogous to a friction threshold. To mitigate deviations toward unsupported actions, we propose Frictional Q-Learning, an off-policy algorithm that encodes supported actions as tangent directions using a contrastive variational autoencoder. We further show that an orthonormal basis of the orthogonal complement corresponds to normal components under mild local isometry assumptions. Extensive empirical results on standard continuous-control benchmarks consistently demonstrate robust and stable performance compared with competitive baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Frictional Q-Learning tries a static-friction analogy plus contrastive VAE to split extrapolation error into tangential and normal parts on an action manifold, but the abstract shows no derivations or checks on the key assumptions.

read the letter

The main thing to know is that this paper frames extrapolation error in off-policy continuous-control RL as a decomposition on an action manifold, where supported directions are tangential and unsupported ones are normal, then uses a contrastive VAE to encode the tangents and a friction threshold to keep the policy stable. They add a claim that an orthonormal basis of the orthogonal complement gives the normal components under mild local isometry. The reported experiments claim more robust performance than baselines on standard benchmarks. That combination of the friction analogy and the VAE encoding step is the clearest new element. It targets a real, recurring issue in data-limited settings like robotics, and the conceptual split into anisotropic value sensitivity is a reasonable way to motivate a stability condition. If the manifold representation holds, it could give a more structured regularizer than existing correction methods. The soft spots are mostly around missing support for the central steps. The abstract states the manifold decomposition and local-isometry result but supplies no derivations, error bounds, or validation of the isometry assumption. The premise that the replay buffer forms a smooth low-dimensional manifold with clear tangential support and dominant normal error components is a strong modeling choice that could break in noisy or high-dimensional data. The friction threshold is an extra free parameter that will need tuning. Without seeing the full proofs or the actual experimental controls, it is hard to tell how much of the claimed stability comes from the new components versus standard tricks. This work is for RL researchers who already work on off-policy methods in continuous spaces and are open to geometric or manifold-based fixes. A reader who cares about practical reliability in robotics or data-scarce domains would get the most out of it. The paper deserves a serious referee because the problem is important, the framework is internally coherent at the abstract level, and the empirical claim is falsifiable even if the math needs checking.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Frictional Q-Learning, an off-policy RL algorithm that models the replay buffer as a smooth low-dimensional action manifold. Supported actions are encoded as tangent directions via a contrastive variational autoencoder, while the normal component of an orthonormal basis of the orthogonal complement is claimed to capture first-order extrapolation error under mild local isometry assumptions. This decomposition is said to induce an intrinsic anisotropy in value sensitivity that yields a stability condition analogous to a static friction threshold. The paper reports robust and stable empirical performance on standard continuous-control benchmarks relative to competitive baselines.

Significance. If the manifold decomposition, local-isometry correspondence, and friction-threshold stability condition can be rigorously established, the work would supply a geometrically motivated mechanism for controlling extrapolation error in off-policy continuous-control RL. The contrastive-VAE encoding of tangent directions and the explicit friction analogy constitute a distinctive framing that could influence subsequent research on action-space geometry and value-function stability.

major comments (2)

[Abstract and §3] Abstract and §3 (theoretical development): the central claims that the replay buffer constitutes a smooth low-dimensional action manifold, that support directions are exactly the tangential component, and that an orthonormal basis of the orthogonal complement corresponds to normal components under mild local isometry are asserted without derivation, error bounds, or explicit construction of the manifold from the replay buffer.
[§4 and §5] §4 (algorithm) and §5 (experiments): the friction threshold is introduced as a free parameter whose value is not derived from the fitted manifold or from any previously computed quantity; no ablation or sensitivity analysis quantifies how performance depends on this choice, undermining the claim that the method is parameter-light relative to baselines.

minor comments (2)

[§3] Notation for the contrastive VAE loss and the precise definition of the tangent/normal decomposition should be stated explicitly with consistent symbols across sections.
[§5] Figure captions and axis labels in the experimental section would benefit from explicit indication of which baseline each curve corresponds to and whether shaded regions represent standard error or min/max.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment in detail below and outline the revisions we plan to make.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (theoretical development): the central claims that the replay buffer constitutes a smooth low-dimensional action manifold, that support directions are exactly the tangential component, and that an orthonormal basis of the orthogonal complement corresponds to normal components under mild local isometry are asserted without derivation, error bounds, or explicit construction of the manifold from the replay buffer.

Authors: The theoretical claims in the abstract and §3 are derived under the stated mild local isometry assumptions, which allow us to identify the tangential and normal components with respect to the action manifold fitted via the contrastive VAE. However, we agree that providing an explicit step-by-step construction of the manifold from the replay buffer data and including error bounds would improve clarity. In the revised manuscript, we will expand §3 with a detailed derivation, including how the replay buffer points are used to learn the manifold and bounds on the first-order extrapolation error. revision: yes
Referee: [§4 and §5] §4 (algorithm) and §5 (experiments): the friction threshold is introduced as a free parameter whose value is not derived from the fitted manifold or from any previously computed quantity; no ablation or sensitivity analysis quantifies how performance depends on this choice, undermining the claim that the method is parameter-light relative to baselines.

Authors: The friction threshold is a hyperparameter motivated by the stability condition in §3, but we acknowledge that it is not automatically derived from the manifold. To strengthen the empirical validation and address the concern about parameter sensitivity, we will include an ablation study in the revised version of §5. This study will vary the threshold value and report performance metrics on the benchmarks, showing that the method maintains competitive performance across a range of values and remains relatively parameter-light compared to baselines that require more extensive tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces novel components

full rationale

The paper's derivation chain begins with an explicit analogy to static friction and defines the replay buffer as a smooth low-dimensional action manifold with tangential support directions and normal extrapolation-error components. It then introduces a contrastive variational autoencoder to encode tangent directions and states that an orthonormal basis of the orthogonal complement corresponds to normal components under mild local isometry assumptions. These steps are presented as new modeling choices rather than reductions of outputs to previously fitted parameters or self-citations. No equation is shown to equal its input by construction, no parameter is fitted on a subset and then relabeled as a prediction, and no load-bearing uniqueness theorem is imported from prior author work. The empirical evaluation on standard continuous-control benchmarks therefore rests on independently stated assumptions and architectural innovations, rendering the central claims self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only view limits visibility; main unverified elements are the manifold representation of the replay buffer and the local isometry assumption used to align orthonormal bases with normal components.

free parameters (1)

friction threshold
Stability condition analogous to friction threshold; appears to function as a tunable hyperparameter controlling deviation into unsupported directions.

axioms (1)

domain assumption mild local isometry assumptions
Invoked to establish that an orthonormal basis of the orthogonal complement corresponds to normal components.

invented entities (1)

action manifold no independent evidence
purpose: Represent replay buffer as smooth low-dimensional surface separating supported (tangent) and unsupported (normal) directions
Core modeling choice that enables the tangential/normal decomposition and friction analogy.

pith-pipeline@v0.9.0 · 5666 in / 1261 out tokens · 59327 ms · 2026-05-18T14:42:45.995973+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

replay buffer is represented as a smooth, low-dimensional action manifold, where the support directions correspond to the tangential component, while the normal component captures the dominant first-order extrapolation error
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

an orthonormal basis of the orthogonal complement corresponds to normal components under mild local isometry assumptions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

[1]

Contrastive Variational Autoencoder Enhances Salient Features

Abubakar Abid and James Zou. Contrastive variational autoencoder enhances salient features. arXiv preprint arXiv:1902.04601, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[2]

The mechanics of n-player differentiable games

David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. In Proceedings of the 35th International Conference on Machine Learning. PMLR, 2018

work page 2018
[3]

Neuro-dynamic programming

Dimitri P Bertsekas. Neuro-dynamic programming. In Encyclopedia of optimization, pp.\ 2555--2560. Springer, 2008

work page 2008
[4]

Maximum entropy reinforcement learning via energy-based normalizing flow, 2024

Chen-Hao Chao, Chien Feng, Wei-Fang Sun, Cheng-Kuang Lee, Simon See, and Chun-Yi Lee. Maximum entropy reinforcement learning via energy-based normalizing flow, 2024. URL https://arxiv.org/abs/2405.13629

work page arXiv 2024
[5]

Th \'e orie des machines simples en ayant \'e gard au frottement de leurs parties et \`a la roideur des cordages

Charles Augustin Coulomb. Th \'e orie des machines simples en ayant \'e gard au frottement de leurs parties et \`a la roideur des cordages . Bachelier, 1821

work page
[6]

Improved deep reinforcement learning for robotics through distribution-based experience retention

Tim de Bruin, Jens Kober, Karl Tuyls, and Robert Babu s ka. Improved deep reinforcement learning for robotics through distribution-based experience retention. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 3947--3952. IEEE, 2016

work page 2016
[7]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp.\ 1587--1596. PMLR, 2018

work page 2018
[8]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp.\ 2052--2062. PMLR, 2019

work page 2052
[9]

Reinforcement learning with deep energy-based policies

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning. PMLR, 2017

work page 2017
[10]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.\ 1861--1870. Pmlr, 2018

work page 2018
[11]

Selective experience replay for lifelong learning

David Isele and Akansel Cosgun. Selective experience replay for lifelong learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[12]

Planning with Diffusion for Flexible Behavior Synthesis

Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Policy gradient and actor--critic in continuous time

Yanwei Jia and Xun Yu Zhou. Policy gradient and actor--critic in continuous time. Journal of Machine Learning Research, 23 0 (84): 0 1--50, 2022

work page 2022
[14]

q-learning in continuous time

Yanwei Jia and Xun Yu Zhou. q-learning in continuous time. Journal of Machine Learning Research, 24 0 (130): 0 1--53, 2023

work page 2023
[15]

Hamilton--jacobi deep q-learning for continuous-time control

Jeongho Kim, Jaeuk Shin, and Insoon Yang. Hamilton--jacobi deep q-learning for continuous-time control. Journal of Machine Learning Research, 22 0 (262): 0 1--51, 2021

work page 2021
[16]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[17]

Actor-critic algorithms

Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999

work page 1999
[18]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[19]

Stein variational policy gradient

Yang Liu, Prasanna Ramachandran, Qiang Liu, Jian Peng, et al. Stein variational policy gradient. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2017

work page 2017
[20]

Stochastic hamiltonian gradient methods for smooth games

Nicolas Loizou, Sharan Vaswani, Volkan Cevher, and Simon Lacoste-Julien. Stochastic hamiltonian gradient methods for smooth games. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020

work page 2020
[21]

Philosophiae naturalis principia mathematica, volume 1

Isaac Newton. Philosophiae naturalis principia mathematica, volume 1. G. Brookman, 1833

work page
[22]

Off-policy temporal-difference learning with function approximation

Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning with function approximation. In ICML, pp.\ 417--424, 2001

work page 2001
[23]

Stable-baselines3: Reliable reinforcement learning implementations

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of machine learning research, 22 0 (268): 0 1--8, 2021

work page 2021
[24]

Deterministic policy gradient algorithms

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pp.\ 387--395. Pmlr, 2014

work page 2014
[25]

Trimesh Authors

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.\ 5026--5033, 2012. doi:10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012
[26]

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Hannah Tan, and Omar G. Younis. Gymnasium: A standard interface for reinforcement learning environments, 2024. URL https://arxiv.org...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Reinforcement learning in continuous time and space: A stochastic control approach

Hao Wang and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, 21 0 (178): 0 1--34, 2020

work page 2020
[28]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[29]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[30]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[31]

eoدloIvR&O6r'XppYY1vң e8&_ 5n (& X fjßeƭ X N;'n;cd [95 /V 5ZȄ[[k T-f1rEeʰe._R֢A[ x0 #'1j&( Aʀ X9M XO _fjLk'iOKj 9 #E K #a8ѵɑw

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv

[1] [1]

Contrastive Variational Autoencoder Enhances Salient Features

Abubakar Abid and James Zou. Contrastive variational autoencoder enhances salient features. arXiv preprint arXiv:1902.04601, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[2] [2]

The mechanics of n-player differentiable games

David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. In Proceedings of the 35th International Conference on Machine Learning. PMLR, 2018

work page 2018

[3] [3]

Neuro-dynamic programming

Dimitri P Bertsekas. Neuro-dynamic programming. In Encyclopedia of optimization, pp.\ 2555--2560. Springer, 2008

work page 2008

[4] [4]

Maximum entropy reinforcement learning via energy-based normalizing flow, 2024

Chen-Hao Chao, Chien Feng, Wei-Fang Sun, Cheng-Kuang Lee, Simon See, and Chun-Yi Lee. Maximum entropy reinforcement learning via energy-based normalizing flow, 2024. URL https://arxiv.org/abs/2405.13629

work page arXiv 2024

[5] [5]

Th \'e orie des machines simples en ayant \'e gard au frottement de leurs parties et \`a la roideur des cordages

Charles Augustin Coulomb. Th \'e orie des machines simples en ayant \'e gard au frottement de leurs parties et \`a la roideur des cordages . Bachelier, 1821

work page

[6] [6]

Improved deep reinforcement learning for robotics through distribution-based experience retention

Tim de Bruin, Jens Kober, Karl Tuyls, and Robert Babu s ka. Improved deep reinforcement learning for robotics through distribution-based experience retention. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 3947--3952. IEEE, 2016

work page 2016

[7] [7]

Addressing function approximation error in actor-critic methods

Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp.\ 1587--1596. PMLR, 2018

work page 2018

[8] [8]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp.\ 2052--2062. PMLR, 2019

work page 2052

[9] [9]

Reinforcement learning with deep energy-based policies

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning. PMLR, 2017

work page 2017

[10] [10]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.\ 1861--1870. Pmlr, 2018

work page 2018

[11] [11]

Selective experience replay for lifelong learning

David Isele and Akansel Cosgun. Selective experience replay for lifelong learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018

[12] [12]

Planning with Diffusion for Flexible Behavior Synthesis

Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Policy gradient and actor--critic in continuous time

Yanwei Jia and Xun Yu Zhou. Policy gradient and actor--critic in continuous time. Journal of Machine Learning Research, 23 0 (84): 0 1--50, 2022

work page 2022

[14] [14]

q-learning in continuous time

Yanwei Jia and Xun Yu Zhou. q-learning in continuous time. Journal of Machine Learning Research, 24 0 (130): 0 1--53, 2023

work page 2023

[15] [15]

Hamilton--jacobi deep q-learning for continuous-time control

Jeongho Kim, Jaeuk Shin, and Insoon Yang. Hamilton--jacobi deep q-learning for continuous-time control. Journal of Machine Learning Research, 22 0 (262): 0 1--51, 2021

work page 2021

[16] [16]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[17] [17]

Actor-critic algorithms

Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999

work page 1999

[18] [18]

Continuous control with deep reinforcement learning

Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[19] [19]

Stein variational policy gradient

Yang Liu, Prasanna Ramachandran, Qiang Liu, Jian Peng, et al. Stein variational policy gradient. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2017

work page 2017

[20] [20]

Stochastic hamiltonian gradient methods for smooth games

Nicolas Loizou, Sharan Vaswani, Volkan Cevher, and Simon Lacoste-Julien. Stochastic hamiltonian gradient methods for smooth games. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020

work page 2020

[21] [21]

Philosophiae naturalis principia mathematica, volume 1

Isaac Newton. Philosophiae naturalis principia mathematica, volume 1. G. Brookman, 1833

work page

[22] [22]

Off-policy temporal-difference learning with function approximation

Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning with function approximation. In ICML, pp.\ 417--424, 2001

work page 2001

[23] [23]

Stable-baselines3: Reliable reinforcement learning implementations

Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of machine learning research, 22 0 (268): 0 1--8, 2021

work page 2021

[24] [24]

Deterministic policy gradient algorithms

David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pp.\ 387--395. Pmlr, 2014

work page 2014

[25] [25]

Trimesh Authors

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.\ 5026--5033, 2012. doi:10.1109/IROS.2012.6386109

work page doi:10.1109/iros.2012.6386109 2012

[26] [26]

Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Hannah Tan, and Omar G. Younis. Gymnasium: A standard interface for reinforcement learning environments, 2024. URL https://arxiv.org...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Reinforcement learning in continuous time and space: A stochastic control approach

Hao Wang and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, 21 0 (178): 0 1--34, 2020

work page 2020

[28] [28]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[29] [29]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[30] [30]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[31] [31]

eoدloIvR&O6r'XppYY1vң e8&_ 5n (& X fjßeƭ X N;'n;cd [95 /V 5ZȄ[[k T-f1rEeʰe._R֢A[ x0 #'1j&( Aʀ X9M XO _fjLk'iOKj 9 #E K #a8ѵɑw

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv