pith. sign in

arxiv: 2509.19771 · v5 · submitted 2025-09-24 · 💻 cs.LG · cs.AI

Frictional Q-Learning

Pith reviewed 2026-05-18 14:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningoff-policy algorithmsextrapolation errorsvariational autoencodersaction manifoldscontinuous controlQ-learning
0
0 comments X

The pith

Frictional Q-Learning mitigates extrapolation errors in off-policy reinforcement learning by decomposing the replay buffer into tangential supported directions and normal error components using a friction analogy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve extrapolation errors that arise in off-policy reinforcement learning when policies choose actions poorly represented in the replay buffer. By analogizing to static friction, the authors model the replay buffer as a smooth low-dimensional manifold where supported actions are tangential and errors are normal. A contrastive variational autoencoder learns to identify these tangent directions for supported actions. This setup creates a natural stability condition like a friction threshold that discourages movement into unsupported regions. The result is more reliable learning demonstrated on standard continuous control problems.

Core claim

Off-policy reinforcement learning suffers from extrapolation errors when a learned policy selects actions that are weakly supported in the replay buffer. From this perspective, the replay buffer is represented as a smooth, low-dimensional action manifold, where the support directions correspond to the tangential component, while the normal component captures the dominant first-order extrapolation error. This decomposition reveals an intrinsic anisotropy in value sensitivity that naturally induces a stability condition analogous to a friction threshold. Frictional Q-Learning encodes supported actions as tangent directions using a contrastive variational autoencoder and shows that an orthonorm

What carries the argument

The contrastive variational autoencoder encoding supported actions as tangent directions on the action manifold, combined with the friction threshold stability condition derived from the tangential-normal decomposition.

If this is right

  • Robust and stable performance on standard continuous-control benchmarks compared with competitive baselines.
  • The method enforces avoidance of actions with large normal components, reducing value estimation errors.
  • Under local isometry assumptions, the orthogonal complement provides the normal directions for error correction.
  • Value sensitivity exhibits anisotropy due to the manifold structure, leading to directional stability preferences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This friction-based decomposition might extend to other off-policy methods like actor-critic algorithms beyond Q-learning.
  • Applying the manifold representation to high-dimensional state spaces could test the scalability of the low-dimensional action manifold assumption.
  • Investigating whether the contrastive VAE can be replaced with simpler density estimators would clarify the necessity of the variational approach.

Load-bearing premise

The replay buffer forms a smooth low-dimensional manifold where deviations to unsupported actions are dominated by first-order extrapolation errors in the normal direction.

What would settle it

Running Frictional Q-Learning on a benchmark where the replay buffer does not approximate a low-dimensional manifold and observing persistent instability or no performance gain would falsify the core manifold assumption.

Figures

Figures reproduced from arXiv: 2509.19771 by Hyo Kyung Lee, Hyunwoo Kim.

Figure 1
Figure 1. Figure 1: Body of mass m on an inclined plane at angle θ, illustrating force components and static friction. Consider a body of mass m resting on a plane inclined at an angle θ relative to the horizontal. The gravitational force mg acts vertically downward and can be decomposed into two components with respect to the plane: a tangential component mg sin θ directed downslope, and a normal component mg cos θ perpendic… view at source ↗
Figure 2
Figure 2. Figure 2: Average return (solid line) and standard deviation (shaded area) across five independent [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average return (solid line) and standard deviation (shaded area) across five runs with [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Average return (solid line) and standard deviation (shaded area) across five runs with differ [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Density distribution of replay buffer actions (blue) and orthonormal actions (orange) in [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

Off-policy reinforcement learning suffers from extrapolation errors when a learned policy selects actions that are weakly supported in the replay buffer. In this study, we address this issue by drawing an analogy to static friction. From this perspective, the replay buffer is represented as a smooth, low-dimensional action manifold, where the support directions correspond to the tangential component, while the normal component captures the dominant first-order extrapolation error. This decomposition reveals an intrinsic anisotropy in value sensitivity that naturally induces a stability condition analogous to a friction threshold. To mitigate deviations toward unsupported actions, we propose Frictional Q-Learning, an off-policy algorithm that encodes supported actions as tangent directions using a contrastive variational autoencoder. We further show that an orthonormal basis of the orthogonal complement corresponds to normal components under mild local isometry assumptions. Extensive empirical results on standard continuous-control benchmarks consistently demonstrate robust and stable performance compared with competitive baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Frictional Q-Learning, an off-policy RL algorithm that models the replay buffer as a smooth low-dimensional action manifold. Supported actions are encoded as tangent directions via a contrastive variational autoencoder, while the normal component of an orthonormal basis of the orthogonal complement is claimed to capture first-order extrapolation error under mild local isometry assumptions. This decomposition is said to induce an intrinsic anisotropy in value sensitivity that yields a stability condition analogous to a static friction threshold. The paper reports robust and stable empirical performance on standard continuous-control benchmarks relative to competitive baselines.

Significance. If the manifold decomposition, local-isometry correspondence, and friction-threshold stability condition can be rigorously established, the work would supply a geometrically motivated mechanism for controlling extrapolation error in off-policy continuous-control RL. The contrastive-VAE encoding of tangent directions and the explicit friction analogy constitute a distinctive framing that could influence subsequent research on action-space geometry and value-function stability.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (theoretical development): the central claims that the replay buffer constitutes a smooth low-dimensional action manifold, that support directions are exactly the tangential component, and that an orthonormal basis of the orthogonal complement corresponds to normal components under mild local isometry are asserted without derivation, error bounds, or explicit construction of the manifold from the replay buffer.
  2. [§4 and §5] §4 (algorithm) and §5 (experiments): the friction threshold is introduced as a free parameter whose value is not derived from the fitted manifold or from any previously computed quantity; no ablation or sensitivity analysis quantifies how performance depends on this choice, undermining the claim that the method is parameter-light relative to baselines.
minor comments (2)
  1. [§3] Notation for the contrastive VAE loss and the precise definition of the tangent/normal decomposition should be stated explicitly with consistent symbols across sections.
  2. [§5] Figure captions and axis labels in the experimental section would benefit from explicit indication of which baseline each curve corresponds to and whether shaded regions represent standard error or min/max.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment in detail below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (theoretical development): the central claims that the replay buffer constitutes a smooth low-dimensional action manifold, that support directions are exactly the tangential component, and that an orthonormal basis of the orthogonal complement corresponds to normal components under mild local isometry are asserted without derivation, error bounds, or explicit construction of the manifold from the replay buffer.

    Authors: The theoretical claims in the abstract and §3 are derived under the stated mild local isometry assumptions, which allow us to identify the tangential and normal components with respect to the action manifold fitted via the contrastive VAE. However, we agree that providing an explicit step-by-step construction of the manifold from the replay buffer data and including error bounds would improve clarity. In the revised manuscript, we will expand §3 with a detailed derivation, including how the replay buffer points are used to learn the manifold and bounds on the first-order extrapolation error. revision: yes

  2. Referee: [§4 and §5] §4 (algorithm) and §5 (experiments): the friction threshold is introduced as a free parameter whose value is not derived from the fitted manifold or from any previously computed quantity; no ablation or sensitivity analysis quantifies how performance depends on this choice, undermining the claim that the method is parameter-light relative to baselines.

    Authors: The friction threshold is a hyperparameter motivated by the stability condition in §3, but we acknowledge that it is not automatically derived from the manifold. To strengthen the empirical validation and address the concern about parameter sensitivity, we will include an ablation study in the revised version of §5. This study will vary the threshold value and report performance metrics on the benchmarks, showing that the method maintains competitive performance across a range of values and remains relatively parameter-light compared to baselines that require more extensive tuning. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces novel components

full rationale

The paper's derivation chain begins with an explicit analogy to static friction and defines the replay buffer as a smooth low-dimensional action manifold with tangential support directions and normal extrapolation-error components. It then introduces a contrastive variational autoencoder to encode tangent directions and states that an orthonormal basis of the orthogonal complement corresponds to normal components under mild local isometry assumptions. These steps are presented as new modeling choices rather than reductions of outputs to previously fitted parameters or self-citations. No equation is shown to equal its input by construction, no parameter is fitted on a subset and then relabeled as a prediction, and no load-bearing uniqueness theorem is imported from prior author work. The empirical evaluation on standard continuous-control benchmarks therefore rests on independently stated assumptions and architectural innovations, rendering the central claims self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only view limits visibility; main unverified elements are the manifold representation of the replay buffer and the local isometry assumption used to align orthonormal bases with normal components.

free parameters (1)
  • friction threshold
    Stability condition analogous to friction threshold; appears to function as a tunable hyperparameter controlling deviation into unsupported directions.
axioms (1)
  • domain assumption mild local isometry assumptions
    Invoked to establish that an orthonormal basis of the orthogonal complement corresponds to normal components.
invented entities (1)
  • action manifold no independent evidence
    purpose: Represent replay buffer as smooth low-dimensional surface separating supported (tangent) and unsupported (normal) directions
    Core modeling choice that enables the tangential/normal decomposition and friction analogy.

pith-pipeline@v0.9.0 · 5666 in / 1261 out tokens · 59327 ms · 2026-05-18T14:42:45.995973+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 5 internal anchors

  1. [1]

    Contrastive Variational Autoencoder Enhances Salient Features

    Abubakar Abid and James Zou. Contrastive variational autoencoder enhances salient features. arXiv preprint arXiv:1902.04601, 2019

  2. [2]

    The mechanics of n-player differentiable games

    David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. In Proceedings of the 35th International Conference on Machine Learning. PMLR, 2018

  3. [3]

    Neuro-dynamic programming

    Dimitri P Bertsekas. Neuro-dynamic programming. In Encyclopedia of optimization, pp.\ 2555--2560. Springer, 2008

  4. [4]

    Maximum entropy reinforcement learning via energy-based normalizing flow, 2024

    Chen-Hao Chao, Chien Feng, Wei-Fang Sun, Cheng-Kuang Lee, Simon See, and Chun-Yi Lee. Maximum entropy reinforcement learning via energy-based normalizing flow, 2024. URL https://arxiv.org/abs/2405.13629

  5. [5]

    Th \'e orie des machines simples en ayant \'e gard au frottement de leurs parties et \`a la roideur des cordages

    Charles Augustin Coulomb. Th \'e orie des machines simples en ayant \'e gard au frottement de leurs parties et \`a la roideur des cordages . Bachelier, 1821

  6. [6]

    Improved deep reinforcement learning for robotics through distribution-based experience retention

    Tim de Bruin, Jens Kober, Karl Tuyls, and Robert Babu s ka. Improved deep reinforcement learning for robotics through distribution-based experience retention. In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.\ 3947--3952. IEEE, 2016

  7. [7]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp.\ 1587--1596. PMLR, 2018

  8. [8]

    Off-policy deep reinforcement learning without exploration

    Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pp.\ 2052--2062. PMLR, 2019

  9. [9]

    Reinforcement learning with deep energy-based policies

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning. PMLR, 2017

  10. [10]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp.\ 1861--1870. Pmlr, 2018

  11. [11]

    Selective experience replay for lifelong learning

    David Isele and Akansel Cosgun. Selective experience replay for lifelong learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  12. [12]

    Planning with Diffusion for Flexible Behavior Synthesis

    Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022

  13. [13]

    Policy gradient and actor--critic in continuous time

    Yanwei Jia and Xun Yu Zhou. Policy gradient and actor--critic in continuous time. Journal of Machine Learning Research, 23 0 (84): 0 1--50, 2022

  14. [14]

    q-learning in continuous time

    Yanwei Jia and Xun Yu Zhou. q-learning in continuous time. Journal of Machine Learning Research, 24 0 (130): 0 1--53, 2023

  15. [15]

    Hamilton--jacobi deep q-learning for continuous-time control

    Jeongho Kim, Jaeuk Shin, and Insoon Yang. Hamilton--jacobi deep q-learning for continuous-time control. Journal of Machine Learning Research, 22 0 (262): 0 1--51, 2021

  16. [16]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  17. [17]

    Actor-critic algorithms

    Vijay Konda and John Tsitsiklis. Actor-critic algorithms. Advances in neural information processing systems, 12, 1999

  18. [18]

    Continuous control with deep reinforcement learning

    Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

  19. [19]

    Stein variational policy gradient

    Yang Liu, Prasanna Ramachandran, Qiang Liu, Jian Peng, et al. Stein variational policy gradient. In Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2017

  20. [20]

    Stochastic hamiltonian gradient methods for smooth games

    Nicolas Loizou, Sharan Vaswani, Volkan Cevher, and Simon Lacoste-Julien. Stochastic hamiltonian gradient methods for smooth games. In Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020

  21. [21]

    Philosophiae naturalis principia mathematica, volume 1

    Isaac Newton. Philosophiae naturalis principia mathematica, volume 1. G. Brookman, 1833

  22. [22]

    Off-policy temporal-difference learning with function approximation

    Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning with function approximation. In ICML, pp.\ 417--424, 2001

  23. [23]

    Stable-baselines3: Reliable reinforcement learning implementations

    Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of machine learning research, 22 0 (268): 0 1--8, 2021

  24. [24]

    Deterministic policy gradient algorithms

    David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International conference on machine learning, pp.\ 387--395. Pmlr, 2014

  25. [25]

    Trimesh Authors

    Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.\ 5026--5033, 2012. doi:10.1109/IROS.2012.6386109

  26. [26]

    Mark Towers, Ariel Kwiatkowski, Jordan Terry, John U. Balis, Gianluca De Cola, Tristan Deleu, Manuel Goulão, Andreas Kallinteris, Markus Krimmel, Arjun KG, Rodrigo Perez-Vicente, Andrea Pierré, Sander Schulhoff, Jun Jet Tai, Hannah Tan, and Omar G. Younis. Gymnasium: A standard interface for reinforcement learning environments, 2024. URL https://arxiv.org...

  27. [27]

    Reinforcement learning in continuous time and space: A stochastic control approach

    Hao Wang and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach. Journal of Machine Learning Research, 21 0 (178): 0 1--34, 2020

  28. [28]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  29. [29]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  30. [30]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  31. [31]

    eoدloIvR&O6r'XppYY1vң e8&_ 5n (& X fjßeƭ X N;'n;cd [95 /V 5ZȄ[[k T-f1rEeʰe._R֢A[ x0 #'1j&( Aʀ X9M XO _fjLk'iOKj 9 #E K #a8ѵɑw

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...