pith. machine review for the scientific record. sign in

arxiv: 2604.18978 · v2 · submitted 2026-04-21 · 💻 cs.LG · cs.AI

Recognition: unknown

Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords low-rank adaptationcritic learningoff-policy reinforcement learningstructural regularizationoverfitting mitigationSACactor-critic methods
0
0 comments X

The pith

Freezing base critic weights and training only low-rank adapters reduces overfitting and boosts performance in off-policy RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores using Low-Rank Adaptation to scale up critic networks in off-policy reinforcement learning without the usual problems of overfitting and instability. It freezes the randomly initialized base matrices in the critic and only optimizes the low-rank adapter parameters, which limits the critic's updates to a smaller space. This acts as a form of structural regularization during replay-based training. If effective, it allows larger critics to be used more reliably, leading to better policy learning in algorithms like SAC and FastTD3. The approach is tested across various tasks and shows reduced critic loss along with competitive or superior policy results.

Core claim

Our approach freezes randomly initialized base matrices and optimizes only the corresponding low-rank adapters in the critic, thereby constraining critic updates to a low-dimensional subspace. This provides a simple structural regularizer that efficiently reduces critic loss and improves policy performance in off-policy RL settings.

What carries the argument

Low-Rank Adaptation (LoRA) on critic networks, freezing base matrices and training only low-rank adapters to constrain updates to a low-dimensional subspace.

If this is right

  • Critic loss decreases more efficiently throughout training.
  • Policy performance improves or matches the best results on most tasks.
  • The method applies across SAC, FastTD3, and different network architectures.
  • Structural regularization emerges without extra hyperparameters or loss terms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The subspace constraint may support longer training runs before instability sets in.
  • Low-rank critics could scale to higher capacities in environments with scarce or noisy data.
  • Similar freezing of base weights might regularize other components such as actors in the same framework.

Load-bearing premise

Freezing randomly initialized base matrices and optimizing only low-rank adapters sufficiently preserves the critic's expressive power while preventing overfitting in replay-based training.

What would settle it

A direct comparison on tasks with minimal overfitting risk where LoRA critics achieve noticeably lower returns than full-parameter critics, showing the low-rank constraint has removed necessary capacity.

Figures

Figures reproduced from arXiv: 2604.18978 by Fei Miao, Jie Feng, Jonathan Petit, Qing Su, Shihao Ji, Sihong He, Songyang Han, Yuanyuan Shi, Yuan Zhuang, Yuexin Bian.

Figure 1
Figure 1. Figure 1: A motivating example: Left: Bellman residual (RMS) over training. Right: Final true-Q error on d π vs. LoRA rank with 5 seeds; dashed line is the dense baseline. 5 Low-Rank Adaptation for Critic Learning In this work, we propose to use Low-Rank Adaptation (LoRA) [14] as a structural constraint for critic learning [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed LoRA-based critic architecture. For each linear layer in the critic residual block, we freeze the randomly initialized dense base matrix W0 and train only the low-rank adapters A and B. Panel (a) represents the LoRA method applied on the SimbaV2 architecture. Panel (b) illustrates the LoRA method for the BRC with BroNet structure. 5.1 LoRA Parameterization for Critic Networks Consi… view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison across RL algorithms and network architectures. We evaluate three settings: SAC+SimbaV2 on DMC-Hard tasks, SAC+BRC on DMC-Hard tasks, and FastTD3+SimbaV2⋆ on IsaacLab robotics tasks. For each setting, the leftmost column reports the average normalized return and critic loss across all tasks, while the remaining columns show representative individual tasks. LoRA consistently achieves … view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of pure LoRA training and hybrid schemes that perform dense updates [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study of the LoRA critic design. We evaluate four key design choices on Dog-Trot (top) and Humanoid-Run (bottom): (a) the rank of the frozen base matrix W0, (b) the initialization norm ∥w0∥2, (c) the effect of removing hyperspherical weight normalization or the frozen base matrix, and (d) the trainable-parameter trade-off obtained by varying the LoRA rank and comparing against static sparsity. Res… view at source ↗
Figure 6
Figure 6. Figure 6: LoRA-compatible hyperspherical projection. Top: standard SimbaV2 projects each updated weight vector W onto the unit hypersphere to obtain W′ . Bottom: in the LoRA parame￾terization, the base vector W0 is frozen, so directly normalizing W0 + ∆W would also rescale the frozen base. Instead, for the SAC+SimbaV2 setting, we solve for a scalar sj > 0 and absorb it into the LoRA update so that W0 + sj∆W lies on … view at source ↗
Figure 7
Figure 7. Figure 7: Per-environment learning curves. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

Scaling critic capacity is a promising direction for improving off-policy reinforcement learning (RL). However, recent work shows that larger critics are prone to overfitting and instability in replay-based bootstrapped training. In this paper, we propose using Low-Rank Adaptation (LoRA) as a structural regularizer for critic learning. Our approach freezes randomly initialized base matrices and optimizes only the corresponding low-rank adapters, thereby constraining critic updates to a low-dimensional subspace. We evaluate our method across different off-policy RL algorithms, including SAC and FastTD3 based on different network architectures. Empirically, LoRA efficiently reduces critic loss during training and improves overall policy performance, achieving the best or competitive results on most tasks. Extensive experiments demonstrate that our low-rank updates provide a simple and effective form of structural regularization for critic learning in off-policy RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Low-Rank Adaptation (LoRA) as a structural regularizer for critic networks in off-policy RL algorithms such as SAC and FastTD3. Randomly initialized base matrices are frozen while only low-rank adapters are optimized, constraining critic updates to a low-dimensional subspace during replay-based bootstrapped training. The authors claim this reduces critic loss, mitigates overfitting from large critics, and yields best or competitive policy performance across tasks, supported by extensive experiments on different network architectures.

Significance. If the central claim holds after addressing experimental controls, the work offers a simple, hyperparameter-light way to regularize critic capacity in off-policy RL by leveraging parameter-efficient adaptation techniques. This could help scale critics without instability, building on known benefits of LoRA in other domains. The approach is credited for its straightforward integration and focus on a practical issue in replay-based training.

major comments (2)
  1. [Experiments] Experiments section: The claim that low-rank subspace constraints provide structural regularization (distinct from capacity reduction) is not isolated, as no control baseline matches the number of trainable parameters using a full-rank but narrower architecture. Without this, performance gains could be explained by reduced capacity alone, which is already known to mitigate overfitting in off-policy settings.
  2. [Results] Results and abstract: No quantitative metrics, ablation details, error bars, or statistical significance tests are reported for the claimed improvements on SAC and FastTD3, preventing assessment of effect sizes or robustness of the empirical findings.
minor comments (1)
  1. [Method] Method section: The description of how base matrices are initialized and frozen could include an explicit equation for the adapted critic output to clarify the low-rank update form.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we intend to make to strengthen the paper.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The claim that low-rank subspace constraints provide structural regularization (distinct from capacity reduction) is not isolated, as no control baseline matches the number of trainable parameters using a full-rank but narrower architecture. Without this, performance gains could be explained by reduced capacity alone, which is already known to mitigate overfitting in off-policy settings.

    Authors: We agree that the current experiments do not fully isolate the effect of constraining updates to a low-rank subspace from the general benefit of reduced trainable parameters. A narrower full-rank critic with matched parameter count would provide a stronger control. We will add this baseline to the revised experiments section, evaluating it on representative tasks from the SAC and FastTD3 suites to better distinguish the structural regularization aspect of LoRA. revision: yes

  2. Referee: [Results] Results and abstract: No quantitative metrics, ablation details, error bars, or statistical significance tests are reported for the claimed improvements on SAC and FastTD3, preventing assessment of effect sizes or robustness of the empirical findings.

    Authors: We acknowledge that the presentation of results can be improved by including more quantitative details. Although our experiments were conducted with multiple random seeds, we will revise the results section and figures to report mean returns with standard deviations, include error bars, expand ablation studies on LoRA rank and related hyperparameters, and add statistical significance tests (such as paired t-tests) where appropriate to quantify the improvements on SAC and FastTD3. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal grounded in external task performance

full rationale

The paper proposes LoRA-based low-rank updates as a structural regularizer for critics in off-policy RL and validates it through experiments on SAC and FastTD3 across tasks. No derivation chain, equations, or first-principles predictions are presented that reduce to fitted parameters or self-referential definitions by construction. The central claim rests on observed reductions in critic loss and policy improvements, which are measured against external benchmarks rather than internal fits or self-citations. The approach is self-contained as an empirical regularization technique.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract introduces no new free parameters, axioms, or invented entities beyond standard RL assumptions and the LoRA technique imported from another domain.

pith-pipeline@v0.9.0 · 5461 in / 934 out tokens · 27916 ms · 2026-05-10T02:51:09.967005+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor.International conference on machine learning, 2018

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor.International conference on machine learning, 2018

  2. [2]

    Addressing function approximation error in actor-critic methods.International Conference on Machine Learning, 2018

    Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods.International Conference on Machine Learning, 2018

  3. [3]

    Stop regressing: Training value functions via classification for scalable deep rl

    Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taïga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, et al. Stop regressing: Training value functions via classification for scalable deep rl.arXiv preprint arXiv:2403.03950, 2024

  4. [4]

    Double q-learning.Advances in neural information processing systems, 23, 2010

    Hado Hasselt. Double q-learning.Advances in neural information processing systems, 23, 2010

  5. [5]

    Deep reinforcement learning with double q-learning, 2015

    Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning, 2015

  6. [6]

    Crossq: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity.The Twelfth International Conference on Learning Represen- tations, 2024

    Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, and Jan Peters. Crossq: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity.The Twelfth International Conference on Learning Represen- tations, 2024

  7. [7]

    Wurman, Jaegul Choo, Peter Stone, and Takuma Seno

    Hojoon Lee, Dongyoon Hwang, Donghu Kim, Hyunseung Kim, Jun Jet Tai, Kaushik Subra- manian, Peter R. Wurman, Jaegul Choo, Peter Stone, and Takuma Seno. Simba: Simplicity bias for scaling up parameters in deep reinforcement learning.The Thirteenth International Conference on Learning Representations, 2025

  8. [8]

    Hyper- spherical normalization for scalable deep reinforcement learning.International Conference on Machine Learning, 2025

    Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, and Jaegul Choo. Hyper- spherical normalization for scalable deep reinforcement learning.International Conference on Machine Learning, 2025

  9. [9]

    A distributional perspective on rein- forcement learning

    Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein- forcement learning. InInternational conference on machine learning, pages 449–458. Pmlr, 2017

  10. [10]

    Scaling off-policy reinforcement learning with batch and weight normalization.arXiv preprint arXiv:2502.07523, 2025

    Daniel Palenicek, Florian V ogt, Joe Watson, and Jan Peters. Scaling off-policy reinforcement learning with batch and weight normalization.arXiv preprint arXiv:2502.07523, 2025

  11. [11]

    TD-MPC2: Scalable, robust world models for continuous control.The Twelfth International Conference on Learning Representations, 2024

    Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control.The Twelfth International Conference on Learning Representations, 2024

  12. [12]

    Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control

    Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miło´s, and Marek Cygan. Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control. Advances in neural information processing systems, 2024

  13. [13]

    Network sparsity unlocks the scaling potential of deep reinforcement learning.International Conference on Machine Learning, 2025

    Guozheng Ma, Lu Li, Zilin Wang, Li Shen, Pierre-Luc Bacon, and Dacheng Tao. Network sparsity unlocks the scaling potential of deep reinforcement learning.International Conference on Machine Learning, 2025

  14. [14]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.International Conference on Learning Representations, 2022. 10

  15. [15]

    Big- ger, regularized, categorical: High-capacity value functions are efficient multi-task learners

    Michal Nauman, Marek Cygan, Carmelo Sferrazza, Aviral Kumar, and Pieter Abbeel. Big- ger, regularized, categorical: High-capacity value functions are efficient multi-task learners. Advances in Neural Information Processing Systems, 2025

  16. [16]

    Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control, 2025

    Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control, 2025

  17. [17]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  18. [18]

    Andrychowicz, A

    Marcin Andrychowicz, Anton Raichuk, Piotr Sta´nczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study.arXiv preprint arXiv:2006.05990, 2020

  19. [19]

    Rn-d: Discretized categorical actors with regularized networks for on-policy reinforcement learning.arXiv preprint arXiv:2601.23075, 2026

    Yuexin Bian, Jie Feng, Tao Wang, Yijiang Li, Sicun Gao, and Yuanyuan Shi. Rn-d: Discretized categorical actors with regularized networks for on-policy reinforcement learning.arXiv preprint arXiv:2601.23075, 2026

  20. [20]

    Implicit under- parameterization inhibits data-efficient deep reinforcement learning.International Conference on Learning Representations, 2021

    Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under- parameterization inhibits data-efficient deep reinforcement learning.International Conference on Learning Representations, 2021

  21. [21]

    Understanding plasticity in neural networks.International Conference on Machine Learning, 2023

    Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks.International Conference on Machine Learning, 2023

  22. [22]

    The primacy bias in deep reinforcement learning.International Conference on Machine Learning, 2022

    Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning.International Conference on Machine Learning, 2022

  23. [23]

    Towards deeper deep reinforcement learning with spectral normalization.Advances in neural information processing systems, 34:8242–8255, 2021

    Nils Bjorck, Carla P Gomes, and Kilian Q Weinberger. Towards deeper deep reinforcement learning with spectral normalization.Advances in neural information processing systems, 34:8242–8255, 2021

  24. [24]

    A simple weight decay can improve generalization.Advances in neural information processing systems, 4, 1991

    Anders Krogh and John Hertz. A simple weight decay can improve generalization.Advances in neural information processing systems, 4, 1991

  25. [25]

    Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

  26. [26]

    On the effectiveness of parameter-efficient fine-tuning

    Zihao Fu, Haoran Yang, Anthony Man-Cho So, Wai Lam, Lidong Bing, and Nigel Collier. On the effectiveness of parameter-efficient fine-tuning. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 12799–12807, 2023

  27. [27]

    Hansen, N

    Takuya Hiraoka, Takahisa Imagawa, Taisei Hashimoto, Takashi Onishi, and Yoshimasa Tsu- ruoka. Dropout q-functions for doubly efficient reinforcement learning.arXiv preprint arXiv:2110.02034, 2021

  28. [28]

    DeepMind Control Suite

    Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

  29. [29]

    Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023

    Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Babich, Gavriel State, Marco Hutter, and Animesh Garg. Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023. 11 A...

  30. [30]

    This observation motivates the initialization scheme for the frozen base matrix and the LoRA adapters described below. C Why standard row-wise weight normalization is incompatible with frozen LoRA bases A natural approach to combining LoRA with hyperspherical weight normalization is to apply the standard row-wise normalization directly to the effective we...