arxiv: 2604.18978 · v2 · submitted 2026-04-21 · 💻 cs.LG · cs.AI

Recognition: unknown

Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning

Yuan Zhuang , Yuexin Bian , Sihong He , Jie Feng , Qing Su , Songyang Han , Jonathan Petit , Shihao Ji

show 2 more authors

Yuanyuan Shi Fei Miao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords low-rank adaptationcritic learningoff-policy reinforcement learningstructural regularizationoverfitting mitigationSACactor-critic methods

0 comments

The pith

Freezing base critic weights and training only low-rank adapters reduces overfitting and boosts performance in off-policy RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores using Low-Rank Adaptation to scale up critic networks in off-policy reinforcement learning without the usual problems of overfitting and instability. It freezes the randomly initialized base matrices in the critic and only optimizes the low-rank adapter parameters, which limits the critic's updates to a smaller space. This acts as a form of structural regularization during replay-based training. If effective, it allows larger critics to be used more reliably, leading to better policy learning in algorithms like SAC and FastTD3. The approach is tested across various tasks and shows reduced critic loss along with competitive or superior policy results.

Core claim

Our approach freezes randomly initialized base matrices and optimizes only the corresponding low-rank adapters in the critic, thereby constraining critic updates to a low-dimensional subspace. This provides a simple structural regularizer that efficiently reduces critic loss and improves policy performance in off-policy RL settings.

What carries the argument

Low-Rank Adaptation (LoRA) on critic networks, freezing base matrices and training only low-rank adapters to constrain updates to a low-dimensional subspace.

If this is right

Critic loss decreases more efficiently throughout training.
Policy performance improves or matches the best results on most tasks.
The method applies across SAC, FastTD3, and different network architectures.
Structural regularization emerges without extra hyperparameters or loss terms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The subspace constraint may support longer training runs before instability sets in.
Low-rank critics could scale to higher capacities in environments with scarce or noisy data.
Similar freezing of base weights might regularize other components such as actors in the same framework.

Load-bearing premise

Freezing randomly initialized base matrices and optimizing only low-rank adapters sufficiently preserves the critic's expressive power while preventing overfitting in replay-based training.

What would settle it

A direct comparison on tasks with minimal overfitting risk where LoRA critics achieve noticeably lower returns than full-parameter critics, showing the low-rank constraint has removed necessary capacity.

Figures

Figures reproduced from arXiv: 2604.18978 by Fei Miao, Jie Feng, Jonathan Petit, Qing Su, Shihao Ji, Sihong He, Songyang Han, Yuanyuan Shi, Yuan Zhuang, Yuexin Bian.

**Figure 1.** Figure 1: A motivating example: Left: Bellman residual (RMS) over training. Right: Final true-Q error on d π vs. LoRA rank with 5 seeds; dashed line is the dense baseline. 5 Low-Rank Adaptation for Critic Learning In this work, we propose to use Low-Rank Adaptation (LoRA) [14] as a structural constraint for critic learning [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the proposed LoRA-based critic architecture. For each linear layer in the critic residual block, we freeze the randomly initialized dense base matrix W0 and train only the low-rank adapters A and B. Panel (a) represents the LoRA method applied on the SimbaV2 architecture. Panel (b) illustrates the LoRA method for the BRC with BroNet structure. 5.1 LoRA Parameterization for Critic Networks Consi… view at source ↗

**Figure 3.** Figure 3: Performance comparison across RL algorithms and network architectures. We evaluate three settings: SAC+SimbaV2 on DMC-Hard tasks, SAC+BRC on DMC-Hard tasks, and FastTD3+SimbaV2⋆ on IsaacLab robotics tasks. For each setting, the leftmost column reports the average normalized return and critic loss across all tasks, while the remaining columns show representative individual tasks. LoRA consistently achieves … view at source ↗

**Figure 4.** Figure 4: Comparison of pure LoRA training and hybrid schemes that perform dense updates [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study of the LoRA critic design. We evaluate four key design choices on Dog-Trot (top) and Humanoid-Run (bottom): (a) the rank of the frozen base matrix W0, (b) the initialization norm ∥w0∥2, (c) the effect of removing hyperspherical weight normalization or the frozen base matrix, and (d) the trainable-parameter trade-off obtained by varying the LoRA rank and comparing against static sparsity. Res… view at source ↗

**Figure 6.** Figure 6: LoRA-compatible hyperspherical projection. Top: standard SimbaV2 projects each updated weight vector W onto the unit hypersphere to obtain W′ . Bottom: in the LoRA parameterization, the base vector W0 is frozen, so directly normalizing W0 + ∆W would also rescale the frozen base. Instead, for the SAC+SimbaV2 setting, we solve for a scalar sj > 0 and absorb it into the LoRA update so that W0 + sj∆W lies on … view at source ↗

**Figure 7.** Figure 7: Per-environment learning curves. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Scaling critic capacity is a promising direction for improving off-policy reinforcement learning (RL). However, recent work shows that larger critics are prone to overfitting and instability in replay-based bootstrapped training. In this paper, we propose using Low-Rank Adaptation (LoRA) as a structural regularizer for critic learning. Our approach freezes randomly initialized base matrices and optimizes only the corresponding low-rank adapters, thereby constraining critic updates to a low-dimensional subspace. We evaluate our method across different off-policy RL algorithms, including SAC and FastTD3 based on different network architectures. Empirically, LoRA efficiently reduces critic loss during training and improves overall policy performance, achieving the best or competitive results on most tasks. Extensive experiments demonstrate that our low-rank updates provide a simple and effective form of structural regularization for critic learning in off-policy RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LoRA on critics gives a simple way to cut overfitting in off-policy RL but the experiments leave it unclear whether the low-rank structure or just fewer parameters explain the gains.

read the letter

The main thing here is that the authors apply LoRA to the critic by freezing randomly initialized base matrices and training only the low-rank adapters. This constrains updates to a low-dimensional subspace during replay-based bootstrapping in algorithms like SAC and FastTD3. They report lower critic loss and better or competitive policy performance across tasks and network architectures. The approach is presented as an easy structural regularizer that lets you scale critic capacity without the usual instability. What is new is the targeted use of LoRA for this exact purpose in off-policy critic learning; the method itself borrows from language models but the framing as a fix for replay overfitting is a reasonable extension. The paper does well by keeping the change minimal and showing it works on multiple base algorithms without new losses or architectures. The soft spots center on the mechanism. The stress-test concern holds up on the provided details: without an ablation that matches the number of trainable parameters using a different structure, such as narrower full-rank layers, the results could be explained by capacity reduction alone, which is already known to help with overfitting. The abstract mentions extensive experiments but supplies no numbers, error bars, or ablation specifics, so the strength of the evidence is difficult to judge. If the full paper includes a parameter-matched control and proper statistics, that would tighten the claim; otherwise the interpretation stays loose. This is for RL practitioners working on continuous control who already run SAC-style methods and want a quick knob for critic stability. It does not reshape theory but could be a useful practical tool. I would send it to peer review because the idea is clean enough to deserve referee input on the experiments and mechanism, even if revisions are needed to isolate the effect.

Referee Report

2 major / 1 minor

Summary. The paper proposes Low-Rank Adaptation (LoRA) as a structural regularizer for critic networks in off-policy RL algorithms such as SAC and FastTD3. Randomly initialized base matrices are frozen while only low-rank adapters are optimized, constraining critic updates to a low-dimensional subspace during replay-based bootstrapped training. The authors claim this reduces critic loss, mitigates overfitting from large critics, and yields best or competitive policy performance across tasks, supported by extensive experiments on different network architectures.

Significance. If the central claim holds after addressing experimental controls, the work offers a simple, hyperparameter-light way to regularize critic capacity in off-policy RL by leveraging parameter-efficient adaptation techniques. This could help scale critics without instability, building on known benefits of LoRA in other domains. The approach is credited for its straightforward integration and focus on a practical issue in replay-based training.

major comments (2)

[Experiments] Experiments section: The claim that low-rank subspace constraints provide structural regularization (distinct from capacity reduction) is not isolated, as no control baseline matches the number of trainable parameters using a full-rank but narrower architecture. Without this, performance gains could be explained by reduced capacity alone, which is already known to mitigate overfitting in off-policy settings.
[Results] Results and abstract: No quantitative metrics, ablation details, error bars, or statistical significance tests are reported for the claimed improvements on SAC and FastTD3, preventing assessment of effect sizes or robustness of the empirical findings.

minor comments (1)

[Method] Method section: The description of how base matrices are initialized and frozen could include an explicit equation for the adapted critic output to clarify the low-rank update form.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline the revisions we intend to make to strengthen the paper.

read point-by-point responses

Referee: [Experiments] Experiments section: The claim that low-rank subspace constraints provide structural regularization (distinct from capacity reduction) is not isolated, as no control baseline matches the number of trainable parameters using a full-rank but narrower architecture. Without this, performance gains could be explained by reduced capacity alone, which is already known to mitigate overfitting in off-policy settings.

Authors: We agree that the current experiments do not fully isolate the effect of constraining updates to a low-rank subspace from the general benefit of reduced trainable parameters. A narrower full-rank critic with matched parameter count would provide a stronger control. We will add this baseline to the revised experiments section, evaluating it on representative tasks from the SAC and FastTD3 suites to better distinguish the structural regularization aspect of LoRA. revision: yes
Referee: [Results] Results and abstract: No quantitative metrics, ablation details, error bars, or statistical significance tests are reported for the claimed improvements on SAC and FastTD3, preventing assessment of effect sizes or robustness of the empirical findings.

Authors: We acknowledge that the presentation of results can be improved by including more quantitative details. Although our experiments were conducted with multiple random seeds, we will revise the results section and figures to report mean returns with standard deviations, include error bars, expand ablation studies on LoRA rank and related hyperparameters, and add statistical significance tests (such as paired t-tests) where appropriate to quantify the improvements on SAC and FastTD3. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal grounded in external task performance

full rationale

The paper proposes LoRA-based low-rank updates as a structural regularizer for critics in off-policy RL and validates it through experiments on SAC and FastTD3 across tasks. No derivation chain, equations, or first-principles predictions are presented that reduce to fitted parameters or self-referential definitions by construction. The central claim rests on observed reductions in critic loss and policy improvements, which are measured against external benchmarks rather than internal fits or self-citations. The approach is self-contained as an empirical regularization technique.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract introduces no new free parameters, axioms, or invented entities beyond standard RL assumptions and the LoRA technique imported from another domain.

pith-pipeline@v0.9.0 · 5461 in / 934 out tokens · 27916 ms · 2026-05-10T02:51:09.967005+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor.International conference on machine learning, 2018

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor.International conference on machine learning, 2018

2018
[2]

Addressing function approximation error in actor-critic methods.International Conference on Machine Learning, 2018

Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods.International Conference on Machine Learning, 2018

2018
[3]

Stop regressing: Training value functions via classification for scalable deep rl

Jesse Farebrother, Jordi Orbay, Quan Vuong, Adrien Ali Taïga, Yevgen Chebotar, Ted Xiao, Alex Irpan, Sergey Levine, Pablo Samuel Castro, Aleksandra Faust, et al. Stop regressing: Training value functions via classification for scalable deep rl.arXiv preprint arXiv:2403.03950, 2024

work page arXiv 2024
[4]

Double q-learning.Advances in neural information processing systems, 23, 2010

Hado Hasselt. Double q-learning.Advances in neural information processing systems, 23, 2010

2010
[5]

Deep reinforcement learning with double q-learning, 2015

Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning, 2015

2015
[6]

Crossq: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity.The Twelfth International Conference on Learning Represen- tations, 2024

Aditya Bhatt, Daniel Palenicek, Boris Belousov, Max Argus, Artemij Amiranashvili, Thomas Brox, and Jan Peters. Crossq: Batch normalization in deep reinforcement learning for greater sample efficiency and simplicity.The Twelfth International Conference on Learning Represen- tations, 2024

2024
[7]

Wurman, Jaegul Choo, Peter Stone, and Takuma Seno

Hojoon Lee, Dongyoon Hwang, Donghu Kim, Hyunseung Kim, Jun Jet Tai, Kaushik Subra- manian, Peter R. Wurman, Jaegul Choo, Peter Stone, and Takuma Seno. Simba: Simplicity bias for scaling up parameters in deep reinforcement learning.The Thirteenth International Conference on Learning Representations, 2025

2025
[8]

Hyper- spherical normalization for scalable deep reinforcement learning.International Conference on Machine Learning, 2025

Hojoon Lee, Youngdo Lee, Takuma Seno, Donghu Kim, Peter Stone, and Jaegul Choo. Hyper- spherical normalization for scalable deep reinforcement learning.International Conference on Machine Learning, 2025

2025
[9]

A distributional perspective on rein- forcement learning

Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein- forcement learning. InInternational conference on machine learning, pages 449–458. Pmlr, 2017

2017
[10]

Scaling off-policy reinforcement learning with batch and weight normalization.arXiv preprint arXiv:2502.07523, 2025

Daniel Palenicek, Florian V ogt, Joe Watson, and Jan Peters. Scaling off-policy reinforcement learning with batch and weight normalization.arXiv preprint arXiv:2502.07523, 2025

work page arXiv 2025
[11]

TD-MPC2: Scalable, robust world models for continuous control.The Twelfth International Conference on Learning Representations, 2024

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control.The Twelfth International Conference on Learning Representations, 2024

2024
[12]

Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control

Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miło´s, and Marek Cygan. Bigger, regularized, optimistic: scaling for compute and sample-efficient continuous control. Advances in neural information processing systems, 2024

2024
[13]

Network sparsity unlocks the scaling potential of deep reinforcement learning.International Conference on Machine Learning, 2025

Guozheng Ma, Lu Li, Zilin Wang, Li Shen, Pierre-Luc Bacon, and Dacheng Tao. Network sparsity unlocks the scaling potential of deep reinforcement learning.International Conference on Machine Learning, 2025

2025
[14]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.International Conference on Learning Representations, 2022. 10

2022
[15]

Big- ger, regularized, categorical: High-capacity value functions are efficient multi-task learners

Michal Nauman, Marek Cygan, Carmelo Sferrazza, Aviral Kumar, and Pieter Abbeel. Big- ger, regularized, categorical: High-capacity value functions are efficient multi-task learners. Advances in Neural Information Processing Systems, 2025

2025
[16]

Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control, 2025

Younggyo Seo, Carmelo Sferrazza, Haoran Geng, Michal Nauman, Zhao-Heng Yin, and Pieter Abbeel. Fasttd3: Simple, fast, and capable reinforcement learning for humanoid control, 2025

2025
[17]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[18]

Andrychowicz, A

Marcin Andrychowicz, Anton Raichuk, Piotr Sta´nczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study.arXiv preprint arXiv:2006.05990, 2020

work page arXiv 2006
[19]

Rn-d: Discretized categorical actors with regularized networks for on-policy reinforcement learning.arXiv preprint arXiv:2601.23075, 2026

Yuexin Bian, Jie Feng, Tao Wang, Yijiang Li, Sicun Gao, and Yuanyuan Shi. Rn-d: Discretized categorical actors with regularized networks for on-policy reinforcement learning.arXiv preprint arXiv:2601.23075, 2026

work page arXiv 2026
[20]

Implicit under- parameterization inhibits data-efficient deep reinforcement learning.International Conference on Learning Representations, 2021

Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under- parameterization inhibits data-efficient deep reinforcement learning.International Conference on Learning Representations, 2021

2021
[21]

Understanding plasticity in neural networks.International Conference on Machine Learning, 2023

Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney. Understanding plasticity in neural networks.International Conference on Machine Learning, 2023

2023
[22]

The primacy bias in deep reinforcement learning.International Conference on Machine Learning, 2022

Evgenii Nikishin, Max Schwarzer, Pierluca D’Oro, Pierre-Luc Bacon, and Aaron Courville. The primacy bias in deep reinforcement learning.International Conference on Machine Learning, 2022

2022
[23]

Towards deeper deep reinforcement learning with spectral normalization.Advances in neural information processing systems, 34:8242–8255, 2021

Nils Bjorck, Carla P Gomes, and Kilian Q Weinberger. Towards deeper deep reinforcement learning with spectral normalization.Advances in neural information processing systems, 34:8242–8255, 2021

2021
[24]

A simple weight decay can improve generalization.Advances in neural information processing systems, 4, 1991

Anders Krogh and John Hertz. A simple weight decay can improve generalization.Advances in neural information processing systems, 4, 1991

1991
[25]

Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.The journal of machine learning research, 15(1):1929–1958, 2014

1929
[26]

On the effectiveness of parameter-efficient fine-tuning

Zihao Fu, Haoran Yang, Anthony Man-Cho So, Wai Lam, Lidong Bing, and Nigel Collier. On the effectiveness of parameter-efficient fine-tuning. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 12799–12807, 2023

2023
[27]

Hansen, N

Takuya Hiraoka, Takahisa Imagawa, Taisei Hashimoto, Takashi Onishi, and Yoshimasa Tsu- ruoka. Dropout q-functions for doubly efficient reinforcement learning.arXiv preprint arXiv:2110.02034, 2021

work page arXiv 2021
[28]

DeepMind Control Suite

Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018

work page internal anchor Pith review arXiv 2018
[29]

Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023

Mayank Mittal, Calvin Yu, Qinxi Yu, Jingzhou Liu, Nikita Rudin, David Hoeller, Jia Lin Yuan, Ritvik Singh, Yunrong Guo, Hammad Mazhar, Ajay Mandlekar, Buck Babich, Gavriel State, Marco Hutter, and Animesh Garg. Orbit: A unified simulation framework for interactive robot learning environments.IEEE Robotics and Automation Letters, 8(6):3740–3747, 2023. 11 A...

2023
[30]

This observation motivates the initialization scheme for the frozen base matrix and the LoRA adapters described below. C Why standard row-wise weight normalization is incompatible with frozen LoRA bases A natural approach to combining LoRA with hyperspherical weight normalization is to apply the standard row-wise normalization directly to the effective we...

work page arXiv