Adaptive Control in Autonomous Driving via Real-Time Recurrent RL
Pith reviewed 2026-05-21 13:42 UTC · model grok-4.3
The pith
Online recurrent RL fine-tunes pretrained driving policies in real time to handle distribution shifts with event-camera inputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Extending Real-Time Recurrent Reinforcement Learning to LrcSSM models enables effective online adaptation of pretrained autonomous driving policies to distribution shifts. When combined with offline behavioral cloning, the method produces rapid and reliable improvements during both simulated CarRacing runs and real-world line-following on a RoboRacer platform equipped with an event camera, marking the first demonstration of such online RL fine-tuning on standard hardware in closed-loop settings.
What carries the argument
Real-Time Recurrent Reinforcement Learning (RTRRL), a memory-efficient online update rule that adjusts policy parameters at every time step without backpropagation through time, extended to support LrcSSM nonlinear diagonal state-space models.
Load-bearing premise
Online parameter updates performed at every time step will remain stable and safe under real sensor noise and latency without extra safeguards or fallback controllers.
What would settle it
A closed-loop real-world run in which the online-fine-tuned policy loses lane tracking or becomes unstable under normal event-camera noise and latency would falsify the stability premise.
Figures
read the original abstract
We study online fine-tuning of pretrained control policies for autonomous driving using Real-Time Recurrent Reinforcement Learning (RTRRL), a memory-efficient algorithm that updates policy parameters at every time step without backpropagation through time. We extend RTRRL to support LrcSSM, a recently proposed nonlinear diagonal state-space model, and combine offline behavioral cloning with online RTRRL fine-tuning to adapt policies to distribution shifts at deployment. We validate the approach in the CarRacing simulation and on a 1:10-scale RoboRacer platform equipped with an event camera, where a pretrained policy is fine-tuned online during real-world line-following. To our knowledge, this is the first demonstration of online RL fine-tuning with event-camera observations on standard (non-spiking) hardware in closed-loop control. LrcSSM-based policies improve fastest and most consistently across both settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Real-Time Recurrent Reinforcement Learning (RTRRL) for online fine-tuning of pretrained autonomous-driving policies, extending the algorithm to LrcSSM nonlinear diagonal state-space models. It combines offline behavioral cloning with per-timestep online updates to adapt to distribution shifts, and reports validation in the CarRacing simulator plus a closed-loop line-following experiment on a 1:10 RoboRacer platform equipped with an event camera. The central empirical claim is that LrcSSM-based policies improve fastest and most consistently in both domains, together with the assertion that this constitutes the first demonstration of online RL fine-tuning with event-camera observations on standard (non-spiking) hardware.
Significance. If the stability and performance claims are substantiated with quantitative evidence, the work would offer a memory-efficient route to real-time policy adaptation in autonomous driving without BPTT, and the event-camera closed-loop result on commodity hardware would be a practical contribution to robust perception-action loops under sensor sparsity.
major comments (2)
- [Abstract / Validation] Abstract and validation sections: the headline claim that LrcSSM policies 'improve fastest and most consistently across both settings' is presented without any quantitative metrics, baselines, statistical tests, success rates, or failure-case analysis, rendering the central empirical result unverifiable from the reported text.
- [Real-world experiment] Real-world RoboRacer experiment section: the description of closed-loop fine-tuning treats per-timestep RTRRL + LrcSSM updates as inherently stable under event-camera noise and latency, yet provides no per-trial divergence rates, safety-intervention counts, or fallback-controller behavior when events become sparse or latency spikes occur; this information is load-bearing for the 'most consistently' qualifier.
minor comments (1)
- [Method] Notation for LrcSSM and RTRRL could be introduced with a short equation or pseudocode block to clarify the per-step update rule.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address the major comments point by point below and commit to revisions that strengthen the quantitative presentation of our results without altering the core claims or methodology.
read point-by-point responses
-
Referee: [Abstract / Validation] Abstract and validation sections: the headline claim that LrcSSM policies 'improve fastest and most consistently across both settings' is presented without any quantitative metrics, baselines, statistical tests, success rates, or failure-case analysis, rendering the central empirical result unverifiable from the reported text.
Authors: We agree that the abstract is a high-level summary and that additional quantitative detail would improve verifiability. The full manuscript contains learning-curve figures and baseline comparisons in the validation sections for both simulation and real-world domains. To directly address this point, we will expand the text to report explicit metrics (e.g., mean improvement per update step, success rates across trials), include statistical tests where appropriate, and add a short failure-case discussion. revision: yes
-
Referee: [Real-world experiment] Real-world RoboRacer experiment section: the description of closed-loop fine-tuning treats per-timestep RTRRL + LrcSSM updates as inherently stable under event-camera noise and latency, yet provides no per-trial divergence rates, safety-intervention counts, or fallback-controller behavior when events become sparse or latency spikes occur; this information is load-bearing for the 'most consistently' qualifier.
Authors: The current section emphasizes the feasibility of closed-loop event-camera control on commodity hardware. We acknowledge that quantitative stability metrics are not reported in detail. In the revision we will add per-trial statistics, including divergence rates, counts of safety interventions, and a description of any fallback behavior observed when event rates drop or latency increases. revision: yes
Circularity Check
Empirical validation of RTRRL+LrcSSM extension contains no derivation chain
full rationale
The paper frames its contribution as an empirical demonstration of online fine-tuning for autonomous driving policies using Real-Time Recurrent Reinforcement Learning extended to LrcSSM models. It reports performance improvements from CarRacing simulation and closed-loop RoboRacer experiments with event-camera input, without presenting equations, first-principles derivations, or predictions that reduce to fitted inputs by construction. The central claims rest on observed experimental outcomes rather than self-referential definitions or load-bearing self-citations that would force the results. Prior work on RTRRL and LrcSSM is referenced as background but does not substitute for the new empirical validation, keeping the overall circularity low.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Real-Time Recurrent Reinforcement Learning (RTRRL) ... performs parameter updates at every time-step ... using RTRL or RFLO
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
doi: 10.1109/IROS55552. 2023.10342437. Bellec, G., Scherr, F., Subramoney, A., Hajek, E., Salaj, D., Legenstein, R., and Maass, W. A solution to the learn- ing dilemma for recurrent networks of spiking neurons. Nature communications, 11(1):3625,
-
[2]
doi: 10.1109/MSP.2020. 2985815. Chen, K., Wei, H., Deng, Z., and Lin, S. Towards fast safe online reinforcement learning via policy finetuning. Transactions on Machine Learning Research,
-
[3]
Farsang, M., Lechner, M., Lung, D., Hasani, R., Rus, D., and Grosu, R. Learning with chemical versus electrical synapses does it make a difference? In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pp. 15106–15112. IEEE, 2024a. Farsang, M., Neubauer, S. A., and Grosu, R. Liquid Re- sistance Liquid Capacitance Networks. InThe First ...
-
[4]
doi: 10.1016/S0893-6080(05) 80125-X
ISSN 0893-6080. doi: 10.1016/S0893-6080(05) 80125-X. Gallego, G., Delbr¨uck, T., Orchard, G., Bartolozzi, C., Taba, B., Censi, A., Leutenegger, S., Davison, A. J., Conradt, J., Daniilidis, K., and Scaramuzza, D. Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180,
-
[5]
1109/TPAMI.2020.3008413. Gerstner, W., Kistler, W. M., Naud, R., and Paninski, L. Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition. Cambridge University Press, Cambridge,
-
[6]
1017/CBO9781107447615. Korkmaz, E. A survey analyzing generalization in deep reinforcement learning.arXiv preprint arXiv:2401.02349,
-
[7]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline rein- forcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[8]
On-the-Fly VLA Adaptation via Test-Time Reinforcement Learning
Liu, C., Liu, Y ., Wang, T., Zhuang, Q., Liang, J. C., Yang, W., Xu, R., Wang, Q., Liu, D., and Han, C. On-the-fly vla adaptation via test-time reinforcement learning.arXiv preprint arXiv:2601.06748,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
ISSN 1533-7928. Murray, J. M. Local online learning in recurrent networks with random feedback.eLife, 8:e43299, may 2019a. ISSN 2050-084X. doi: 10.7554/eLife.43299. Murray, J. M. Local online learning in recurrent networks with random feedback.eLife, 8:e43299, May 2019b. ISSN 2050-084X. doi: 10.7554/eLife.43299. Orvieto, A., Smith, S. L., Gu, A., Fernando...
-
[10]
Lee, Matthew Tan, Yuke Zhu, and Jeannette Bohg
doi: 10.1109/ICRA48506.2021.9560881. V oogd, K. L., Allamaa, J. P., Alonso-Mora, J., and Son, T. D. Reinforcement learning from simulation to real world au- tonomous driving using digital twin.IFAC-PapersOnLine, 56(2):1510–1515,
-
[11]
CDDT: Fast Approximate 2D Ray Casting for Accelerated Localization
Walsh, C. and Karaman, S. Cddt: Fast approximate 2d ray casting for accelerated localization. abs/1705.01167,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Beyond model adaptation at test time: A survey,
Xiao, Z. and Snoek, C. G. Beyond model adaptation at test time: A survey.arXiv preprint arXiv:2411.03687,
-
[13]
Deep learning for event-based vision: A comprehensive survey and bench- marks
Zheng, X., Liu, Y ., Lu, Y ., Hua, T., Pan, T., Zhang, W., Tao, D., and Wang, L. Deep learning for event-based vision: A comprehensive survey and benchmarks.arXiv preprint arXiv:2302.08890,
-
[14]
Zhu, R., Sun, E., Huang, G., and Celiktutan, O. Ef- ficient continual adaptation of pretrained robotic pol- icy with online meta-learned adapters.arXiv preprint arXiv:2503.18684,
-
[15]
Require:Linear actor policy:π θA(a|h), linear critic value-function:ˆvθC(h), and recurrent layer: RNNθR([o, a, r], h, ˆJ) 1:θ A, θC, θR ←initialize network parameters 2:B A, BC ←initialize feedback matrices 3:h, e A, eC, eR ←0 4:o←reset environment 5:h, ˆJ←RNN θR([o,0,0], h,0) 6:v←ˆv θC(h) 7:whilenot donedo 8:π←π θA(h) 9:a←sample(π) 10:o, r←take actiona 1...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.