pith. sign in

arxiv: 2602.02236 · v4 · pith:AVKQZCPDnew · submitted 2026-02-02 · 💻 cs.RO · cs.LG· cs.NE· cs.SY· eess.SY

Adaptive Control in Autonomous Driving via Real-Time Recurrent RL

Pith reviewed 2026-05-21 13:42 UTC · model grok-4.3

classification 💻 cs.RO cs.LGcs.NEcs.SYeess.SY
keywords online reinforcement learningautonomous drivingevent cameraspolicy adaptationstate-space modelsreal-time controlbehavioral cloning
0
0 comments X

The pith

Online recurrent RL fine-tunes pretrained driving policies in real time to handle distribution shifts with event-camera inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies online fine-tuning of control policies for autonomous driving using Real-Time Recurrent Reinforcement Learning, an algorithm that updates policy parameters at every time step without backpropagation through time. It extends this method to LrcSSM, a nonlinear diagonal state-space model, and pairs offline behavioral cloning with online adaptation to respond to changes encountered at deployment. The approach is tested in the CarRacing simulator and on a 1:10-scale physical platform with an event camera performing line-following tasks. LrcSSM-based policies show the fastest and most consistent performance gains in both environments. This constitutes the first reported case of online RL fine-tuning using event-camera observations on standard non-spiking hardware inside closed-loop control.

Core claim

Extending Real-Time Recurrent Reinforcement Learning to LrcSSM models enables effective online adaptation of pretrained autonomous driving policies to distribution shifts. When combined with offline behavioral cloning, the method produces rapid and reliable improvements during both simulated CarRacing runs and real-world line-following on a RoboRacer platform equipped with an event camera, marking the first demonstration of such online RL fine-tuning on standard hardware in closed-loop settings.

What carries the argument

Real-Time Recurrent Reinforcement Learning (RTRRL), a memory-efficient online update rule that adjusts policy parameters at every time step without backpropagation through time, extended to support LrcSSM nonlinear diagonal state-space models.

Load-bearing premise

Online parameter updates performed at every time step will remain stable and safe under real sensor noise and latency without extra safeguards or fallback controllers.

What would settle it

A closed-loop real-world run in which the online-fine-tuned policy loses lane tracking or becomes unstable under normal event-camera noise and latency would falsify the stability premise.

Figures

Figures reproduced from arXiv: 2602.02236 by Daniela Rus, Felix Resch, Julian Lemmel, M\'onika Farsang, Radu Grosu, Ramin Hasani.

Figure 1
Figure 1. Figure 1: Overview of our proposed method and experiments. After collecting human control data in the environment, a policy is pretrained using behavioral cloning. The policy is then fine-tuned online using RTRRL. The gradients needed for optimization are computed with RTRL or RFLO for diagonalized or fully connected RNN models respectively. – proves fundamentally inadequate for handling such non￾stationary environm… view at source ↗
Figure 2
Figure 2. Figure 2: shows the model structure used for our experiments. Core components are the convolutional encoder and the recurrent policy, which are pretrained first using supervised learning, and later fine-tuned using reinforcement learning. The convolutional decoder and the recurrent critic are used only during pretraining and fine-tuning respectively. CNN Encoder CNN Decoder RNN Policy CNN Encoder RNN Policy RNN Crit… view at source ↗
Figure 3
Figure 3. Figure 3: RoboRacer car equipped with Sony/Prophesee IMX636 sensor for the real-world deployment of the proposed algorithm. Unlike an RGB optical sensor, the DVS captures changes in pixel intensity and generates a stream of intensity change events, triggered when the intensity exceeds a pre￾defined threshold. We use aggregated events to generate frame-based representations for use with conventional (non￾spiking) neu… view at source ↗
Figure 4
Figure 4. Figure 4: RGB frame and the corresponding DVS event frame representation. Typically, filtering is applied to each representation to re￾move noise from the event stream, and some representations also flatten event polarities. Gallego et al. (2022) describe the different representations in more detail and typical algo￾rithms for event data. In the LineTracking experiment, we use the dataset collected by (Resch et al.,… view at source ↗
Figure 5
Figure 5. Figure 5: Boxplots of evaluation reward on three different tracks for five different pretrained models, aggregated per type. Left shows rewards before fine-tuning – right after. We found that a learning rate around 10−6 for the actor is best. The critic learning rate appeared to be of less im￾portance with values in the range of 10−3 to 10−5 being acceptable. Entropy regularization did show negligible im￾pact overal… view at source ↗
Figure 7
Figure 7. Figure 7: Shown are trajectories of five laps of finetuning a sub￾optimal policy. Initially, the car goes off the road (red) – but it improves each lap, eventually completing the track (blue) [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Boxplots of cumulative rewards for the LineTracking experiment of five different pretrained models, aggregated per type. Left shows rewards before fine-tuning – right after. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Median cumulative rewards per lap for the LineTracking experiment. Shaded regions show the standard deviation. 5.2. Real-world deployment For the LineTracking task, the agent is placed on a pre-determined starting point on a line marked on the floor with clearly distinguishable tape. The goal of this task is to follow the line as closely as possible, while avoiding rapid steering inputs. For evaluation pur… view at source ↗
Figure 11
Figure 11. Figure 11: Mean validation loss during pretraining on the CarRacing dataset. Shown is the mean reward of five seeds per model type with standard deviation shown as shaded regions. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Mean validation loss during pretraining on the LineTracking dataset. Shown is the mean reward of five seeds per model type with standard deviation shown as shaded regions. CT-RNN [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Left: Exemplary decoded images predicted by the CNN autoencoder after pretraining on the CarRacing dataset. Right: Actions predicted by the pretrained policy [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Left: Exemplary decoded images predicted by the CNN autoencoder after pretraining on the LineTracking dataset. Right: Actions predicted by the pretrained policy. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Trajectories of policies without fine-tuning. Crosses indicate manual intervention, and circles indicate the resumption by the policy [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Number of interventions per lap of the pre-trained LineTracking models. C.2. Policies During Fine-tuning (a) CT-RNN. (b) LRC [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Trajectories of policies with fine-tuning. Laps that required manual intervention were terminated upon intervention. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_17.png] view at source ↗
read the original abstract

We study online fine-tuning of pretrained control policies for autonomous driving using Real-Time Recurrent Reinforcement Learning (RTRRL), a memory-efficient algorithm that updates policy parameters at every time step without backpropagation through time. We extend RTRRL to support LrcSSM, a recently proposed nonlinear diagonal state-space model, and combine offline behavioral cloning with online RTRRL fine-tuning to adapt policies to distribution shifts at deployment. We validate the approach in the CarRacing simulation and on a 1:10-scale RoboRacer platform equipped with an event camera, where a pretrained policy is fine-tuned online during real-world line-following. To our knowledge, this is the first demonstration of online RL fine-tuning with event-camera observations on standard (non-spiking) hardware in closed-loop control. LrcSSM-based policies improve fastest and most consistently across both settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents Real-Time Recurrent Reinforcement Learning (RTRRL) for online fine-tuning of pretrained autonomous-driving policies, extending the algorithm to LrcSSM nonlinear diagonal state-space models. It combines offline behavioral cloning with per-timestep online updates to adapt to distribution shifts, and reports validation in the CarRacing simulator plus a closed-loop line-following experiment on a 1:10 RoboRacer platform equipped with an event camera. The central empirical claim is that LrcSSM-based policies improve fastest and most consistently in both domains, together with the assertion that this constitutes the first demonstration of online RL fine-tuning with event-camera observations on standard (non-spiking) hardware.

Significance. If the stability and performance claims are substantiated with quantitative evidence, the work would offer a memory-efficient route to real-time policy adaptation in autonomous driving without BPTT, and the event-camera closed-loop result on commodity hardware would be a practical contribution to robust perception-action loops under sensor sparsity.

major comments (2)
  1. [Abstract / Validation] Abstract and validation sections: the headline claim that LrcSSM policies 'improve fastest and most consistently across both settings' is presented without any quantitative metrics, baselines, statistical tests, success rates, or failure-case analysis, rendering the central empirical result unverifiable from the reported text.
  2. [Real-world experiment] Real-world RoboRacer experiment section: the description of closed-loop fine-tuning treats per-timestep RTRRL + LrcSSM updates as inherently stable under event-camera noise and latency, yet provides no per-trial divergence rates, safety-intervention counts, or fallback-controller behavior when events become sparse or latency spikes occur; this information is load-bearing for the 'most consistently' qualifier.
minor comments (1)
  1. [Method] Notation for LrcSSM and RTRRL could be introduced with a short equation or pseudocode block to clarify the per-step update rule.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the major comments point by point below and commit to revisions that strengthen the quantitative presentation of our results without altering the core claims or methodology.

read point-by-point responses
  1. Referee: [Abstract / Validation] Abstract and validation sections: the headline claim that LrcSSM policies 'improve fastest and most consistently across both settings' is presented without any quantitative metrics, baselines, statistical tests, success rates, or failure-case analysis, rendering the central empirical result unverifiable from the reported text.

    Authors: We agree that the abstract is a high-level summary and that additional quantitative detail would improve verifiability. The full manuscript contains learning-curve figures and baseline comparisons in the validation sections for both simulation and real-world domains. To directly address this point, we will expand the text to report explicit metrics (e.g., mean improvement per update step, success rates across trials), include statistical tests where appropriate, and add a short failure-case discussion. revision: yes

  2. Referee: [Real-world experiment] Real-world RoboRacer experiment section: the description of closed-loop fine-tuning treats per-timestep RTRRL + LrcSSM updates as inherently stable under event-camera noise and latency, yet provides no per-trial divergence rates, safety-intervention counts, or fallback-controller behavior when events become sparse or latency spikes occur; this information is load-bearing for the 'most consistently' qualifier.

    Authors: The current section emphasizes the feasibility of closed-loop event-camera control on commodity hardware. We acknowledge that quantitative stability metrics are not reported in detail. In the revision we will add per-trial statistics, including divergence rates, counts of safety interventions, and a description of any fallback behavior observed when event rates drop or latency increases. revision: yes

Circularity Check

0 steps flagged

Empirical validation of RTRRL+LrcSSM extension contains no derivation chain

full rationale

The paper frames its contribution as an empirical demonstration of online fine-tuning for autonomous driving policies using Real-Time Recurrent Reinforcement Learning extended to LrcSSM models. It reports performance improvements from CarRacing simulation and closed-loop RoboRacer experiments with event-camera input, without presenting equations, first-principles derivations, or predictions that reduce to fitted inputs by construction. The central claims rest on observed experimental outcomes rather than self-referential definitions or load-bearing self-citations that would force the results. Prior work on RTRRL and LrcSSM is referenced as background but does not substitute for the new empirical validation, keeping the overall circularity low.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit equations, so free parameters, axioms, and invented entities cannot be extracted; the central claim rests on the unstated assumption that RTRRL updates remain stable in closed loop.

pith-pipeline@v0.9.0 · 5702 in / 1024 out tokens · 27689 ms · 2026-05-21T13:42:41.860795+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 3 internal anchors

  1. [1]

    2023.10342437

    doi: 10.1109/IROS55552. 2023.10342437. Bellec, G., Scherr, F., Subramoney, A., Hajek, E., Salaj, D., Legenstein, R., and Maass, W. A solution to the learn- ing dilemma for recurrent networks of spiking neurons. Nature communications, 11(1):3625,

  2. [2]

    doi: 10.1109/MSP.2020. 2985815. Chen, K., Wei, H., Deng, Z., and Lin, S. Towards fast safe online reinforcement learning via policy finetuning. Transactions on Machine Learning Research,

  3. [3]

    Learning with chemical versus electrical synapses does it make a difference? In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pp

    Farsang, M., Lechner, M., Lung, D., Hasani, R., Rus, D., and Grosu, R. Learning with chemical versus electrical synapses does it make a difference? In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pp. 15106–15112. IEEE, 2024a. Farsang, M., Neubauer, S. A., and Grosu, R. Liquid Re- sistance Liquid Capacitance Networks. InThe First ...

  4. [4]

    doi: 10.1016/S0893-6080(05) 80125-X

    ISSN 0893-6080. doi: 10.1016/S0893-6080(05) 80125-X. Gallego, G., Delbr¨uck, T., Orchard, G., Bartolozzi, C., Taba, B., Censi, A., Leutenegger, S., Davison, A. J., Conradt, J., Daniilidis, K., and Scaramuzza, D. Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180,

  5. [5]

    Gerstner, W., Kistler, W

    1109/TPAMI.2020.3008413. Gerstner, W., Kistler, W. M., Naud, R., and Paninski, L. Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition. Cambridge University Press, Cambridge,

  6. [6]

    Korkmaz, E

    1017/CBO9781107447615. Korkmaz, E. A survey analyzing generalization in deep reinforcement learning.arXiv preprint arXiv:2401.02349,

  7. [7]

    Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline rein- forcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643,

  8. [8]

    On-the-Fly VLA Adaptation via Test-Time Reinforcement Learning

    Liu, C., Liu, Y ., Wang, T., Zhuang, Q., Liang, J. C., Yang, W., Xu, R., Wang, Q., Liu, D., and Han, C. On-the-fly vla adaptation via test-time reinforcement learning.arXiv preprint arXiv:2601.06748,

  9. [9]

    Murray, J

    ISSN 1533-7928. Murray, J. M. Local online learning in recurrent networks with random feedback.eLife, 8:e43299, may 2019a. ISSN 2050-084X. doi: 10.7554/eLife.43299. Murray, J. M. Local online learning in recurrent networks with random feedback.eLife, 8:e43299, May 2019b. ISSN 2050-084X. doi: 10.7554/eLife.43299. Orvieto, A., Smith, S. L., Gu, A., Fernando...

  10. [10]

    Lee, Matthew Tan, Yuke Zhu, and Jeannette Bohg

    doi: 10.1109/ICRA48506.2021.9560881. V oogd, K. L., Allamaa, J. P., Alonso-Mora, J., and Son, T. D. Reinforcement learning from simulation to real world au- tonomous driving using digital twin.IFAC-PapersOnLine, 56(2):1510–1515,

  11. [11]

    CDDT: Fast Approximate 2D Ray Casting for Accelerated Localization

    Walsh, C. and Karaman, S. Cddt: Fast approximate 2d ray casting for accelerated localization. abs/1705.01167,

  12. [12]

    Beyond model adaptation at test time: A survey,

    Xiao, Z. and Snoek, C. G. Beyond model adaptation at test time: A survey.arXiv preprint arXiv:2411.03687,

  13. [13]

    Deep learning for event-based vision: A comprehensive survey and bench- marks

    Zheng, X., Liu, Y ., Lu, Y ., Hua, T., Pan, T., Zhang, W., Tao, D., and Wang, L. Deep learning for event-based vision: A comprehensive survey and benchmarks.arXiv preprint arXiv:2302.08890,

  14. [14]

    Ef- ficient continual adaptation of pretrained robotic pol- icy with online meta-learned adapters.arXiv preprint arXiv:2503.18684,

    Zhu, R., Sun, E., Huang, G., and Celiktutan, O. Ef- ficient continual adaptation of pretrained robotic pol- icy with online meta-learned adapters.arXiv preprint arXiv:2503.18684,

  15. [15]

    Additional Pre-training Results We show the validation loss curves from pre-training on the CarRacing and LineTracking dataset in Fig

    Require:Linear actor policy:π θA(a|h), linear critic value-function:ˆvθC(h), and recurrent layer: RNNθR([o, a, r], h, ˆJ) 1:θ A, θC, θR ←initialize network parameters 2:B A, BC ←initialize feedback matrices 3:h, e A, eC, eR ←0 4:o←reset environment 5:h, ˆJ←RNN θR([o,0,0], h,0) 6:v←ˆv θC(h) 7:whilenot donedo 8:π←π θA(h) 9:a←sample(π) 10:o, r←take actiona 1...