Continual Robot Policy Learning via Variational Neural Dynamics

Davide Scaramuzza; Ismail Geles; Jiaxu Xing; Rudolf Reiter; Yifan Zhai; Yunfan Ren; Zhiyuan Zhu

arxiv: 2606.27353 · v1 · pith:LEBKU776new · submitted 2026-06-25 · 💻 cs.RO

Continual Robot Policy Learning via Variational Neural Dynamics

Jiaxu Xing , Zhiyuan Zhu , Yunfan Ren , Ismail Geles , Yifan Zhai , Rudolf Reiter , Davide Scaramuzza This is my paper

Pith reviewed 2026-06-26 04:26 UTC · model grok-4.3

classification 💻 cs.RO

keywords continual learningrobot policydynamics modelingvariational inferencerecurrent encoderonline adaptationquadrotor controlneural residual

0 comments

The pith

A variational dynamics model lets robot policies recover from recurring disturbances like wind changes by inferring hidden conditions online instead of re-fitting residuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a continual learning method for robot controllers that must handle hidden, recurring changes in dynamics such as shifting wind or varying payloads. It builds a condition-aware model that merges an analytical physics prior with a neural residual for unknown effects, then uses a recurrent encoder to extract the current hidden condition from short recent state-action histories. Policies are trained through differentiable simulation that draws diverse conditions from the learned latent distribution. At runtime the policy is conditioned on conditions inferred directly from real interaction, enabling recognition-based adaptation rather than repeated residual updates. Experiments on a quadrotor show recovery from wind disturbances in roughly one second and large error reductions relative to prior online adaptation techniques.

Core claim

By training a policy on dynamics sampled from a variational model whose latent conditions are inferred online by a recurrent encoder, the robot can adapt to recurring hidden dynamics through recognition rather than residual re-fitting, yielding recovery times around one second and error reductions of 65.7 percent in hover and 53.3 percent in tracking on real quadrotors under changing wind.

What carries the argument

A variational neural dynamics model that fuses an analytical physics prior with a neural residual and conditions both the residual and the policy on a latent state inferred by a recurrent encoder from recent trajectories.

If this is right

Policies recover from recurring disturbances in roughly one second on real quadrotors under changing wind.
Large-disturbance hover errors drop by 65.7 percent and tracking errors by 53.3 percent versus state-of-the-art online adaptation.
Policy learning proceeds by sampling diverse conditions from the latent model inside differentiable simulation.
At deployment, real-time encoder outputs replace sampled conditions to enable fast recognition of known dynamics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same encoder-based inference could be tested on ground robots facing recurring terrain or payload shifts.
If the separation between prior and residual holds, the method might reduce the frequency of full policy retraining in long deployments.
Combining the latent condition with other adaptation signals such as visual cues could be examined as an extension.
The approach suggests a route for continual learning on platforms where dynamics recur but are not fully observable from single steps.

Load-bearing premise

The framework assumes recurring hidden dynamics can be reliably inferred online from short recent interaction histories via the recurrent encoder without significant interference or mode collapse between the physics prior and neural residual.

What would settle it

If recovery time from recurring wind disturbances on the quadrotor equals or exceeds the time required by online residual re-fitting, the advantage of inference-based adaptation would be falsified.

Figures

Figures reproduced from arXiv: 2606.27353 by Davide Scaramuzza, Ismail Geles, Jiaxu Xing, Rudolf Reiter, Yifan Zhai, Yunfan Ren, Zhiyuan Zhu.

**Figure 1.** Figure 1: Method overview. Our framework learns latent-conditioned residual dynamics from real trajectories, augmenting a rigid-body prior with a FiLM-modulated neural residual inferred from recent state-action history. Sampled latents drive parallel differentiable policy training, while online inferred latents enable 50 Hz condition-aware deployment without privileged disturbance labels. Existing approaches only ad… view at source ↗

**Figure 2.** Figure 2: Real-world policy refinement. On hardware, continual policy learning reduces figure-eight tracking error from 41 cm for the base policy to 9 cm after refinement. The lower panels show representative xy tracking traces before and after refinement, illustrating improved trajectory tracking under the same deployment setup. 4.2 System Analysis Dynamics Learning and Latent Space Analysis. We next examine whet… view at source ↗

**Figure 4.** Figure 4: Latent dynamics analysis. Unsupervised latents cluster by wind direction and magnitude, and the learned dynamics accurately reproduce a representative wind-conditioned rollout. learning: the hidden dynamics can be inferred from real trajectories, and the latent embedding can generate plausible condition-specific rollouts for training. Method Hover Tracking Small↓ Large↓ Small↓ Large↓ Base DiffSim 0.328 1.… view at source ↗

**Figure 5.** Figure 5: Quantitative benchmark comparisons. Across quadrotor landing, vision-based hovering, trajectory tracking, and contact-rich manipulation, the proposed latent dynamics framework approaches the ground-truth-disturbance oracle and outperforms residual, single-condition, and latentadaptation baselines. Lower error is better except for box-pushing success rate. tracking, our method is consistently closest to … view at source ↗

**Figure 6.** Figure 6: Manipulation simulation setup. We apply the same continual-learning pipeline to manipulation domains with hidden object, payload, and contact dynamics. Deployment trajectories update a latent-conditioned residual model, and sampled latents expose the policy to recurring hidden manipulation conditions during differentiable policy optimization. Quadrotor wind model. To evaluate the system in controlled sim… view at source ↗

**Figure 7.** Figure 7: Training curves comparing return versus epoch [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Experiments on Learned Dyanmics. Our learned Variational Dynamics Model is able to capture diverse disturbance conditions with very low open-loop cumulative errors. 1.0 0.5 0.0 0.5 1.0 X [m] 1.0 0.5 0.0 0.5 1.0 Y [m] Wind (+1.5, +0.0) |1.5| err 0.02 m Reference Flown 1.0 0.5 0.0 0.5 1.0 X [m] 1.0 0.5 0.0 0.5 1.0 Y [m] Wind (+0.0, +1.5) |1.5| err 0.02 m 1.0 0.5 0.0 0.5 1.0 X [m] 1.0 0.5 0.0 0.5 1.0 Y [m] Wi… view at source ↗

**Figure 9.** Figure 9: Policy performance in diverse disturbance conditions. Our learned policy through the continual learning framework is able to capture diverse external conditions. The reconstruction term decodes the latent back to the normalized context sequence, Lrec = 1 N X N n=1 ∥Gη(Eϕ(Hn)) − norm(Hn)∥ 2 2 , (16) where Gη is the auxiliary decoder and norm(·) applies the training-buffer input normalization. This term is a… view at source ↗

read the original abstract

Robots deployed in the real world rarely operate under a single fixed dynamics model: wind changes, payloads vary, batteries drain, contacts shift, and hardware wears. Yet most learning-based controllers are trained once and deployed as if learning were complete. This prevents the robot from using deployment experience to further improve task performance. In this work, we propose a continual learning framework that uses real-world experience to improve robot policies under hidden and recurring dynamics. Our method learns a condition-aware dynamics model from real state-action trajectories by combining an analytical physics prior with a neural residual for unmodeled effects. A recurrent encoder infers the current hidden condition from recent interaction, and this estimate conditions both the residual model and the policy. Policy learning is performed via differentiable simulation using diverse learned dynamics sampled from the latent model. At deployment, these sampled conditions are replaced by conditions inferred online from recent real interaction, allowing the policy to recover recurring dynamics by recognition rather than residual re-fitting. Through extensive simulation studies and real-world experiments, we demonstrate that the framework improves policy performance under diverse unobserved disturbances. On real quadrotor trajectory tracking under changing wind, the policy recovers from recurring disturbances in roughly 1s, about 5x faster than online residual re-fitting. It also reduces large-disturbance hover and tracking errors by 65.7% and 53.3% over the state-of-the-art online adaptation approaches

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core idea is using a recurrent encoder to infer hidden dynamics conditions online so the policy can recognize and adapt to recurring disturbances instead of refitting residuals each time.

read the letter

The main takeaway is that this work gives a concrete way to do continual policy improvement on robots when dynamics change in recurring but hidden ways. It combines an analytical physics model with a neural residual, puts both under a variational latent condition, and uses a recurrent encoder to pull the current condition from short recent trajectories. That condition then drives both the residual and the policy. Policy training happens in differentiable simulation by sampling from the learned latent model; at test time the samples get swapped for real-time inferences.

What stands out as new is the specific loop that turns online inference into policy adaptation by recognition rather than repeated system identification. The real quadrotor experiments under changing wind are the strongest part: they report recovery in about one second (roughly 5x faster than online residual refitting) and error reductions of 65.7% and 53.3% versus prior online adaptation methods. Those numbers come from actual hardware, which is worth something.

The soft spot is exactly where the stress-test note flags it. The headline gains rest on the recurrent encoder producing a clean, usable latent condition from short histories without mode collapse or leakage into the analytical prior. The abstract gives no numbers on inference accuracy, disentanglement quality, or behavior under sensor noise, so it is hard to judge whether the 5x speedup is robust or tied to favorable conditions. Without those checks the advantage over simpler residual methods could shrink.

This is for people working on adaptive robot control who already have differentiable simulators and want to handle non-stationary but recurring effects. It deserves a serious referee because the real-robot results are there and the framing is coherent, even if the inference step needs more scrutiny in revision.

Referee Report

1 major / 1 minor

Summary. The paper proposes a continual learning framework for robot policies under hidden recurring dynamics. It learns a condition-aware dynamics model by combining an analytical physics prior with a neural residual, uses a recurrent encoder to infer the current hidden condition from recent state-action trajectories, and conditions both the residual and the policy on this latent estimate. Policy learning occurs via differentiable simulation sampling diverse conditions from the variational model; at deployment, online-inferred conditions replace sampling to enable recognition-based recovery rather than residual re-fitting. Real quadrotor experiments under changing wind report ~1s recovery (5x faster than online residual re-fitting) and 65.7%/53.3% reductions in large-disturbance hover/tracking errors versus SOTA online adaptation methods.

Significance. If the central claims hold, the framework provides a practical route to continual policy improvement in real-world robotics by leveraging variational inference for condition recognition instead of repeated adaptation. Strengths include the hybrid analytical-neural model, differentiable simulation for policy optimization, and demonstration on physical hardware with recurring disturbances. This could influence adaptive control and lifelong learning in robotics if the inference step proves robust.

major comments (1)

[recurrent encoder and online inference procedure (Section 3)] The headline performance claims (1s recovery, 5x speedup, 65.7% and 53.3% error reductions) rest on the recurrent encoder reliably extracting a usable latent condition from short recent histories without mode collapse or leakage into the analytical prior. No quantitative evaluation of inference accuracy, latent disentanglement, or robustness to sensor noise is reported, which is load-bearing for the advantage over residual re-fitting.

minor comments (1)

Notation for the variational posterior and the conditioning mechanism could be clarified with an explicit diagram or additional equations showing how the latent sample is injected into the residual and policy.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the paper.

read point-by-point responses

Referee: [recurrent encoder and online inference procedure (Section 3)] The headline performance claims (1s recovery, 5x speedup, 65.7% and 53.3% error reductions) rest on the recurrent encoder reliably extracting a usable latent condition from short recent histories without mode collapse or leakage into the analytical prior. No quantitative evaluation of inference accuracy, latent disentanglement, or robustness to sensor noise is reported, which is load-bearing for the advantage over residual re-fitting.

Authors: We agree that the reported performance advantages depend on the recurrent encoder's ability to perform reliable online inference. While the end-to-end simulation and hardware results provide indirect support through task-level metrics, we acknowledge the value of direct quantitative analysis. In the revised manuscript, we will add: (i) inference accuracy metrics on trajectories with known ground-truth conditions in simulation, (ii) quantitative measures of latent disentanglement (e.g., mutual information or correlation analysis between latent dimensions and condition parameters), and (iii) robustness evaluations under injected sensor noise. These additions will directly substantiate the claims regarding the inference procedure's contribution to faster recovery versus residual re-fitting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on external experimental comparisons

full rationale

The paper's core claims concern empirical recovery speed and error reductions on real quadrotor hardware under wind disturbances, benchmarked against online residual re-fitting and state-of-the-art adaptation methods. The framework description (analytical prior + neural residual, recurrent encoder for latent condition, differentiable simulation for policy training) introduces no self-definitional loops, no fitted parameters renamed as predictions, and no load-bearing self-citations that substitute for independent verification. All reported metrics derive from held-out real trajectories and baseline comparisons rather than algebraic reduction to the model's own fitted values. This is the normal non-circular outcome for an experimental robotics paper whose central results are falsifiable outside its training loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no specific free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.1-grok · 5799 in / 1059 out tokens · 35461 ms · 2026-06-26T04:26:06.744219+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 1 canonical work pages

[1]

Kaufmann, L

E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza. Champion-level drone racing using deep reinforcement learning.Nature, 620(7976):982–987, Aug 2023. ISSN 1476-4687

2023
[2]

Geles, L

I. Geles, L. Bauersfeld, A. Romero, J. Xing, and D. Scaramuzza. Demonstrating agile flight from pixels without state estimation.Robotics: Science and Systems, 2024

2024
[3]

J. Xing, I. Geles, E. Aljalbout, and D. Scaramuzza. Multi-task reinforcement learning for quadrotor control.IEEE Robotics and Automation Letters, 9(10), 2024

2024
[4]

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning robust per- ceptive locomotion for quadrupedal robots in the wild.Science Robotics, 7(62), 2022

2022
[5]

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning quadrupedal locomo- tion over challenging terrain.Science Robotics, 5(47), 2020

2020
[6]

C. Chi, Z. Xu, S. Feng, et al. Diffusion policy: Visuomotor policy learning via action diffusion. International Journal of Robotics Research, 2025

2025
[7]

Aljalbout, J

E. Aljalbout, J. Xing, A. Romero, I. Akinola, C. R. Garrett, E. Heiden, A. Gupta, T. Hermans, Y . Narang, D. Fox, et al. The reality gap in robotics: Challenges, solutions, and best practices. Annual Review of Control, Robotics, and Autonomous Systems, 9, 2025

2025
[8]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InProc. IROS, 2017

2017
[9]

Y . Ren, Z. Zhu, J. Xing, and D. Scaramuzza. Learning agile quadrotor flight in the real world. InarXiv Preprint, 2026

2026
[10]

Hwangbo, J

J. Hwangbo, J. Lee, A. Dosovitskiy, et al. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26), 2019

2019
[11]

Bauersfeld, E

L. Bauersfeld, E. Kaufmann, P. Foehn, S. Sun, and D. Scaramuzza. NeuroBEM: Hybrid aero- dynamic quadrotor model. InRobotics: Science and Systems, 2021

2021
[12]

J. Pan*, J. Xing*, R. Reiter, Y . Zhai, E. Aljalbout, and D. Scaramuzza. Learning on the fly: Rapid policy adaptation via differentiable simulation.IEEE Robotics and Automation Letters, 2025

2025
[13]

H. Wang, J. Xing, N. Messikommer, and D. Scaramuzza. Environment as policy: Learning to race in unseen tracks. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 11333–11339. IEEE, 2025

2025
[14]

Hanover, P

D. Hanover, P. Foehn, S. Sun, E. Kaufmann, and D. Scaramuzza. Performance, precision, and payloads: Adaptive nonlinear mpc for quadrotors.IEEE Robotics and Automation Letters, 7 (2):690–697, 2021

2021
[15]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[16]

Schulman, S

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015. 9

2015
[17]

Rudin, D

N. Rudin, D. Hoeller, P. Reist, and M. Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InConference on robot learning, pages 91–100. PMLR, 2022

2022
[18]

W. Yu, J. Tan, C. K. Liu, and G. Turk. Preparing for the unknown: Learning a universal policy with online system identification.arXiv preprint arXiv:1702.02453, 2017

Pith/arXiv arXiv 2017
[19]

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In2018 IEEE international conference on robotics and automation (ICRA), pages 3803–3810. IEEE, 2018

2018
[20]

Rakelly, A

K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen. Efficient off-policy meta- reinforcement learning via probabilistic context variables. InInternational conference on ma- chine learning, pages 5331–5340. PMLR, 2019

2019
[21]

O’Connell, G

M. O’Connell, G. Shi, X. Shi, K. Azizzadenesheli, A. Anandkumar, Y . Yue, and S.-J. Chung. Neural-fly enables rapid learning for agile flight in strong winds.Science Robotics, 7(66): eabm6597, 2022

2022
[22]

Kumar, Z

A. Kumar, Z. Fu, D. Pathak, and J. Malik. Rma: Rapid motor adaptation for legged robots. Proceedings of Robotics: Science and Systems (RSS), 2021

2021
[23]

H. Qi, A. Kumar, R. Calandra, Y . Ma, and J. Malik. In-hand object rotation via rapid motor adaptation. InConference on Robot Learning, pages 1722–1732. PMLR, 2023

2023
[24]

Kumar, Z

A. Kumar, Z. Li, J. Zeng, D. Pathak, K. Sreenath, and J. Malik. Adapting rapid motor adapta- tion for bipedal robots. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1161–1168. IEEE, 2022

2022
[25]

Huang, R

K. Huang, R. Rana, A. Spitzer, G. Shi, and B. Boots. DATT: Deep adaptive trajectory tracking for quadrotor control. InConference on Robot Learning, 2023

2023
[26]

Kaufmann, L

E. Kaufmann, L. Bauersfeld, and D. Scaramuzza. A benchmark comparison of learned control policies for agile quadrotor flight. In2022 International Conference on Robotics and Automa- tion (ICRA), pages 10504–10510. IEEE, 2022

2022
[27]

Gretton, K

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch ¨olkopf, and A. Smola. A kernel two-sample test.Journal of Machine Learning Research, 13(1):723–773, 2012

2012
[28]

C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem. Brax – a differ- entiable physics engine for large scale rigid body simulation.Advances in Neural Information Processing Systems (NeurIPS), 2021

2021
[29]

Y . Song, S. Kim, and D. Scaramuzza. Learning quadrupedal locomotion via differentiable simulation. InProc. Conference on Robot Learning, 2024

2024
[30]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y . Bengio and Y . LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015

2015
[31]

Neuralfeels with neural fields: Visuotactile perception for in-hand manipulation,

P. Foehn, E. Kaufmann, A. Romero, R. Penicka, S. Sun, L. Bauersfeld, T. Laengle, G. Cioffi, Y . Song, A. Loquercio, and D. Scaramuzza. Agilicious: Open-source and open-hardware agile quadrotor for vision-based flight.Science Robotics, 7(67), 2022. doi:10.1126/scirobotics. abl6259. 10 A Supplementary Materials Our supplementary materials provide the implem...

work page doi:10.1126/scirobotics 2022

[1] [1]

Kaufmann, L

E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza. Champion-level drone racing using deep reinforcement learning.Nature, 620(7976):982–987, Aug 2023. ISSN 1476-4687

2023

[2] [2]

Geles, L

I. Geles, L. Bauersfeld, A. Romero, J. Xing, and D. Scaramuzza. Demonstrating agile flight from pixels without state estimation.Robotics: Science and Systems, 2024

2024

[3] [3]

J. Xing, I. Geles, E. Aljalbout, and D. Scaramuzza. Multi-task reinforcement learning for quadrotor control.IEEE Robotics and Automation Letters, 9(10), 2024

2024

[4] [4]

T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning robust per- ceptive locomotion for quadrupedal robots in the wild.Science Robotics, 7(62), 2022

2022

[5] [5]

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter. Learning quadrupedal locomo- tion over challenging terrain.Science Robotics, 5(47), 2020

2020

[6] [6]

C. Chi, Z. Xu, S. Feng, et al. Diffusion policy: Visuomotor policy learning via action diffusion. International Journal of Robotics Research, 2025

2025

[7] [7]

Aljalbout, J

E. Aljalbout, J. Xing, A. Romero, I. Akinola, C. R. Garrett, E. Heiden, A. Gupta, T. Hermans, Y . Narang, D. Fox, et al. The reality gap in robotics: Challenges, solutions, and best practices. Annual Review of Control, Robotics, and Autonomous Systems, 9, 2025

2025

[8] [8]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InProc. IROS, 2017

2017

[9] [9]

Y . Ren, Z. Zhu, J. Xing, and D. Scaramuzza. Learning agile quadrotor flight in the real world. InarXiv Preprint, 2026

2026

[10] [10]

Hwangbo, J

J. Hwangbo, J. Lee, A. Dosovitskiy, et al. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26), 2019

2019

[11] [11]

Bauersfeld, E

L. Bauersfeld, E. Kaufmann, P. Foehn, S. Sun, and D. Scaramuzza. NeuroBEM: Hybrid aero- dynamic quadrotor model. InRobotics: Science and Systems, 2021

2021

[12] [12]

J. Pan*, J. Xing*, R. Reiter, Y . Zhai, E. Aljalbout, and D. Scaramuzza. Learning on the fly: Rapid policy adaptation via differentiable simulation.IEEE Robotics and Automation Letters, 2025

2025

[13] [13]

H. Wang, J. Xing, N. Messikommer, and D. Scaramuzza. Environment as policy: Learning to race in unseen tracks. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 11333–11339. IEEE, 2025

2025

[14] [14]

Hanover, P

D. Hanover, P. Foehn, S. Sun, E. Kaufmann, and D. Scaramuzza. Performance, precision, and payloads: Adaptive nonlinear mpc for quadrotors.IEEE Robotics and Automation Letters, 7 (2):690–697, 2021

2021

[15] [15]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[16] [16]

Schulman, S

J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015. 9

2015

[17] [17]

Rudin, D

N. Rudin, D. Hoeller, P. Reist, and M. Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InConference on robot learning, pages 91–100. PMLR, 2022

2022

[18] [18]

W. Yu, J. Tan, C. K. Liu, and G. Turk. Preparing for the unknown: Learning a universal policy with online system identification.arXiv preprint arXiv:1702.02453, 2017

Pith/arXiv arXiv 2017

[19] [19]

X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In2018 IEEE international conference on robotics and automation (ICRA), pages 3803–3810. IEEE, 2018

2018

[20] [20]

Rakelly, A

K. Rakelly, A. Zhou, C. Finn, S. Levine, and D. Quillen. Efficient off-policy meta- reinforcement learning via probabilistic context variables. InInternational conference on ma- chine learning, pages 5331–5340. PMLR, 2019

2019

[21] [21]

O’Connell, G

M. O’Connell, G. Shi, X. Shi, K. Azizzadenesheli, A. Anandkumar, Y . Yue, and S.-J. Chung. Neural-fly enables rapid learning for agile flight in strong winds.Science Robotics, 7(66): eabm6597, 2022

2022

[22] [22]

Kumar, Z

A. Kumar, Z. Fu, D. Pathak, and J. Malik. Rma: Rapid motor adaptation for legged robots. Proceedings of Robotics: Science and Systems (RSS), 2021

2021

[23] [23]

H. Qi, A. Kumar, R. Calandra, Y . Ma, and J. Malik. In-hand object rotation via rapid motor adaptation. InConference on Robot Learning, pages 1722–1732. PMLR, 2023

2023

[24] [24]

Kumar, Z

A. Kumar, Z. Li, J. Zeng, D. Pathak, K. Sreenath, and J. Malik. Adapting rapid motor adapta- tion for bipedal robots. In2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1161–1168. IEEE, 2022

2022

[25] [25]

Huang, R

K. Huang, R. Rana, A. Spitzer, G. Shi, and B. Boots. DATT: Deep adaptive trajectory tracking for quadrotor control. InConference on Robot Learning, 2023

2023

[26] [26]

Kaufmann, L

E. Kaufmann, L. Bauersfeld, and D. Scaramuzza. A benchmark comparison of learned control policies for agile quadrotor flight. In2022 International Conference on Robotics and Automa- tion (ICRA), pages 10504–10510. IEEE, 2022

2022

[27] [27]

Gretton, K

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch ¨olkopf, and A. Smola. A kernel two-sample test.Journal of Machine Learning Research, 13(1):723–773, 2012

2012

[28] [28]

C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem. Brax – a differ- entiable physics engine for large scale rigid body simulation.Advances in Neural Information Processing Systems (NeurIPS), 2021

2021

[29] [29]

Y . Song, S. Kim, and D. Scaramuzza. Learning quadrupedal locomotion via differentiable simulation. InProc. Conference on Robot Learning, 2024

2024

[30] [30]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y . Bengio and Y . LeCun, editors,3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015

2015

[31] [31]

Neuralfeels with neural fields: Visuotactile perception for in-hand manipulation,

P. Foehn, E. Kaufmann, A. Romero, R. Penicka, S. Sun, L. Bauersfeld, T. Laengle, G. Cioffi, Y . Song, A. Loquercio, and D. Scaramuzza. Agilicious: Open-source and open-hardware agile quadrotor for vision-based flight.Science Robotics, 7(67), 2022. doi:10.1126/scirobotics. abl6259. 10 A Supplementary Materials Our supplementary materials provide the implem...

work page doi:10.1126/scirobotics 2022