One Demonstration Is Enough for Real-World Robotic Reinforcement Learning

Ceyao Zhang; Hongze Yu; Junge Zhang; Song Liu; Yaodong Yang; Yuanpei Chen; Yuhan Wang; Yuwan Liu

arxiv: 2607.01651 · v1 · pith:L4IYUY44new · submitted 2026-07-02 · 💻 cs.RO

One Demonstration Is Enough for Real-World Robotic Reinforcement Learning

Yuwan Liu , Hongze Yu , Song Liu , Yuhan Wang , Junge Zhang , Yaodong Yang , Yuanpei Chen , Ceyao Zhang This is my paper

Pith reviewed 2026-07-03 12:36 UTC · model grok-4.3

classification 💻 cs.RO

keywords robotic reinforcement learningsingle demonstrationautomated interventioncontact-intensive manipulationreal-world robot learningsafety recoverysliding window intervention

0 comments

The pith

AutoSERL trains effective real-world robot policies from one demonstration by automating all human intervention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a single demonstration can replace both large demonstration sets and ongoing human oversight in physical robot reinforcement learning. It introduces three mechanisms that together guide safe exploration, recover from failures, and stop intervening once the learned policy succeeds alone. A reader would care because data collection and human time remain the dominant costs when moving RL from simulation to contact-rich hardware tasks such as peg insertion and object hanging. The reported experiments demonstrate that the resulting policies exceed several multi-demonstration and imitation baselines while matching continuous human-in-the-loop performance across six tasks on two robot arms.

Core claim

AutoSERL automates intervention in real-world robot RL from a single demonstration by combining a sliding window intervention mechanism that continuously steers exploration away from local optima, a safety recovery mechanism that returns the robot to predefined trajectory points after detected failures, and an intervention termination criterion that disables guidance once the policy completes the task independently. On six contact-intensive manipulation tasks the method reaches 100 percent success on insertion problems, exceeds SERL initialized with twenty demonstrations, behavior cloning, and a dedicated one-shot imitation baseline, and matches human-in-the-loop SERL while improving robustn

What carries the argument

The AutoSERL framework, whose three mechanisms (sliding-window guidance, recovery-point safety correction, and automatic termination) together convert one demonstration into continuous automated supervision until the policy no longer needs it.

If this is right

Policies reach 100 percent success on insertion tasks using only one demonstration.
Performance exceeds methods that start with twenty demonstrations or rely on behavior cloning.
Robustness to positional variations improves compared with the listed baselines.
Results match those of continuous human-in-the-loop training without requiring the human during learning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach lowers the barrier to deploying RL on new robot hardware by reducing both demonstration volume and live supervision.
If recovery points can be generated automatically rather than supplied by hand, the method would extend to a wider set of tasks without additional engineering.
The termination criterion may allow the same framework to scale to longer-horizon sequences once the policy stabilizes on the initial sub-tasks.

Load-bearing premise

Predefined trajectory recovery points must be supplied for each task so that the safety mechanism can correct failures without introducing hidden task-specific engineering.

What would settle it

Measure success rates on the same insertion tasks after removing or randomly perturbing the predefined recovery points; a sharp drop would falsify the claim that one demonstration plus the described automation is sufficient.

Figures

Figures reproduced from arXiv: 2607.01651 by Ceyao Zhang, Hongze Yu, Junge Zhang, Song Liu, Yaodong Yang, Yuanpei Chen, Yuhan Wang, Yuwan Liu.

**Figure 1.** Figure 1: Overview of AutoSERL. Auto Intervention 1 (Sliding Window Intervention): the robot is guided to the nearest point within the sliding window only when the angle θ between the trajectory’s forward direction and the vector to that point satisfies θ ≤ 90◦ , preventing the robot from being pulled back to already-visited positions. Auto Intervention 2 (Safety Recovery Mechanism): when the robot is stuck, it is g… view at source ↗

**Figure 1.** Figure 1: All tasks considered in this paper involve interaction between a hand [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Experimental setup. Left: the setup for the hanging and hinge-based tasks, consisting of a UR5 robot, an Inspire dexterous hand and two Intel RealSense D435 cameras. Right: the setup for the insertion tasks, consisting of a Franka robot and two wrist-mounted Intel RealSense D405 cameras. During evaluation, all automatic intervention mechanisms are disabled, and each task is evaluated over 50 episodes. Task… view at source ↗

**Figure 3.** Figure 3: Overview of the experimental tasks: (A)Plug Insertion. (B)USB Insertion. (C)Hanger Suspension. (D)Correction Tape Suspension. (E)Spoon Suspension. (F)Drawer Opening. rate. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the stuck cases across different tasks: (A)Plug Insertion. (B)USB Insertion. (C)Drawer Opening. (D)Hanger Suspension. (E)Correction Tape Suspension. (F)Spoon Suspension [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Training curves of time versus intervention steps and time versus episode return for each task under SERL and AutoSERL. Positional variations. In the plug insertion task, we randomize the initial plug position within a ±3 cm range in the x–y plane while keeping the socket position fixed to evaluate robustness to positional variations. For each episode, the intervention reference trajectory consists of the … view at source ↗

**Figure 6.** Figure 6: Robustness and Heuristic Hyperparameter Analysis: (a) and (b) show the training curves for the plug insertion task across five random seeds and under positional variations. (c) and (d) present the training curves for the plug insertion task under different settings of hyperparameters th1 and th2, respectively. report training-time versus success-rate curves. As illustrated in [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 7.** Figure 7: Ablation study and trajectory comparison: (a) and (b) results on the plug insertion and USB insertion tasks under the No sliding window intervention and No recovery mechanism settings, respectively. (c) results on the drawer opening task under the No intervention termination setting. (d) results on 3D visualization of the demo trajectory and the policy rollout trajectory trained from it for the plug insert… view at source ↗

read the original abstract

Learning effective robot control policies on physical hardware is challenging due to costly data collection and the difficulty of reward specification. Prior work has incorporated demonstrations into reinforcement learning (RL), yet existing approaches either require large numbers of demonstrations or depend on continuous human intervention during training. To address these limitations, we present AutoSERL, a framework that leverages a single demonstration to fully automate the intervention process in real-world robot RL. The framework includes three complementary mechanisms to accomplish certain tasks: a sliding window intervention mechanism that continuously guides exploration to prevent local optima and unsafe deviations, a safety recovery mechanism that detects and corrects failure states via predefined trajectory recovery points, and an intervention termination criterion that automatically disables guidance once the policy can independently complete the task, preserving its exploration advantage. We evaluate AutoSERL on six contact-intensive manipulation tasks across two robot platforms, spanning insertion, hanging, and hinge-based tasks. AutoSERL consistently outperforms SERL initialized with 20 demonstrations, behavior cloning, and MILES -- a dedicated one-shot imitation learning baseline -- across all tasks while matching HIL-SERL, achieves 100% success rate on insertion tasks, and demonstrates improved robustness to positional variations, all from a single demonstration. Code and videos are available on our project website: https://autoserl.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoSERL shows workable hardware results from one demo but the safety recovery points likely add uncounted human setup that undercuts the automation claim.

read the letter

The main thing to know is that AutoSERL combines sliding-window guidance, safety recovery via predefined trajectory points, and automatic termination to run real-robot RL after a single demonstration, reporting better results than SERL with 20 demos and matching a human-in-the-loop version on insertion and hanging tasks.

The paper does the useful work of testing on physical hardware across two platforms and six contact-rich tasks, with claims of 100% success on insertions and improved robustness to position shifts. Releasing code and videos is also straightforward and helpful for anyone who wants to check the implementation.

The soft spot is the recovery mechanism. The abstract states it corrects failures with predefined trajectory recovery points, yet gives no indication these points are derived automatically from the single demonstration. If they require separate manual specification per task, the total human cost exceeds one demonstration and the "fully automate" framing does not hold. That matches the stress-test concern exactly, and it is load-bearing for the central claim. The abstract also omits error bars or statistical tests, which makes the consistency of the outperformance harder to judge without the full tables.

This is for roboticists working on sample-efficient or safe RL for manipulation. A reader already running physical experiments would get concrete comparisons to evaluate. It deserves peer review because the hardware results are the substantive part and the mechanisms are specific enough to critique in detail.

Referee Report

1 major / 1 minor

Summary. The manuscript presents AutoSERL, a framework for automating intervention in real-world robotic RL using only one demonstration. It features a sliding window intervention, safety recovery via predefined trajectory recovery points, and an automatic termination criterion. Evaluations on six tasks across two platforms show outperformance over several baselines and 100% success on insertion tasks.

Significance. Should the central claims regarding full automation from a single demonstration hold, this work could significantly lower the barrier to applying RL on physical robots by minimizing human input. The provision of code and videos is a positive aspect for reproducibility.

major comments (1)

Abstract: The safety recovery mechanism relies on 'predefined trajectory recovery points' to detect and correct failure states. However, the paper's claim is that it 'fully automate[s] the intervention process' from 'a single demonstration.' The manuscript does not specify how these recovery points are sourced or derived from the single demonstration alone. This is a load-bearing issue for the automation and one-demonstration claims, as manual specification per task would introduce additional human effort not accounted for in the central guarantee.

minor comments (1)

Abstract: The results are presented without error bars, statistical tests, or implementation details for baselines, which limits assessment of the performance claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and constructive comments. We address the single major comment below.

read point-by-point responses

Referee: Abstract: The safety recovery mechanism relies on 'predefined trajectory recovery points' to detect and correct failure states. However, the paper's claim is that it 'fully automate[s] the intervention process' from 'a single demonstration.' The manuscript does not specify how these recovery points are sourced or derived from the single demonstration alone. This is a load-bearing issue for the automation and one-demonstration claims, as manual specification per task would introduce additional human effort not accounted for in the central guarantee.

Authors: We agree that the abstract and methods would benefit from greater explicitness on this point to support the central claim. The recovery points are obtained directly from the single demonstration by automatically selecting key states along the demonstrated trajectory that enable return to safe configurations. In the revised manuscript we will update the abstract and the relevant methods description to state this derivation process explicitly, confirming that no per-task manual specification beyond the initial demonstration is required. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external robot benchmarks

full rationale

The paper reports physical-robot success rates and robustness metrics for AutoSERL versus independent baselines (SERL with 20 demonstrations, behavior cloning, MILES, HIL-SERL). No equations, fitted parameters, or derivations appear in the provided text that could reduce the claimed 100% insertion success or outperformance to a quantity defined by the method itself. The safety-recovery mechanism is described at the level of implementation rather than as a mathematical reduction; any cost of supplying recovery points is an empirical assumption, not a self-referential derivation. The evaluation therefore remains self-contained against external hardware benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the three mechanisms are presented as engineering components rather than new theoretical entities.

pith-pipeline@v0.9.1-grok · 5776 in / 1091 out tokens · 48646 ms · 2026-07-03T12:36:13.051578+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 25 canonical work pages · 9 internal anchors

[1]

Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding (2017),https://arxiv.org/abs/1708.08611

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

Chen, Y., Tian, S., Liu, S., Zhou, Y., Li, H., Zhao, D.: Conrft: A reinforced fine- tuning method for vla models via consistency policy (2025),https://arxiv.org/ abs/2502.05450

work page arXiv 2025
[3]

UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning

Deng, H., Gao, Y., Lin, Y., Liu, H., Wu, Z., Wang, Z.: Uniintervene: Agen- tic intervention for efficient real-world reinforcement learning. arXiv preprint arXiv:2606.12372 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Challenges of Real-World Reinforcement Learning

Dulac-Arnold, G., Mankowitz, D., Hester, T.: Challenges of real-world reinforce- ment learning. arXiv preprint arXiv:1904.12901 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[5]

Fisac, J.F., Akametalu, A.K., Zeilinger, M.N., Kaynama, S., Gillula, J., Tomlin, C.J.: A general safety framework for learning-based control in uncertain robotic systems (2018),https://arxiv.org/abs/1705.01292

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

In: 2012 IEEE International Conference on Robotics and Automation

Gillula, J.H., Tomlin, C.J.: Guaranteed safe online learning via reachability: track- ing a ground target using a quadrotor. In: 2012 IEEE International Conference on Robotics and Automation. pp. 2723–2730 (2012).https://doi.org/10.1109/ ICRA.2012.6225136

work page arXiv 2012
[7]

Hoque, R., Balakrishna, A., Novoseller, E., Wilcox, A., Brown, D.S., Goldberg, K.: Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning (2021),https://arxiv.org/abs/2109.08273

work page arXiv 2021
[8]

Hu, K., Shi, H., He, Y., Wang, W., Liu, C.K., Song, S.: Robot trains robot: Au- tomatic real-world policy adaptation and learning for humanoids (2025),https: //arxiv.org/abs/2508.12252

work page arXiv 2025
[9]

Johns, E.: Coarse-to-fine imitation learning: Robot manipulation from a single demonstration (2021),https://arxiv.org/abs/2105.06411

work page arXiv 2021
[10]

Kelly, M., Sidrane, C., Driggs-Campbell, K., Kochenderfer, M.J.: Hg-dagger: In- teractive imitation learning with human experts (2019),https://arxiv.org/abs/ 1810.02890

work page internal anchor Pith review Pith/arXiv arXiv 2019
[11]

arXiv preprint arXiv:2601.07821 (2026)

Li, H., Lei, K., Zang, S., Hu, K., Liang, Y., An, B., Li, X., Xu, H.: Failure-aware rl: Reliable offline-to-online reinforcement learning with self-recovery for real-world manipulation. arXiv preprint arXiv:2601.07821 (2026)

work page arXiv 2026
[12]

Liu et al

Li, S., Bastani, O.: Robust model predictive shielding for safe reinforcement learn- ing with stochastic dynamics (2020),https://arxiv.org/abs/1910.10885 16 Y. Liu et al

work page arXiv 2020
[13]

Liu, H., Nasiriany, S., Zhang, L., Bao, Z., Zhu, Y.: Robot learning on the job: Human-in-the-loop autonomy and learning during deployment (2023),https:// arxiv.org/abs/2211.08416

work page arXiv 2023
[14]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Luo,J.,Hu,Z.,Xu,C.,Tan,Y.L.,Berg,J.,Sharma,A.,Schaal,S.,Finn,C.,Gupta, A., Levine, S.: Serl: A software suite for sample-efficient robotic reinforcement learning. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 16961–16969. IEEE (2024)

2024
[15]

Science Robotics10(105), eads5033 (2025)

Luo, J., Xu, C., Wu, J., Levine, S.: Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning. Science Robotics10(105), eads5033 (2025)

2025
[16]

Mandlekar, A., Xu, D., Martín-Martín, R., Zhu, Y., Fei-Fei, L., Savarese, S.: Human-in-the-loop imitation learning using remote teleoperation (2020),https: //arxiv.org/abs/2012.06733

work page arXiv 2020
[17]

Playing Atari with Deep Reinforcement Learning

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[18]

In: 2018 IEEE inter- national conference on robotics and automation (ICRA)

Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., Abbeel, P.: Overcoming exploration in reinforcement learning with demonstrations. In: 2018 IEEE inter- national conference on robotics and automation (ICRA). pp. 6292–6299. IEEE (2018)

2018
[19]

Palo, N.D., Johns, E.: On the effectiveness of retrieval, alignment, and replay in manipulation (2023),https://arxiv.org/abs/2312.12345

work page arXiv 2023
[20]

arXiv preprint arXiv:2410.19693 (2024)

Papagiannis, G., Johns, E.: Miles: Making imitation learning easy with self- supervision. arXiv preprint arXiv:2410.19693 (2024)

work page arXiv 2024
[21]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Rajeswaran, A., Kumar, V., Gupta, A., Vezzani, G., Schulman, J., Todorov, E., Levine, S.: Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Ross, S., Gordon, G.J., Bagnell, J.A.: A reduction of imitation learning and struc- tured prediction to no-regret online learning (2011),https://arxiv.org/abs/ 1011.0686

work page internal anchor Pith review Pith/arXiv arXiv 2011
[23]

nature529(7587), 484–489 (2016)

Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Master- ing the game of go with deep neural networks and tree search. nature529(7587), 484–489 (2016)

2016
[24]

Thananjeyan, B., Balakrishna, A., Nair, S., Luo, M., Srinivasan, K., Hwang, M., Gonzalez, J.E., Ibarz, J., Finn, C., Goldberg, K.: Recovery rl: Safe reinforcement learning with learned recovery zones (2021),https://arxiv.org/abs/2010.15920

work page arXiv 2021
[25]

Valassakis, E., Papagiannis, G., Palo, N.D., Johns, E.: Demonstrate once, imi- tate immediately (dome): Learning visual servoing for one-shot imitation learning (2022),https://arxiv.org/abs/2204.02863

work page arXiv 2022
[26]

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep re- inforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Wen, B., Lian, W., Bekris, K., Schaal, S.: You only demonstrate once: Category- level manipulation from single visual demonstration (2022),https://arxiv.org/ abs/2201.12716

work page arXiv 2022
[28]

Wu, P., Shentu, Y., Liao, Q., Jin, D., Guo, M., Sreenath, K., Lin, X., Abbeel, P.: Robocopilot: Human-in-the-loop interactive imitation learning for robot manipu- lation (2025),https://arxiv.org/abs/2503.07771 AutoSERL 17

work page arXiv 2025
[29]

Liu et al

Xu, X., Hou, Y., Xin, C., Liu, Z., Song, S.: Compliant residual dagger: Improving real-world contact-rich manipulation with human corrections (2025),https:// arxiv.org/abs/2506.16685 18 Y. Liu et al. Appendix A Learning Details Our training framework is based on SERL [14]. Following SERL, we maintain bothademobufferandareplaybufferfordatastorage.Thedemobu...

work page arXiv 2025

[1] [1]

Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., Topcu, U.: Safe reinforcement learning via shielding (2017),https://arxiv.org/abs/1708.08611

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

Chen, Y., Tian, S., Liu, S., Zhou, Y., Li, H., Zhao, D.: Conrft: A reinforced fine- tuning method for vla models via consistency policy (2025),https://arxiv.org/ abs/2502.05450

work page arXiv 2025

[3] [3]

UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning

Deng, H., Gao, Y., Lin, Y., Liu, H., Wu, Z., Wang, Z.: Uniintervene: Agen- tic intervention for efficient real-world reinforcement learning. arXiv preprint arXiv:2606.12372 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

Challenges of Real-World Reinforcement Learning

Dulac-Arnold, G., Mankowitz, D., Hester, T.: Challenges of real-world reinforce- ment learning. arXiv preprint arXiv:1904.12901 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904

[5] [5]

Fisac, J.F., Akametalu, A.K., Zeilinger, M.N., Kaynama, S., Gillula, J., Tomlin, C.J.: A general safety framework for learning-based control in uncertain robotic systems (2018),https://arxiv.org/abs/1705.01292

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

In: 2012 IEEE International Conference on Robotics and Automation

Gillula, J.H., Tomlin, C.J.: Guaranteed safe online learning via reachability: track- ing a ground target using a quadrotor. In: 2012 IEEE International Conference on Robotics and Automation. pp. 2723–2730 (2012).https://doi.org/10.1109/ ICRA.2012.6225136

work page arXiv 2012

[7] [7]

Hoque, R., Balakrishna, A., Novoseller, E., Wilcox, A., Brown, D.S., Goldberg, K.: Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning (2021),https://arxiv.org/abs/2109.08273

work page arXiv 2021

[8] [8]

Hu, K., Shi, H., He, Y., Wang, W., Liu, C.K., Song, S.: Robot trains robot: Au- tomatic real-world policy adaptation and learning for humanoids (2025),https: //arxiv.org/abs/2508.12252

work page arXiv 2025

[9] [9]

Johns, E.: Coarse-to-fine imitation learning: Robot manipulation from a single demonstration (2021),https://arxiv.org/abs/2105.06411

work page arXiv 2021

[10] [10]

Kelly, M., Sidrane, C., Driggs-Campbell, K., Kochenderfer, M.J.: Hg-dagger: In- teractive imitation learning with human experts (2019),https://arxiv.org/abs/ 1810.02890

work page internal anchor Pith review Pith/arXiv arXiv 2019

[11] [11]

arXiv preprint arXiv:2601.07821 (2026)

Li, H., Lei, K., Zang, S., Hu, K., Liang, Y., An, B., Li, X., Xu, H.: Failure-aware rl: Reliable offline-to-online reinforcement learning with self-recovery for real-world manipulation. arXiv preprint arXiv:2601.07821 (2026)

work page arXiv 2026

[12] [12]

Liu et al

Li, S., Bastani, O.: Robust model predictive shielding for safe reinforcement learn- ing with stochastic dynamics (2020),https://arxiv.org/abs/1910.10885 16 Y. Liu et al

work page arXiv 2020

[13] [13]

Liu, H., Nasiriany, S., Zhang, L., Bao, Z., Zhu, Y.: Robot learning on the job: Human-in-the-loop autonomy and learning during deployment (2023),https:// arxiv.org/abs/2211.08416

work page arXiv 2023

[14] [14]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Luo,J.,Hu,Z.,Xu,C.,Tan,Y.L.,Berg,J.,Sharma,A.,Schaal,S.,Finn,C.,Gupta, A., Levine, S.: Serl: A software suite for sample-efficient robotic reinforcement learning. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 16961–16969. IEEE (2024)

2024

[15] [15]

Science Robotics10(105), eads5033 (2025)

Luo, J., Xu, C., Wu, J., Levine, S.: Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning. Science Robotics10(105), eads5033 (2025)

2025

[16] [16]

Mandlekar, A., Xu, D., Martín-Martín, R., Zhu, Y., Fei-Fei, L., Savarese, S.: Human-in-the-loop imitation learning using remote teleoperation (2020),https: //arxiv.org/abs/2012.06733

work page arXiv 2020

[17] [17]

Playing Atari with Deep Reinforcement Learning

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[18] [18]

In: 2018 IEEE inter- national conference on robotics and automation (ICRA)

Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., Abbeel, P.: Overcoming exploration in reinforcement learning with demonstrations. In: 2018 IEEE inter- national conference on robotics and automation (ICRA). pp. 6292–6299. IEEE (2018)

2018

[19] [19]

Palo, N.D., Johns, E.: On the effectiveness of retrieval, alignment, and replay in manipulation (2023),https://arxiv.org/abs/2312.12345

work page arXiv 2023

[20] [20]

arXiv preprint arXiv:2410.19693 (2024)

Papagiannis, G., Johns, E.: Miles: Making imitation learning easy with self- supervision. arXiv preprint arXiv:2410.19693 (2024)

work page arXiv 2024

[21] [21]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Rajeswaran, A., Kumar, V., Gupta, A., Vezzani, G., Schulman, J., Todorov, E., Levine, S.: Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Ross, S., Gordon, G.J., Bagnell, J.A.: A reduction of imitation learning and struc- tured prediction to no-regret online learning (2011),https://arxiv.org/abs/ 1011.0686

work page internal anchor Pith review Pith/arXiv arXiv 2011

[23] [23]

nature529(7587), 484–489 (2016)

Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Master- ing the game of go with deep neural networks and tree search. nature529(7587), 484–489 (2016)

2016

[24] [24]

Thananjeyan, B., Balakrishna, A., Nair, S., Luo, M., Srinivasan, K., Hwang, M., Gonzalez, J.E., Ibarz, J., Finn, C., Goldberg, K.: Recovery rl: Safe reinforcement learning with learned recovery zones (2021),https://arxiv.org/abs/2010.15920

work page arXiv 2021

[25] [25]

Valassakis, E., Papagiannis, G., Palo, N.D., Johns, E.: Demonstrate once, imi- tate immediately (dome): Learning visual servoing for one-shot imitation learning (2022),https://arxiv.org/abs/2204.02863

work page arXiv 2022

[26] [26]

Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards

Vecerik, M., Hester, T., Scholz, J., Wang, F., Pietquin, O., Piot, B., Heess, N., Rothörl, T., Lampe, T., Riedmiller, M.: Leveraging demonstrations for deep re- inforcement learning on robotics problems with sparse rewards. arXiv preprint arXiv:1707.08817 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

Wen, B., Lian, W., Bekris, K., Schaal, S.: You only demonstrate once: Category- level manipulation from single visual demonstration (2022),https://arxiv.org/ abs/2201.12716

work page arXiv 2022

[28] [28]

Wu, P., Shentu, Y., Liao, Q., Jin, D., Guo, M., Sreenath, K., Lin, X., Abbeel, P.: Robocopilot: Human-in-the-loop interactive imitation learning for robot manipu- lation (2025),https://arxiv.org/abs/2503.07771 AutoSERL 17

work page arXiv 2025

[29] [29]

Liu et al

Xu, X., Hou, Y., Xin, C., Liu, Z., Song, S.: Compliant residual dagger: Improving real-world contact-rich manipulation with human corrections (2025),https:// arxiv.org/abs/2506.16685 18 Y. Liu et al. Appendix A Learning Details Our training framework is based on SERL [14]. Following SERL, we maintain bothademobufferandareplaybufferfordatastorage.Thedemobu...

work page arXiv 2025