pith. sign in

arxiv: 2606.22027 · v2 · pith:T3O5A7XNnew · submitted 2026-06-20 · 💻 cs.RO · cs.AI

RARM: Confidence-Gated Progress Reward Modeling for RL in Manipulation

Pith reviewed 2026-06-26 12:16 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords robot manipulationreinforcement learningreward modelingprogress estimationvisual comparatordense rewardslong-horizon tasks
0
0 comments X

The pith

A general-video comparator turns one demonstration into a dense, gated reward that improves RL success on robot manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RARM to address the reward design bottleneck in reinforcement learning for robot manipulation. It trains a lightweight visual model once on general-purpose videos using a contrastive temporal objective, then at test time anchors to a single successful demonstration and only rewards confident forward progress. This avoids the need for task-specific labels or engineering while suppressing false-positive rewards from visually plausible but incorrect states. Experiments show it yields the highest success rates across simulated and real tasks, with the largest improvements on long-horizon problems.

Core claim

RARM is a reference-anchored visual comparator trained on general videos that converts one demonstration into a progress-aware reward by matching rollout clips to reference clips and applying confidence gating to suppress uncertain or false-positive matches, resulting in higher RL success rates without robot-specific training data or per-task reward engineering.

What carries the argument

The Reference-Anchored Reward Model (RARM), a lightweight visual comparator that matches rollout clips against a reference demonstration and issues rewards only for confident forward progress.

If this is right

  • RL agents reach higher success rates on both simulated and real manipulation tasks when using RARM rewards.
  • Gains are largest on long-horizon tasks where false-positive rewards are especially damaging.
  • No task-specific demonstrations or progress labels are required beyond a single successful rollout.
  • The same pretrained model works across multiple tasks without per-task reward redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may reduce the engineering effort needed to apply RL to new manipulation problems if the general-video pretraining transfers reliably.
  • It suggests that temporal contrastive objectives on broad video corpora can serve as a foundation for progress estimation in other sequential decision domains.
  • If the confidence gate proves robust, similar gating mechanisms could be added to other learned reward models to limit over-optimism.

Load-bearing premise

A model trained only on general-purpose videos can produce reliable progress estimates for robot manipulation without any robot data or task labels, and its confidence gate will reliably block false-positive rewards.

What would settle it

Running the same RL training loop with RARM rewards disabled on a long-horizon task such as cloth folding and measuring whether success rates drop to the level of sparse-reward baselines would test the claim.

Figures

Figures reproduced from arXiv: 2606.22027 by Fan Shi, Kehan Wen, Minghao Fu, Pengyu Jing, Pengzhi Yang, Xin Liu, Xinyu Wang, Yaheng Shen, Yiduo Qu, Zhenhao Huang.

Figure 1
Figure 1. Figure 1: Motivation. Exist￾ing reward models often overesti￾mate failed states, provide unreli￾able progress signals during fail￾ures. RARM addresses these fail￾ure modes with confidence-gate by rewarding only confident forward progress along the reference and suppressing uncertain matches. Existing methods rarely achieve both. VLM-based reward models [12, 13, 14, 15] carry broad semantic priors but are costly to q… view at source ↗
Figure 2
Figure 2. Figure 2: Method Overview. Reward Model Training: As shown in (c), we sample an anchor clip from an unlabeled video, positives from the same temporal region, and negatives from three complementary sources. These clip pairs are scored by the comparator in (a), which is trained with the soft-nearest-neighbours loss in Eq. (1). RL Training: Given a reference video and a rollout video in (b), we compare each rollout cli… view at source ↗
Figure 3
Figure 3. Figure 3: Success rate over 1M environment steps across 4 MetaWorld and 5 LIBERO-10 tasks [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation. Success rate over 1M steps for four component ablations on one LIBERO and one MetaWorld task, averaged over the same five seeds. Main results [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Progress prediction on cloth folding tasks. Cumulative predicted reward on a paired success/failure cloth-folding rollout, normalized per model by its own success-rollout final reward. Our model (red) tracks the linear oracle on success and saturates at 0.50 of its success total on failure, while all four baselines assign more than 80% of their success reward to failure (GVL 0.97, Robometer 0.88, RoboDopam… view at source ↗
Figure 6
Figure 6. Figure 6: Task 1 — Pick Eraser from Box. Filmstrip of one successful rollout sampled at uniform intervals; action-phrase labels appear under each frame. Task description. A single-arm short-horizon pick-and-place task. The robot must reach into a green storage box, grasp an orange-and-yellow blackboard eraser whose pose inside the box is ran￾domised across episodes, lift it cleanly past the box rim, and release it o… view at source ↗
Figure 7
Figure 7. Figure 7: Task 2 — Open Red Drawer. Filmstrip of one successful rollout sampled at uniform intervals; action-phrase labels appear under each frame. Task description. A single-arm short-horizon articulated-object manipulation task. The robot must localise the blue plastic handle on the red drawer (mounted on the drawer face that points toward the robot, hence not visible in the top-down filmstrip view above), close t… view at source ↗
Figure 8
Figure 8. Figure 8: Task 3 — Bimanual Hand-over. Filmstrip of one successful rollout sampled at uniform intervals; action-phrase labels appear under each frame. Task description. A dual-arm long-horizon coordination task. The left arm picks the off-white container off the tabletop, raises it to a mid-air hand-over pose, and must hold steady while the right arm reaches in and closes around the same object; only after the right… view at source ↗
Figure 9
Figure 9. Figure 9: Task 4 — Bimanual Clothes Folding. Filmstrip of one successful rollout sampled at uniform intervals; action-phrase labels appear under each frame. B.4 Task 4: Bimanual Clothes Folding (Dual-Arm, Long-Horizon, Multi-Stage) Task description. A dual-arm long-horizon multi-stage cloth-manipulation task on a flat orange shirt. Stage 1: both arms simultaneously pinch one side of the shirt and fold it across to t… view at source ↗
Figure 10
Figure 10. Figure 10: Generated reference demonstration on the MetaWorld pull task. (a) Learning curves comparing RARM using a real reference with a different object instance, and synthetic refer￾ences generated by SeeDance 2.0. (b) Example frames from two synthetic trajectories generated by SeeDance 2.0 from an initial scene image and task instruction. RARM requires a single reference demonstra￾tion for each new task configur… view at source ↗
Figure 11
Figure 11. Figure 11: RARM Robustness: Comparison of RARM progress estimates on augmented rollouts compared with a fixed reference demonstration. rollout trajectory, we digitally augmentations to its frames and then used RARM to re-estimate the progress of the altered rollout against a different, but unaltered, reference demonstration. We evaluated four augmentations, each targeting a distinct real-world failure mode: • Augmen… view at source ↗
read the original abstract

Reinforcement learning for robot manipulation is often bottlenecked by reward design, especially in long-horizon tasks: sparse success rewards provide weak supervision, while hand-crafted dense rewards are tedious to design and generalize poorly across tasks. Progress-based reward models offer a promising alternative by estimating how far an observation has advanced toward task completion, but existing approaches often require task-specific demonstrations or progress labels, and can assign high rewards to visually plausible but physically incorrect states. We introduce the Reference-Anchored Reward Model (RARM), a lightweight visual comparator that converts a single successful demonstration into a dense, progress-aware reward. RARM is trained once on general-purpose videos with a contrastive temporal objective, requiring no robot-specific data, task-specific reward labels, or per-task reward engineering. At deployment, RARM matches rollout clips to reference clips and rewards only confident forward progress, suppressing uncertain matches that may otherwise produce false-positive rewards. Across 9 simulated manipulation tasks from LIBERO and MetaWorld and 4 real-world tasks, RARM achieves the best overall success rates in subsequent RL training, with particularly large gains on long-horizon tasks such as cloth folding, where unreliable progress estimates are especially harmful.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Reference-Anchored Reward Model (RARM), a lightweight visual comparator trained once on general-purpose videos via a contrastive temporal objective. It converts a single successful demonstration into a dense progress-aware reward for RL by matching rollout clips to reference clips and applying confidence gating to reward only confident forward progress, suppressing uncertain matches. No robot-specific data, task-specific labels, or per-task engineering is required. The central empirical claim is that RARM yields the best overall success rates in subsequent RL training across 9 simulated manipulation tasks (LIBERO, MetaWorld) and 4 real-world tasks, with particularly large gains on long-horizon tasks such as cloth folding.

Significance. If the generalization claim holds, RARM would offer a practical way to obtain dense, task-agnostic rewards from a single demo and general video pretraining, reducing the reward-design bottleneck in long-horizon robot manipulation RL. The approach would demonstrate that contrastive temporal embeddings can transfer progress signals across visual domains without adaptation, which is a non-trivial result if supported by rigorous evidence of domain-gap handling and gating efficacy.

major comments (2)
  1. [Abstract (and Experiments section)] The central claim that a model trained solely on general videos produces reliable progress estimates for robot states (without domain adaptation or robot data) is load-bearing for all reported gains. No quantitative evidence is supplied on how frequently robot observations fall into the low-confidence regime or whether gating removes false-positive rewards on physically invalid but visually similar states (e.g., incorrectly folded cloth). Without such measurements, the attribution of success-rate improvements on long-horizon tasks to the method remains unverified.
  2. [Abstract] The abstract asserts superior empirical results on 9 simulated + 4 real tasks but supplies no quantitative details, baselines, statistical tests, or ablation evidence. This absence prevents assessment of whether the data support the generalization claim or whether gains could arise from other factors (e.g., reward scaling or RL hyperparameters).
minor comments (1)
  1. [Method] Notation for the confidence threshold and the exact form of the contrastive loss should be defined explicitly with equations rather than described only in prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the empirical support for RARM's generalization claims. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract (and Experiments section)] The central claim that a model trained solely on general videos produces reliable progress estimates for robot states (without domain adaptation or robot data) is load-bearing for all reported gains. No quantitative evidence is supplied on how frequently robot observations fall into the low-confidence regime or whether gating removes false-positive rewards on physically invalid but visually similar states (e.g., incorrectly folded cloth). Without such measurements, the attribution of success-rate improvements on long-horizon tasks to the method remains unverified.

    Authors: We agree that the manuscript lacks explicit quantitative measurements of low-confidence frequency on robot observations and direct verification that gating suppresses false positives on invalid states. The reported RL success rates, especially the large gains on long-horizon tasks, provide indirect support, but do not substitute for the requested analysis. In revision we will add a new subsection with confidence histograms comparing robot rollouts to general videos and qualitative examples of gating on physically invalid but visually similar states. revision: yes

  2. Referee: [Abstract] The abstract asserts superior empirical results on 9 simulated + 4 real tasks but supplies no quantitative details, baselines, statistical tests, or ablation evidence. This absence prevents assessment of whether the data support the generalization claim or whether gains could arise from other factors (e.g., reward scaling or RL hyperparameters).

    Authors: Abstracts are length-constrained and intended to convey the high-level contribution; the full quantitative results (per-task success rates, baseline comparisons, statistical tests, and gating ablations) appear in Section 4 with tables and figures. We will expand the abstract with one sentence summarizing the magnitude of gains if space allows. revision: partial

Circularity Check

0 steps flagged

No circularity detected; derivation is self-contained

full rationale

The paper defines RARM via a contrastive temporal objective trained once on external general-purpose videos, with no robot data or task labels required. Deployment uses reference-clip matching plus confidence gating on rollout observations. No equations or claims reduce the progress reward or success-rate gains to a fitted quantity defined by the same inputs, nor to self-citations whose load-bearing premise is unverified. The central generalization claim is presented as an empirical result rather than a definitional identity, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the transferability of contrastive video representations to robot states and on the effectiveness of confidence gating, neither of which is independently verified in the abstract.

axioms (1)
  • domain assumption A contrastive temporal objective on general videos learns representations that capture task progress transferable to robot manipulation
    Invoked to justify training without robot data or labels.
invented entities (1)
  • RARM visual comparator no independent evidence
    purpose: To generate dense progress rewards from a single reference demonstration
    New model introduced by the paper; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5767 in / 1241 out tokens · 20361 ms · 2026-06-26T12:16:09.514791+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 13 linked inside Pith

  1. [1]

    O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  2. [2]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

  3. [3]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  4. [4]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. Pi 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  5. [5]

    Intelligence, A

    P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al. Pi 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

  6. [6]

    S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

  7. [7]

    De Haan, D

    P. De Haan, D. Jayaraman, and S. Levine. Causal confusion in imitation learning.Advances in neural information processing systems, 32, 2019

  8. [8]

    Codevilla, E

    F. Codevilla, E. Santana, A. M. L ´opez, and A. Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF international conference on computer vision, pages 9329–9338, 2019

  9. [9]

    R. Tian, Y . Wu, and A. Bacjsy. Position: Good embodied reward models need bad behavior data, 2026

  10. [10]

    T. W. Ayalew, X. Zhang, K. Y . Wu, T. Jiang, M. Maire, and M. R. Walter. Progressor: A perceptually guided reward estimator with self-supervised online refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10297–10306, 2025

  11. [11]

    J. Leng, C. Huang, B. Zhu, and J. Huang. Taming overconfidence in llms: Reward calibration in rlhf. InInternational Conference on Learning Representations, volume 2025, pages 16484– 16517, 2025

  12. [12]

    Y . J. Ma, J. Hejna, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmani, P. Xu, D. Driess, T. Xiao, et al. Vision language models are in-context value learners. InThe Thirteenth International Conference on Learning Representations, 2024

  13. [13]

    T. Lee, A. Wagenmaker, K. Pertsch, P. Liang, S. Levine, and C. Finn. Roboreward: General- purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026

  14. [14]

    Liang, Y

    A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. S. Huang, L. Zettlemoyer, D. Fox, et al. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons.arXiv preprint arXiv:2603.02115, 2026

  15. [15]

    Zhang, Y

    J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Biyik, and J. Zhang. Rewind: Language-guided rewards teach robot policies without new demonstrations.arXiv preprint arXiv:2505.10911, 2025. 10

  16. [16]

    Q. Chen, J. Yu, M. Schwager, P. Abbeel, Y . Shentu, and P. Wu. Sarm: Stage-aware reward modeling for long horizon robot manipulation.arXiv preprint arXiv:2509.25358, 2025

  17. [17]

    Y . Liu, C. Wen, Y . Hu, D. Jayaraman, and Y . Gao. Timerewarder: Learning dense reward from passive videos via frame-wise temporal distance.arXiv preprint arXiv:2509.26627, 2025

  18. [18]

    D. Yang, D. Tjia, J. Berg, D. Damen, P. Agrawal, and A. Gupta. Rank2reward: Learning shaped reward functions from passive video. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2806–2813. IEEE, 2024

  19. [19]

    Escontrela, A

    A. Escontrela, A. Adeniji, W. Yan, A. Jain, X. B. Peng, K. Goldberg, Y . Lee, D. Hafner, and P. Abbeel. Video prediction models as rewards for reinforcement learning.Advances in Neural Information Processing Systems, 36:68760–68783, 2023

  20. [20]

    Huang, G

    T. Huang, G. Jiang, Y . Ze, and H. Xu. Diffusion reward: Learning rewards via conditional video diffusion. InEuropean Conference on Computer Vision, pages 478–495. Springer, 2024

  21. [21]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  22. [22]

    T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR, 2020

  23. [23]

    Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

  24. [24]

    Y . J. Ma, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. Liv: Language-image repre- sentations and rewards for robotic control. InInternational Conference on Machine Learning, pages 23301–23320. PMLR, 2023

  25. [25]

    Zhang, Y

    Z. Zhang, Y . Li, O. Bastani, A. Gupta, D. Jayaraman, Y . J. Ma, and L. Weihs. Universal visual decomposer: Long-horizon manipulation made easy. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6973–6980. IEEE, 2024

  26. [26]

    C. Kim, M. Heo, D. Lee, J. Shin, H. Lee, J. J. Lim, and K. Lee. Subtask-aware visual reward learning from segmented demonstrations.arXiv preprint arXiv:2502.20630, 2025

  27. [27]

    Y . Yang, M. Chen, Q. Qiu, J. Wu, W. Wang, B. Lin, Z. Guan, and X. He. Adapt2reward: Adapt- ing video-language models to generalizable robotic rewards via failure prompts. InEuropean Conference on Computer Vision, pages 163–180. Springer, 2024

  28. [28]

    A. K. Jain, V . Mohta, S. Kim, A. Bhardwaj, J. Ren, Y . Feng, S. Choudhury, and G. Swamy. A smooth sea never made a skilled sailor: Robust imitation via learning to search. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  29. [29]

    S. Zhai, Q. Zhang, T. Zhang, F. Huang, H. Zhang, M. Zhou, S. Zhang, L. Liu, S. Lin, and J. Pang. A vision-language-action-critic model for robotic real-world reinforcement learning. arXiv preprint arXiv:2509.15937, 2025

  30. [30]

    H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhao, X. Chen, P. Co, et al. Robo- dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025

  31. [31]

    S. Chen, C. Harrison, Y .-C. Lee, A. J. Yang, Z. Ren, L. J. Ratliff, J. Duan, D. Fox, and R. Kr- ishna. Topreward: Token probabilities as hidden zero-shot rewards for robotics.arXiv preprint arXiv:2602.19313, 2026. 11

  32. [32]

    Kumar, J

    S. Kumar, J. Zamora, N. Hansen, R. Jangir, and X. Wang. Graph inverse reinforcement learning from diverse videos. InConference on Robot Learning, pages 55–66. PMLR, 2023

  33. [33]

    Y . Fu, H. Zhang, D. Wu, W. Xu, and B. Boulet. Robot policy learning with temporal optimal transport reward.Advances in Neural Information Processing Systems, 37:122078–122103, 2024

  34. [34]

    J. Shi, J. Smith, J. Qian, and D. Jayaraman. Points2reward: Robotic manipulation rewards from just one video

  35. [35]

    Guzey, Y

    I. Guzey, Y . Dai, G. Savva, R. Bhirangi, and L. Pinto. Bridging the human to robot dexterity gap through object-oriented rewards. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3344–3351. IEEE, 2025

  36. [36]

    K. Yu, S. Zhang, H. Soora, F. Huang, H. Huang, P. Tokekar, and R. Gao. Genflowrl: Shaping rewards with generative object-centric flow in visual reinforcement learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13183–13192, 2025

  37. [37]

    Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar. Eureka: Human-level reward design via coding large language models.arXiv preprint arXiv:2310.12931, 2023

  38. [38]

    S. K. S. Ghasemipour, A. Wahid, J. Tompson, P. Sanketi, and I. Mordatch. Self-improving embodied foundation models.arXiv preprint arXiv:2509.15155, 2025

  39. [39]

    L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023

  40. [40]

    W. Shen, R. Zheng, W. Zhan, J. Zhao, S. Dou, T. Gui, Q. Zhang, and X.-J. Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2859–2873, 2023

  41. [41]

    Riedmiller, R

    M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Wiele, V . Mnih, N. Heess, and J. T. Springenberg. Learning by playing solving sparse reward tasks from scratch. In International conference on machine learning, pages 4344–4353. PMLR, 2018

  42. [42]

    Tobin, R

    J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017

  43. [43]

    O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020

  44. [44]

    Kalashnikov, A

    D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. InConference on robot learning, pages 651–673. PMLR, 2018

  45. [45]

    Rajeswaran, V

    A. Rajeswaran, V . Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstra- tions.arXiv preprint arXiv:1709.10087, 2017

  46. [46]

    J. Luo, Z. Hu, C. Xu, Y . L. Tan, J. Berg, A. Sharma, S. Schaal, C. Finn, A. Gupta, and S. Levine. Serl: A software suite for sample-efficient robotic reinforcement learning. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 16961–16969. IEEE, 2024

  47. [47]

    Ho and S

    J. Ho and S. Ermon. Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016. 12

  48. [48]

    C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. InInternational conference on machine learning, pages 49–58. PMLR, 2016

  49. [49]

    Andrychowicz, F

    M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. To- bin, O. Pieter Abbeel, and W. Zaremba. Hindsight experience replay.Advances in neural information processing systems, 30, 2017

  50. [50]

    Pathak, P

    D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self- supervised prediction. InInternational conference on machine learning, pages 2778–2787. PMLR, 2017

  51. [51]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  52. [52]

    Salakhutdinov and G

    R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neigh- bourhood structure. InArtificial intelligence and statistics, pages 412–419. PMLR, 2007

  53. [53]

    Sim ´eoni, H

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  54. [54]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  55. [55]

    J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  56. [56]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  57. [57]

    Yarats, R

    D. Yarats, R. Fergus, A. Lazaric, and L. Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning.arXiv preprint arXiv:2107.09645, 2021

  58. [58]

    Wagenmaker, M

    A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799, 2025

  59. [59]

    Sontakke, J

    S. Sontakke, J. Zhang, S. Arnold, K. Pertsch, E. Bıyık, D. Sadigh, C. Finn, and L. Itti. Robo- clip: One demonstration is enough to learn robot policies.Advances in Neural Information Processing Systems, 36:55681–55693, 2023

  60. [60]

    Caelles, J

    S. Caelles, J. Pont-Tuset, F. Perazzi, A. Montes, K.-K. Maninis, and L. Van Gool. The 2019 davis challenge on vos: Unsupervised multi-object segmentation.arXiv:1905.00737, 2019

  61. [61]

    left arm

    H. Ding, C. Liu, S. He, X. Jiang, P. H. Torr, and S. Bai. MOSE: A new dataset for video object segmentation in complex scenes. InICCV, 2023. 13 A Simulation Environments Table 2: Simulation tasks used in our evaluation. The image column is reserved for task visualiza- tions. Simulation Envi- ronment Task Image Description MetaWorld (MT50) Task 7 Bypass a ...