RARM: Confidence-Gated Progress Reward Modeling for RL in Manipulation

Fan Shi; Kehan Wen; Minghao Fu; Pengyu Jing; Pengzhi Yang; Xin Liu; Xinyu Wang; Yaheng Shen; Yiduo Qu; Zhenhao Huang

arxiv: 2606.22027 · v2 · pith:T3O5A7XNnew · submitted 2026-06-20 · 💻 cs.RO · cs.AI

RARM: Confidence-Gated Progress Reward Modeling for RL in Manipulation

Pengzhi Yang , Xinyu Wang , Pengyu Jing , Kehan Wen , Yiduo Qu , Zhenhao Huang , Minghao Fu , Xin Liu

show 2 more authors

Yaheng Shen Fan Shi

This is my paper

Pith reviewed 2026-06-26 12:16 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords robot manipulationreinforcement learningreward modelingprogress estimationvisual comparatordense rewardslong-horizon tasks

0 comments

The pith

A general-video comparator turns one demonstration into a dense, gated reward that improves RL success on robot manipulation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RARM to address the reward design bottleneck in reinforcement learning for robot manipulation. It trains a lightweight visual model once on general-purpose videos using a contrastive temporal objective, then at test time anchors to a single successful demonstration and only rewards confident forward progress. This avoids the need for task-specific labels or engineering while suppressing false-positive rewards from visually plausible but incorrect states. Experiments show it yields the highest success rates across simulated and real tasks, with the largest improvements on long-horizon problems.

Core claim

RARM is a reference-anchored visual comparator trained on general videos that converts one demonstration into a progress-aware reward by matching rollout clips to reference clips and applying confidence gating to suppress uncertain or false-positive matches, resulting in higher RL success rates without robot-specific training data or per-task reward engineering.

What carries the argument

The Reference-Anchored Reward Model (RARM), a lightweight visual comparator that matches rollout clips against a reference demonstration and issues rewards only for confident forward progress.

If this is right

RL agents reach higher success rates on both simulated and real manipulation tasks when using RARM rewards.
Gains are largest on long-horizon tasks where false-positive rewards are especially damaging.
No task-specific demonstrations or progress labels are required beyond a single successful rollout.
The same pretrained model works across multiple tasks without per-task reward redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may reduce the engineering effort needed to apply RL to new manipulation problems if the general-video pretraining transfers reliably.
It suggests that temporal contrastive objectives on broad video corpora can serve as a foundation for progress estimation in other sequential decision domains.
If the confidence gate proves robust, similar gating mechanisms could be added to other learned reward models to limit over-optimism.

Load-bearing premise

A model trained only on general-purpose videos can produce reliable progress estimates for robot manipulation without any robot data or task labels, and its confidence gate will reliably block false-positive rewards.

What would settle it

Running the same RL training loop with RARM rewards disabled on a long-horizon task such as cloth folding and measuring whether success rates drop to the level of sparse-reward baselines would test the claim.

Figures

Figures reproduced from arXiv: 2606.22027 by Fan Shi, Kehan Wen, Minghao Fu, Pengyu Jing, Pengzhi Yang, Xin Liu, Xinyu Wang, Yaheng Shen, Yiduo Qu, Zhenhao Huang.

**Figure 1.** Figure 1: Motivation. Existing reward models often overestimate failed states, provide unreliable progress signals during failures. RARM addresses these failure modes with confidence-gate by rewarding only confident forward progress along the reference and suppressing uncertain matches. Existing methods rarely achieve both. VLM-based reward models [12, 13, 14, 15] carry broad semantic priors but are costly to q… view at source ↗

**Figure 2.** Figure 2: Method Overview. Reward Model Training: As shown in (c), we sample an anchor clip from an unlabeled video, positives from the same temporal region, and negatives from three complementary sources. These clip pairs are scored by the comparator in (a), which is trained with the soft-nearest-neighbours loss in Eq. (1). RL Training: Given a reference video and a rollout video in (b), we compare each rollout cli… view at source ↗

**Figure 3.** Figure 3: Success rate over 1M environment steps across 4 MetaWorld and 5 LIBERO-10 tasks [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation. Success rate over 1M steps for four component ablations on one LIBERO and one MetaWorld task, averaged over the same five seeds. Main results [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Progress prediction on cloth folding tasks. Cumulative predicted reward on a paired success/failure cloth-folding rollout, normalized per model by its own success-rollout final reward. Our model (red) tracks the linear oracle on success and saturates at 0.50 of its success total on failure, while all four baselines assign more than 80% of their success reward to failure (GVL 0.97, Robometer 0.88, RoboDopam… view at source ↗

**Figure 6.** Figure 6: Task 1 — Pick Eraser from Box. Filmstrip of one successful rollout sampled at uniform intervals; action-phrase labels appear under each frame. Task description. A single-arm short-horizon pick-and-place task. The robot must reach into a green storage box, grasp an orange-and-yellow blackboard eraser whose pose inside the box is randomised across episodes, lift it cleanly past the box rim, and release it o… view at source ↗

**Figure 7.** Figure 7: Task 2 — Open Red Drawer. Filmstrip of one successful rollout sampled at uniform intervals; action-phrase labels appear under each frame. Task description. A single-arm short-horizon articulated-object manipulation task. The robot must localise the blue plastic handle on the red drawer (mounted on the drawer face that points toward the robot, hence not visible in the top-down filmstrip view above), close t… view at source ↗

**Figure 8.** Figure 8: Task 3 — Bimanual Hand-over. Filmstrip of one successful rollout sampled at uniform intervals; action-phrase labels appear under each frame. Task description. A dual-arm long-horizon coordination task. The left arm picks the off-white container off the tabletop, raises it to a mid-air hand-over pose, and must hold steady while the right arm reaches in and closes around the same object; only after the right… view at source ↗

**Figure 9.** Figure 9: Task 4 — Bimanual Clothes Folding. Filmstrip of one successful rollout sampled at uniform intervals; action-phrase labels appear under each frame. B.4 Task 4: Bimanual Clothes Folding (Dual-Arm, Long-Horizon, Multi-Stage) Task description. A dual-arm long-horizon multi-stage cloth-manipulation task on a flat orange shirt. Stage 1: both arms simultaneously pinch one side of the shirt and fold it across to t… view at source ↗

**Figure 10.** Figure 10: Generated reference demonstration on the MetaWorld pull task. (a) Learning curves comparing RARM using a real reference with a different object instance, and synthetic references generated by SeeDance 2.0. (b) Example frames from two synthetic trajectories generated by SeeDance 2.0 from an initial scene image and task instruction. RARM requires a single reference demonstration for each new task configur… view at source ↗

**Figure 11.** Figure 11: RARM Robustness: Comparison of RARM progress estimates on augmented rollouts compared with a fixed reference demonstration. rollout trajectory, we digitally augmentations to its frames and then used RARM to re-estimate the progress of the altered rollout against a different, but unaltered, reference demonstration. We evaluated four augmentations, each targeting a distinct real-world failure mode: • Augmen… view at source ↗

read the original abstract

Reinforcement learning for robot manipulation is often bottlenecked by reward design, especially in long-horizon tasks: sparse success rewards provide weak supervision, while hand-crafted dense rewards are tedious to design and generalize poorly across tasks. Progress-based reward models offer a promising alternative by estimating how far an observation has advanced toward task completion, but existing approaches often require task-specific demonstrations or progress labels, and can assign high rewards to visually plausible but physically incorrect states. We introduce the Reference-Anchored Reward Model (RARM), a lightweight visual comparator that converts a single successful demonstration into a dense, progress-aware reward. RARM is trained once on general-purpose videos with a contrastive temporal objective, requiring no robot-specific data, task-specific reward labels, or per-task reward engineering. At deployment, RARM matches rollout clips to reference clips and rewards only confident forward progress, suppressing uncertain matches that may otherwise produce false-positive rewards. Across 9 simulated manipulation tasks from LIBERO and MetaWorld and 4 real-world tasks, RARM achieves the best overall success rates in subsequent RL training, with particularly large gains on long-horizon tasks such as cloth folding, where unreliable progress estimates are especially harmful.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RARM's general-video contrastive model with reference matching and confidence gating is a plausible shot at dense rewards for robot RL, but the domain gap and lack of supporting numbers leave the main claims unconvincing.

read the letter

The paper's main contribution is a reward model trained once on general videos with a contrastive temporal loss, then used at deployment by matching short rollout clips against a single successful demo and only crediting confident forward progress. That specific pairing of reference-anchored comparison plus explicit gating to suppress uncertain matches is presented as new and does not obviously reduce to prior published equations.

It handles the sparse-reward problem in long-horizon manipulation without robot-specific data or per-task labels, which is a practical direction worth testing. The motivation for gating—to avoid false positives on visually plausible but physically wrong states—is stated clearly.

The soft spots are the ones flagged in the stress test. The central assumption is that an embedding learned on ordinary videos will produce reliable progress signals on robot observations despite differences in camera, lighting, and dynamics, yet the abstract gives no measurement of how often robot states fall into the low-confidence regime or whether gating actually removes invalid rewards. No quantitative results, baselines, ablations, or statistical details appear, so it is impossible to judge whether the reported gains on LIBERO, MetaWorld, and real cloth-folding tasks are robust or attributable to the method. The weakest link remains the claim that no domain adaptation or robot data is needed.

This is for people working on reward design inside robotic RL. A reader looking for a new dense-reward construction might find the setup useful to try, but only after seeing the full experiments.

I would send it to peer review so the empirical claims can be checked, but I would not cite it on the basis of the abstract alone.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Reference-Anchored Reward Model (RARM), a lightweight visual comparator trained once on general-purpose videos via a contrastive temporal objective. It converts a single successful demonstration into a dense progress-aware reward for RL by matching rollout clips to reference clips and applying confidence gating to reward only confident forward progress, suppressing uncertain matches. No robot-specific data, task-specific labels, or per-task engineering is required. The central empirical claim is that RARM yields the best overall success rates in subsequent RL training across 9 simulated manipulation tasks (LIBERO, MetaWorld) and 4 real-world tasks, with particularly large gains on long-horizon tasks such as cloth folding.

Significance. If the generalization claim holds, RARM would offer a practical way to obtain dense, task-agnostic rewards from a single demo and general video pretraining, reducing the reward-design bottleneck in long-horizon robot manipulation RL. The approach would demonstrate that contrastive temporal embeddings can transfer progress signals across visual domains without adaptation, which is a non-trivial result if supported by rigorous evidence of domain-gap handling and gating efficacy.

major comments (2)

[Abstract (and Experiments section)] The central claim that a model trained solely on general videos produces reliable progress estimates for robot states (without domain adaptation or robot data) is load-bearing for all reported gains. No quantitative evidence is supplied on how frequently robot observations fall into the low-confidence regime or whether gating removes false-positive rewards on physically invalid but visually similar states (e.g., incorrectly folded cloth). Without such measurements, the attribution of success-rate improvements on long-horizon tasks to the method remains unverified.
[Abstract] The abstract asserts superior empirical results on 9 simulated + 4 real tasks but supplies no quantitative details, baselines, statistical tests, or ablation evidence. This absence prevents assessment of whether the data support the generalization claim or whether gains could arise from other factors (e.g., reward scaling or RL hyperparameters).

minor comments (1)

[Method] Notation for the confidence threshold and the exact form of the contrastive loss should be defined explicitly with equations rather than described only in prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the empirical support for RARM's generalization claims. We address each major comment below.

read point-by-point responses

Referee: [Abstract (and Experiments section)] The central claim that a model trained solely on general videos produces reliable progress estimates for robot states (without domain adaptation or robot data) is load-bearing for all reported gains. No quantitative evidence is supplied on how frequently robot observations fall into the low-confidence regime or whether gating removes false-positive rewards on physically invalid but visually similar states (e.g., incorrectly folded cloth). Without such measurements, the attribution of success-rate improvements on long-horizon tasks to the method remains unverified.

Authors: We agree that the manuscript lacks explicit quantitative measurements of low-confidence frequency on robot observations and direct verification that gating suppresses false positives on invalid states. The reported RL success rates, especially the large gains on long-horizon tasks, provide indirect support, but do not substitute for the requested analysis. In revision we will add a new subsection with confidence histograms comparing robot rollouts to general videos and qualitative examples of gating on physically invalid but visually similar states. revision: yes
Referee: [Abstract] The abstract asserts superior empirical results on 9 simulated + 4 real tasks but supplies no quantitative details, baselines, statistical tests, or ablation evidence. This absence prevents assessment of whether the data support the generalization claim or whether gains could arise from other factors (e.g., reward scaling or RL hyperparameters).

Authors: Abstracts are length-constrained and intended to convey the high-level contribution; the full quantitative results (per-task success rates, baseline comparisons, statistical tests, and gating ablations) appear in Section 4 with tables and figures. We will expand the abstract with one sentence summarizing the magnitude of gains if space allows. revision: partial

Circularity Check

0 steps flagged

No circularity detected; derivation is self-contained

full rationale

The paper defines RARM via a contrastive temporal objective trained once on external general-purpose videos, with no robot data or task labels required. Deployment uses reference-clip matching plus confidence gating on rollout observations. No equations or claims reduce the progress reward or success-rate gains to a fitted quantity defined by the same inputs, nor to self-citations whose load-bearing premise is unverified. The central generalization claim is presented as an empirical result rather than a definitional identity, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the transferability of contrastive video representations to robot states and on the effectiveness of confidence gating, neither of which is independently verified in the abstract.

axioms (1)

domain assumption A contrastive temporal objective on general videos learns representations that capture task progress transferable to robot manipulation
Invoked to justify training without robot data or labels.

invented entities (1)

RARM visual comparator no independent evidence
purpose: To generate dense progress rewards from a single reference demonstration
New model introduced by the paper; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5767 in / 1241 out tokens · 20361 ms · 2026-06-26T12:16:09.514791+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 13 linked inside Pith

[1]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024
[2]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[3]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[4]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. Pi 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[5]

Intelligence, A

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al. Pi 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

Pith/arXiv arXiv 2025
[6]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

2011
[7]

De Haan, D

P. De Haan, D. Jayaraman, and S. Levine. Causal confusion in imitation learning.Advances in neural information processing systems, 32, 2019

2019
[8]

Codevilla, E

F. Codevilla, E. Santana, A. M. L ´opez, and A. Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF international conference on computer vision, pages 9329–9338, 2019

2019
[9]

R. Tian, Y . Wu, and A. Bacjsy. Position: Good embodied reward models need bad behavior data, 2026

2026
[10]

T. W. Ayalew, X. Zhang, K. Y . Wu, T. Jiang, M. Maire, and M. R. Walter. Progressor: A perceptually guided reward estimator with self-supervised online refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10297–10306, 2025

2025
[11]

J. Leng, C. Huang, B. Zhu, and J. Huang. Taming overconfidence in llms: Reward calibration in rlhf. InInternational Conference on Learning Representations, volume 2025, pages 16484– 16517, 2025

2025
[12]

Y . J. Ma, J. Hejna, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmani, P. Xu, D. Driess, T. Xiao, et al. Vision language models are in-context value learners. InThe Thirteenth International Conference on Learning Representations, 2024

2024
[13]

T. Lee, A. Wagenmaker, K. Pertsch, P. Liang, S. Levine, and C. Finn. Roboreward: General- purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026

arXiv 2026
[14]

Liang, Y

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. S. Huang, L. Zettlemoyer, D. Fox, et al. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons.arXiv preprint arXiv:2603.02115, 2026

Pith/arXiv arXiv 2026
[15]

Zhang, Y

J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Biyik, and J. Zhang. Rewind: Language-guided rewards teach robot policies without new demonstrations.arXiv preprint arXiv:2505.10911, 2025. 10

arXiv 2025
[16]

Q. Chen, J. Yu, M. Schwager, P. Abbeel, Y . Shentu, and P. Wu. Sarm: Stage-aware reward modeling for long horizon robot manipulation.arXiv preprint arXiv:2509.25358, 2025

Pith/arXiv arXiv 2025
[17]

Y . Liu, C. Wen, Y . Hu, D. Jayaraman, and Y . Gao. Timerewarder: Learning dense reward from passive videos via frame-wise temporal distance.arXiv preprint arXiv:2509.26627, 2025

Pith/arXiv arXiv 2025
[18]

D. Yang, D. Tjia, J. Berg, D. Damen, P. Agrawal, and A. Gupta. Rank2reward: Learning shaped reward functions from passive video. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2806–2813. IEEE, 2024

2024
[19]

Escontrela, A

A. Escontrela, A. Adeniji, W. Yan, A. Jain, X. B. Peng, K. Goldberg, Y . Lee, D. Hafner, and P. Abbeel. Video prediction models as rewards for reinforcement learning.Advances in Neural Information Processing Systems, 36:68760–68783, 2023

2023
[20]

Huang, G

T. Huang, G. Jiang, Y . Ze, and H. Xu. Diffusion reward: Learning rewards via conditional video diffusion. InEuropean Conference on Computer Vision, pages 478–495. Springer, 2024

2024
[21]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023
[22]

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR, 2020

2020
[23]

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

Pith/arXiv arXiv 2022
[24]

Y . J. Ma, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. Liv: Language-image repre- sentations and rewards for robotic control. InInternational Conference on Machine Learning, pages 23301–23320. PMLR, 2023

2023
[25]

Zhang, Y

Z. Zhang, Y . Li, O. Bastani, A. Gupta, D. Jayaraman, Y . J. Ma, and L. Weihs. Universal visual decomposer: Long-horizon manipulation made easy. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6973–6980. IEEE, 2024

2024
[26]

C. Kim, M. Heo, D. Lee, J. Shin, H. Lee, J. J. Lim, and K. Lee. Subtask-aware visual reward learning from segmented demonstrations.arXiv preprint arXiv:2502.20630, 2025

arXiv 2025
[27]

Y . Yang, M. Chen, Q. Qiu, J. Wu, W. Wang, B. Lin, Z. Guan, and X. He. Adapt2reward: Adapt- ing video-language models to generalizable robotic rewards via failure prompts. InEuropean Conference on Computer Vision, pages 163–180. Springer, 2024

2024
[28]

A. K. Jain, V . Mohta, S. Kim, A. Bhardwaj, J. Ren, Y . Feng, S. Choudhury, and G. Swamy. A smooth sea never made a skilled sailor: Robust imitation via learning to search. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[29]

S. Zhai, Q. Zhang, T. Zhang, F. Huang, H. Zhang, M. Zhou, S. Zhang, L. Liu, S. Lin, and J. Pang. A vision-language-action-critic model for robotic real-world reinforcement learning. arXiv preprint arXiv:2509.15937, 2025

arXiv 2025
[30]

H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhao, X. Chen, P. Co, et al. Robo- dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025

arXiv 2025
[31]

S. Chen, C. Harrison, Y .-C. Lee, A. J. Yang, Z. Ren, L. J. Ratliff, J. Duan, D. Fox, and R. Kr- ishna. Topreward: Token probabilities as hidden zero-shot rewards for robotics.arXiv preprint arXiv:2602.19313, 2026. 11

arXiv 2026
[32]

Kumar, J

S. Kumar, J. Zamora, N. Hansen, R. Jangir, and X. Wang. Graph inverse reinforcement learning from diverse videos. InConference on Robot Learning, pages 55–66. PMLR, 2023

2023
[33]

Y . Fu, H. Zhang, D. Wu, W. Xu, and B. Boulet. Robot policy learning with temporal optimal transport reward.Advances in Neural Information Processing Systems, 37:122078–122103, 2024

2024
[34]

J. Shi, J. Smith, J. Qian, and D. Jayaraman. Points2reward: Robotic manipulation rewards from just one video
[35]

Guzey, Y

I. Guzey, Y . Dai, G. Savva, R. Bhirangi, and L. Pinto. Bridging the human to robot dexterity gap through object-oriented rewards. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3344–3351. IEEE, 2025

2025
[36]

K. Yu, S. Zhang, H. Soora, F. Huang, H. Huang, P. Tokekar, and R. Gao. Genflowrl: Shaping rewards with generative object-centric flow in visual reinforcement learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13183–13192, 2025

2025
[37]

Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar. Eureka: Human-level reward design via coding large language models.arXiv preprint arXiv:2310.12931, 2023

Pith/arXiv arXiv 2023
[38]

S. K. S. Ghasemipour, A. Wahid, J. Tompson, P. Sanketi, and I. Mordatch. Self-improving embodied foundation models.arXiv preprint arXiv:2509.15155, 2025

arXiv 2025
[39]

L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023

2023
[40]

W. Shen, R. Zheng, W. Zhan, J. Zhao, S. Dou, T. Gui, Q. Zhang, and X.-J. Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2859–2873, 2023

2023
[41]

Riedmiller, R

M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Wiele, V . Mnih, N. Heess, and J. T. Springenberg. Learning by playing solving sparse reward tasks from scratch. In International conference on machine learning, pages 4344–4353. PMLR, 2018

2018
[42]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017

2017
[43]

O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020

2020
[44]

Kalashnikov, A

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. InConference on robot learning, pages 651–673. PMLR, 2018

2018
[45]

Rajeswaran, V

A. Rajeswaran, V . Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstra- tions.arXiv preprint arXiv:1709.10087, 2017

Pith/arXiv arXiv 2017
[46]

J. Luo, Z. Hu, C. Xu, Y . L. Tan, J. Berg, A. Sharma, S. Schaal, C. Finn, A. Gupta, and S. Levine. Serl: A software suite for sample-efficient robotic reinforcement learning. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 16961–16969. IEEE, 2024

2024
[47]

Ho and S

J. Ho and S. Ermon. Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016. 12

2016
[48]

C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. InInternational conference on machine learning, pages 49–58. PMLR, 2016

2016
[49]

Andrychowicz, F

M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. To- bin, O. Pieter Abbeel, and W. Zaremba. Hindsight experience replay.Advances in neural information processing systems, 30, 2017

2017
[50]

Pathak, P

D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self- supervised prediction. InInternational conference on machine learning, pages 2778–2787. PMLR, 2017

2017
[51]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

Pith/arXiv arXiv 2023
[52]

Salakhutdinov and G

R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neigh- bourhood structure. InArtificial intelligence and statistics, pages 412–419. PMLR, 2007

2007
[53]

Sim ´eoni, H

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

Pith/arXiv arXiv 2025
[54]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022
[55]

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[56]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017
[57]

Yarats, R

D. Yarats, R. Fergus, A. Lazaric, and L. Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning.arXiv preprint arXiv:2107.09645, 2021

arXiv 2021
[58]

Wagenmaker, M

A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799, 2025

Pith/arXiv arXiv 2025
[59]

Sontakke, J

S. Sontakke, J. Zhang, S. Arnold, K. Pertsch, E. Bıyık, D. Sadigh, C. Finn, and L. Itti. Robo- clip: One demonstration is enough to learn robot policies.Advances in Neural Information Processing Systems, 36:55681–55693, 2023

2023
[60]

Caelles, J

S. Caelles, J. Pont-Tuset, F. Perazzi, A. Montes, K.-K. Maninis, and L. Van Gool. The 2019 davis challenge on vos: Unsupervised multi-object segmentation.arXiv:1905.00737, 2019

Pith/arXiv arXiv 2019
[61]

left arm

H. Ding, C. Liu, S. He, X. Jiang, P. H. Torr, and S. Bai. MOSE: A new dataset for video object segmentation in complex scenes. InICCV, 2023. 13 A Simulation Environments Table 2: Simulation tasks used in our evaluation. The image column is reserved for task visualiza- tions. Simulation Envi- ronment Task Image Description MetaWorld (MT50) Task 7 Bypass a ...

2023

[1] [1]

O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

Pith/arXiv arXiv 2024

[2] [2]

Zitkovich, T

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[3] [3]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[4] [4]

Black, N

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, et al. Pi 0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[5] [5]

Intelligence, A

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, et al. Pi 0.6: a vla that learns from experience.arXiv preprint arXiv:2511.14759, 2025

Pith/arXiv arXiv 2025

[6] [6]

S. Ross, G. Gordon, and D. Bagnell. A reduction of imitation learning and structured predic- tion to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Pro- ceedings, 2011

2011

[7] [7]

De Haan, D

P. De Haan, D. Jayaraman, and S. Levine. Causal confusion in imitation learning.Advances in neural information processing systems, 32, 2019

2019

[8] [8]

Codevilla, E

F. Codevilla, E. Santana, A. M. L ´opez, and A. Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InProceedings of the IEEE/CVF international conference on computer vision, pages 9329–9338, 2019

2019

[9] [9]

R. Tian, Y . Wu, and A. Bacjsy. Position: Good embodied reward models need bad behavior data, 2026

2026

[10] [10]

T. W. Ayalew, X. Zhang, K. Y . Wu, T. Jiang, M. Maire, and M. R. Walter. Progressor: A perceptually guided reward estimator with self-supervised online refinement. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10297–10306, 2025

2025

[11] [11]

J. Leng, C. Huang, B. Zhu, and J. Huang. Taming overconfidence in llms: Reward calibration in rlhf. InInternational Conference on Learning Representations, volume 2025, pages 16484– 16517, 2025

2025

[12] [12]

Y . J. Ma, J. Hejna, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmani, P. Xu, D. Driess, T. Xiao, et al. Vision language models are in-context value learners. InThe Thirteenth International Conference on Learning Representations, 2024

2024

[13] [13]

T. Lee, A. Wagenmaker, K. Pertsch, P. Liang, S. Levine, and C. Finn. Roboreward: General- purpose vision-language reward models for robotics.arXiv preprint arXiv:2601.00675, 2026

arXiv 2026

[14] [14]

Liang, Y

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. S. Huang, L. Zettlemoyer, D. Fox, et al. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons.arXiv preprint arXiv:2603.02115, 2026

Pith/arXiv arXiv 2026

[15] [15]

Zhang, Y

J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Biyik, and J. Zhang. Rewind: Language-guided rewards teach robot policies without new demonstrations.arXiv preprint arXiv:2505.10911, 2025. 10

arXiv 2025

[16] [16]

Q. Chen, J. Yu, M. Schwager, P. Abbeel, Y . Shentu, and P. Wu. Sarm: Stage-aware reward modeling for long horizon robot manipulation.arXiv preprint arXiv:2509.25358, 2025

Pith/arXiv arXiv 2025

[17] [17]

Y . Liu, C. Wen, Y . Hu, D. Jayaraman, and Y . Gao. Timerewarder: Learning dense reward from passive videos via frame-wise temporal distance.arXiv preprint arXiv:2509.26627, 2025

Pith/arXiv arXiv 2025

[18] [18]

D. Yang, D. Tjia, J. Berg, D. Damen, P. Agrawal, and A. Gupta. Rank2reward: Learning shaped reward functions from passive video. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2806–2813. IEEE, 2024

2024

[19] [19]

Escontrela, A

A. Escontrela, A. Adeniji, W. Yan, A. Jain, X. B. Peng, K. Goldberg, Y . Lee, D. Hafner, and P. Abbeel. Video prediction models as rewards for reinforcement learning.Advances in Neural Information Processing Systems, 36:68760–68783, 2023

2023

[20] [20]

Huang, G

T. Huang, G. Jiang, Y . Ze, and H. Xu. Diffusion reward: Learning rewards via conditional video diffusion. InEuropean Conference on Computer Vision, pages 478–495. Springer, 2024

2024

[21] [21]

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

2023

[22] [22]

T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. InConference on robot learning, pages 1094–1100. PMLR, 2020

2020

[23] [23]

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training.arXiv preprint arXiv:2210.00030, 2022

Pith/arXiv arXiv 2022

[24] [24]

Y . J. Ma, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman. Liv: Language-image repre- sentations and rewards for robotic control. InInternational Conference on Machine Learning, pages 23301–23320. PMLR, 2023

2023

[25] [25]

Zhang, Y

Z. Zhang, Y . Li, O. Bastani, A. Gupta, D. Jayaraman, Y . J. Ma, and L. Weihs. Universal visual decomposer: Long-horizon manipulation made easy. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6973–6980. IEEE, 2024

2024

[26] [26]

C. Kim, M. Heo, D. Lee, J. Shin, H. Lee, J. J. Lim, and K. Lee. Subtask-aware visual reward learning from segmented demonstrations.arXiv preprint arXiv:2502.20630, 2025

arXiv 2025

[27] [27]

Y . Yang, M. Chen, Q. Qiu, J. Wu, W. Wang, B. Lin, Z. Guan, and X. He. Adapt2reward: Adapt- ing video-language models to generalizable robotic rewards via failure prompts. InEuropean Conference on Computer Vision, pages 163–180. Springer, 2024

2024

[28] [28]

A. K. Jain, V . Mohta, S. Kim, A. Bhardwaj, J. Ren, Y . Feng, S. Choudhury, and G. Swamy. A smooth sea never made a skilled sailor: Robust imitation via learning to search. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[29] [29]

S. Zhai, Q. Zhang, T. Zhang, F. Huang, H. Zhang, M. Zhou, S. Zhang, L. Liu, S. Lin, and J. Pang. A vision-language-action-critic model for robotic real-world reinforcement learning. arXiv preprint arXiv:2509.15937, 2025

arXiv 2025

[30] [30]

H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhao, X. Chen, P. Co, et al. Robo- dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025

arXiv 2025

[31] [31]

S. Chen, C. Harrison, Y .-C. Lee, A. J. Yang, Z. Ren, L. J. Ratliff, J. Duan, D. Fox, and R. Kr- ishna. Topreward: Token probabilities as hidden zero-shot rewards for robotics.arXiv preprint arXiv:2602.19313, 2026. 11

arXiv 2026

[32] [32]

Kumar, J

S. Kumar, J. Zamora, N. Hansen, R. Jangir, and X. Wang. Graph inverse reinforcement learning from diverse videos. InConference on Robot Learning, pages 55–66. PMLR, 2023

2023

[33] [33]

Y . Fu, H. Zhang, D. Wu, W. Xu, and B. Boulet. Robot policy learning with temporal optimal transport reward.Advances in Neural Information Processing Systems, 37:122078–122103, 2024

2024

[34] [34]

J. Shi, J. Smith, J. Qian, and D. Jayaraman. Points2reward: Robotic manipulation rewards from just one video

[35] [35]

Guzey, Y

I. Guzey, Y . Dai, G. Savva, R. Bhirangi, and L. Pinto. Bridging the human to robot dexterity gap through object-oriented rewards. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3344–3351. IEEE, 2025

2025

[36] [36]

K. Yu, S. Zhang, H. Soora, F. Huang, H. Huang, P. Tokekar, and R. Gao. Genflowrl: Shaping rewards with generative object-centric flow in visual reinforcement learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13183–13192, 2025

2025

[37] [37]

Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar. Eureka: Human-level reward design via coding large language models.arXiv preprint arXiv:2310.12931, 2023

Pith/arXiv arXiv 2023

[38] [38]

S. K. S. Ghasemipour, A. Wahid, J. Tompson, P. Sanketi, and I. Mordatch. Self-improving embodied foundation models.arXiv preprint arXiv:2509.15155, 2025

arXiv 2025

[39] [39]

L. Gao, J. Schulman, and J. Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023

2023

[40] [40]

W. Shen, R. Zheng, W. Zhan, J. Zhao, S. Dou, T. Gui, Q. Zhang, and X.-J. Huang. Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2859–2873, 2023

2023

[41] [41]

Riedmiller, R

M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Wiele, V . Mnih, N. Heess, and J. T. Springenberg. Learning by playing solving sparse reward tasks from scratch. In International conference on machine learning, pages 4344–4353. PMLR, 2018

2018

[42] [42]

Tobin, R

J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017

2017

[43] [43]

O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020

2020

[44] [44]

Kalashnikov, A

D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V . Vanhoucke, et al. Scalable deep reinforcement learning for vision-based robotic manipulation. InConference on robot learning, pages 651–673. PMLR, 2018

2018

[45] [45]

Rajeswaran, V

A. Rajeswaran, V . Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstra- tions.arXiv preprint arXiv:1709.10087, 2017

Pith/arXiv arXiv 2017

[46] [46]

J. Luo, Z. Hu, C. Xu, Y . L. Tan, J. Berg, A. Sharma, S. Schaal, C. Finn, A. Gupta, and S. Levine. Serl: A software suite for sample-efficient robotic reinforcement learning. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 16961–16969. IEEE, 2024

2024

[47] [47]

Ho and S

J. Ho and S. Ermon. Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016. 12

2016

[48] [48]

C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. InInternational conference on machine learning, pages 49–58. PMLR, 2016

2016

[49] [49]

Andrychowicz, F

M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. To- bin, O. Pieter Abbeel, and W. Zaremba. Hindsight experience replay.Advances in neural information processing systems, 30, 2017

2017

[50] [50]

Pathak, P

D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self- supervised prediction. InInternational conference on machine learning, pages 2778–2787. PMLR, 2017

2017

[51] [51]

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

Pith/arXiv arXiv 2023

[52] [52]

Salakhutdinov and G

R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neigh- bourhood structure. InArtificial intelligence and statistics, pages 412–419. PMLR, 2007

2007

[53] [53]

Sim ´eoni, H

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

Pith/arXiv arXiv 2025

[54] [54]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

2022

[55] [55]

J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[56] [56]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

2017

[57] [57]

Yarats, R

D. Yarats, R. Fergus, A. Lazaric, and L. Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning.arXiv preprint arXiv:2107.09645, 2021

arXiv 2021

[58] [58]

Wagenmaker, M

A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning.arXiv preprint arXiv:2506.15799, 2025

Pith/arXiv arXiv 2025

[59] [59]

Sontakke, J

S. Sontakke, J. Zhang, S. Arnold, K. Pertsch, E. Bıyık, D. Sadigh, C. Finn, and L. Itti. Robo- clip: One demonstration is enough to learn robot policies.Advances in Neural Information Processing Systems, 36:55681–55693, 2023

2023

[60] [60]

Caelles, J

S. Caelles, J. Pont-Tuset, F. Perazzi, A. Montes, K.-K. Maninis, and L. Van Gool. The 2019 davis challenge on vos: Unsupervised multi-object segmentation.arXiv:1905.00737, 2019

Pith/arXiv arXiv 2019

[61] [61]

left arm

H. Ding, C. Liu, S. He, X. Jiang, P. H. Torr, and S. Bai. MOSE: A new dataset for video object segmentation in complex scenes. InICCV, 2023. 13 A Simulation Environments Table 2: Simulation tasks used in our evaluation. The image column is reserved for task visualiza- tions. Simulation Envi- ronment Task Image Description MetaWorld (MT50) Task 7 Bypass a ...

2023