pith. machine review for the scientific record. sign in

arxiv: 2603.02115 · v2 · submitted 2026-03-02 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:54 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG
keywords robot reward modelstrajectory comparisonspreference supervisiongeneralizable rewardsfailure trajectoriesRBM-1M datasetrobot learning
0
0 comments X

The pith

Robometer trains generalizable robot reward models by combining frame-level progress with inter-trajectory preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Robometer trains general-purpose robot reward models using a dual objective that pairs local progress labels on expert demonstrations with global ordering constraints from comparing entire trajectories. Prior approaches scale poorly because assigning dense progress labels becomes ambiguous on the abundant suboptimal and failed trajectories found in large robotics datasets. The added preference loss imposes ordering across same-task trajectories, letting the model learn from both successes and failures. Training occurs on the new RBM-1M dataset of over one million trajectories spanning multiple robot embodiments and tasks. The resulting rewards generalize better than earlier methods and raise performance on a range of downstream robot learning benchmarks and real-world applications.

Core claim

Robometer is a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. It is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task. This formulation supports effective learning from both real and augmented failed trajectories on the RBM-1M dataset of over one million trajectories.

What carries the argument

The dual objective of a frame-level progress loss anchored on expert demonstrations plus a trajectory-comparison preference loss that supplies global ordering constraints across same-task trajectories.

If this is right

  • Reward functions become more generalizable across diverse robot embodiments and tasks.
  • Robot learning performance improves across benchmarks and real-world downstream applications.
  • Large-scale datasets containing many failed and suboptimal trajectories can be used directly for reward learning.
  • Augmented failure trajectories contribute positively to reward quality without dense manual labeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The preference component could allow reward models to be fine-tuned on new tasks with only a small number of comparative labels rather than full demonstrations.
  • Human preference collection on trajectory pairs might be crowdsourced more easily than dense progress annotation, lowering data costs for future datasets.
  • The same dual-loss structure could be tested in simulation-to-real transfer settings where failure data is cheap to generate.
  • Extending the comparison loss to cross-embodiment trajectory pairs could test whether a single reward model can serve multiple robot platforms without retraining.

Load-bearing premise

Inter-trajectory preference supervision from comparisons imposes reliable global ordering constraints even on ambiguous suboptimal and failure trajectories without introducing significant labeling noise or bias.

What would settle it

Collect fresh human preference labels on a held-out set of trajectories from multiple tasks and embodiments; if the learned rewards fail to match the human ordering on a majority of pairs, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2603.02115 by Abhishek Gupta, Abrar Anwar, Aditya Shah, Alex S. Huang, Andreea Bobu, Anqi Li, Anthony Liang, Dieter Fox, Erdem Biyik, Jesse Zhang, Jiahui Zhang, Luke Zettlemoyer, Minyoung Hwang, Sidhant Kaushik, Stephen Tu, Yigit Korkmaz, Yu Xiang.

Figure 1
Figure 1. Figure 1: ROBOMETER Overview. ROBOMETER is trained on RBM-1M, a 1M-trajectory dataset spanning 21 robot embodiments, containing both reward￾labeled/expert trajectories and reward-unlabeled, failed trajectories. The model is supervised with a dual objective: predicting frame-level task progress (reward) and learning trajectory-level preferences from pairwise comparisons. To help with downstream RL, it is also trained… view at source ↗
Figure 2
Figure 2. Figure 2: ROBOMETER is a VLM-based reward model, that predicts dense, per-frame progress-based rewards and success labels for the first of two video trajectories. To be able to train with failed, non-expert data, we also predict which of the two video trajectories better completes the task. We use three strategies for curating training examples from our given datasets, which are further detailed in Section III-D wit… view at source ↗
Figure 3
Figure 3. Figure 3: Video-Language Reward Confusion Matrix. For each task sampled at random from self-collected, unseen data from RBM-EVAL-OOD, we compute rewards for all combinations of demonstration videos and language descriptions. ROBOMETER produces the most diagonal-heavy confusion matrix, indicating strong alignment between unseen demos and instructions. We also report the column￾normalized diagonal mean under each mode… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Analysis of Failure, Suboptimal and Successful Trajectories. We visualize the progress predictions for three trajectories of different quality for the same task. Notably, for the suboptimal trajectory, ROBOMETER predicts steadily increasing progress as the robot approaches the pen holder, but sharply reduces its progress estimate when the marker is dropped, correctly reflecting regression in ta… view at source ↗
Figure 5
Figure 5. Figure 5: RL w/ Ablation Models in LIBERO-90 tasks from scratch, corresponding to ablations trained only on LIBERO￾10/Object/Goal/Spatial data from Table IV. We report the average success rate ± standard deviation across 5 seeds. H3 Fine-tuning from pre-trained VLMs helps with reward predictions on unseen tasks. Our main analysis is performed via a controlled setting with data from the LIBERO [66] robot manipulation… view at source ↗
Figure 6
Figure 6. Figure 6: Automatic online RL with DSRL on a DROID setup with ROBOMETER improves π0 from 20% to 85% on a single-stage task and 20% to 70% on a two-stage task, outperforming RoboReward’s overall success rate by 2.5×. DSRL with ROBOMETER learns to avoid base π0 errors such as collisions or moving the wrong object. The setup is deemed “automatic” because success detection and stage advancement are handled automatically… view at source ↗
Figure 7
Figure 7. Figure 7: Offline RL results using IQL on a mixture of Noisy and Expert trajectories. ROBOMETER rewards consistently outperform both RoboReward and sparse rewards: 2.4× average success rate improvement over the best baseline for each task. evaluation trials. Additional details and finer-grained results on each experiment can be found in Appendix E. Automatic Online RL. First, we evaluate ROBOMETER in an automated on… view at source ↗
Figure 8
Figure 8. Figure 8: (a): Proportion of task-relevant subtrajectories out of 100 retrieval queries. Our method consistently retrieves a high number of relevant subtrajectories using either the preference or progress objective. (b): Success rates of LoRA-finetuned π0.5 policies using the retrieved trajectories from each method. Small amounts of suboptimal & unrelated data retrieved by other baselines degrade policy-learning per… view at source ↗
Figure 9
Figure 9. Figure 9: Failure Detection Examples. (a): Terminal events such as drops cause a sharp regression in predicted task progress, which ROBOMETER flags shortly after the event. (b): Non-terminal failures correctly exhibit oscillatory progress with ROBOMETER. cases. Irreversible failures such as object drops induce sharp regressions in predicted task progress, which ROBOMETER flags shortly after the event, while non-term… view at source ↗
Figure 10
Figure 10. Figure 10: Pie chart of RBM-1M dataset types. Full table with individual dataset details in Table IX. accuracy. To avoid this issue, progress-only generalist robotic reward models would have to either manually label end states or simply forgo using these data sources. However, due to ROBOMETER’s trajectory comparison-based preference prediction objective, we can still use these noisier datasets for preference predic… view at source ↗
Figure 11
Figure 11. Figure 11: ROBOMETER model architecture. For a given task, our VLM based model takes in language description and two trajectories 1 and 2 separated by split tokens. The vlm output for trajectory 1 is fed into two MLP heads : progress - task completion percent and success - task completion probability. Finally the full output is passed into a preference MLP to choose which trajectory best completes the provided task.… view at source ↗
Figure 12
Figure 12. Figure 12: Failure Detection OOD confusion matrices with ternary ground truth and binary prediction. Rows indicate ground-truth execution outcomes (failure, suboptimal, success), while columns indicate binary predictions (predicted failure vs. predicted success). Suboptimal trajectories correspond to executions that make partial progress but do not complete the task. Suboptimal trajectories are treated as failures i… view at source ↗
Figure 13
Figure 13. Figure 13: Irreversible failures. Terminal events such as drops or spills cause a sharp regression in predicted task progress, which our model reliably flags as failures shortly after the event. stalls, oscillates, or terminates execution before completing the task. We additionally highlight semantic failures, where the robot executes a physically plausible behavior that violates the task instruction. All qualitativ… view at source ↗
Figure 14
Figure 14. Figure 14: Semantic failures. The robot executes smooth and physically plausible trajectories but violates the task instruction, resulting in persistently low predicted progress and failure detection without an abrupt terminal event [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: Data filtering & retrieval scene configuration from all cameras. Given a task language instruction, we perform retrieval using different reward models as follows. For RoboReward and ROBOMETER-Prog, we first compute per-timestep reward predictions for each subtrajectory conditioned on the instruc￾tion. We then calculate the value-order correlation of each subtrajectory using the predicted rewards and selec… view at source ↗
Figure 17
Figure 17. Figure 17: Model-Based RL with ROBOMETER integrated into DreamZero [77]. In this cluttered scene, ROBOMETER improves DreamZero’s performance from 20% success rate to 70%. receptacle plates of different colors. DreamZero typically places the ice cream cone in the wrong plate, while integrating ROBOMETER corrects for this mistake. Overall, results in [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗
read the original abstract

General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Robometer, a scalable reward modeling framework for general-purpose robotic reward models. It combines intra-trajectory progress supervision (frame-level loss anchored on expert data) with inter-trajectory preference supervision (pairwise comparison loss imposing global ordering). The authors curate the RBM-1M dataset containing over one million trajectories across diverse embodiments and tasks, including substantial suboptimal and failure data, and report that the resulting models are more generalizable than prior methods while improving downstream robot learning performance on benchmarks and real-world tasks.

Significance. If the empirical results hold under the dual-objective formulation, the work would be significant for scaling reward learning beyond expert-only datasets, a key bottleneck in robotics. The curation of RBM-1M and the explicit dual loss (progress anchoring plus preference ordering) represent a concrete advance; the public release of code, weights, and videos further strengthens reproducibility and potential impact.

major comments (2)
  1. [§3.2] §3.2 (Dual Objective): The preference loss term relies on inter-trajectory comparisons over the full RBM-1M set, including ambiguous failure trajectories. The manuscript must provide explicit details on how these preference labels are generated (heuristics, human annotation, or automated proxies) together with quantitative label-consistency metrics (e.g., inter-annotator agreement or sensitivity to near-equivalent failures). Without such verification, contradictory gradients from noisy preferences could undermine the claimed global ordering and generalization.
  2. [§4.3] §4.3 and Table 2 (Ablation Studies): The reported downstream gains are attributed to the combination of progress and preference losses, yet the ablation isolating the contribution of the preference term on failure trajectories is not shown. Adding this ablation (or reporting the performance drop when the preference loss is removed) is required to confirm that the global-ordering component, rather than the progress anchor alone, drives the claimed improvements.
minor comments (2)
  1. [Abstract] The abstract states that the method works on 'real and augmented failed trajectories,' but the augmentation procedure and its effect on label quality are only briefly mentioned; a dedicated paragraph or figure clarifying the augmentation pipeline would improve clarity.
  2. [Figure 3] Figure 3 (Reward visualization): The color scale and normalization used for reward heatmaps across different tasks are not stated, making direct visual comparison between methods difficult.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the dual-objective formulation and ablation studies. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Dual Objective): The preference loss term relies on inter-trajectory comparisons over the full RBM-1M set, including ambiguous failure trajectories. The manuscript must provide explicit details on how these preference labels are generated (heuristics, human annotation, or automated proxies) together with quantitative label-consistency metrics (e.g., inter-annotator agreement or sensitivity to near-equivalent failures). Without such verification, contradictory gradients from noisy preferences could undermine the claimed global ordering and generalization.

    Authors: We agree that explicit details on preference label generation are necessary for reproducibility and to address potential concerns about noisy labels. In the revised manuscript, we will expand §3.2 to describe that preference labels are generated via automated proxies based on task success indicators and trajectory efficiency metrics from the RBM-1M curation process, with a subset of pairs validated through human annotation. We will also include quantitative label-consistency metrics, such as inter-annotator agreement on the validated subset and sensitivity analysis for near-equivalent failures, to demonstrate that the global ordering remains stable. revision: yes

  2. Referee: [§4.3] §4.3 and Table 2 (Ablation Studies): The reported downstream gains are attributed to the combination of progress and preference losses, yet the ablation isolating the contribution of the preference term on failure trajectories is not shown. Adding this ablation (or reporting the performance drop when the preference loss is removed) is required to confirm that the global-ordering component, rather than the progress anchor alone, drives the claimed improvements.

    Authors: We acknowledge that the current ablations in §4.3 and Table 2 do not isolate the preference term's contribution specifically on failure trajectories. In the revised manuscript, we will add a new ablation experiment comparing the full dual-objective model against a variant trained only with the progress loss (including on failure data). We will report the performance drop on downstream benchmarks to confirm that the inter-trajectory preference supervision drives the observed generalization gains. revision: yes

Circularity Check

0 steps flagged

No circularity in Robometer's dual-objective reward modeling

full rationale

The paper's central derivation uses an externally curated RBM-1M dataset of over one million trajectories and a dual loss (frame-level progress anchored on expert data plus inter-trajectory preference loss) to learn rewards. No step reduces a claimed prediction to a fitted parameter by construction, invokes self-citation as load-bearing uniqueness, or renames a known result; the method remains self-contained against external benchmarks and downstream evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate specific free parameters, axioms, or invented entities; the approach relies on standard RL concepts and a new dataset without explicit new postulates.

pith-pipeline@v0.9.0 · 5556 in / 1000 out tokens · 46424 ms · 2026-05-15T17:54:56.992930+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

    cs.RO 2026-05 unverdicted novelty 7.0

    DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.

  2. Reinforcing VLAs in Task-Agnostic World Models

    cs.AI 2026-05 unverdicted novelty 6.0

    RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.

  3. Grounded World Model for Semantically Generalizable Planning

    cs.RO 2026-04 conditional novelty 6.0

    A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.

  4. ARM: Advantage Reward Modeling for Long-Horizon Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    ARM trains reward models on Progressive/Regressive/Stagnant labels to enable adaptive reweighting in offline RL, reaching 99.4% success on towel-folding with minimal human intervention.

  5. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.

  6. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

Reference graph

Works this paper leans on

166 extracted references · 166 canonical work pages · cited by 5 Pith papers · 15 internal anchors

  1. [1]

    The relativity of ‘absolute’ judge- ments,

    D. R. J. Laming, “The relativity of ‘absolute’ judge- ments,”British Journal of Mathematical and Statistical Psychology, vol. 37, pp. 152–183, 1984

  2. [2]

    Absolute identification by relative judgment

    N. Stewart, G. D. A. Brown, and N. Chater, “Absolute identification by relative judgment.”Psychological re- view, vol. 112 4, pp. 881–911, 2005

  3. [3]

    The effect of relative encoding on memory-based judgments,

    M. A. Sharif and D. M. Oppenheimer, “The effect of relative encoding on memory-based judgments,”Psy- chological Science, vol. 27, no. 8, pp. 1136–1145, 2016

  4. [4]

    Rank2reward: Learning shaped reward func- tions from passive video,

    D. Yang, D. Tjia, J. Berg, D. Damen, P. Agrawal, and A. Gupta, “Rank2reward: Learning shaped reward func- tions from passive video,” inInternational Conference on Robotics and Automation (ICRA), 2024

  5. [5]

    ReWiND: Language-guided rewards teach robot policies without new demonstrations,

    J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Biyik, and J. Zhang, “ReWiND: Language-guided rewards teach robot policies without new demonstrations,” inConference on Robot Learning (CoRL), 2025

  6. [6]

    Vision language models are in-context value learners,

    Y . J. Ma, J. Hejna, A. Wahid, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmaniet al., “Vision language models are in-context value learners,” inInternational Conference on Learning Representations (ICLR), 2025

  7. [7]

    SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

    Q. Chen, J. Yu, M. Schwager, P. Abbeel, F. Shentu, and P. Wu, “Sarm: Stage-aware reward modeling for long horizon robot manipulation,”arXiv preprint arXiv:2509.25358, 2025

  8. [8]

    SAFE: Multitask fail- ure detection for vision-language-action models,

    Q. Gu, Y . Ju, S. Sun, I. Gilitschenski, H. Nishimura, M. Itkina, and F. Shkurti, “SAFE: Multitask fail- ure detection for vision-language-action models,” in Conference on Neural Information Processing Systems (NeurIPS), 2025

  9. [9]

    Real-world offline reinforce- ment learning from vision language model feedback,

    S. Venkataraman, Y . Wang, Z. Wang, N. S. Ravie, Z. Erickson, and D. Held, “Real-world offline reinforce- ment learning from vision language model feedback,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025

  10. [10]

    A vision-language- action-critic model for robotic real-world reinforcement learning,

    S. Zhai, Q. Zhang, T. Zhang, F. Huang, H. Zhang, M. Zhou, S. Zhang, L. Liuet al., “A vision-language- action-critic model for robotic real-world reinforcement learning,”arXiv preprint arXiv:2509.15937, 2025

  11. [11]

    Roboreward: General-purpose vision- language reward models for robotics,

    T. Lee, A. Wagenmaker, K. Pertsch, P. Liang, S. Levine, and C. Finn, “Roboreward: General-purpose vision- language reward models for robotics,”arXiv preprint arXiv:2601.00675, 2026

  12. [12]

    Robo-dopamine: General process re- ward modeling for high-precision robotic manipula- tion,

    H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhaoet al., “Robo-dopamine: General process re- ward modeling for high-precision robotic manipula- tion,”arXiv preprint arXiv:2512.23703, 2025

  13. [13]

    Position: Good embodied reward models need bad behavior data,

    R. Tian, Y . Wu, and A. Bacjsy, “Position: Good embodied reward models need bad behavior data,” Carnegie Mellon University, Tech. Rep., 2026. [Online]. Available: https://cmu-intentlab.github.io/pdf/ tian icml 26 position.pdf

  14. [14]

    Algorithms for inverse reinforcement learning,

    A. Y . Ng and S. J. Russell, “Algorithms for inverse reinforcement learning,” inInternational Conference on Machine Learning (ICML), 2000

  15. [15]

    Apprenticeship learning via inverse reinforcement learning,

    P. Abbeel and A. Y . Ng, “Apprenticeship learning via inverse reinforcement learning,” inInternational Con- ference on Machine Learning (ICML), 2004

  16. [16]

    Maximum entropy inverse reinforcement learning,

    B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning,” in AAAI Conference on Artificial Intelligence, 2008

  17. [17]

    Guided cost learn- ing: Deep inverse optimal control via policy optimiza- tion,

    C. Finn, S. Levine, and P. Abbeel, “Guided cost learn- ing: Deep inverse optimal control via policy optimiza- tion,” inInternational Conference on Machine Learning (ICML), 2016

  18. [18]

    Feature expansive reward learning: Rethinking human input,

    A. Bobu, M. Wiggert, C. Tomlin, and A. D. Dra- gan, “Feature expansive reward learning: Rethinking human input,” inACM/IEEE International Conference on Human-Robot Interaction (HRI), 2021

  19. [19]

    Generative adversarial imitation learning,

    J. Ho and S. Ermon, “Generative adversarial imitation learning,” inAdvances in Neural Information Process- ing Systems (NeurIPS), 2016

  20. [20]

    Socially compliant navigation through raw depth inputs with generative adversarial imitation learning,

    L. Tai, J. Zhang, M. Liu, and W. Burgard, “Socially compliant navigation through raw depth inputs with generative adversarial imitation learning,” inInterna- tional Conference on Robotics and Automation (ICRA), 2018

  21. [21]

    Learning robust rewards with adversarial inverse reinforcement learning,

    J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarial inverse reinforcement learning,” in International Conference on Learning Representations (ICLR), 2018

  22. [22]

    Variational inverse control with events: A general framework for data-driven reward definition,

    J. Fu, A. Singh, D. Ghosh, L. Yang, and S. Levine, “Variational inverse control with events: A general framework for data-driven reward definition,” inAd- vances in Neural Information Processing Systems (NeurIPS), 2018

  23. [23]

    Robot policy learning with temporal optimal transport reward,

    Y . Fu, H. Zhang, D. Wu, W. Xu, and B. Boulet, “Robot policy learning with temporal optimal transport reward,” inConference on Neural Information Processing Sys- tems (NeurIPS), 2024

  24. [24]

    A smooth sea never made a skilled SAILOR: Robust imitation via learning to search,

    A. K. Jain, V . Mohta, S. Kim, A. Bhardwaj, J. Ren, Y . Feng, S. Choudhury, and G. Swamy, “A smooth sea never made a skilled SAILOR: Robust imitation via learning to search,” inConference on Neural Informa- tion Processing Systems (NeurIPS), 2025

  25. [25]

    Deep reinforcement learn- ing from human preferences,

    P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learn- ing from human preferences,” inAdvances in Neural Information Processing Systems (NeurIPS), 2017

  26. [26]

    Active preference-based learning of reward functions,

    D. Sadigh, A. D. Dragan, S. S. Sastry, and S. A. Seshia, “Active preference-based learning of reward functions,” inRobotics: Science and Systems (RSS), 2017

  27. [27]

    Learning from physical human corrections, one feature at a time,

    A. Bajcsy, D. P. Losey, M. K. O’Malley, and A. D. Dragan, “Learning from physical human corrections, one feature at a time,” inInternational Conference on Human-Robot Interaction (HRI), 2018

  28. [28]

    Active preference-based gaussian process regression for reward learning,

    E. Biyik, N. Huynh, M. J. Kochenderfer, and D. Sadigh, “Active preference-based gaussian process regression for reward learning,” inRobotics: Science and Systems 11 (RSS), 2020

  29. [29]

    Pebble: Feedback- efficient interactive reinforcement learning via relabel- ing experience and unsupervised pre-training,

    K. Lee, L. Smith, and P. Abbeel, “Pebble: Feedback- efficient interactive reinforcement learning via relabel- ing experience and unsupervised pre-training,” inIn- ternational Conference on Machine Learning (ICML), 2021

  30. [30]

    Few-shot preference learn- ing for human-in-the-loop rl,

    J. Hejna and D. Sadigh, “Few-shot preference learn- ing for human-in-the-loop rl,” inConference on Robot Learning (CoRL), 2022

  31. [31]

    Learning multimodal rewards from rankings,

    V . Myers, E. Biyik, N. Anari, and D. Sadigh, “Learning multimodal rewards from rankings,” inConference on Robot Learning (CoRL), 2021

  32. [32]

    Trajectory improvement and reward learning from comparative language feedback,

    Z. Yang, M. Jun, J. Tien, S. J. Russell, A. Dragan, and E. Bıyık, “Trajectory improvement and reward learning from comparative language feedback,” inConference on Robot Learning (CoRL), 2024

  33. [33]

    Mile: Model-based interven- tion learning,

    Y . Korkmaz and E. Bıyık, “Mile: Model-based interven- tion learning,” inInternational Conference on Robotics and Automation (ICRA), 2025

  34. [34]

    Robomonkey: Scaling test-time sampling and verification for vision-language- action models,

    J. Kwok, C. Agia, R. Sinha, M. Foutter, S. Li, I. Stoica, A. Mirhoseini, and M. Pavone, “Robomonkey: Scaling test-time sampling and verification for vision-language- action models,” inConference on Robot Learning (CoRL), 2025

  35. [35]

    Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,

    Y . Wang, Z. Sun, J. Zhang, Z. Xian, E. Biyik, D. Held, and Z. Erickson, “Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,” in International Conference on Machine Learning (ICML), 2024

  36. [36]

    Eureka: Human- level reward design via coding large language models,

    Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fanet al., “Eureka: Human- level reward design via coding large language models,” inInternational Conference on Learning Representa- tions (ICLR), 2024

  37. [37]

    Language to rewards for robotic skill synthesis,

    W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. Gonzalez Arenas, H.-T. Lewis Chiang, T. Erezet al., “Language to rewards for robotic skill synthesis,” in Conference on Robot Learning (CoRL), 2023

  38. [38]

    Text2reward: Reward shaping with language models for reinforcement learning,

    T. Xie, S. Zhao, C. H. Wu, Y . Liu, Q. Luo, V . Zhong, Y . Yang, and T. Yu, “Text2reward: Reward shaping with language models for reinforcement learning,” in International Conference on Learning Representations (ICLR), 2024

  39. [39]

    Reward design with language models,

    M. Kwon, S. M. Xie, K. Bullard, and D. Sadigh, “Reward design with language models,” inInternational Conference on Learning Representations (ICLR), 2023

  40. [40]

    Masked irl: Llm-guided reward disambigua- tion from demonstrations and language,

    M. Hwang, A. Forsey-Smerek, N. Dennler, and A. Bobu, “Masked irl: Llm-guided reward disambigua- tion from demonstrations and language,”arXiv preprint arXiv:2511.14565, 2025

  41. [41]

    Can foundation models perform zero-shot task specification for robot manipulation?

    Y . Cui, S. Niekum, A. Gupta, V . Kumar, and A. Ra- jeswaran, “Can foundation models perform zero-shot task specification for robot manipulation?” inLearning for Dynamics and Control Conference (L4DC), 2022

  42. [42]

    Zero-shot reward specification via grounded natural language,

    P. Mahmoudieh, D. Pathak, and T. Darrell, “Zero-shot reward specification via grounded natural language,” in International Conference on Machine Learning (ICML), 2022

  43. [43]

    Roboclip: One demonstration is enough to learn robot policies,

    S. A. Sontakke, J. Zhang, S. Arnold, K. Pertsch, E. Biyik, D. Sadigh, C. Finn, and L. Itti, “Roboclip: One demonstration is enough to learn robot policies,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

  44. [44]

    Vision-language models as success detectors,

    Y . Du, K. Konyushkova, M. Denil, A. Raju, J. Landon, F. Hill, N. de Freitas, and S. Cabi, “Vision-language models as success detectors,” inConference on Lifelong Learning Agents, 2023

  45. [45]

    Vip: Towards universal visual re- ward and representation via value-implicit pre-training,

    Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Ku- mar, and A. Zhang, “Vip: Towards universal visual re- ward and representation via value-implicit pre-training,” inInternational Conference on Learning Representa- tions (ICLR), 2023

  46. [46]

    Language reward modulation for pretraining reinforcement learning,

    A. Adeniji, A. Xie, C. Sferrazza, Y . Seo, S. James, and P. Abbeel, “Language reward modulation for pretraining reinforcement learning,”arXiv preprint arXiv:2308.12270, 2024

  47. [47]

    FuRL: Visual-language models as fuzzy rewards for reinforcement learning,

    Y . Fu, H. Zhang, D. Wu, W. Xu, and B. Boulet, “FuRL: Visual-language models as fuzzy rewards for reinforcement learning,” inInternatonal Conference on Machine Learning (ICML), 2024

  48. [48]

    Task success is not enough: Investi- gating the use of video-language models as behavior critics for catching undesirable agent behaviors,

    L. Guan, Y . Zhou, D. Liu, Y . Zha, H. B. Amor, and S. Kambhampati, “Task success is not enough: Investi- gating the use of video-language models as behavior critics for catching undesirable agent behaviors,” in Conference on Language Modeling (COLM), 2024

  49. [49]

    Vision-language models are zero-shot re- ward models for reinforcement learning,

    J. Rocamonde, V . Montesinos, E. Nava, E. Perez, and D. Lindner, “Vision-language models are zero-shot re- ward models for reinforcement learning,” inInterna- tional Conference on Learning Representations (ICLR), 2024

  50. [50]

    Topreward: Token probabilities as hidden zero-shot rewards for robotics,

    S. Chen, C. Harrison, Y .-C. Lee, A. J. Yang, Z. Ren, L. J. Ratliff, J. Duan, D. Foxet al., “Topreward: Token probabilities as hidden zero-shot rewards for robotics,” arXiv preprint arXiv:2602.19313, 2026

  51. [51]

    Minedojo: Building open-ended embodied agents with internet- scale knowledge,

    L. Fan, G. Wang, Y . Jiang, A. Mandlekar, Y . Yang, H. Zhu, A. Tang, D.-A. Huanget al., “Minedojo: Building open-ended embodied agents with internet- scale knowledge,” inConference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022

  52. [52]

    Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling,

    K. Nottingham, P. Ammanabrolu, A. Suhr, Y . Choi, H. Hajishirzi, S. Singh, and R. Fox, “Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling,” in International Conference on Machine Learning (ICML), 2023

  53. [53]

    Lift: Unsupervised reinforcement learning with foundation models as teachers,

    T. Nam, J. Lee, J. Zhang, S. J. Hwang, J. J. Lim, and K. Pertsch, “Lift: Unsupervised reinforcement learning with foundation models as teachers,”arXiv preprint arXiv:2312.08958, 2023

  54. [54]

    Liv: Language-image 12 representations and rewards for robotic control,

    Y . J. Ma, W. Liang, V . Som, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman, “Liv: Language-image 12 representations and rewards for robotic control,” in International Conference on Machine Learning (ICML), 2023

  55. [55]

    VICtor: Learning hierarchical vision-instruction correlation rewards for long-horizon manipulation,

    K.-H. Hung, P.-C. Lo, J.-F. Yeh, H.-Y . Hsu, Y .-T. Chen, and W. H. Hsu, “VICtor: Learning hierarchical vision-instruction correlation rewards for long-horizon manipulation,” inInternational Conference on Learning Representations (ICLR), 2025

  56. [56]

    Subtask-aware visual reward learning from segmented demonstrations,

    C. Kim, M. Heo, D. Lee, H. Lee, J. Shin, J. J. Lim, and K. Lee, “Subtask-aware visual reward learning from segmented demonstrations,” inInternational Confer- ence on Learning Representations (ICLR), 2025

  57. [57]

    Opengvl: Benchmarking vi- sual temporal progress for data curation,

    P. Budzianowski, E. Wi ´snios, G. G ´oral, I. Kulakov, V . Petrenko, and K. Walas, “Opengvl: Benchmarking vi- sual temporal progress for data curation,”arXiv preprint arXiv:2509.17321, 2025

  58. [58]

    Self-improving embodied foundation models,

    S. K. S. Ghasemipour, A. Wahid, J. Tompson, P. R. Sanketi, and I. Mordatch, “Self-improving embodied foundation models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2025

  59. [59]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    P. Intelligence, A. Amin, R. Aniceto, A. Balakr- ishna, K. Black, K. Conley, G. Connors, J. Darpinian et al., “π ∗ 0.6: A vla that learns from experience,” arXiv:2511.14759, 2025

  60. [60]

    Progresslm: Towards progress reasoning in vision-language models,

    J. Zhang, C. Qian, H. Sun, H. Lu, D. Wang, L. Xue, and H. Liu, “Progresslm: Towards progress reasoning in vision-language models,”arXiv preprint arXiv:2601.15224, 2026

  61. [61]

    Roboarena: Distributed real-world evaluation of generalist robot policies,

    P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Eppner, C. Nearyet al., “Roboarena: Distributed real-world evaluation of generalist robot policies,” inConference on Robot Learning (CoRL), 2025

  62. [62]

    Open X-Embodiment: Robotic learning datasets and RT-X models,

    O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee et al., “Open X-Embodiment: Robotic learning datasets and RT-X models,” inInternational Conference on Robotics and Automation (ICRA), 2024

  63. [63]

    Agibot world colosseum,

    A. W. C. contributors, “Agibot world colosseum,” 2024

  64. [64]

    Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,

    D. Damen, H. Doughty, G. M. Farinella, A. Furnari, J. Ma, E. Kazakos, D. Moltisanti, J. Munroet al., “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,”International Journal of Computer Vision (IJCV), vol. 130, p. 33–55, 2022

  65. [65]

    Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,

    H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu, “Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,”arXiv preprint arXiv:2307.00595, 2023

  66. [66]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”arXiv preprint arXiv:2306.03310, 2023

  67. [67]

    Autonomous improvement of instruction following skills via foundation models,

    Z. Zhou, P. Atreya, A. Lee, H. R. Walke, O. Mees, and S. Levine, “Autonomous improvement of instruction following skills via foundation models,” inConference on Robot Learning (CoRL), 2024

  68. [68]

    Failsafe: Reasoning and recovery from failures in vision-language-action models,

    Z. Lin, J. Duan, H. Fang, D. Fox, R. Krishna, C. Tan, and B. Wen, “Failsafe: Reasoning and recovery from failures in vision-language-action models,”arXiv preprint arXiv:2510.01642, 2025

  69. [69]

    A dis- tributional perspective on reinforcement learning,

    M. G. Bellemare, W. Dabney, and R. Munos, “A dis- tributional perspective on reinforcement learning,” in International Conference on Machine Learning (ICML), 2017

  70. [70]

    Adapower: Specializing world foundation models for predictive manipulation,

    Y . Huang, S. Zou, J. Zhang, X. Liu, R. Hu, and K. Xu, “Adapower: Specializing world foundation models for predictive manipulation,”arXiv preprint arXiv:2512.035358, 2025

  71. [71]

    Towards improving reward design in RL: A reward alignment metric for RL practitioners,

    C. Muslimani, K. Johnstonbaugh, S. Chandramouli, S. Booth, W. B. Knox, and M. E. Taylor, “Towards improving reward design in RL: A reward alignment metric for RL practitioners,” inReinforcement Learning Conference (RLC), 2025

  72. [72]

    Robofac: A comprehensive framework for robotic failure analysis and correction,

    W. Lu, M. Ye, Z. Ye, R. Tao, S. Yang, and B. Zhao, “Robofac: A comprehensive framework for robotic failure analysis and correction,”arXiv preprint arXiv:2505.12224, 2025

  73. [73]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations (ICLR), 2022

  74. [74]

    Steering your diffusion policy with latent space rein- forcement learning,

    A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine, “Steering your diffusion policy with latent space rein- forcement learning,” inConference on Robot Learning (CoRL), 2025

  75. [75]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groomet al., “π 0: A vision- language-action flow model for general robot control,” arXiv preprint arxiv:2410.24164, 2024

  76. [76]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama et al., “Droid: A large-scale in-the-wild robot manipu- lation dataset,”arXiv preprint arXiv:2403.12945, 2024

  77. [77]

    World Action Models are Zero-shot Policies

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tanet al., “World action models are zero-shot policies,”arXiv preprint arXiv:2602.15922, 2026

  78. [78]

    Offline Reinforcement Learning with Implicit Q-Learning

    I. Kostrikov, A. Nair, and S. Levine, “Offline reinforce- ment learning with implicit q-learning,”arXiv preprint arXiv:2110.06169, 2021

  79. [79]

    Learning latent plans from play,

    C. Lynch, M. Khansari, T. Xiao, V . Kumar, J. Tompson, S. Levine, and P. Sermanet, “Learning latent plans from play,”Conference on Robot Learning (CoRL), 2019

  80. [80]

    Sigmoid loss for language image pre-training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” in International Conference on Computer Vision (ICCV), 2023

Showing first 80 references.