arxiv: 2603.02115 · v2 · submitted 2026-03-02 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Anthony Liang , Yigit Korkmaz , Jiahui Zhang , Minyoung Hwang , Abrar Anwar , Sidhant Kaushik , Aditya Shah , Alex S. Huang

show 9 more authors

Luke Zettlemoyer Dieter Fox Yu Xiang Anqi Li Andreea Bobu Abhishek Gupta Stephen Tu Erdem Biyik Jesse Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:54 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords robot reward modelstrajectory comparisonspreference supervisiongeneralizable rewardsfailure trajectoriesRBM-1M datasetrobot learning

0 comments

The pith

Robometer trains generalizable robot reward models by combining frame-level progress with inter-trajectory preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Robometer trains general-purpose robot reward models using a dual objective that pairs local progress labels on expert demonstrations with global ordering constraints from comparing entire trajectories. Prior approaches scale poorly because assigning dense progress labels becomes ambiguous on the abundant suboptimal and failed trajectories found in large robotics datasets. The added preference loss imposes ordering across same-task trajectories, letting the model learn from both successes and failures. Training occurs on the new RBM-1M dataset of over one million trajectories spanning multiple robot embodiments and tasks. The resulting rewards generalize better than earlier methods and raise performance on a range of downstream robot learning benchmarks and real-world applications.

Core claim

Robometer is a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. It is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task. This formulation supports effective learning from both real and augmented failed trajectories on the RBM-1M dataset of over one million trajectories.

What carries the argument

The dual objective of a frame-level progress loss anchored on expert demonstrations plus a trajectory-comparison preference loss that supplies global ordering constraints across same-task trajectories.

If this is right

Reward functions become more generalizable across diverse robot embodiments and tasks.
Robot learning performance improves across benchmarks and real-world downstream applications.
Large-scale datasets containing many failed and suboptimal trajectories can be used directly for reward learning.
Augmented failure trajectories contribute positively to reward quality without dense manual labeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The preference component could allow reward models to be fine-tuned on new tasks with only a small number of comparative labels rather than full demonstrations.
Human preference collection on trajectory pairs might be crowdsourced more easily than dense progress annotation, lowering data costs for future datasets.
The same dual-loss structure could be tested in simulation-to-real transfer settings where failure data is cheap to generate.
Extending the comparison loss to cross-embodiment trajectory pairs could test whether a single reward model can serve multiple robot platforms without retraining.

Load-bearing premise

Inter-trajectory preference supervision from comparisons imposes reliable global ordering constraints even on ambiguous suboptimal and failure trajectories without introducing significant labeling noise or bias.

What would settle it

Collect fresh human preference labels on a held-out set of trajectories from multiple tasks and embodiments; if the learned rewards fail to match the human ordering on a majority of pairs, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2603.02115 by Abhishek Gupta, Abrar Anwar, Aditya Shah, Alex S. Huang, Andreea Bobu, Anqi Li, Anthony Liang, Dieter Fox, Erdem Biyik, Jesse Zhang, Jiahui Zhang, Luke Zettlemoyer, Minyoung Hwang, Sidhant Kaushik, Stephen Tu, Yigit Korkmaz, Yu Xiang.

**Figure 1.** Figure 1: ROBOMETER Overview. ROBOMETER is trained on RBM-1M, a 1M-trajectory dataset spanning 21 robot embodiments, containing both rewardlabeled/expert trajectories and reward-unlabeled, failed trajectories. The model is supervised with a dual objective: predicting frame-level task progress (reward) and learning trajectory-level preferences from pairwise comparisons. To help with downstream RL, it is also trained… view at source ↗

**Figure 2.** Figure 2: ROBOMETER is a VLM-based reward model, that predicts dense, per-frame progress-based rewards and success labels for the first of two video trajectories. To be able to train with failed, non-expert data, we also predict which of the two video trajectories better completes the task. We use three strategies for curating training examples from our given datasets, which are further detailed in Section III-D wit… view at source ↗

**Figure 3.** Figure 3: Video-Language Reward Confusion Matrix. For each task sampled at random from self-collected, unseen data from RBM-EVAL-OOD, we compute rewards for all combinations of demonstration videos and language descriptions. ROBOMETER produces the most diagonal-heavy confusion matrix, indicating strong alignment between unseen demos and instructions. We also report the columnnormalized diagonal mean under each mode… view at source ↗

**Figure 4.** Figure 4: Qualitative Analysis of Failure, Suboptimal and Successful Trajectories. We visualize the progress predictions for three trajectories of different quality for the same task. Notably, for the suboptimal trajectory, ROBOMETER predicts steadily increasing progress as the robot approaches the pen holder, but sharply reduces its progress estimate when the marker is dropped, correctly reflecting regression in ta… view at source ↗

**Figure 5.** Figure 5: RL w/ Ablation Models in LIBERO-90 tasks from scratch, corresponding to ablations trained only on LIBERO10/Object/Goal/Spatial data from Table IV. We report the average success rate ± standard deviation across 5 seeds. H3 Fine-tuning from pre-trained VLMs helps with reward predictions on unseen tasks. Our main analysis is performed via a controlled setting with data from the LIBERO [66] robot manipulation… view at source ↗

**Figure 6.** Figure 6: Automatic online RL with DSRL on a DROID setup with ROBOMETER improves π0 from 20% to 85% on a single-stage task and 20% to 70% on a two-stage task, outperforming RoboReward’s overall success rate by 2.5×. DSRL with ROBOMETER learns to avoid base π0 errors such as collisions or moving the wrong object. The setup is deemed “automatic” because success detection and stage advancement are handled automatically… view at source ↗

**Figure 7.** Figure 7: Offline RL results using IQL on a mixture of Noisy and Expert trajectories. ROBOMETER rewards consistently outperform both RoboReward and sparse rewards: 2.4× average success rate improvement over the best baseline for each task. evaluation trials. Additional details and finer-grained results on each experiment can be found in Appendix E. Automatic Online RL. First, we evaluate ROBOMETER in an automated on… view at source ↗

**Figure 8.** Figure 8: (a): Proportion of task-relevant subtrajectories out of 100 retrieval queries. Our method consistently retrieves a high number of relevant subtrajectories using either the preference or progress objective. (b): Success rates of LoRA-finetuned π0.5 policies using the retrieved trajectories from each method. Small amounts of suboptimal & unrelated data retrieved by other baselines degrade policy-learning per… view at source ↗

**Figure 9.** Figure 9: Failure Detection Examples. (a): Terminal events such as drops cause a sharp regression in predicted task progress, which ROBOMETER flags shortly after the event. (b): Non-terminal failures correctly exhibit oscillatory progress with ROBOMETER. cases. Irreversible failures such as object drops induce sharp regressions in predicted task progress, which ROBOMETER flags shortly after the event, while non-term… view at source ↗

**Figure 10.** Figure 10: Pie chart of RBM-1M dataset types. Full table with individual dataset details in Table IX. accuracy. To avoid this issue, progress-only generalist robotic reward models would have to either manually label end states or simply forgo using these data sources. However, due to ROBOMETER’s trajectory comparison-based preference prediction objective, we can still use these noisier datasets for preference predic… view at source ↗

**Figure 11.** Figure 11: ROBOMETER model architecture. For a given task, our VLM based model takes in language description and two trajectories 1 and 2 separated by split tokens. The vlm output for trajectory 1 is fed into two MLP heads : progress - task completion percent and success - task completion probability. Finally the full output is passed into a preference MLP to choose which trajectory best completes the provided task.… view at source ↗

**Figure 12.** Figure 12: Failure Detection OOD confusion matrices with ternary ground truth and binary prediction. Rows indicate ground-truth execution outcomes (failure, suboptimal, success), while columns indicate binary predictions (predicted failure vs. predicted success). Suboptimal trajectories correspond to executions that make partial progress but do not complete the task. Suboptimal trajectories are treated as failures i… view at source ↗

**Figure 13.** Figure 13: Irreversible failures. Terminal events such as drops or spills cause a sharp regression in predicted task progress, which our model reliably flags as failures shortly after the event. stalls, oscillates, or terminates execution before completing the task. We additionally highlight semantic failures, where the robot executes a physically plausible behavior that violates the task instruction. All qualitativ… view at source ↗

**Figure 14.** Figure 14: Semantic failures. The robot executes smooth and physically plausible trajectories but violates the task instruction, resulting in persistently low predicted progress and failure detection without an abrupt terminal event [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

**Figure 16.** Figure 16: Data filtering & retrieval scene configuration from all cameras. Given a task language instruction, we perform retrieval using different reward models as follows. For RoboReward and ROBOMETER-Prog, we first compute per-timestep reward predictions for each subtrajectory conditioned on the instruction. We then calculate the value-order correlation of each subtrajectory using the predicted rewards and selec… view at source ↗

**Figure 17.** Figure 17: Model-Based RL with ROBOMETER integrated into DreamZero [77]. In this cluttered scene, ROBOMETER improves DreamZero’s performance from 20% success rate to 70%. receptacle plates of different colors. DreamZero typically places the ice cream cone in the wrong plate, while integrating ROBOMETER corrects for this mistake. Overall, results in [PITH_FULL_IMAGE:figures/full_fig_p033_17.png] view at source ↗

read the original abstract

General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Robometer's dual loss on a million-trajectory dataset with failures is a practical step forward for reward modeling, but the quality of those inter-trajectory preferences needs direct verification.

read the letter

The main point is that Robometer trains reward models by anchoring local progress on expert frames while adding pairwise trajectory comparisons across a large set that includes failures and suboptimal runs. This dual setup lets the model draw signal from data that standard progress-only methods discard, and the authors back it with a new RBM-1M dataset of over a million trajectories spanning multiple robots and tasks. They report stronger generalization and better downstream robot performance than prior approaches, which matches the scaling goal they set out.

Referee Report

2 major / 2 minor

Summary. The paper introduces Robometer, a scalable reward modeling framework for general-purpose robotic reward models. It combines intra-trajectory progress supervision (frame-level loss anchored on expert data) with inter-trajectory preference supervision (pairwise comparison loss imposing global ordering). The authors curate the RBM-1M dataset containing over one million trajectories across diverse embodiments and tasks, including substantial suboptimal and failure data, and report that the resulting models are more generalizable than prior methods while improving downstream robot learning performance on benchmarks and real-world tasks.

Significance. If the empirical results hold under the dual-objective formulation, the work would be significant for scaling reward learning beyond expert-only datasets, a key bottleneck in robotics. The curation of RBM-1M and the explicit dual loss (progress anchoring plus preference ordering) represent a concrete advance; the public release of code, weights, and videos further strengthens reproducibility and potential impact.

major comments (2)

[§3.2] §3.2 (Dual Objective): The preference loss term relies on inter-trajectory comparisons over the full RBM-1M set, including ambiguous failure trajectories. The manuscript must provide explicit details on how these preference labels are generated (heuristics, human annotation, or automated proxies) together with quantitative label-consistency metrics (e.g., inter-annotator agreement or sensitivity to near-equivalent failures). Without such verification, contradictory gradients from noisy preferences could undermine the claimed global ordering and generalization.
[§4.3] §4.3 and Table 2 (Ablation Studies): The reported downstream gains are attributed to the combination of progress and preference losses, yet the ablation isolating the contribution of the preference term on failure trajectories is not shown. Adding this ablation (or reporting the performance drop when the preference loss is removed) is required to confirm that the global-ordering component, rather than the progress anchor alone, drives the claimed improvements.

minor comments (2)

[Abstract] The abstract states that the method works on 'real and augmented failed trajectories,' but the augmentation procedure and its effect on label quality are only briefly mentioned; a dedicated paragraph or figure clarifying the augmentation pipeline would improve clarity.
[Figure 3] Figure 3 (Reward visualization): The color scale and normalization used for reward heatmaps across different tasks are not stated, making direct visual comparison between methods difficult.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the dual-objective formulation and ablation studies. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] §3.2 (Dual Objective): The preference loss term relies on inter-trajectory comparisons over the full RBM-1M set, including ambiguous failure trajectories. The manuscript must provide explicit details on how these preference labels are generated (heuristics, human annotation, or automated proxies) together with quantitative label-consistency metrics (e.g., inter-annotator agreement or sensitivity to near-equivalent failures). Without such verification, contradictory gradients from noisy preferences could undermine the claimed global ordering and generalization.

Authors: We agree that explicit details on preference label generation are necessary for reproducibility and to address potential concerns about noisy labels. In the revised manuscript, we will expand §3.2 to describe that preference labels are generated via automated proxies based on task success indicators and trajectory efficiency metrics from the RBM-1M curation process, with a subset of pairs validated through human annotation. We will also include quantitative label-consistency metrics, such as inter-annotator agreement on the validated subset and sensitivity analysis for near-equivalent failures, to demonstrate that the global ordering remains stable. revision: yes
Referee: [§4.3] §4.3 and Table 2 (Ablation Studies): The reported downstream gains are attributed to the combination of progress and preference losses, yet the ablation isolating the contribution of the preference term on failure trajectories is not shown. Adding this ablation (or reporting the performance drop when the preference loss is removed) is required to confirm that the global-ordering component, rather than the progress anchor alone, drives the claimed improvements.

Authors: We acknowledge that the current ablations in §4.3 and Table 2 do not isolate the preference term's contribution specifically on failure trajectories. In the revised manuscript, we will add a new ablation experiment comparing the full dual-objective model against a variant trained only with the progress loss (including on failure data). We will report the performance drop on downstream benchmarks to confirm that the inter-trajectory preference supervision drives the observed generalization gains. revision: yes

Circularity Check

0 steps flagged

No circularity in Robometer's dual-objective reward modeling

full rationale

The paper's central derivation uses an externally curated RBM-1M dataset of over one million trajectories and a dual loss (frame-level progress anchored on expert data plus inter-trajectory preference loss) to learn rewards. No step reduces a claimed prediction to a fitted parameter by construction, invokes self-citation as load-bearing uniqueness, or renames a known result; the method remains self-contained against external benchmarks and downstream evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate specific free parameters, axioms, or invented entities; the approach relies on standard RL concepts and a new dataset without explicit new postulates.

pith-pipeline@v0.9.0 · 5556 in / 1000 out tokens · 46424 ms · 2026-05-15T17:54:56.992930+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ROBOMETER is trained with a dual objective: ... preference-prediction loss over trajectory comparisons

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
cs.RO 2026-05 unverdicted novelty 7.0

DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.
Reinforcing VLAs in Task-Agnostic World Models
cs.AI 2026-05 unverdicted novelty 6.0

RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
ARM: Advantage Reward Modeling for Long-Horizon Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

ARM trains reward models on Progressive/Regressive/Stagnant labels to enable adaptive reweighting in offline RL, reaching 99.4% success on towel-folding with minimal human intervention.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

Reference graph

Works this paper leans on

166 extracted references · 166 canonical work pages · cited by 5 Pith papers · 15 internal anchors

[1]

The relativity of ‘absolute’ judge- ments,

D. R. J. Laming, “The relativity of ‘absolute’ judge- ments,”British Journal of Mathematical and Statistical Psychology, vol. 37, pp. 152–183, 1984

work page 1984
[2]

Absolute identification by relative judgment

N. Stewart, G. D. A. Brown, and N. Chater, “Absolute identification by relative judgment.”Psychological re- view, vol. 112 4, pp. 881–911, 2005

work page 2005
[3]

The effect of relative encoding on memory-based judgments,

M. A. Sharif and D. M. Oppenheimer, “The effect of relative encoding on memory-based judgments,”Psy- chological Science, vol. 27, no. 8, pp. 1136–1145, 2016

work page 2016
[4]

Rank2reward: Learning shaped reward func- tions from passive video,

D. Yang, D. Tjia, J. Berg, D. Damen, P. Agrawal, and A. Gupta, “Rank2reward: Learning shaped reward func- tions from passive video,” inInternational Conference on Robotics and Automation (ICRA), 2024

work page 2024
[5]

ReWiND: Language-guided rewards teach robot policies without new demonstrations,

J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Biyik, and J. Zhang, “ReWiND: Language-guided rewards teach robot policies without new demonstrations,” inConference on Robot Learning (CoRL), 2025

work page 2025
[6]

Vision language models are in-context value learners,

Y . J. Ma, J. Hejna, A. Wahid, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmaniet al., “Vision language models are in-context value learners,” inInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[7]

SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

Q. Chen, J. Yu, M. Schwager, P. Abbeel, F. Shentu, and P. Wu, “Sarm: Stage-aware reward modeling for long horizon robot manipulation,”arXiv preprint arXiv:2509.25358, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

SAFE: Multitask fail- ure detection for vision-language-action models,

Q. Gu, Y . Ju, S. Sun, I. Gilitschenski, H. Nishimura, M. Itkina, and F. Shkurti, “SAFE: Multitask fail- ure detection for vision-language-action models,” in Conference on Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[9]

Real-world offline reinforce- ment learning from vision language model feedback,

S. Venkataraman, Y . Wang, Z. Wang, N. S. Ravie, Z. Erickson, and D. Held, “Real-world offline reinforce- ment learning from vision language model feedback,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025

work page 2025
[10]

A vision-language- action-critic model for robotic real-world reinforcement learning,

S. Zhai, Q. Zhang, T. Zhang, F. Huang, H. Zhang, M. Zhou, S. Zhang, L. Liuet al., “A vision-language- action-critic model for robotic real-world reinforcement learning,”arXiv preprint arXiv:2509.15937, 2025

work page arXiv 2025
[11]

Roboreward: General-purpose vision- language reward models for robotics,

T. Lee, A. Wagenmaker, K. Pertsch, P. Liang, S. Levine, and C. Finn, “Roboreward: General-purpose vision- language reward models for robotics,”arXiv preprint arXiv:2601.00675, 2026

work page arXiv 2026
[12]

Robo-dopamine: General process re- ward modeling for high-precision robotic manipula- tion,

H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhaoet al., “Robo-dopamine: General process re- ward modeling for high-precision robotic manipula- tion,”arXiv preprint arXiv:2512.23703, 2025

work page arXiv 2025
[13]

Position: Good embodied reward models need bad behavior data,

R. Tian, Y . Wu, and A. Bacjsy, “Position: Good embodied reward models need bad behavior data,” Carnegie Mellon University, Tech. Rep., 2026. [Online]. Available: https://cmu-intentlab.github.io/pdf/ tian icml 26 position.pdf

work page 2026
[14]

Algorithms for inverse reinforcement learning,

A. Y . Ng and S. J. Russell, “Algorithms for inverse reinforcement learning,” inInternational Conference on Machine Learning (ICML), 2000

work page 2000
[15]

Apprenticeship learning via inverse reinforcement learning,

P. Abbeel and A. Y . Ng, “Apprenticeship learning via inverse reinforcement learning,” inInternational Con- ference on Machine Learning (ICML), 2004

work page 2004
[16]

Maximum entropy inverse reinforcement learning,

B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning,” in AAAI Conference on Artificial Intelligence, 2008

work page 2008
[17]

Guided cost learn- ing: Deep inverse optimal control via policy optimiza- tion,

C. Finn, S. Levine, and P. Abbeel, “Guided cost learn- ing: Deep inverse optimal control via policy optimiza- tion,” inInternational Conference on Machine Learning (ICML), 2016

work page 2016
[18]

Feature expansive reward learning: Rethinking human input,

A. Bobu, M. Wiggert, C. Tomlin, and A. D. Dra- gan, “Feature expansive reward learning: Rethinking human input,” inACM/IEEE International Conference on Human-Robot Interaction (HRI), 2021

work page 2021
[19]

Generative adversarial imitation learning,

J. Ho and S. Ermon, “Generative adversarial imitation learning,” inAdvances in Neural Information Process- ing Systems (NeurIPS), 2016

work page 2016
[20]

Socially compliant navigation through raw depth inputs with generative adversarial imitation learning,

L. Tai, J. Zhang, M. Liu, and W. Burgard, “Socially compliant navigation through raw depth inputs with generative adversarial imitation learning,” inInterna- tional Conference on Robotics and Automation (ICRA), 2018

work page 2018
[21]

Learning robust rewards with adversarial inverse reinforcement learning,

J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarial inverse reinforcement learning,” in International Conference on Learning Representations (ICLR), 2018

work page 2018
[22]

Variational inverse control with events: A general framework for data-driven reward definition,

J. Fu, A. Singh, D. Ghosh, L. Yang, and S. Levine, “Variational inverse control with events: A general framework for data-driven reward definition,” inAd- vances in Neural Information Processing Systems (NeurIPS), 2018

work page 2018
[23]

Robot policy learning with temporal optimal transport reward,

Y . Fu, H. Zhang, D. Wu, W. Xu, and B. Boulet, “Robot policy learning with temporal optimal transport reward,” inConference on Neural Information Processing Sys- tems (NeurIPS), 2024

work page 2024
[24]

A smooth sea never made a skilled SAILOR: Robust imitation via learning to search,

A. K. Jain, V . Mohta, S. Kim, A. Bhardwaj, J. Ren, Y . Feng, S. Choudhury, and G. Swamy, “A smooth sea never made a skilled SAILOR: Robust imitation via learning to search,” inConference on Neural Informa- tion Processing Systems (NeurIPS), 2025

work page 2025
[25]

Deep reinforcement learn- ing from human preferences,

P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learn- ing from human preferences,” inAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[26]

Active preference-based learning of reward functions,

D. Sadigh, A. D. Dragan, S. S. Sastry, and S. A. Seshia, “Active preference-based learning of reward functions,” inRobotics: Science and Systems (RSS), 2017

work page 2017
[27]

Learning from physical human corrections, one feature at a time,

A. Bajcsy, D. P. Losey, M. K. O’Malley, and A. D. Dragan, “Learning from physical human corrections, one feature at a time,” inInternational Conference on Human-Robot Interaction (HRI), 2018

work page 2018
[28]

Active preference-based gaussian process regression for reward learning,

E. Biyik, N. Huynh, M. J. Kochenderfer, and D. Sadigh, “Active preference-based gaussian process regression for reward learning,” inRobotics: Science and Systems 11 (RSS), 2020

work page 2020
[29]

Pebble: Feedback- efficient interactive reinforcement learning via relabel- ing experience and unsupervised pre-training,

K. Lee, L. Smith, and P. Abbeel, “Pebble: Feedback- efficient interactive reinforcement learning via relabel- ing experience and unsupervised pre-training,” inIn- ternational Conference on Machine Learning (ICML), 2021

work page 2021
[30]

Few-shot preference learn- ing for human-in-the-loop rl,

J. Hejna and D. Sadigh, “Few-shot preference learn- ing for human-in-the-loop rl,” inConference on Robot Learning (CoRL), 2022

work page 2022
[31]

Learning multimodal rewards from rankings,

V . Myers, E. Biyik, N. Anari, and D. Sadigh, “Learning multimodal rewards from rankings,” inConference on Robot Learning (CoRL), 2021

work page 2021
[32]

Trajectory improvement and reward learning from comparative language feedback,

Z. Yang, M. Jun, J. Tien, S. J. Russell, A. Dragan, and E. Bıyık, “Trajectory improvement and reward learning from comparative language feedback,” inConference on Robot Learning (CoRL), 2024

work page 2024
[33]

Mile: Model-based interven- tion learning,

Y . Korkmaz and E. Bıyık, “Mile: Model-based interven- tion learning,” inInternational Conference on Robotics and Automation (ICRA), 2025

work page 2025
[34]

Robomonkey: Scaling test-time sampling and verification for vision-language- action models,

J. Kwok, C. Agia, R. Sinha, M. Foutter, S. Li, I. Stoica, A. Mirhoseini, and M. Pavone, “Robomonkey: Scaling test-time sampling and verification for vision-language- action models,” inConference on Robot Learning (CoRL), 2025

work page 2025
[35]

Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,

Y . Wang, Z. Sun, J. Zhang, Z. Xian, E. Biyik, D. Held, and Z. Erickson, “Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,” in International Conference on Machine Learning (ICML), 2024

work page 2024
[36]

Eureka: Human- level reward design via coding large language models,

Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fanet al., “Eureka: Human- level reward design via coding large language models,” inInternational Conference on Learning Representa- tions (ICLR), 2024

work page 2024
[37]

Language to rewards for robotic skill synthesis,

W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. Gonzalez Arenas, H.-T. Lewis Chiang, T. Erezet al., “Language to rewards for robotic skill synthesis,” in Conference on Robot Learning (CoRL), 2023

work page 2023
[38]

Text2reward: Reward shaping with language models for reinforcement learning,

T. Xie, S. Zhao, C. H. Wu, Y . Liu, Q. Luo, V . Zhong, Y . Yang, and T. Yu, “Text2reward: Reward shaping with language models for reinforcement learning,” in International Conference on Learning Representations (ICLR), 2024

work page 2024
[39]

Reward design with language models,

M. Kwon, S. M. Xie, K. Bullard, and D. Sadigh, “Reward design with language models,” inInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[40]

Masked irl: Llm-guided reward disambigua- tion from demonstrations and language,

M. Hwang, A. Forsey-Smerek, N. Dennler, and A. Bobu, “Masked irl: Llm-guided reward disambigua- tion from demonstrations and language,”arXiv preprint arXiv:2511.14565, 2025

work page arXiv 2025
[41]

Can foundation models perform zero-shot task specification for robot manipulation?

Y . Cui, S. Niekum, A. Gupta, V . Kumar, and A. Ra- jeswaran, “Can foundation models perform zero-shot task specification for robot manipulation?” inLearning for Dynamics and Control Conference (L4DC), 2022

work page 2022
[42]

Zero-shot reward specification via grounded natural language,

P. Mahmoudieh, D. Pathak, and T. Darrell, “Zero-shot reward specification via grounded natural language,” in International Conference on Machine Learning (ICML), 2022

work page 2022
[43]

Roboclip: One demonstration is enough to learn robot policies,

S. A. Sontakke, J. Zhang, S. Arnold, K. Pertsch, E. Biyik, D. Sadigh, C. Finn, and L. Itti, “Roboclip: One demonstration is enough to learn robot policies,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[44]

Vision-language models as success detectors,

Y . Du, K. Konyushkova, M. Denil, A. Raju, J. Landon, F. Hill, N. de Freitas, and S. Cabi, “Vision-language models as success detectors,” inConference on Lifelong Learning Agents, 2023

work page 2023
[45]

Vip: Towards universal visual re- ward and representation via value-implicit pre-training,

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Ku- mar, and A. Zhang, “Vip: Towards universal visual re- ward and representation via value-implicit pre-training,” inInternational Conference on Learning Representa- tions (ICLR), 2023

work page 2023
[46]

Language reward modulation for pretraining reinforcement learning,

A. Adeniji, A. Xie, C. Sferrazza, Y . Seo, S. James, and P. Abbeel, “Language reward modulation for pretraining reinforcement learning,”arXiv preprint arXiv:2308.12270, 2024

work page arXiv 2024
[47]

FuRL: Visual-language models as fuzzy rewards for reinforcement learning,

Y . Fu, H. Zhang, D. Wu, W. Xu, and B. Boulet, “FuRL: Visual-language models as fuzzy rewards for reinforcement learning,” inInternatonal Conference on Machine Learning (ICML), 2024

work page 2024
[48]

Task success is not enough: Investi- gating the use of video-language models as behavior critics for catching undesirable agent behaviors,

L. Guan, Y . Zhou, D. Liu, Y . Zha, H. B. Amor, and S. Kambhampati, “Task success is not enough: Investi- gating the use of video-language models as behavior critics for catching undesirable agent behaviors,” in Conference on Language Modeling (COLM), 2024

work page 2024
[49]

Vision-language models are zero-shot re- ward models for reinforcement learning,

J. Rocamonde, V . Montesinos, E. Nava, E. Perez, and D. Lindner, “Vision-language models are zero-shot re- ward models for reinforcement learning,” inInterna- tional Conference on Learning Representations (ICLR), 2024

work page 2024
[50]

Topreward: Token probabilities as hidden zero-shot rewards for robotics,

S. Chen, C. Harrison, Y .-C. Lee, A. J. Yang, Z. Ren, L. J. Ratliff, J. Duan, D. Foxet al., “Topreward: Token probabilities as hidden zero-shot rewards for robotics,” arXiv preprint arXiv:2602.19313, 2026

work page arXiv 2026
[51]

Minedojo: Building open-ended embodied agents with internet- scale knowledge,

L. Fan, G. Wang, Y . Jiang, A. Mandlekar, Y . Yang, H. Zhu, A. Tang, D.-A. Huanget al., “Minedojo: Building open-ended embodied agents with internet- scale knowledge,” inConference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022

work page 2022
[52]

Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling,

K. Nottingham, P. Ammanabrolu, A. Suhr, Y . Choi, H. Hajishirzi, S. Singh, and R. Fox, “Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling,” in International Conference on Machine Learning (ICML), 2023

work page 2023
[53]

Lift: Unsupervised reinforcement learning with foundation models as teachers,

T. Nam, J. Lee, J. Zhang, S. J. Hwang, J. J. Lim, and K. Pertsch, “Lift: Unsupervised reinforcement learning with foundation models as teachers,”arXiv preprint arXiv:2312.08958, 2023

work page arXiv 2023
[54]

Liv: Language-image 12 representations and rewards for robotic control,

Y . J. Ma, W. Liang, V . Som, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman, “Liv: Language-image 12 representations and rewards for robotic control,” in International Conference on Machine Learning (ICML), 2023

work page 2023
[55]

VICtor: Learning hierarchical vision-instruction correlation rewards for long-horizon manipulation,

K.-H. Hung, P.-C. Lo, J.-F. Yeh, H.-Y . Hsu, Y .-T. Chen, and W. H. Hsu, “VICtor: Learning hierarchical vision-instruction correlation rewards for long-horizon manipulation,” inInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[56]

Subtask-aware visual reward learning from segmented demonstrations,

C. Kim, M. Heo, D. Lee, H. Lee, J. Shin, J. J. Lim, and K. Lee, “Subtask-aware visual reward learning from segmented demonstrations,” inInternational Confer- ence on Learning Representations (ICLR), 2025

work page 2025
[57]

Opengvl: Benchmarking vi- sual temporal progress for data curation,

P. Budzianowski, E. Wi ´snios, G. G ´oral, I. Kulakov, V . Petrenko, and K. Walas, “Opengvl: Benchmarking vi- sual temporal progress for data curation,”arXiv preprint arXiv:2509.17321, 2025

work page arXiv 2025
[58]

Self-improving embodied foundation models,

S. K. S. Ghasemipour, A. Wahid, J. Tompson, P. R. Sanketi, and I. Mordatch, “Self-improving embodied foundation models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[59]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

P. Intelligence, A. Amin, R. Aniceto, A. Balakr- ishna, K. Black, K. Conley, G. Connors, J. Darpinian et al., “π ∗ 0.6: A vla that learns from experience,” arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Progresslm: Towards progress reasoning in vision-language models,

J. Zhang, C. Qian, H. Sun, H. Lu, D. Wang, L. Xue, and H. Liu, “Progresslm: Towards progress reasoning in vision-language models,”arXiv preprint arXiv:2601.15224, 2026

work page arXiv 2026
[61]

Roboarena: Distributed real-world evaluation of generalist robot policies,

P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Eppner, C. Nearyet al., “Roboarena: Distributed real-world evaluation of generalist robot policies,” inConference on Robot Learning (CoRL), 2025

work page 2025
[62]

Open X-Embodiment: Robotic learning datasets and RT-X models,

O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee et al., “Open X-Embodiment: Robotic learning datasets and RT-X models,” inInternational Conference on Robotics and Automation (ICRA), 2024

work page 2024
[63]

Agibot world colosseum,

A. W. C. contributors, “Agibot world colosseum,” 2024

work page 2024
[64]

Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,

D. Damen, H. Doughty, G. M. Farinella, A. Furnari, J. Ma, E. Kazakos, D. Moltisanti, J. Munroet al., “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,”International Journal of Computer Vision (IJCV), vol. 130, p. 33–55, 2022

work page 2022
[65]

Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,

H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu, “Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,”arXiv preprint arXiv:2307.00595, 2023

work page arXiv 2023
[66]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”arXiv preprint arXiv:2306.03310, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Autonomous improvement of instruction following skills via foundation models,

Z. Zhou, P. Atreya, A. Lee, H. R. Walke, O. Mees, and S. Levine, “Autonomous improvement of instruction following skills via foundation models,” inConference on Robot Learning (CoRL), 2024

work page 2024
[68]

Failsafe: Reasoning and recovery from failures in vision-language-action models,

Z. Lin, J. Duan, H. Fang, D. Fox, R. Krishna, C. Tan, and B. Wen, “Failsafe: Reasoning and recovery from failures in vision-language-action models,”arXiv preprint arXiv:2510.01642, 2025

work page arXiv 2025
[69]

A dis- tributional perspective on reinforcement learning,

M. G. Bellemare, W. Dabney, and R. Munos, “A dis- tributional perspective on reinforcement learning,” in International Conference on Machine Learning (ICML), 2017

work page 2017
[70]

Adapower: Specializing world foundation models for predictive manipulation,

Y . Huang, S. Zou, J. Zhang, X. Liu, R. Hu, and K. Xu, “Adapower: Specializing world foundation models for predictive manipulation,”arXiv preprint arXiv:2512.035358, 2025

work page arXiv 2025
[71]

Towards improving reward design in RL: A reward alignment metric for RL practitioners,

C. Muslimani, K. Johnstonbaugh, S. Chandramouli, S. Booth, W. B. Knox, and M. E. Taylor, “Towards improving reward design in RL: A reward alignment metric for RL practitioners,” inReinforcement Learning Conference (RLC), 2025

work page 2025
[72]

Robofac: A comprehensive framework for robotic failure analysis and correction,

W. Lu, M. Ye, Z. Ye, R. Tao, S. Yang, and B. Zhao, “Robofac: A comprehensive framework for robotic failure analysis and correction,”arXiv preprint arXiv:2505.12224, 2025

work page arXiv 2025
[73]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[74]

Steering your diffusion policy with latent space rein- forcement learning,

A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine, “Steering your diffusion policy with latent space rein- forcement learning,” inConference on Robot Learning (CoRL), 2025

work page 2025
[75]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groomet al., “π 0: A vision- language-action flow model for general robot control,” arXiv preprint arxiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama et al., “Droid: A large-scale in-the-wild robot manipu- lation dataset,”arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

World Action Models are Zero-shot Policies

S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tanet al., “World action models are zero-shot policies,”arXiv preprint arXiv:2602.15922, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[78]

Offline Reinforcement Learning with Implicit Q-Learning

I. Kostrikov, A. Nair, and S. Levine, “Offline reinforce- ment learning with implicit q-learning,”arXiv preprint arXiv:2110.06169, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[79]

Learning latent plans from play,

C. Lynch, M. Khansari, T. Xiao, V . Kumar, J. Tompson, S. Levine, and P. Sermanet, “Learning latent plans from play,”Conference on Robot Learning (CoRL), 2019

work page 2019
[80]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” in International Conference on Computer Vision (ICCV), 2023

work page 2023

Showing first 80 references.