Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Anthony Liang , Yigit Korkmaz , Jiahui Zhang , Minyoung Hwang , Abrar Anwar , Sidhant Kaushik , Aditya Shah , Alex S. Huang

show 9 more authors

Luke Zettlemoyer Dieter Fox Yu Xiang Anqi Li Andreea Bobu Abhishek Gupta Stephen Tu Erdem Biyik Jesse Zhang

Authors on Pith no claims yet

classification 💻 cs.RO cs.AIcs.LG

keywords rewardrobometerprogresstrajectoriesacrossexpertrobotsupervision

0 comments

read the original abstract

General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
cs.RO 2026-05 unverdicted novelty 7.0

DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.
Reinforcing VLAs in Task-Agnostic World Models
cs.AI 2026-05 unverdicted novelty 6.0

RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
Grounded World Model for Semantically Generalizable Planning
cs.RO 2026-04 conditional novelty 6.0

A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
ARM: Advantage Reward Modeling for Long-Horizon Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

ARM trains reward models on Progressive/Regressive/Stagnant labels to enable adaptive reweighting in offline RL, reaching 99.4% success on towel-folding with minimal human intervention.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.