Recognition: 2 theorem links
· Lean TheoremRobometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons
Pith reviewed 2026-05-15 17:54 UTC · model grok-4.3
The pith
Robometer trains generalizable robot reward models by combining frame-level progress with inter-trajectory preferences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Robometer is a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. It is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task. This formulation supports effective learning from both real and augmented failed trajectories on the RBM-1M dataset of over one million trajectories.
What carries the argument
The dual objective of a frame-level progress loss anchored on expert demonstrations plus a trajectory-comparison preference loss that supplies global ordering constraints across same-task trajectories.
If this is right
- Reward functions become more generalizable across diverse robot embodiments and tasks.
- Robot learning performance improves across benchmarks and real-world downstream applications.
- Large-scale datasets containing many failed and suboptimal trajectories can be used directly for reward learning.
- Augmented failure trajectories contribute positively to reward quality without dense manual labeling.
Where Pith is reading between the lines
- The preference component could allow reward models to be fine-tuned on new tasks with only a small number of comparative labels rather than full demonstrations.
- Human preference collection on trajectory pairs might be crowdsourced more easily than dense progress annotation, lowering data costs for future datasets.
- The same dual-loss structure could be tested in simulation-to-real transfer settings where failure data is cheap to generate.
- Extending the comparison loss to cross-embodiment trajectory pairs could test whether a single reward model can serve multiple robot platforms without retraining.
Load-bearing premise
Inter-trajectory preference supervision from comparisons imposes reliable global ordering constraints even on ambiguous suboptimal and failure trajectories without introducing significant labeling noise or bias.
What would settle it
Collect fresh human preference labels on a held-out set of trajectories from multiple tasks and embodiments; if the learned rewards fail to match the human ordering on a majority of pairs, the central claim is falsified.
Figures
read the original abstract
General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations, providing only local, frame-level supervision. While effective for expert demonstrations, this paradigm scales poorly to large-scale robotics datasets where failed and suboptimal trajectories are abundant and assigning dense progress labels is ambiguous. We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision. Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints across trajectories of the same task, enabling effective learning from both real and augmented failed trajectories. To support this formulation at scale, we curate RBM-1M, a reward-learning dataset comprising over one million trajectories spanning diverse robot embodiments and tasks, including substantial suboptimal and failure data. Across benchmarks and real-world evaluations, Robometer learns more generalizable reward functions than prior methods and improves robot learning performance across a diverse set of downstream applications. Code, model weights, and videos at https://robometer.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Robometer, a scalable reward modeling framework for general-purpose robotic reward models. It combines intra-trajectory progress supervision (frame-level loss anchored on expert data) with inter-trajectory preference supervision (pairwise comparison loss imposing global ordering). The authors curate the RBM-1M dataset containing over one million trajectories across diverse embodiments and tasks, including substantial suboptimal and failure data, and report that the resulting models are more generalizable than prior methods while improving downstream robot learning performance on benchmarks and real-world tasks.
Significance. If the empirical results hold under the dual-objective formulation, the work would be significant for scaling reward learning beyond expert-only datasets, a key bottleneck in robotics. The curation of RBM-1M and the explicit dual loss (progress anchoring plus preference ordering) represent a concrete advance; the public release of code, weights, and videos further strengthens reproducibility and potential impact.
major comments (2)
- [§3.2] §3.2 (Dual Objective): The preference loss term relies on inter-trajectory comparisons over the full RBM-1M set, including ambiguous failure trajectories. The manuscript must provide explicit details on how these preference labels are generated (heuristics, human annotation, or automated proxies) together with quantitative label-consistency metrics (e.g., inter-annotator agreement or sensitivity to near-equivalent failures). Without such verification, contradictory gradients from noisy preferences could undermine the claimed global ordering and generalization.
- [§4.3] §4.3 and Table 2 (Ablation Studies): The reported downstream gains are attributed to the combination of progress and preference losses, yet the ablation isolating the contribution of the preference term on failure trajectories is not shown. Adding this ablation (or reporting the performance drop when the preference loss is removed) is required to confirm that the global-ordering component, rather than the progress anchor alone, drives the claimed improvements.
minor comments (2)
- [Abstract] The abstract states that the method works on 'real and augmented failed trajectories,' but the augmentation procedure and its effect on label quality are only briefly mentioned; a dedicated paragraph or figure clarifying the augmentation pipeline would improve clarity.
- [Figure 3] Figure 3 (Reward visualization): The color scale and normalization used for reward heatmaps across different tasks are not stated, making direct visual comparison between methods difficult.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the dual-objective formulation and ablation studies. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Dual Objective): The preference loss term relies on inter-trajectory comparisons over the full RBM-1M set, including ambiguous failure trajectories. The manuscript must provide explicit details on how these preference labels are generated (heuristics, human annotation, or automated proxies) together with quantitative label-consistency metrics (e.g., inter-annotator agreement or sensitivity to near-equivalent failures). Without such verification, contradictory gradients from noisy preferences could undermine the claimed global ordering and generalization.
Authors: We agree that explicit details on preference label generation are necessary for reproducibility and to address potential concerns about noisy labels. In the revised manuscript, we will expand §3.2 to describe that preference labels are generated via automated proxies based on task success indicators and trajectory efficiency metrics from the RBM-1M curation process, with a subset of pairs validated through human annotation. We will also include quantitative label-consistency metrics, such as inter-annotator agreement on the validated subset and sensitivity analysis for near-equivalent failures, to demonstrate that the global ordering remains stable. revision: yes
-
Referee: [§4.3] §4.3 and Table 2 (Ablation Studies): The reported downstream gains are attributed to the combination of progress and preference losses, yet the ablation isolating the contribution of the preference term on failure trajectories is not shown. Adding this ablation (or reporting the performance drop when the preference loss is removed) is required to confirm that the global-ordering component, rather than the progress anchor alone, drives the claimed improvements.
Authors: We acknowledge that the current ablations in §4.3 and Table 2 do not isolate the preference term's contribution specifically on failure trajectories. In the revised manuscript, we will add a new ablation experiment comparing the full dual-objective model against a variant trained only with the progress loss (including on failure data). We will report the performance drop on downstream benchmarks to confirm that the inter-trajectory preference supervision drives the observed generalization gains. revision: yes
Circularity Check
No circularity in Robometer's dual-objective reward modeling
full rationale
The paper's central derivation uses an externally curated RBM-1M dataset of over one million trajectories and a dual loss (frame-level progress anchored on expert data plus inter-trajectory preference loss) to learn rewards. No step reduces a claimed prediction to a fitted parameter by construction, invokes self-citation as load-bearing uniqueness, or renames a known result; the method remains self-contained against external benchmarks and downstream evaluations.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ROBOMETER is trained with a dual objective: ... preference-prediction loss over trajectory comparisons
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 6 Pith papers
-
DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies
DreamAvoid uses a Dream Trigger, Action Proposer, and Dream Evaluator trained on success/failure/boundary data to let VLA policies avoid critical-phase failures via test-time future dreaming.
-
Reinforcing VLAs in Task-Agnostic World Models
RAW-Dream lets VLAs learn new tasks in zero-shot imagination by using a world model pre-trained only on task-free behaviors and an unmodified VLM to supply rewards, with dual-noise verification to limit hallucinations.
-
Grounded World Model for Semantically Generalizable Planning
A vision-language-aligned world model turns visuomotor MPC into a language-following planner that reaches 87% success on 288 unseen semantic tasks where standard VLAs drop to 22%.
-
ARM: Advantage Reward Modeling for Long-Horizon Manipulation
ARM trains reward models on Progressive/Regressive/Stagnant labels to enable adaptive reweighting in offline RL, reaching 99.4% success on towel-folding with minimal human intervention.
-
RLDX-1 Technical Report
RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
-
RLDX-1 Technical Report
RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
Reference graph
Works this paper leans on
-
[1]
The relativity of ‘absolute’ judge- ments,
D. R. J. Laming, “The relativity of ‘absolute’ judge- ments,”British Journal of Mathematical and Statistical Psychology, vol. 37, pp. 152–183, 1984
work page 1984
-
[2]
Absolute identification by relative judgment
N. Stewart, G. D. A. Brown, and N. Chater, “Absolute identification by relative judgment.”Psychological re- view, vol. 112 4, pp. 881–911, 2005
work page 2005
-
[3]
The effect of relative encoding on memory-based judgments,
M. A. Sharif and D. M. Oppenheimer, “The effect of relative encoding on memory-based judgments,”Psy- chological Science, vol. 27, no. 8, pp. 1136–1145, 2016
work page 2016
-
[4]
Rank2reward: Learning shaped reward func- tions from passive video,
D. Yang, D. Tjia, J. Berg, D. Damen, P. Agrawal, and A. Gupta, “Rank2reward: Learning shaped reward func- tions from passive video,” inInternational Conference on Robotics and Automation (ICRA), 2024
work page 2024
-
[5]
ReWiND: Language-guided rewards teach robot policies without new demonstrations,
J. Zhang, Y . Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Biyik, and J. Zhang, “ReWiND: Language-guided rewards teach robot policies without new demonstrations,” inConference on Robot Learning (CoRL), 2025
work page 2025
-
[6]
Vision language models are in-context value learners,
Y . J. Ma, J. Hejna, A. Wahid, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmaniet al., “Vision language models are in-context value learners,” inInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[7]
SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation
Q. Chen, J. Yu, M. Schwager, P. Abbeel, F. Shentu, and P. Wu, “Sarm: Stage-aware reward modeling for long horizon robot manipulation,”arXiv preprint arXiv:2509.25358, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
SAFE: Multitask fail- ure detection for vision-language-action models,
Q. Gu, Y . Ju, S. Sun, I. Gilitschenski, H. Nishimura, M. Itkina, and F. Shkurti, “SAFE: Multitask fail- ure detection for vision-language-action models,” in Conference on Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[9]
Real-world offline reinforce- ment learning from vision language model feedback,
S. Venkataraman, Y . Wang, Z. Wang, N. S. Ravie, Z. Erickson, and D. Held, “Real-world offline reinforce- ment learning from vision language model feedback,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025
work page 2025
-
[10]
A vision-language- action-critic model for robotic real-world reinforcement learning,
S. Zhai, Q. Zhang, T. Zhang, F. Huang, H. Zhang, M. Zhou, S. Zhang, L. Liuet al., “A vision-language- action-critic model for robotic real-world reinforcement learning,”arXiv preprint arXiv:2509.15937, 2025
-
[11]
Roboreward: General-purpose vision- language reward models for robotics,
T. Lee, A. Wagenmaker, K. Pertsch, P. Liang, S. Levine, and C. Finn, “Roboreward: General-purpose vision- language reward models for robotics,”arXiv preprint arXiv:2601.00675, 2026
-
[12]
Robo-dopamine: General process re- ward modeling for high-precision robotic manipula- tion,
H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhaoet al., “Robo-dopamine: General process re- ward modeling for high-precision robotic manipula- tion,”arXiv preprint arXiv:2512.23703, 2025
-
[13]
Position: Good embodied reward models need bad behavior data,
R. Tian, Y . Wu, and A. Bacjsy, “Position: Good embodied reward models need bad behavior data,” Carnegie Mellon University, Tech. Rep., 2026. [Online]. Available: https://cmu-intentlab.github.io/pdf/ tian icml 26 position.pdf
work page 2026
-
[14]
Algorithms for inverse reinforcement learning,
A. Y . Ng and S. J. Russell, “Algorithms for inverse reinforcement learning,” inInternational Conference on Machine Learning (ICML), 2000
work page 2000
-
[15]
Apprenticeship learning via inverse reinforcement learning,
P. Abbeel and A. Y . Ng, “Apprenticeship learning via inverse reinforcement learning,” inInternational Con- ference on Machine Learning (ICML), 2004
work page 2004
-
[16]
Maximum entropy inverse reinforcement learning,
B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning,” in AAAI Conference on Artificial Intelligence, 2008
work page 2008
-
[17]
Guided cost learn- ing: Deep inverse optimal control via policy optimiza- tion,
C. Finn, S. Levine, and P. Abbeel, “Guided cost learn- ing: Deep inverse optimal control via policy optimiza- tion,” inInternational Conference on Machine Learning (ICML), 2016
work page 2016
-
[18]
Feature expansive reward learning: Rethinking human input,
A. Bobu, M. Wiggert, C. Tomlin, and A. D. Dra- gan, “Feature expansive reward learning: Rethinking human input,” inACM/IEEE International Conference on Human-Robot Interaction (HRI), 2021
work page 2021
-
[19]
Generative adversarial imitation learning,
J. Ho and S. Ermon, “Generative adversarial imitation learning,” inAdvances in Neural Information Process- ing Systems (NeurIPS), 2016
work page 2016
-
[20]
L. Tai, J. Zhang, M. Liu, and W. Burgard, “Socially compliant navigation through raw depth inputs with generative adversarial imitation learning,” inInterna- tional Conference on Robotics and Automation (ICRA), 2018
work page 2018
-
[21]
Learning robust rewards with adversarial inverse reinforcement learning,
J. Fu, K. Luo, and S. Levine, “Learning robust rewards with adversarial inverse reinforcement learning,” in International Conference on Learning Representations (ICLR), 2018
work page 2018
-
[22]
Variational inverse control with events: A general framework for data-driven reward definition,
J. Fu, A. Singh, D. Ghosh, L. Yang, and S. Levine, “Variational inverse control with events: A general framework for data-driven reward definition,” inAd- vances in Neural Information Processing Systems (NeurIPS), 2018
work page 2018
-
[23]
Robot policy learning with temporal optimal transport reward,
Y . Fu, H. Zhang, D. Wu, W. Xu, and B. Boulet, “Robot policy learning with temporal optimal transport reward,” inConference on Neural Information Processing Sys- tems (NeurIPS), 2024
work page 2024
-
[24]
A smooth sea never made a skilled SAILOR: Robust imitation via learning to search,
A. K. Jain, V . Mohta, S. Kim, A. Bhardwaj, J. Ren, Y . Feng, S. Choudhury, and G. Swamy, “A smooth sea never made a skilled SAILOR: Robust imitation via learning to search,” inConference on Neural Informa- tion Processing Systems (NeurIPS), 2025
work page 2025
-
[25]
Deep reinforcement learn- ing from human preferences,
P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learn- ing from human preferences,” inAdvances in Neural Information Processing Systems (NeurIPS), 2017
work page 2017
-
[26]
Active preference-based learning of reward functions,
D. Sadigh, A. D. Dragan, S. S. Sastry, and S. A. Seshia, “Active preference-based learning of reward functions,” inRobotics: Science and Systems (RSS), 2017
work page 2017
-
[27]
Learning from physical human corrections, one feature at a time,
A. Bajcsy, D. P. Losey, M. K. O’Malley, and A. D. Dragan, “Learning from physical human corrections, one feature at a time,” inInternational Conference on Human-Robot Interaction (HRI), 2018
work page 2018
-
[28]
Active preference-based gaussian process regression for reward learning,
E. Biyik, N. Huynh, M. J. Kochenderfer, and D. Sadigh, “Active preference-based gaussian process regression for reward learning,” inRobotics: Science and Systems 11 (RSS), 2020
work page 2020
-
[29]
K. Lee, L. Smith, and P. Abbeel, “Pebble: Feedback- efficient interactive reinforcement learning via relabel- ing experience and unsupervised pre-training,” inIn- ternational Conference on Machine Learning (ICML), 2021
work page 2021
-
[30]
Few-shot preference learn- ing for human-in-the-loop rl,
J. Hejna and D. Sadigh, “Few-shot preference learn- ing for human-in-the-loop rl,” inConference on Robot Learning (CoRL), 2022
work page 2022
-
[31]
Learning multimodal rewards from rankings,
V . Myers, E. Biyik, N. Anari, and D. Sadigh, “Learning multimodal rewards from rankings,” inConference on Robot Learning (CoRL), 2021
work page 2021
-
[32]
Trajectory improvement and reward learning from comparative language feedback,
Z. Yang, M. Jun, J. Tien, S. J. Russell, A. Dragan, and E. Bıyık, “Trajectory improvement and reward learning from comparative language feedback,” inConference on Robot Learning (CoRL), 2024
work page 2024
-
[33]
Mile: Model-based interven- tion learning,
Y . Korkmaz and E. Bıyık, “Mile: Model-based interven- tion learning,” inInternational Conference on Robotics and Automation (ICRA), 2025
work page 2025
-
[34]
Robomonkey: Scaling test-time sampling and verification for vision-language- action models,
J. Kwok, C. Agia, R. Sinha, M. Foutter, S. Li, I. Stoica, A. Mirhoseini, and M. Pavone, “Robomonkey: Scaling test-time sampling and verification for vision-language- action models,” inConference on Robot Learning (CoRL), 2025
work page 2025
-
[35]
Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,
Y . Wang, Z. Sun, J. Zhang, Z. Xian, E. Biyik, D. Held, and Z. Erickson, “Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,” in International Conference on Machine Learning (ICML), 2024
work page 2024
-
[36]
Eureka: Human- level reward design via coding large language models,
Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fanet al., “Eureka: Human- level reward design via coding large language models,” inInternational Conference on Learning Representa- tions (ICLR), 2024
work page 2024
-
[37]
Language to rewards for robotic skill synthesis,
W. Yu, N. Gileadi, C. Fu, S. Kirmani, K.-H. Lee, M. Gonzalez Arenas, H.-T. Lewis Chiang, T. Erezet al., “Language to rewards for robotic skill synthesis,” in Conference on Robot Learning (CoRL), 2023
work page 2023
-
[38]
Text2reward: Reward shaping with language models for reinforcement learning,
T. Xie, S. Zhao, C. H. Wu, Y . Liu, Q. Luo, V . Zhong, Y . Yang, and T. Yu, “Text2reward: Reward shaping with language models for reinforcement learning,” in International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[39]
Reward design with language models,
M. Kwon, S. M. Xie, K. Bullard, and D. Sadigh, “Reward design with language models,” inInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[40]
Masked irl: Llm-guided reward disambigua- tion from demonstrations and language,
M. Hwang, A. Forsey-Smerek, N. Dennler, and A. Bobu, “Masked irl: Llm-guided reward disambigua- tion from demonstrations and language,”arXiv preprint arXiv:2511.14565, 2025
-
[41]
Can foundation models perform zero-shot task specification for robot manipulation?
Y . Cui, S. Niekum, A. Gupta, V . Kumar, and A. Ra- jeswaran, “Can foundation models perform zero-shot task specification for robot manipulation?” inLearning for Dynamics and Control Conference (L4DC), 2022
work page 2022
-
[42]
Zero-shot reward specification via grounded natural language,
P. Mahmoudieh, D. Pathak, and T. Darrell, “Zero-shot reward specification via grounded natural language,” in International Conference on Machine Learning (ICML), 2022
work page 2022
-
[43]
Roboclip: One demonstration is enough to learn robot policies,
S. A. Sontakke, J. Zhang, S. Arnold, K. Pertsch, E. Biyik, D. Sadigh, C. Finn, and L. Itti, “Roboclip: One demonstration is enough to learn robot policies,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[44]
Vision-language models as success detectors,
Y . Du, K. Konyushkova, M. Denil, A. Raju, J. Landon, F. Hill, N. de Freitas, and S. Cabi, “Vision-language models as success detectors,” inConference on Lifelong Learning Agents, 2023
work page 2023
-
[45]
Vip: Towards universal visual re- ward and representation via value-implicit pre-training,
Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Ku- mar, and A. Zhang, “Vip: Towards universal visual re- ward and representation via value-implicit pre-training,” inInternational Conference on Learning Representa- tions (ICLR), 2023
work page 2023
-
[46]
Language reward modulation for pretraining reinforcement learning,
A. Adeniji, A. Xie, C. Sferrazza, Y . Seo, S. James, and P. Abbeel, “Language reward modulation for pretraining reinforcement learning,”arXiv preprint arXiv:2308.12270, 2024
-
[47]
FuRL: Visual-language models as fuzzy rewards for reinforcement learning,
Y . Fu, H. Zhang, D. Wu, W. Xu, and B. Boulet, “FuRL: Visual-language models as fuzzy rewards for reinforcement learning,” inInternatonal Conference on Machine Learning (ICML), 2024
work page 2024
-
[48]
L. Guan, Y . Zhou, D. Liu, Y . Zha, H. B. Amor, and S. Kambhampati, “Task success is not enough: Investi- gating the use of video-language models as behavior critics for catching undesirable agent behaviors,” in Conference on Language Modeling (COLM), 2024
work page 2024
-
[49]
Vision-language models are zero-shot re- ward models for reinforcement learning,
J. Rocamonde, V . Montesinos, E. Nava, E. Perez, and D. Lindner, “Vision-language models are zero-shot re- ward models for reinforcement learning,” inInterna- tional Conference on Learning Representations (ICLR), 2024
work page 2024
-
[50]
Topreward: Token probabilities as hidden zero-shot rewards for robotics,
S. Chen, C. Harrison, Y .-C. Lee, A. J. Yang, Z. Ren, L. J. Ratliff, J. Duan, D. Foxet al., “Topreward: Token probabilities as hidden zero-shot rewards for robotics,” arXiv preprint arXiv:2602.19313, 2026
-
[51]
Minedojo: Building open-ended embodied agents with internet- scale knowledge,
L. Fan, G. Wang, Y . Jiang, A. Mandlekar, Y . Yang, H. Zhu, A. Tang, D.-A. Huanget al., “Minedojo: Building open-ended embodied agents with internet- scale knowledge,” inConference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022
work page 2022
-
[52]
K. Nottingham, P. Ammanabrolu, A. Suhr, Y . Choi, H. Hajishirzi, S. Singh, and R. Fox, “Do embodied agents dream of pixelated sheep?: Embodied decision making using language guided world modelling,” in International Conference on Machine Learning (ICML), 2023
work page 2023
-
[53]
Lift: Unsupervised reinforcement learning with foundation models as teachers,
T. Nam, J. Lee, J. Zhang, S. J. Hwang, J. J. Lim, and K. Pertsch, “Lift: Unsupervised reinforcement learning with foundation models as teachers,”arXiv preprint arXiv:2312.08958, 2023
-
[54]
Liv: Language-image 12 representations and rewards for robotic control,
Y . J. Ma, W. Liang, V . Som, V . Kumar, A. Zhang, O. Bastani, and D. Jayaraman, “Liv: Language-image 12 representations and rewards for robotic control,” in International Conference on Machine Learning (ICML), 2023
work page 2023
-
[55]
VICtor: Learning hierarchical vision-instruction correlation rewards for long-horizon manipulation,
K.-H. Hung, P.-C. Lo, J.-F. Yeh, H.-Y . Hsu, Y .-T. Chen, and W. H. Hsu, “VICtor: Learning hierarchical vision-instruction correlation rewards for long-horizon manipulation,” inInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[56]
Subtask-aware visual reward learning from segmented demonstrations,
C. Kim, M. Heo, D. Lee, H. Lee, J. Shin, J. J. Lim, and K. Lee, “Subtask-aware visual reward learning from segmented demonstrations,” inInternational Confer- ence on Learning Representations (ICLR), 2025
work page 2025
-
[57]
Opengvl: Benchmarking vi- sual temporal progress for data curation,
P. Budzianowski, E. Wi ´snios, G. G ´oral, I. Kulakov, V . Petrenko, and K. Walas, “Opengvl: Benchmarking vi- sual temporal progress for data curation,”arXiv preprint arXiv:2509.17321, 2025
-
[58]
Self-improving embodied foundation models,
S. K. S. Ghasemipour, A. Wahid, J. Tompson, P. R. Sanketi, and I. Mordatch, “Self-improving embodied foundation models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[59]
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
P. Intelligence, A. Amin, R. Aniceto, A. Balakr- ishna, K. Black, K. Conley, G. Connors, J. Darpinian et al., “π ∗ 0.6: A vla that learns from experience,” arXiv:2511.14759, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Progresslm: Towards progress reasoning in vision-language models,
J. Zhang, C. Qian, H. Sun, H. Lu, D. Wang, L. Xue, and H. Liu, “Progresslm: Towards progress reasoning in vision-language models,”arXiv preprint arXiv:2601.15224, 2026
-
[61]
Roboarena: Distributed real-world evaluation of generalist robot policies,
P. Atreya, K. Pertsch, T. Lee, M. J. Kim, A. Jain, A. Kuramshin, C. Eppner, C. Nearyet al., “Roboarena: Distributed real-world evaluation of generalist robot policies,” inConference on Robot Learning (CoRL), 2025
work page 2025
-
[62]
Open X-Embodiment: Robotic learning datasets and RT-X models,
O. X.-E. Collaboration, A. O’Neill, A. Rehman, A. Gupta, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee et al., “Open X-Embodiment: Robotic learning datasets and RT-X models,” inInternational Conference on Robotics and Automation (ICRA), 2024
work page 2024
- [63]
-
[64]
Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,
D. Damen, H. Doughty, G. M. Farinella, A. Furnari, J. Ma, E. Kazakos, D. Moltisanti, J. Munroet al., “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,”International Journal of Computer Vision (IJCV), vol. 130, p. 33–55, 2022
work page 2022
-
[65]
Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,
H.-S. Fang, H. Fang, Z. Tang, J. Liu, C. Wang, J. Wang, H. Zhu, and C. Lu, “Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,”arXiv preprint arXiv:2307.00595, 2023
-
[66]
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”arXiv preprint arXiv:2306.03310, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Autonomous improvement of instruction following skills via foundation models,
Z. Zhou, P. Atreya, A. Lee, H. R. Walke, O. Mees, and S. Levine, “Autonomous improvement of instruction following skills via foundation models,” inConference on Robot Learning (CoRL), 2024
work page 2024
-
[68]
Failsafe: Reasoning and recovery from failures in vision-language-action models,
Z. Lin, J. Duan, H. Fang, D. Fox, R. Krishna, C. Tan, and B. Wen, “Failsafe: Reasoning and recovery from failures in vision-language-action models,”arXiv preprint arXiv:2510.01642, 2025
-
[69]
A dis- tributional perspective on reinforcement learning,
M. G. Bellemare, W. Dabney, and R. Munos, “A dis- tributional perspective on reinforcement learning,” in International Conference on Machine Learning (ICML), 2017
work page 2017
-
[70]
Adapower: Specializing world foundation models for predictive manipulation,
Y . Huang, S. Zou, J. Zhang, X. Liu, R. Hu, and K. Xu, “Adapower: Specializing world foundation models for predictive manipulation,”arXiv preprint arXiv:2512.035358, 2025
-
[71]
Towards improving reward design in RL: A reward alignment metric for RL practitioners,
C. Muslimani, K. Johnstonbaugh, S. Chandramouli, S. Booth, W. B. Knox, and M. E. Taylor, “Towards improving reward design in RL: A reward alignment metric for RL practitioners,” inReinforcement Learning Conference (RLC), 2025
work page 2025
-
[72]
Robofac: A comprehensive framework for robotic failure analysis and correction,
W. Lu, M. Ye, Z. Ye, R. Tao, S. Yang, and B. Zhao, “Robofac: A comprehensive framework for robotic failure analysis and correction,”arXiv preprint arXiv:2505.12224, 2025
-
[73]
LoRA: Low-rank adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations (ICLR), 2022
work page 2022
-
[74]
Steering your diffusion policy with latent space rein- forcement learning,
A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine, “Steering your diffusion policy with latent space rein- forcement learning,” inConference on Robot Learning (CoRL), 2025
work page 2025
-
[75]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groomet al., “π 0: A vision- language-action flow model for general robot control,” arXiv preprint arxiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama et al., “Droid: A large-scale in-the-wild robot manipu- lation dataset,”arXiv preprint arXiv:2403.12945, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
World Action Models are Zero-shot Policies
S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tanet al., “World action models are zero-shot policies,”arXiv preprint arXiv:2602.15922, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[78]
Offline Reinforcement Learning with Implicit Q-Learning
I. Kostrikov, A. Nair, and S. Levine, “Offline reinforce- ment learning with implicit q-learning,”arXiv preprint arXiv:2110.06169, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[79]
Learning latent plans from play,
C. Lynch, M. Khansari, T. Xiao, V . Kumar, J. Tompson, S. Levine, and P. Sermanet, “Learning latent plans from play,”Conference on Robot Learning (CoRL), 2019
work page 2019
-
[80]
Sigmoid loss for language image pre-training,
X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” in International Conference on Computer Vision (ICCV), 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.