Recognition: 2 theorem links
· Lean TheoremProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation
Pith reviewed 2026-05-12 01:36 UTC · model grok-4.3
The pith
ProcVLM estimates robotic task progress by first inferring remaining atomic actions then scoring intra-stage visual changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ProcVLM grounds progress estimation in procedural structure and intra-stage visual change, adopting a reasoning-before-estimation paradigm that infers the remaining atomic actions before estimating task progress. Supervision is created by synthesizing frame-level subtask-semantic annotations across 30 embodied datasets, assigning progress budgets by subtask structure, and distributing each budget proportionally to measured visual change inside the subtask. The resulting model is trained on a large corpus of annotated frames with progress estimation as the central objective alongside action segmentation and future planning.
What carries the argument
The reasoning-before-estimation paradigm, which first infers remaining atomic actions from the current frame and then assigns progress based on subtask structure plus intra-subtask visual change.
If this is right
- Progress estimates become more discriminative inside individual trajectories, correctly identifying unfinished steps, stagnation, and failure states.
- The same model improves performance on related procedure-aware tasks such as action segmentation and future planning.
- Synthesized procedural supervision from 30 datasets scales to 60 million frames while preserving subtask structure.
- The trained model can serve directly as a dense reward signal for reward-guided policy optimization in robotic manipulation.
Where Pith is reading between the lines
- The synthesis pipeline could be reused to create similar supervision for non-robotic sequential tasks that have clear procedural stages.
- Progress estimates grounded in remaining actions rather than elapsed time may reduce reward hacking in reinforcement learning settings where agents can stall or loop.
- Combining the reasoning step with existing vision-language models might allow real-time procedural monitoring in deployed robots without additional fine-tuning.
Load-bearing premise
Frame-level subtask annotations synthesized from multiple embodied datasets, with budgets assigned by subtask structure and distributed by visual change, accurately reflect genuine task progress without synthesis artifacts or systematic biases.
What would settle it
Train a policy using ProcVLM progress scores as the dense reward on a held-out long-horizon manipulation benchmark and observe that success rates remain no higher than those obtained with simple time-based or terminal-success rewards.
Figures
read the original abstract
Long-horizon robotic manipulation requires dense feedback that reflects how a task advances through its procedural stages, not merely whether the final outcome is successful. Existing reward models often rely on trajectory-level success labels or time-based interpolation, which can conflate elapsed time with true task progress and therefore fail to capture unfinished steps, stagnation, and failure states. We present ProcVLM, a progress-aware vision-language model that learns procedure-grounded progress as a dense reward signal for manipulation. Rather than deriving progress from terminal outcomes or temporal proxies, ProcVLM grounds progress estimation in procedural structure and intra-stage visual change, and further adopts a reasoning-before-estimation paradigm that infers the remaining atomic actions before estimating task progress. Specifically, we construct this supervision by synthesizing frame-level subtask-semantic annotations, assigning progress budgets according to subtask structure, and distributing each budget based on intra-subtask visual change. To train ProcVLM at scale, we build a standardized procedural supervision synthesis pipeline and construct ProcCorpus-60M from 30 embodied datasets with 60M annotated frames, from which we derive ProcVQA for procedure-aware pretraining, with progress estimation as the central task alongside action segmentation and future planning. Experiments on ProcVQA and reward-model benchmarks show that ProcVLM improves embodied procedural reasoning and yields more discriminative trajectory-internal progress estimates than representative baselines, supporting its use as a dense reward model for downstream reward-guided policy optimization. Project page: https://procvlm.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ProcVLM, a vision-language model for learning dense, procedure-grounded progress rewards in robotic manipulation. It constructs ProcCorpus-60M (60M frames from 30 embodied datasets) via a synthesis pipeline that generates frame-level subtask annotations, assigns progress budgets by subtask structure, and distributes them proportionally to intra-subtask visual change. ProcVLM is pretrained on the derived ProcVQA dataset using a reasoning-before-estimation paradigm (infer remaining atomic actions before progress estimation) alongside action segmentation and future planning. Experiments on ProcVQA and reward benchmarks report improved procedural reasoning and more discriminative trajectory-internal progress estimates than baselines, positioning it as a dense reward model for policy optimization.
Significance. If the synthesized supervision accurately reflects true procedural progress, ProcVLM could meaningfully advance dense reward modeling for long-horizon manipulation by addressing conflation of time with progress and better capturing unfinished steps or failures. The scale of ProcCorpus-60M and multi-task pretraining on ProcVQA constitute a valuable resource for the community. The reasoning-before-estimation design offers a concrete mechanism that may yield more reliable estimates than direct regression approaches.
major comments (2)
- [§3 (Method), Synthesis Pipeline] §3 (Method), Synthesis Pipeline: The central claim that ProcVLM yields faithful procedure-grounded progress estimates rests on the assumption that progress budgets assigned by subtask structure and allocated by intra-subtask visual change produce supervision that matches actual task advancement; however, visual change is a proxy that can fail for non-visual progress, visually similar actions, or camera-dominated signals, introducing potential systematic bias into the 60M-frame corpus and ProcVQA labels.
- [§4 (Experiments)] §4 (Experiments): The reported gains in discriminative progress estimates and ProcVQA performance are presented without ablations that isolate the reasoning-before-estimation component from the synthesis pipeline itself, nor comparisons against methods using independent ground-truth progress annotations, making it difficult to confirm that improvements arise from procedure-grounding rather than artifacts of the self-generated labels.
minor comments (2)
- [Abstract] Abstract: The phrase 'representative baselines' is used without enumeration, which obscures the strength of the comparative claims for readers evaluating the reward-model results.
- [Notation throughout] Notation throughout: The terms 'progress budget per subtask' and 'intra-subtask visual change metric' are introduced without explicit formulas or pseudocode for their computation, which could be clarified in the method section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below with point-by-point responses and indicate planned revisions.
read point-by-point responses
-
Referee: [§3 (Method), Synthesis Pipeline] §3 (Method), Synthesis Pipeline: The central claim that ProcVLM yields faithful procedure-grounded progress estimates rests on the assumption that progress budgets assigned by subtask structure and allocated by intra-subtask visual change produce supervision that matches actual task advancement; however, visual change is a proxy that can fail for non-visual progress, visually similar actions, or camera-dominated signals, introducing potential systematic bias into the 60M-frame corpus and ProcVQA labels.
Authors: We agree that intra-subtask visual change is an imperfect proxy and can introduce systematic bias for non-visual progress, visually similar actions, or camera-dominated signals. Our synthesis pipeline was designed for scalability across 30 diverse embodied datasets where manual or sensor-based progress labels are unavailable. We will add a dedicated limitations paragraph in the revised manuscript discussing these failure modes, their potential impact on label fidelity, and the conditions under which the proxy is expected to be reliable for visual manipulation tasks. revision: partial
-
Referee: [§4 (Experiments)] §4 (Experiments): The reported gains in discriminative progress estimates and ProcVQA performance are presented without ablations that isolate the reasoning-before-estimation component from the synthesis pipeline itself, nor comparisons against methods using independent ground-truth progress annotations, making it difficult to confirm that improvements arise from procedure-grounding rather than artifacts of the self-generated labels.
Authors: We acknowledge the value of isolating the reasoning-before-estimation component. We will add an ablation in the revised experiments that trains a direct-estimation variant (removing the reasoning step) while holding the synthesis pipeline fixed, to quantify its contribution to ProcVQA performance and progress discriminativeness. However, comparisons against methods using independent ground-truth progress annotations are not feasible, as no such large-scale ground-truth datasets exist for the 30 embodied datasets in ProcCorpus-60M; the synthesis pipeline was developed precisely to provide scalable supervision in their absence. We will clarify this design motivation in the experiments section. revision: partial
- Direct experimental comparisons to methods trained on independent ground-truth progress annotations, as no such large-scale annotated datasets exist for the tasks in ProcCorpus-60M.
Circularity Check
No significant circularity in derivation chain
full rationale
The paper explicitly constructs synthetic supervision by synthesizing frame-level subtask-semantic annotations, assigning progress budgets according to subtask structure, and distributing budgets based on intra-subtask visual change, then trains ProcVLM to predict these labels via a standard supervised vision-language modeling pipeline on ProcCorpus-60M and ProcVQA. This setup does not reduce any prediction to its inputs by construction, as the model learns a generalizable mapping from visual observations to progress scores rather than tautologically echoing the synthesis rules. No equations or claims equate the output progress estimate directly to the label-generation procedure, and central claims are supported by external benchmark evaluations rather than self-referential definitions. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing elements in the provided derivation.
Axiom & Free-Parameter Ledger
free parameters (2)
- progress budget per subtask
- intra-subtask visual change metric
axioms (2)
- domain assumption Intra-subtask visual change is a monotonic and unbiased indicator of progress within each procedural stage.
- domain assumption Synthesized frame-level subtask annotations from 30 heterogeneous embodied datasets preserve semantic consistency across sources.
invented entities (3)
-
ProcVLM
no independent evidence
-
ProcCorpus-60M
no independent evidence
-
ProcVQA
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we define progress as the normalized accumulation of local visual change weighted by subtask duration: p(t) = ∫ w(τ)r(τ)dτ / ∫ w(τ)r(τ)dτ where w(τ)=clip(K(ek−sk)/T,0.75,1.25), r(τ)=∥ϕ̇(τ)∥ / ∫∥ϕ̇(u)∥du
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ProcVLM grounds progress estimation in procedural structure and intra-stage visual change... synthesizing frame-level subtask-semantic annotations, assigning progress budgets according to subtask structure, and distributing each budget based on intra-subtask visual change
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Do as i can, not as i say: Grounding language in robotic affordances, 2022
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances, 2022
work page 2022
-
[2]
Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szymon Jakubczak, Rowan Jen...
work page Pith review arXiv 2025
-
[3]
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. InAdvances in Neural Information Processing Systems, 2017
work page 2017
-
[4]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Bellemare, Will Dabney, and Rémi Munos
Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforce- ment learning. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 449–458. PMLR, 2017
work page 2017
-
[6]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control, 2024. URLhttps://arxiv.org/abs/2410.24164
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale, 2022. URL https://arxiv.org/abs/ 2212.06817
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023. URL https://arxiv.org/abs/2307.15818
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Yevgen Chebotar, Quan Vuong, Karol Hausman, Fei Xia, Yao Lu, Alex Irpan, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, Keerthana Gopalakrishnan, Julian Ibarz, Ofir Nachum, Sumedh Anand Sontakke, Grecia Salazar, Huong T. Tran, Jodilyn Peralta, Clayton Tan, Deeksha Manjunath, Jaspiar Singh, Brianna Zitkovich, Tomas Jackson, Kanishka Rao, Chelsea ...
work page 2023
-
[10]
Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna
Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna. Topreward: Token probabilities as hidden zero-shot rewards for robotics, 2026. 11
work page 2026
-
[11]
Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable d...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, 2017
work page 2017
-
[13]
Evo-rl: Towards iterative policy improvement in real-world offline rl
Evo-RL Contributors. Evo-rl: Towards iterative policy improvement in real-world offline rl. https://github.com/MINT-SJTU/Evo-RL, 2026
work page 2026
-
[14]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model, 2023. URLhttps://arxiv.org/abs/2303.03378
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Vision-language models as success detectors, 2023
Yuqing Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Ser- manet, Tianhe Yu, et al. Vision-language models as success detectors, 2023
work page 2023
-
[16]
Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,
Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,
- [17]
-
[18]
Srpo: Self-referential policy optimization for vision-language-action models,
Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, and Xipeng Qiu. Srpo: Self-referential policy optimization for vision- language-action models.arXiv preprint arXiv:2511.15605, 2025
-
[19]
Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean- Baptiste Alayrac, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer, 2025
work page 2025
-
[20]
Gemini robotics: Bringing ai into the physical world, 2025
Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, et al. Gemini robotics: Bringing ai into the physical world, 2025
work page 2025
-
[21]
xval: A continuous number encoding for large language models, 2023
Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Holden Parker, Bruno Régaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, and Shirley Ho. xval: A continuous number encoding for large language models, 2023
work page 2023
-
[22]
Google DeepMind. Gemini robotics-er 1.5. Model card and technical report, 2025. URL https://deepmind.google/models/gemini-robotics/gemini-robotics-er/
work page 2025
-
[23]
Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, and Chunhe Xia. Co-rft: Efficient fine-tuning of vision-language-action models through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025
-
[24]
V oxposer: Composable 3d value maps for robotic manipulation with language models
Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. InConference on Robot Learning, 2023
work page 2023
-
[25]
Inner monologue: Embodied reasoning through planning with language models
Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. InProceedings of the Conference on Robot Learning, 2023
work page 2023
-
[26]
Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation, 2024
Wenlong Huang, Igor Mordatch, Deepak Pathak, et al. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation, 2024
work page 2024
-
[27]
Kuan-Hao Hung, Po-Chen Lo, Jen-Feng Yeh, et al. Victor: Learning hierarchical vision- instruction correlation rewards for long-horizon manipulation, 2025. 12
work page 2025
-
[28]
Alleviating over- segmentation errors by detecting action boundaries
Yuchi Ishikawa, Seito Kasai, Yoshimitsu Aoki, and Hirokatsu Kataoka. Alleviating over- segmentation errors by detecting action boundaries. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021
work page 2021
-
[29]
Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete, 2025. URL https://arxiv.org/abs/2502.21257
-
[30]
Vima: General robot manipulation with multimodal prompts
Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yifeng Dou, Yanjie Chen, Li Fei- Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts. InInternational Conference on Machine Learning, 2023
work page 2023
-
[31]
Xinyao Kang and Yi-Ling Kuo. Incorporating task progress knowledge for subgoal generation in robotic manipulation through image edits. InProceedings of the Winter Conference on Applications of Computer Vision, pages 7490–7499, 2025
work page 2025
-
[32]
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Pannag Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model, 2024. URLhttps://arxiv.org/abs/2406.09246
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Flynn, Rene Vidal, Austin Reiter, and Gregory D
Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. Temporal convolutional networks for action segmentation and detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 156–165, 2017
work page 2017
-
[35]
Roboreward: General-purpose vision-language reward models for robotics, 2026
Tony Lee, Andrew Wagenmaker, Karl Pertsch, Percy Liang, Sergey Levine, and Chelsea Finn. Roboreward: General-purpose vision-language reward models for robotics, 2026
work page 2026
-
[36]
Peng Li, Yixiao Chen, Hao Wu, et al. Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models, 2025
work page 2025
-
[37]
Reinforcement Learning with Action Chunking
Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. arXiv preprint arXiv:2507.07969, 2025
work page internal anchor Pith review arXiv 2025
-
[38]
Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons, 2026
work page 2026
-
[39]
Code as policies: Language model programs for embodied control
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. InIEEE International Conference on Robotics and Automation, 2023. 13
work page 2023
-
[40]
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https: //arxiv.org/abs/2306.03310
work page internal anchor Pith review arXiv 2023
-
[41]
Moka: Open-world robotic manipulation through mark-based visual prompting, 2024
Fangchen Liu, Kuan Fang Liu, Puchang Xie, et al. Moka: Open-world robotic manipulation through mark-based visual prompting, 2024
work page 2024
-
[42]
Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning, 2025
work page 2025
-
[43]
Regression over classification: Assessing image aesthetics via multimodal large language models
Xingyuan Ma, Shuai He, Anlong Ming, Haobin Zhong, and Huadong Ma. Regression over classification: Assessing image aesthetics via multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 7827–7835, 2026. doi: 10.1609/aaai.v40i10.37726
-
[44]
Eureka: Human-level reward design via coding large language models, 2023
Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models, 2023
work page 2023
-
[45]
Vision language models are in-context value learners, 2024
Yecheng Jason Ma, Joey Hejna, Ayzaan Wahid, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, Jonathan Tompson, Osbert Bastani, Dinesh Jayaraman, Wenhao Yu, Tingnan Zhang, Dorsa Sadigh, and Fei Xia. Vision language models are in-context value learners, 2024. URLhttps://arxiv.org/abs/2411.04549
-
[46]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
NVIDIA, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, et al. Gr00t n1: An open foundation model for generalist humanoid robots, 2025. URLhttps://arxiv.org/abs/2503.14734
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy, 2024. URLhttps://arxiv.org/abs/2405.12213
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, et al. Open x-embodiment: Robotic learning datasets and rt-x models, 2023. URLhttps://arxiv.org/abs/2310.08864
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, et al. π0.5: a vision-language-action model with open-world generalization, 2025. URLhttps://arxiv.org/abs/2504.16054
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Spatialvla: Exploring spatial representations for visual-language-action model, 2025
Delin Qu, Haoming Song, Qizhi Chen, et al. Spatialvla: Exploring spatial representations for visual-language-action model, 2025
work page 2025
-
[51]
Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent, 2022
work page 2022
-
[52]
Allen Z. Ren, Aakarsh Dixit, Anna Bodrova, Anikait Singh, Stephen Tu, Noah Brown, Peng Xu, Fei Xia, Ted Xiao, Sergey Levine, et al. Robopoint: A vision-language model for spatial affordance prediction for robotics, 2024
work page 2024
-
[53]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[54]
Unsupervised learning and segmentation of complex activ- ities from video
Fadime Sener and Angela Yao. Unsupervised learning and segmentation of complex activ- ities from video. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8368–8376, 2018
work page 2018
-
[55]
Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrish- nan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J. Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics, 2023. URLhttps://arxiv.org/abs/2311. 00899. 14
work page 2023
-
[56]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Progprompt: Generating situated robot task plans using large language models, 2023
Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models, 2023
work page 2023
-
[58]
Sontakke, Jesse Zhang, Sebastian M
Sumedh A. Sontakke, Jesse Zhang, Sebastian M. R. Arnold, Karl Pertsch, Erdem Biyik, Dorsa Sadigh, Chelsea Finn, and Laurent Itti. Roboclip: One demonstration is enough to learn robot policies, 2023
work page 2023
-
[59]
Robo-dopamine: General process reward modeling for high- precision robotic manipulation, 2025
Huajie Tan, Sixiang Chen, Yijie Xu, Zixiao Wang, Yuheng Ji, Cheng Chi, Yaoxu Lyu, Zhongxia Zhao, Xiansheng Chen, Peterson Co, Shaoxuan Xie, Guocai Yao, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Robo-dopamine: General process reward modeling for high- precision robotic manipulation, 2025
work page 2025
-
[60]
Bridgedata v2: A dataset for robot learning at scale
Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023
work page 2023
-
[61]
Enhancing numerical prediction of mllms with soft labeling
Pei Wang, Zhaowei Cai, Hao Yang, Davide Modolo, and Ashwin Swaminathan. Enhancing numerical prediction of mllms with soft labeling. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3424–3434, October 2025
work page 2025
-
[62]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Large reward models: Generalizable online robot reward generation with vision-language models, 2026
Yanru Wu, Weiduo Yuan, Ang Qi, Vitor Guizilini, Jiageng Mao, and Yue Wang. Large reward models: Generalizable online robot reward generation with vision-language models, 2026. URL https://arxiv.org/abs/2603.16065
-
[64]
Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi Fan, Guanya Shi, and Yuke Zhu. Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025
-
[65]
RoboChallenge: Large-scale real-robot evaluation of embodied policies
Adina Yakefu, Bin Xie, Chongyang Xu, Enwen Zhang, Erjin Zhou, Fan Jia, Haitao Yang, Haoqiang Fan, Haowei Zhang, Hongyang Peng, Jing Tan, Junwen Huang, Kai Liu, Kaixin Liu, Kefan Gu, Qinglun Zhang, Ruitao Zhang, Saike Huang, Shen Cheng, Shuaicheng Liu, Tiancai Wang, Tiezhen Wang, Wei Sun, Wenbin Tang, Yajun Wei, Yang Chen, Youqiang Gui, Yucheng Zhao, Yunch...
-
[66]
Instructvla: Vision-language-action instruction tuning from understanding to manipulation, 2025
Senqiao Yang, Hongyu Li, Bo Wang, et al. Instructvla: Vision-language-action instruction tuning from understanding to manipulation, 2025
work page 2025
-
[67]
arXiv preprint arXiv:2505.12224 , year=
Zewei Ye, Weifeng Lu, Minghao Ye, Tao Lin, Shuo Yang, Junchi Yan, and Bo Zhao. Robofac: A comprehensive framework for robotic failure analysis and correction, 2026. URL https: //arxiv.org/abs/2505.12224. 15
-
[68]
Generalizable dense reward for long-horizon robotic tasks, 2026
Silong Yong, Stephen Sheng, Carl Qi, Xiaojie Wang, Evan Sheehan, Anurag Shivaprasad, Yaqi Xie, Katia Sycara, and Yesh Dattatreya. Generalizable dense reward for long-horizon robotic tasks, 2026. URLhttps://arxiv.org/abs/2604.00055
-
[69]
Embodied-r1: Reinforced embodied reasoning for general robotic manipulation, 2025
Yifu Yuan, Haiqin Cui, Yaoting Huang, Yibin Chen, Fei Ni, Zibin Dong, Pengyi Li, Yan Zheng, and Jianye Hao. Embodied-r1: Reinforced embodied reasoning for general robotic manipulation, 2025
work page 2025
-
[70]
Regress, don’t guess: A regression-like loss on number tokens for language models
Jonas Zausinger, Lars Pennig, Anamarija Kozina, Sean Sdahl, Julian Sikora, Adrian Dendorfer, Timofey Kuznetsov, Mohamad Hagog, Nina Wiedemann, Kacper Chlodny, Vincent Limbach, Anna Ketteler, Thorben Prein, Vishwa Mohan Singh, Michael Morris Danziger, and Jannis Born. Regress, don’t guess: A regression-like loss on number tokens for language models. In Pro...
work page 2025
-
[71]
A vision-language-action-critic model for robotic real-world reinforcement learning, 2025
Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang. A vision-language-action-critic model for robotic real-world reinforcement learning, 2025
work page 2025
-
[72]
Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang
Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J. Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang. Rewind: Language-guided rewards teach robot policies without new demonstrations, 2025
work page 2025
-
[73]
PROGRESSLM: Towards progress reasoning in vision-language models, 2026
Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, and Han Liu. PROGRESSLM: Towards progress reasoning in vision-language models, 2026
work page 2026
-
[74]
Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309,
Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. GRAPE: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024. 16 Appendix Contents A Annotation Pipeline Details 18 A.1 Annotator Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.