pith. machine review for the scientific record. sign in

arxiv: 2605.08774 · v1 · submitted 2026-05-09 · 💻 cs.RO · cs.LG

Recognition: 2 theorem links

· Lean Theorem

ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:36 UTC · model grok-4.3

classification 💻 cs.RO cs.LG
keywords progress rewardsrobotic manipulationvision-language modelprocedure groundingdense rewardsembodied reasoningaction segmentation
0
0 comments X

The pith

ProcVLM estimates robotic task progress by first inferring remaining atomic actions then scoring intra-stage visual changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a vision-language model that treats progress estimation as a procedure-aware reasoning task rather than a direct mapping from image to scalar. It first identifies which atomic steps are still left in the current subtask, then allocates progress credit according to visible changes within that subtask. This replaces common shortcuts such as elapsed time or final success labels, which often assign credit to stalled or failed trajectories. A sympathetic reader would care because long-horizon manipulation policies need dense, stage-sensitive feedback to learn efficient sequences instead of relying on sparse terminal rewards.

Core claim

ProcVLM grounds progress estimation in procedural structure and intra-stage visual change, adopting a reasoning-before-estimation paradigm that infers the remaining atomic actions before estimating task progress. Supervision is created by synthesizing frame-level subtask-semantic annotations across 30 embodied datasets, assigning progress budgets by subtask structure, and distributing each budget proportionally to measured visual change inside the subtask. The resulting model is trained on a large corpus of annotated frames with progress estimation as the central objective alongside action segmentation and future planning.

What carries the argument

The reasoning-before-estimation paradigm, which first infers remaining atomic actions from the current frame and then assigns progress based on subtask structure plus intra-subtask visual change.

If this is right

  • Progress estimates become more discriminative inside individual trajectories, correctly identifying unfinished steps, stagnation, and failure states.
  • The same model improves performance on related procedure-aware tasks such as action segmentation and future planning.
  • Synthesized procedural supervision from 30 datasets scales to 60 million frames while preserving subtask structure.
  • The trained model can serve directly as a dense reward signal for reward-guided policy optimization in robotic manipulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The synthesis pipeline could be reused to create similar supervision for non-robotic sequential tasks that have clear procedural stages.
  • Progress estimates grounded in remaining actions rather than elapsed time may reduce reward hacking in reinforcement learning settings where agents can stall or loop.
  • Combining the reasoning step with existing vision-language models might allow real-time procedural monitoring in deployed robots without additional fine-tuning.

Load-bearing premise

Frame-level subtask annotations synthesized from multiple embodied datasets, with budgets assigned by subtask structure and distributed by visual change, accurately reflect genuine task progress without synthesis artifacts or systematic biases.

What would settle it

Train a policy using ProcVLM progress scores as the dense reward on a held-out long-horizon manipulation benchmark and observe that success rates remain no higher than those obtained with simple time-based or terminal-success rewards.

Figures

Figures reproduced from arXiv: 2605.08774 by Chengyang Zhang, Hansen Shi, Haoyang Li, Jie Tang, Jing Zhang, Jinkai Zhang, Xiaohan Zhang, Xinlei Guo, Yang Wang, Youhe Feng.

Figure 1
Figure 1. Figure 1: Overview of ProcVLM. We first synthesize frame-wise procedural annotations from robot [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the procedural supervision synthesis pipeline. Raw episodes are processed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: GPU power consumption and utilization over time across multiple GPUs during the [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example of Embodied Chain-of-Thought (ECoT) annotation in ProcCorpus. Given multi￾view robot observations and a task instruction, ECoT enriches the raw frame with task-centric scene reasoning, completion assessment, future action planning, remaining to-do actions, target-object grounding, and optional discrete action tokens for VLA training [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training curves of ProcVLM under the two-stage training pipeline. Green denotes Stage 1 [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representative real-robot rollout records on the JAKA tabletop stack-bowls task. We show [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Zero-shot reward modeling cases on RoboMIND and RoboTwin. ProcVLM provides task [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Zero-shot reward editing on the same video sequence. Left: reward for “put the apple into [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: One-shot adaptation case for a close-oven task. ProcVLM adapts from one successful [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: One-shot adaptation case for an insert-cylinder task. ProcVLM transfers the demonstrated [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: One-shot adaptation case for a move-target task. ProcVLM transfers the demonstrated [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: One-shot adaptation case for a put-in-box task. ProcVLM generalizes from one successful [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗
read the original abstract

Long-horizon robotic manipulation requires dense feedback that reflects how a task advances through its procedural stages, not merely whether the final outcome is successful. Existing reward models often rely on trajectory-level success labels or time-based interpolation, which can conflate elapsed time with true task progress and therefore fail to capture unfinished steps, stagnation, and failure states. We present ProcVLM, a progress-aware vision-language model that learns procedure-grounded progress as a dense reward signal for manipulation. Rather than deriving progress from terminal outcomes or temporal proxies, ProcVLM grounds progress estimation in procedural structure and intra-stage visual change, and further adopts a reasoning-before-estimation paradigm that infers the remaining atomic actions before estimating task progress. Specifically, we construct this supervision by synthesizing frame-level subtask-semantic annotations, assigning progress budgets according to subtask structure, and distributing each budget based on intra-subtask visual change. To train ProcVLM at scale, we build a standardized procedural supervision synthesis pipeline and construct ProcCorpus-60M from 30 embodied datasets with 60M annotated frames, from which we derive ProcVQA for procedure-aware pretraining, with progress estimation as the central task alongside action segmentation and future planning. Experiments on ProcVQA and reward-model benchmarks show that ProcVLM improves embodied procedural reasoning and yields more discriminative trajectory-internal progress estimates than representative baselines, supporting its use as a dense reward model for downstream reward-guided policy optimization. Project page: https://procvlm.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ProcVLM, a vision-language model for learning dense, procedure-grounded progress rewards in robotic manipulation. It constructs ProcCorpus-60M (60M frames from 30 embodied datasets) via a synthesis pipeline that generates frame-level subtask annotations, assigns progress budgets by subtask structure, and distributes them proportionally to intra-subtask visual change. ProcVLM is pretrained on the derived ProcVQA dataset using a reasoning-before-estimation paradigm (infer remaining atomic actions before progress estimation) alongside action segmentation and future planning. Experiments on ProcVQA and reward benchmarks report improved procedural reasoning and more discriminative trajectory-internal progress estimates than baselines, positioning it as a dense reward model for policy optimization.

Significance. If the synthesized supervision accurately reflects true procedural progress, ProcVLM could meaningfully advance dense reward modeling for long-horizon manipulation by addressing conflation of time with progress and better capturing unfinished steps or failures. The scale of ProcCorpus-60M and multi-task pretraining on ProcVQA constitute a valuable resource for the community. The reasoning-before-estimation design offers a concrete mechanism that may yield more reliable estimates than direct regression approaches.

major comments (2)
  1. [§3 (Method), Synthesis Pipeline] §3 (Method), Synthesis Pipeline: The central claim that ProcVLM yields faithful procedure-grounded progress estimates rests on the assumption that progress budgets assigned by subtask structure and allocated by intra-subtask visual change produce supervision that matches actual task advancement; however, visual change is a proxy that can fail for non-visual progress, visually similar actions, or camera-dominated signals, introducing potential systematic bias into the 60M-frame corpus and ProcVQA labels.
  2. [§4 (Experiments)] §4 (Experiments): The reported gains in discriminative progress estimates and ProcVQA performance are presented without ablations that isolate the reasoning-before-estimation component from the synthesis pipeline itself, nor comparisons against methods using independent ground-truth progress annotations, making it difficult to confirm that improvements arise from procedure-grounding rather than artifacts of the self-generated labels.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'representative baselines' is used without enumeration, which obscures the strength of the comparative claims for readers evaluating the reward-model results.
  2. [Notation throughout] Notation throughout: The terms 'progress budget per subtask' and 'intra-subtask visual change metric' are introduced without explicit formulas or pseudocode for their computation, which could be clarified in the method section for reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below with point-by-point responses and indicate planned revisions.

read point-by-point responses
  1. Referee: [§3 (Method), Synthesis Pipeline] §3 (Method), Synthesis Pipeline: The central claim that ProcVLM yields faithful procedure-grounded progress estimates rests on the assumption that progress budgets assigned by subtask structure and allocated by intra-subtask visual change produce supervision that matches actual task advancement; however, visual change is a proxy that can fail for non-visual progress, visually similar actions, or camera-dominated signals, introducing potential systematic bias into the 60M-frame corpus and ProcVQA labels.

    Authors: We agree that intra-subtask visual change is an imperfect proxy and can introduce systematic bias for non-visual progress, visually similar actions, or camera-dominated signals. Our synthesis pipeline was designed for scalability across 30 diverse embodied datasets where manual or sensor-based progress labels are unavailable. We will add a dedicated limitations paragraph in the revised manuscript discussing these failure modes, their potential impact on label fidelity, and the conditions under which the proxy is expected to be reliable for visual manipulation tasks. revision: partial

  2. Referee: [§4 (Experiments)] §4 (Experiments): The reported gains in discriminative progress estimates and ProcVQA performance are presented without ablations that isolate the reasoning-before-estimation component from the synthesis pipeline itself, nor comparisons against methods using independent ground-truth progress annotations, making it difficult to confirm that improvements arise from procedure-grounding rather than artifacts of the self-generated labels.

    Authors: We acknowledge the value of isolating the reasoning-before-estimation component. We will add an ablation in the revised experiments that trains a direct-estimation variant (removing the reasoning step) while holding the synthesis pipeline fixed, to quantify its contribution to ProcVQA performance and progress discriminativeness. However, comparisons against methods using independent ground-truth progress annotations are not feasible, as no such large-scale ground-truth datasets exist for the 30 embodied datasets in ProcCorpus-60M; the synthesis pipeline was developed precisely to provide scalable supervision in their absence. We will clarify this design motivation in the experiments section. revision: partial

standing simulated objections not resolved
  • Direct experimental comparisons to methods trained on independent ground-truth progress annotations, as no such large-scale annotated datasets exist for the tasks in ProcCorpus-60M.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper explicitly constructs synthetic supervision by synthesizing frame-level subtask-semantic annotations, assigning progress budgets according to subtask structure, and distributing budgets based on intra-subtask visual change, then trains ProcVLM to predict these labels via a standard supervised vision-language modeling pipeline on ProcCorpus-60M and ProcVQA. This setup does not reduce any prediction to its inputs by construction, as the model learns a generalizable mapping from visual observations to progress scores rather than tautologically echoing the synthesis rules. No equations or claims equate the output progress estimate directly to the label-generation procedure, and central claims are supported by external benchmark evaluations rather than self-referential definitions. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing elements in the provided derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 3 invented entities

The central claim rests on the validity of the procedural supervision synthesis pipeline and the assumption that visual change within subtasks is a faithful proxy for progress. The VLM itself contains the usual large number of learned parameters, but the load-bearing free parameters are the progress budget assignments and the choice of intra-subtask visual features used for distribution.

free parameters (2)
  • progress budget per subtask
    Assigned according to subtask structure; exact values or fitting procedure not specified in abstract.
  • intra-subtask visual change metric
    Used to distribute each budget; definition and weighting unknown from abstract.
axioms (2)
  • domain assumption Intra-subtask visual change is a monotonic and unbiased indicator of progress within each procedural stage.
    Invoked when distributing progress budgets based on visual change.
  • domain assumption Synthesized frame-level subtask annotations from 30 heterogeneous embodied datasets preserve semantic consistency across sources.
    Required for the standardized synthesis pipeline to produce usable supervision.
invented entities (3)
  • ProcVLM no independent evidence
    purpose: Progress-aware vision-language model for dense reward generation
    New model introduced in the paper.
  • ProcCorpus-60M no independent evidence
    purpose: Large-scale annotated dataset for procedure-grounded pretraining
    Constructed via the authors' synthesis pipeline from 30 datasets.
  • ProcVQA no independent evidence
    purpose: Procedure-aware VQA dataset with progress estimation as central task
    Derived from ProcCorpus-60M for pretraining.

pith-pipeline@v0.9.0 · 5594 in / 1857 out tokens · 65111 ms · 2026-05-12T01:36:46.985163+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we define progress as the normalized accumulation of local visual change weighted by subtask duration: p(t) = ∫ w(τ)r(τ)dτ / ∫ w(τ)r(τ)dτ where w(τ)=clip(K(ek−sk)/T,0.75,1.25), r(τ)=∥ϕ̇(τ)∥ / ∫∥ϕ̇(u)∥du

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    ProcVLM grounds progress estimation in procedural structure and intra-stage visual change... synthesizing frame-level subtask-semantic annotations, assigning progress budgets according to subtask structure, and distributing each budget based on intra-subtask visual change

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 17 internal anchors

  1. [1]

    Do as i can, not as i say: Grounding language in robotic affordances, 2022

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances, 2022

  2. [2]

    Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szymon Jakubczak, Rowan Jen...

  3. [3]

    Hindsight experience replay

    Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. InAdvances in Neural Information Processing Systems, 2017

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  5. [5]

    Bellemare, Will Dabney, and Rémi Munos

    Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforce- ment learning. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 449–458. PMLR, 2017

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control, 2024. URLhttps://arxiv.org/abs/2410.24164

  7. [7]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale, 2022. URL https://arxiv.org/abs/ 2212.06817

  8. [8]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023. URL https://arxiv.org/abs/2307.15818

  9. [9]

    Tran, Jodilyn Peralta, Clayton Tan, Deeksha Manjunath, Jaspiar Singh, Brianna Zitkovich, Tomas Jackson, Kanishka Rao, Chelsea Finn, and Sergey Levine

    Yevgen Chebotar, Quan Vuong, Karol Hausman, Fei Xia, Yao Lu, Alex Irpan, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, Keerthana Gopalakrishnan, Julian Ibarz, Ofir Nachum, Sumedh Anand Sontakke, Grecia Salazar, Huong T. Tran, Jodilyn Peralta, Clayton Tan, Deeksha Manjunath, Jaspiar Singh, Brianna Zitkovich, Tomas Jackson, Kanishka Rao, Chelsea ...

  10. [10]

    Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna

    Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna. Topreward: Token probabilities as hidden zero-shot rewards for robotics, 2026. 11

  11. [11]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable d...

  12. [12]

    Christiano, Jan Leike, Tom B

    Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, 2017

  13. [13]

    Evo-rl: Towards iterative policy improvement in real-world offline rl

    Evo-RL Contributors. Evo-rl: Towards iterative policy improvement in real-world offline rl. https://github.com/MINT-SJTU/Evo-RL, 2026

  14. [14]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model, 2023. URLhttps://arxiv.org/abs/2303.03378

  15. [15]

    Vision-language models as success detectors, 2023

    Yuqing Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Ser- manet, Tianhe Yu, et al. Vision-language models as success detectors, 2023

  16. [16]

    Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,

    Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,

  17. [17]

    URLhttps://arxiv.org/abs/2307.00595

  18. [18]

    Srpo: Self-referential policy optimization for vision-language-action models,

    Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, and Xipeng Qiu. Srpo: Self-referential policy optimization for vision- language-action models.arXiv preprint arXiv:2511.15605, 2025

  19. [19]

    Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer, 2025

    Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean- Baptiste Alayrac, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer, 2025

  20. [20]

    Gemini robotics: Bringing ai into the physical world, 2025

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, et al. Gemini robotics: Bringing ai into the physical world, 2025

  21. [21]

    xval: A continuous number encoding for large language models, 2023

    Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Holden Parker, Bruno Régaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, and Shirley Ho. xval: A continuous number encoding for large language models, 2023

  22. [22]

    Gemini robotics-er 1.5

    Google DeepMind. Gemini robotics-er 1.5. Model card and technical report, 2025. URL https://deepmind.google/models/gemini-robotics/gemini-robotics-er/

  23. [23]

    CO-RFT: Efficient fine-tuning of vision-language-action models through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

    Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, and Chunhe Xia. Co-rft: Efficient fine-tuning of vision-language-action models through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

  24. [24]

    V oxposer: Composable 3d value maps for robotic manipulation with language models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. InConference on Robot Learning, 2023

  25. [25]

    Inner monologue: Embodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. InProceedings of the Conference on Robot Learning, 2023

  26. [26]

    Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation, 2024

    Wenlong Huang, Igor Mordatch, Deepak Pathak, et al. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation, 2024

  27. [27]

    Victor: Learning hierarchical vision- instruction correlation rewards for long-horizon manipulation, 2025

    Kuan-Hao Hung, Po-Chen Lo, Jen-Feng Yeh, et al. Victor: Learning hierarchical vision- instruction correlation rewards for long-horizon manipulation, 2025. 12

  28. [28]

    Alleviating over- segmentation errors by detecting action boundaries

    Yuchi Ishikawa, Seito Kasai, Yoshimitsu Aoki, and Hirokatsu Kataoka. Alleviating over- segmentation errors by detecting action boundaries. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021

  29. [29]

    RoboBrain: A Unified Brain Model for Robotic Manipula- tion from Abstract to Concrete.arXiv preprint arXiv:2502.21257, 2025

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete, 2025. URL https://arxiv.org/abs/2502.21257

  30. [30]

    Vima: General robot manipulation with multimodal prompts

    Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yifeng Dou, Yanjie Chen, Li Fei- Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts. InInternational Conference on Machine Learning, 2023

  31. [31]

    Incorporating task progress knowledge for subgoal generation in robotic manipulation through image edits

    Xinyao Kang and Yi-Ling Kuo. Incorporating task progress knowledge for subgoal generation in robotic manipulation through image edits. InProceedings of the Winter Conference on Applications of Computer Vision, pages 7490–7499, 2025

  32. [32]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

  33. [33]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Pannag Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model, 2024. URLhttps://arxiv.org/abs/2406.09246

  34. [34]

    Flynn, Rene Vidal, Austin Reiter, and Gregory D

    Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. Temporal convolutional networks for action segmentation and detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 156–165, 2017

  35. [35]

    Roboreward: General-purpose vision-language reward models for robotics, 2026

    Tony Lee, Andrew Wagenmaker, Karl Pertsch, Percy Liang, Sergey Levine, and Chelsea Finn. Roboreward: General-purpose vision-language reward models for robotics, 2026

  36. [36]

    Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models, 2025

    Peng Li, Yixiao Chen, Hao Wu, et al. Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models, 2025

  37. [37]

    Reinforcement Learning with Action Chunking

    Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. arXiv preprint arXiv:2507.07969, 2025

  38. [38]

    Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang

    Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons, 2026

  39. [39]

    Code as policies: Language model programs for embodied control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. InIEEE International Conference on Robotics and Automation, 2023. 13

  40. [40]

    LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https: //arxiv.org/abs/2306.03310

  41. [41]

    Moka: Open-world robotic manipulation through mark-based visual prompting, 2024

    Fangchen Liu, Kuan Fang Liu, Puchang Xie, et al. Moka: Open-world robotic manipulation through mark-based visual prompting, 2024

  42. [42]

    Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning, 2025

    Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning, 2025

  43. [43]

    Regression over classification: Assessing image aesthetics via multimodal large language models

    Xingyuan Ma, Shuai He, Anlong Ming, Haobin Zhong, and Huadong Ma. Regression over classification: Assessing image aesthetics via multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 7827–7835, 2026. doi: 10.1609/aaai.v40i10.37726

  44. [44]

    Eureka: Human-level reward design via coding large language models, 2023

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models, 2023

  45. [45]

    Vision language models are in-context value learners, 2024

    Yecheng Jason Ma, Joey Hejna, Ayzaan Wahid, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, Jonathan Tompson, Osbert Bastani, Dinesh Jayaraman, Wenhao Yu, Tingnan Zhang, Dorsa Sadigh, and Fei Xia. Vision language models are in-context value learners, 2024. URLhttps://arxiv.org/abs/2411.04549

  46. [46]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, et al. Gr00t n1: An open foundation model for generalist humanoid robots, 2025. URLhttps://arxiv.org/abs/2503.14734

  47. [47]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy, 2024. URLhttps://arxiv.org/abs/2405.12213

  48. [48]

    Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, et al. Open x-embodiment: Robotic learning datasets and rt-x models, 2023. URLhttps://arxiv.org/abs/2310.08864

  49. [49]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, et al. π0.5: a vision-language-action model with open-world generalization, 2025. URLhttps://arxiv.org/abs/2504.16054

  50. [50]

    Spatialvla: Exploring spatial representations for visual-language-action model, 2025

    Delin Qu, Haoming Song, Qizhi Chen, et al. Spatialvla: Exploring spatial representations for visual-language-action model, 2025

  51. [51]

    A generalist agent, 2022

    Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent, 2022

  52. [52]

    Ren, Aakarsh Dixit, Anna Bodrova, Anikait Singh, Stephen Tu, Noah Brown, Peng Xu, Fei Xia, Ted Xiao, Sergey Levine, et al

    Allen Z. Ren, Aakarsh Dixit, Anna Bodrova, Anikait Singh, Stephen Tu, Noah Brown, Peng Xu, Fei Xia, Ted Xiao, Sergey Levine, et al. Robopoint: A vision-language model for spatial affordance prediction for robotics, 2024

  53. [53]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  54. [54]

    Unsupervised learning and segmentation of complex activ- ities from video

    Fadime Sener and Angela Yao. Unsupervised learning and segmentation of complex activ- ities from video. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8368–8376, 2018

  55. [55]

    Joshi, et al

    Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrish- nan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J. Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics, 2023. URLhttps://arxiv.org/abs/2311. 00899. 14

  56. [56]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  57. [57]

    Progprompt: Generating situated robot task plans using large language models, 2023

    Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models, 2023

  58. [58]

    Sontakke, Jesse Zhang, Sebastian M

    Sumedh A. Sontakke, Jesse Zhang, Sebastian M. R. Arnold, Karl Pertsch, Erdem Biyik, Dorsa Sadigh, Chelsea Finn, and Laurent Itti. Roboclip: One demonstration is enough to learn robot policies, 2023

  59. [59]

    Robo-dopamine: General process reward modeling for high- precision robotic manipulation, 2025

    Huajie Tan, Sixiang Chen, Yijie Xu, Zixiao Wang, Yuheng Ji, Cheng Chi, Yaoxu Lyu, Zhongxia Zhao, Xiansheng Chen, Peterson Co, Shaoxuan Xie, Guocai Yao, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Robo-dopamine: General process reward modeling for high- precision robotic manipulation, 2025

  60. [60]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023

  61. [61]

    Enhancing numerical prediction of mllms with soft labeling

    Pei Wang, Zhaowei Cai, Hao Yang, Davide Modolo, and Ashwin Swaminathan. Enhancing numerical prediction of mllms with soft labeling. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3424–3434, October 2025

  62. [62]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

  63. [63]

    Large reward models: Generalizable online robot reward generation with vision-language models, 2026

    Yanru Wu, Weiduo Yuan, Ang Qi, Vitor Guizilini, Jiageng Mao, and Yue Wang. Large reward models: Generalizable online robot reward generation with vision-language models, 2026. URL https://arxiv.org/abs/2603.16065

  64. [64]

    Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

    Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi Fan, Guanya Shi, and Yuke Zhu. Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

  65. [65]

    RoboChallenge: Large-scale real-robot evaluation of embodied policies

    Adina Yakefu, Bin Xie, Chongyang Xu, Enwen Zhang, Erjin Zhou, Fan Jia, Haitao Yang, Haoqiang Fan, Haowei Zhang, Hongyang Peng, Jing Tan, Junwen Huang, Kai Liu, Kaixin Liu, Kefan Gu, Qinglun Zhang, Ruitao Zhang, Saike Huang, Shen Cheng, Shuaicheng Liu, Tiancai Wang, Tiezhen Wang, Wei Sun, Wenbin Tang, Yajun Wei, Yang Chen, Youqiang Gui, Yucheng Zhao, Yunch...

  66. [66]

    Instructvla: Vision-language-action instruction tuning from understanding to manipulation, 2025

    Senqiao Yang, Hongyu Li, Bo Wang, et al. Instructvla: Vision-language-action instruction tuning from understanding to manipulation, 2025

  67. [67]

    arXiv preprint arXiv:2505.12224 , year=

    Zewei Ye, Weifeng Lu, Minghao Ye, Tao Lin, Shuo Yang, Junchi Yan, and Bo Zhao. Robofac: A comprehensive framework for robotic failure analysis and correction, 2026. URL https: //arxiv.org/abs/2505.12224. 15

  68. [68]

    Generalizable dense reward for long-horizon robotic tasks, 2026

    Silong Yong, Stephen Sheng, Carl Qi, Xiaojie Wang, Evan Sheehan, Anurag Shivaprasad, Yaqi Xie, Katia Sycara, and Yesh Dattatreya. Generalizable dense reward for long-horizon robotic tasks, 2026. URLhttps://arxiv.org/abs/2604.00055

  69. [69]

    Embodied-r1: Reinforced embodied reasoning for general robotic manipulation, 2025

    Yifu Yuan, Haiqin Cui, Yaoting Huang, Yibin Chen, Fei Ni, Zibin Dong, Pengyi Li, Yan Zheng, and Jianye Hao. Embodied-r1: Reinforced embodied reasoning for general robotic manipulation, 2025

  70. [70]

    Regress, don’t guess: A regression-like loss on number tokens for language models

    Jonas Zausinger, Lars Pennig, Anamarija Kozina, Sean Sdahl, Julian Sikora, Adrian Dendorfer, Timofey Kuznetsov, Mohamad Hagog, Nina Wiedemann, Kacper Chlodny, Vincent Limbach, Anna Ketteler, Thorben Prein, Vishwa Mohan Singh, Michael Morris Danziger, and Jannis Born. Regress, don’t guess: A regression-like loss on number tokens for language models. In Pro...

  71. [71]

    A vision-language-action-critic model for robotic real-world reinforcement learning, 2025

    Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang. A vision-language-action-critic model for robotic real-world reinforcement learning, 2025

  72. [72]

    Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang

    Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J. Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang. Rewind: Language-guided rewards teach robot policies without new demonstrations, 2025

  73. [73]

    PROGRESSLM: Towards progress reasoning in vision-language models, 2026

    Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, and Han Liu. PROGRESSLM: Towards progress reasoning in vision-language models, 2026

  74. [74]

    Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309,

    Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. GRAPE: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024. 16 Appendix Contents A Annotation Pipeline Details 18 A.1 Annotator Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....