arxiv: 2605.08774 · v1 · submitted 2026-05-09 · 💻 cs.RO · cs.LG

Recognition: 2 theorem links

· Lean Theorem

ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation

Youhe Feng , Hansen Shi , Haoyang Li , Xinlei Guo , Yang Wang , Chengyang Zhang , Jinkai Zhang , Xiaohan Zhang

show 2 more authors

Jie Tang Jing Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:36 UTC · model grok-4.3

classification 💻 cs.RO cs.LG

keywords progress rewardsrobotic manipulationvision-language modelprocedure groundingdense rewardsembodied reasoningaction segmentation

0 comments

The pith

ProcVLM estimates robotic task progress by first inferring remaining atomic actions then scoring intra-stage visual changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a vision-language model that treats progress estimation as a procedure-aware reasoning task rather than a direct mapping from image to scalar. It first identifies which atomic steps are still left in the current subtask, then allocates progress credit according to visible changes within that subtask. This replaces common shortcuts such as elapsed time or final success labels, which often assign credit to stalled or failed trajectories. A sympathetic reader would care because long-horizon manipulation policies need dense, stage-sensitive feedback to learn efficient sequences instead of relying on sparse terminal rewards.

Core claim

ProcVLM grounds progress estimation in procedural structure and intra-stage visual change, adopting a reasoning-before-estimation paradigm that infers the remaining atomic actions before estimating task progress. Supervision is created by synthesizing frame-level subtask-semantic annotations across 30 embodied datasets, assigning progress budgets by subtask structure, and distributing each budget proportionally to measured visual change inside the subtask. The resulting model is trained on a large corpus of annotated frames with progress estimation as the central objective alongside action segmentation and future planning.

What carries the argument

The reasoning-before-estimation paradigm, which first infers remaining atomic actions from the current frame and then assigns progress based on subtask structure plus intra-subtask visual change.

If this is right

Progress estimates become more discriminative inside individual trajectories, correctly identifying unfinished steps, stagnation, and failure states.
The same model improves performance on related procedure-aware tasks such as action segmentation and future planning.
Synthesized procedural supervision from 30 datasets scales to 60 million frames while preserving subtask structure.
The trained model can serve directly as a dense reward signal for reward-guided policy optimization in robotic manipulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The synthesis pipeline could be reused to create similar supervision for non-robotic sequential tasks that have clear procedural stages.
Progress estimates grounded in remaining actions rather than elapsed time may reduce reward hacking in reinforcement learning settings where agents can stall or loop.
Combining the reasoning step with existing vision-language models might allow real-time procedural monitoring in deployed robots without additional fine-tuning.

Load-bearing premise

Frame-level subtask annotations synthesized from multiple embodied datasets, with budgets assigned by subtask structure and distributed by visual change, accurately reflect genuine task progress without synthesis artifacts or systematic biases.

What would settle it

Train a policy using ProcVLM progress scores as the dense reward on a held-out long-horizon manipulation benchmark and observe that success rates remain no higher than those obtained with simple time-based or terminal-success rewards.

Figures

Figures reproduced from arXiv: 2605.08774 by Chengyang Zhang, Hansen Shi, Haoyang Li, Jie Tang, Jing Zhang, Jinkai Zhang, Xiaohan Zhang, Xinlei Guo, Yang Wang, Youhe Feng.

**Figure 2.** Figure 2: Overview of the procedural supervision synthesis pipeline. Raw episodes are processed [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: GPU power consumption and utilization over time across multiple GPUs during the [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: Example of Embodied Chain-of-Thought (ECoT) annotation in ProcCorpus. Given multiview robot observations and a task instruction, ECoT enriches the raw frame with task-centric scene reasoning, completion assessment, future action planning, remaining to-do actions, target-object grounding, and optional discrete action tokens for VLA training [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Training curves of ProcVLM under the two-stage training pipeline. Green denotes Stage 1 [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: Representative real-robot rollout records on the JAKA tabletop stack-bowls task. We show [PITH_FULL_IMAGE:figures/full_fig_p032_6.png] view at source ↗

**Figure 7.** Figure 7: Zero-shot reward modeling cases on RoboMIND and RoboTwin. ProcVLM provides task [PITH_FULL_IMAGE:figures/full_fig_p033_7.png] view at source ↗

**Figure 8.** Figure 8: Zero-shot reward editing on the same video sequence. Left: reward for “put the apple into [PITH_FULL_IMAGE:figures/full_fig_p033_8.png] view at source ↗

**Figure 9.** Figure 9: One-shot adaptation case for a close-oven task. ProcVLM adapts from one successful [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗

**Figure 10.** Figure 10: One-shot adaptation case for an insert-cylinder task. ProcVLM transfers the demonstrated [PITH_FULL_IMAGE:figures/full_fig_p034_10.png] view at source ↗

**Figure 11.** Figure 11: One-shot adaptation case for a move-target task. ProcVLM transfers the demonstrated [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗

**Figure 12.** Figure 12: One-shot adaptation case for a put-in-box task. ProcVLM generalizes from one successful [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗

read the original abstract

Long-horizon robotic manipulation requires dense feedback that reflects how a task advances through its procedural stages, not merely whether the final outcome is successful. Existing reward models often rely on trajectory-level success labels or time-based interpolation, which can conflate elapsed time with true task progress and therefore fail to capture unfinished steps, stagnation, and failure states. We present ProcVLM, a progress-aware vision-language model that learns procedure-grounded progress as a dense reward signal for manipulation. Rather than deriving progress from terminal outcomes or temporal proxies, ProcVLM grounds progress estimation in procedural structure and intra-stage visual change, and further adopts a reasoning-before-estimation paradigm that infers the remaining atomic actions before estimating task progress. Specifically, we construct this supervision by synthesizing frame-level subtask-semantic annotations, assigning progress budgets according to subtask structure, and distributing each budget based on intra-subtask visual change. To train ProcVLM at scale, we build a standardized procedural supervision synthesis pipeline and construct ProcCorpus-60M from 30 embodied datasets with 60M annotated frames, from which we derive ProcVQA for procedure-aware pretraining, with progress estimation as the central task alongside action segmentation and future planning. Experiments on ProcVQA and reward-model benchmarks show that ProcVLM improves embodied procedural reasoning and yields more discriminative trajectory-internal progress estimates than representative baselines, supporting its use as a dense reward model for downstream reward-guided policy optimization. Project page: https://procvlm.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProcVLM scales up synthesized procedural progress labels from visual change to train a VLM for dense robotic rewards, but the proxy assumption needs more scrutiny.

read the letter

The main takeaway is that this paper gives a practical way to get dense, procedure-aware progress signals for long-horizon manipulation by training a VLM on a large synthesized corpus instead of relying on time or terminal success. They pull 30 embodied datasets into ProcCorpus-60M with 60M frames, synthesize frame-level subtask annotations, assign progress budgets by subtask structure, and allocate them according to intra-subtask visual change. The model then uses a reasoning-before-estimation step to first figure out remaining atomic actions before outputting a progress value, and they show it produces more discriminative estimates than baselines on ProcVQA and reward benchmarks for downstream policy use. The scale and the explicit handling of procedural stages are the parts that stand out as useful additions to existing VLM reward work. This directly tackles the common failure mode where rewards conflate elapsed frames with actual advancement or miss unfinished steps. The synthesis pipeline and the pretraining tasks around action segmentation and future planning give a concrete recipe that others can inspect or adapt. The soft spot sits in the label generation step. Visual change serves as the distribution mechanism within each subtask, which is a reasonable proxy but can break when progress is mostly non-visual, when similar-looking states occur at different stages, or when camera motion dominates the signal. The paper's own stress-test note flags this, and the circularity of training on quantities derived from the same pipeline means downstream reward quality depends heavily on how faithfully those labels reflect true procedural progress. I'd want clearer checks against manual annotations or held-out tasks to see if the gains hold up. This is for people working on reward models, VLM-based robotics, or long-horizon manipulation who need better dense feedback. Readers who care about scaling procedural supervision will get value from the dataset construction and the training paradigm even if they modify the visual-change allocation. It deserves a serious referee because the problem is well-motivated, the method is fully specified, and the experiments provide a clear evaluation setup that can be stress-tested further.

Referee Report

2 major / 2 minor

Summary. The paper proposes ProcVLM, a vision-language model for learning dense, procedure-grounded progress rewards in robotic manipulation. It constructs ProcCorpus-60M (60M frames from 30 embodied datasets) via a synthesis pipeline that generates frame-level subtask annotations, assigns progress budgets by subtask structure, and distributes them proportionally to intra-subtask visual change. ProcVLM is pretrained on the derived ProcVQA dataset using a reasoning-before-estimation paradigm (infer remaining atomic actions before progress estimation) alongside action segmentation and future planning. Experiments on ProcVQA and reward benchmarks report improved procedural reasoning and more discriminative trajectory-internal progress estimates than baselines, positioning it as a dense reward model for policy optimization.

Significance. If the synthesized supervision accurately reflects true procedural progress, ProcVLM could meaningfully advance dense reward modeling for long-horizon manipulation by addressing conflation of time with progress and better capturing unfinished steps or failures. The scale of ProcCorpus-60M and multi-task pretraining on ProcVQA constitute a valuable resource for the community. The reasoning-before-estimation design offers a concrete mechanism that may yield more reliable estimates than direct regression approaches.

major comments (2)

[§3 (Method), Synthesis Pipeline] §3 (Method), Synthesis Pipeline: The central claim that ProcVLM yields faithful procedure-grounded progress estimates rests on the assumption that progress budgets assigned by subtask structure and allocated by intra-subtask visual change produce supervision that matches actual task advancement; however, visual change is a proxy that can fail for non-visual progress, visually similar actions, or camera-dominated signals, introducing potential systematic bias into the 60M-frame corpus and ProcVQA labels.
[§4 (Experiments)] §4 (Experiments): The reported gains in discriminative progress estimates and ProcVQA performance are presented without ablations that isolate the reasoning-before-estimation component from the synthesis pipeline itself, nor comparisons against methods using independent ground-truth progress annotations, making it difficult to confirm that improvements arise from procedure-grounding rather than artifacts of the self-generated labels.

minor comments (2)

[Abstract] Abstract: The phrase 'representative baselines' is used without enumeration, which obscures the strength of the comparative claims for readers evaluating the reward-model results.
[Notation throughout] Notation throughout: The terms 'progress budget per subtask' and 'intra-subtask visual change metric' are introduced without explicit formulas or pseudocode for their computation, which could be clarified in the method section for reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below with point-by-point responses and indicate planned revisions.

read point-by-point responses

Referee: [§3 (Method), Synthesis Pipeline] §3 (Method), Synthesis Pipeline: The central claim that ProcVLM yields faithful procedure-grounded progress estimates rests on the assumption that progress budgets assigned by subtask structure and allocated by intra-subtask visual change produce supervision that matches actual task advancement; however, visual change is a proxy that can fail for non-visual progress, visually similar actions, or camera-dominated signals, introducing potential systematic bias into the 60M-frame corpus and ProcVQA labels.

Authors: We agree that intra-subtask visual change is an imperfect proxy and can introduce systematic bias for non-visual progress, visually similar actions, or camera-dominated signals. Our synthesis pipeline was designed for scalability across 30 diverse embodied datasets where manual or sensor-based progress labels are unavailable. We will add a dedicated limitations paragraph in the revised manuscript discussing these failure modes, their potential impact on label fidelity, and the conditions under which the proxy is expected to be reliable for visual manipulation tasks. revision: partial
Referee: [§4 (Experiments)] §4 (Experiments): The reported gains in discriminative progress estimates and ProcVQA performance are presented without ablations that isolate the reasoning-before-estimation component from the synthesis pipeline itself, nor comparisons against methods using independent ground-truth progress annotations, making it difficult to confirm that improvements arise from procedure-grounding rather than artifacts of the self-generated labels.

Authors: We acknowledge the value of isolating the reasoning-before-estimation component. We will add an ablation in the revised experiments that trains a direct-estimation variant (removing the reasoning step) while holding the synthesis pipeline fixed, to quantify its contribution to ProcVQA performance and progress discriminativeness. However, comparisons against methods using independent ground-truth progress annotations are not feasible, as no such large-scale ground-truth datasets exist for the 30 embodied datasets in ProcCorpus-60M; the synthesis pipeline was developed precisely to provide scalable supervision in their absence. We will clarify this design motivation in the experiments section. revision: partial

standing simulated objections not resolved

Direct experimental comparisons to methods trained on independent ground-truth progress annotations, as no such large-scale annotated datasets exist for the tasks in ProcCorpus-60M.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper explicitly constructs synthetic supervision by synthesizing frame-level subtask-semantic annotations, assigning progress budgets according to subtask structure, and distributing budgets based on intra-subtask visual change, then trains ProcVLM to predict these labels via a standard supervised vision-language modeling pipeline on ProcCorpus-60M and ProcVQA. This setup does not reduce any prediction to its inputs by construction, as the model learns a generalizable mapping from visual observations to progress scores rather than tautologically echoing the synthesis rules. No equations or claims equate the output progress estimate directly to the label-generation procedure, and central claims are supported by external benchmark evaluations rather than self-referential definitions. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing elements in the provided derivation.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 3 invented entities

The central claim rests on the validity of the procedural supervision synthesis pipeline and the assumption that visual change within subtasks is a faithful proxy for progress. The VLM itself contains the usual large number of learned parameters, but the load-bearing free parameters are the progress budget assignments and the choice of intra-subtask visual features used for distribution.

free parameters (2)

progress budget per subtask
Assigned according to subtask structure; exact values or fitting procedure not specified in abstract.
intra-subtask visual change metric
Used to distribute each budget; definition and weighting unknown from abstract.

axioms (2)

domain assumption Intra-subtask visual change is a monotonic and unbiased indicator of progress within each procedural stage.
Invoked when distributing progress budgets based on visual change.
domain assumption Synthesized frame-level subtask annotations from 30 heterogeneous embodied datasets preserve semantic consistency across sources.
Required for the standardized synthesis pipeline to produce usable supervision.

invented entities (3)

ProcVLM no independent evidence
purpose: Progress-aware vision-language model for dense reward generation
New model introduced in the paper.
ProcCorpus-60M no independent evidence
purpose: Large-scale annotated dataset for procedure-grounded pretraining
Constructed via the authors' synthesis pipeline from 30 datasets.
ProcVQA no independent evidence
purpose: Procedure-aware VQA dataset with progress estimation as central task
Derived from ProcCorpus-60M for pretraining.

pith-pipeline@v0.9.0 · 5594 in / 1857 out tokens · 65111 ms · 2026-05-12T01:36:46.985163+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we define progress as the normalized accumulation of local visual change weighted by subtask duration: p(t) = ∫ w(τ)r(τ)dτ / ∫ w(τ)r(τ)dτ where w(τ)=clip(K(ek−sk)/T,0.75,1.25), r(τ)=∥ϕ̇(τ)∥ / ∫∥ϕ̇(u)∥du
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ProcVLM grounds progress estimation in procedural structure and intra-stage visual change... synthesizing frame-level subtask-semantic annotations, assigning progress budgets according to subtask structure, and distributing each budget based on intra-subtask visual change

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 17 internal anchors

[1]

Do as i can, not as i say: Grounding language in robotic affordances, 2022

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances, 2022

work page 2022
[2]

Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Szymon Jakubczak, Rowan Jen...

work page Pith review arXiv 2025
[3]

Hindsight experience replay

Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[4]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Bellemare, Will Dabney, and Rémi Munos

Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforce- ment learning. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 449–458. PMLR, 2017

work page 2017
[6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control, 2024. URLhttps://arxiv.org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale, 2022. URL https://arxiv.org/abs/ 2212.06817

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023. URL https://arxiv.org/abs/2307.15818

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Tran, Jodilyn Peralta, Clayton Tan, Deeksha Manjunath, Jaspiar Singh, Brianna Zitkovich, Tomas Jackson, Kanishka Rao, Chelsea Finn, and Sergey Levine

Yevgen Chebotar, Quan Vuong, Karol Hausman, Fei Xia, Yao Lu, Alex Irpan, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, Keerthana Gopalakrishnan, Julian Ibarz, Ofir Nachum, Sumedh Anand Sontakke, Grecia Salazar, Huong T. Tran, Jodilyn Peralta, Clayton Tan, Deeksha Manjunath, Jaspiar Singh, Brianna Zitkovich, Tomas Jackson, Kanishka Rao, Chelsea ...

work page 2023
[10]

Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna

Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, and Ranjay Krishna. Topreward: Token probabilities as hidden zero-shot rewards for robotics, 2026. 11

work page 2026
[11]

RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, and Yao Mu. Robotwin 2.0: A scalable d...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Christiano, Jan Leike, Tom B

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, 2017

work page 2017
[13]

Evo-rl: Towards iterative policy improvement in real-world offline rl

Evo-RL Contributors. Evo-rl: Towards iterative policy improvement in real-world offline rl. https://github.com/MINT-SJTU/Evo-RL, 2026

work page 2026
[14]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model, 2023. URLhttps://arxiv.org/abs/2303.03378

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Vision-language models as success detectors, 2023

Yuqing Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Ser- manet, Tianhe Yu, et al. Vision-language models as success detectors, 2023

work page 2023
[16]

Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. Rh20t: A comprehensive robotic dataset for learning diverse skills in one-shot,

work page
[17]

URLhttps://arxiv.org/abs/2307.00595

work page arXiv
[18]

Srpo: Self-referential policy optimization for vision-language-action models,

Senyu Fei, Siyin Wang, Li Ji, Ao Li, Shiduo Zhang, Liming Liu, Jinlong Hou, Jingjing Gong, Xianzhong Zhao, and Xipeng Qiu. Srpo: Self-referential policy optimization for vision- language-action models.arXiv preprint arXiv:2511.15605, 2025

work page arXiv 2025
[19]

Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer, 2025

Gemini Robotics Team, Abbas Abdolmaleki, Saminda Abeyruwan, Joshua Ainslie, Jean- Baptiste Alayrac, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer, 2025

work page 2025
[20]

Gemini robotics: Bringing ai into the physical world, 2025

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, et al. Gemini robotics: Bringing ai into the physical world, 2025

work page 2025
[21]

xval: A continuous number encoding for large language models, 2023

Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Holden Parker, Bruno Régaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, and Shirley Ho. xval: A continuous number encoding for large language models, 2023

work page 2023
[22]

Gemini robotics-er 1.5

Google DeepMind. Gemini robotics-er 1.5. Model card and technical report, 2025. URL https://deepmind.google/models/gemini-robotics/gemini-robotics-er/

work page 2025
[23]

CO-RFT: Efficient fine-tuning of vision-language-action models through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

Dongchi Huang, Zhirui Fang, Tianle Zhang, Yihang Li, Lin Zhao, and Chunhe Xia. Co-rft: Efficient fine-tuning of vision-language-action models through chunked offline reinforcement learning.arXiv preprint arXiv:2508.02219, 2025

work page arXiv 2025
[24]

V oxposer: Composable 3d value maps for robotic manipulation with language models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. InConference on Robot Learning, 2023

work page 2023
[25]

Inner monologue: Embodied reasoning through planning with language models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. InProceedings of the Conference on Robot Learning, 2023

work page 2023
[26]

Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation, 2024

Wenlong Huang, Igor Mordatch, Deepak Pathak, et al. Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation, 2024

work page 2024
[27]

Victor: Learning hierarchical vision- instruction correlation rewards for long-horizon manipulation, 2025

Kuan-Hao Hung, Po-Chen Lo, Jen-Feng Yeh, et al. Victor: Learning hierarchical vision- instruction correlation rewards for long-horizon manipulation, 2025. 12

work page 2025
[28]

Alleviating over- segmentation errors by detecting action boundaries

Yuchi Ishikawa, Seito Kasai, Yoshimitsu Aoki, and Hirokatsu Kataoka. Alleviating over- segmentation errors by detecting action boundaries. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021

work page 2021
[29]

RoboBrain: A Unified Brain Model for Robotic Manipula- tion from Abstract to Concrete.arXiv preprint arXiv:2502.21257, 2025

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete, 2025. URL https://arxiv.org/abs/2502.21257

work page arXiv 2025
[30]

Vima: General robot manipulation with multimodal prompts

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yifeng Dou, Yanjie Chen, Li Fei- Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: General robot manipulation with multimodal prompts. InInternational Conference on Machine Learning, 2023

work page 2023
[31]

Incorporating task progress knowledge for subgoal generation in robotic manipulation through image edits

Xinyao Kang and Yi-Ling Kuo. Incorporating task progress knowledge for subgoal generation in robotic manipulation through image edits. InProceedings of the Winter Conference on Applications of Computer Vision, pages 7490–7499, 2025

work page 2025
[32]

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Pannag Sanketi, Quan Vuong, et al. Openvla: An open-source vision-language-action model, 2024. URLhttps://arxiv.org/abs/2406.09246

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Flynn, Rene Vidal, Austin Reiter, and Gregory D

Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. Temporal convolutional networks for action segmentation and detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 156–165, 2017

work page 2017
[35]

Roboreward: General-purpose vision-language reward models for robotics, 2026

Tony Lee, Andrew Wagenmaker, Karl Pertsch, Percy Liang, Sergey Levine, and Chelsea Finn. Roboreward: General-purpose vision-language reward models for robotics, 2026

work page 2026
[36]

Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models, 2025

Peng Li, Yixiao Chen, Hao Wu, et al. Bridgevla: Input-output alignment for efficient 3d manipulation learning with vision-language models, 2025

work page 2025
[37]

Reinforcement Learning with Action Chunking

Qiyang Li, Zhiyuan Zhou, and Sergey Levine. Reinforcement learning with action chunking. arXiv preprint arXiv:2507.07969, 2025

work page internal anchor Pith review arXiv 2025
[38]

Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang

Anthony Liang, Yigit Korkmaz, Jiahui Zhang, Minyoung Hwang, Abrar Anwar, Sidhant Kaushik, Aditya Shah, Alex S. Huang, Luke Zettlemoyer, Dieter Fox, Yu Xiang, Anqi Li, Andreea Bobu, Abhishek Gupta, Stephen Tu, Erdem Biyik, and Jesse Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons, 2026

work page 2026
[39]

Code as policies: Language model programs for embodied control

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. InIEEE International Conference on Robotics and Automation, 2023. 13

work page 2023
[40]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023. URL https: //arxiv.org/abs/2306.03310

work page internal anchor Pith review arXiv 2023
[41]

Moka: Open-world robotic manipulation through mark-based visual prompting, 2024

Fangchen Liu, Kuan Fang Liu, Puchang Xie, et al. Moka: Open-world robotic manipulation through mark-based visual prompting, 2024

work page 2024
[42]

Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning, 2025

Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Ziwei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning, 2025

work page 2025
[43]

Regression over classification: Assessing image aesthetics via multimodal large language models

Xingyuan Ma, Shuai He, Anlong Ming, Haobin Zhong, and Huadong Ma. Regression over classification: Assessing image aesthetics via multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 7827–7835, 2026. doi: 10.1609/aaai.v40i10.37726

work page doi:10.1609/aaai.v40i10.37726 2026
[44]

Eureka: Human-level reward design via coding large language models, 2023

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models, 2023

work page 2023
[45]

Vision language models are in-context value learners, 2024

Yecheng Jason Ma, Joey Hejna, Ayzaan Wahid, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, Jonathan Tompson, Osbert Bastani, Dinesh Jayaraman, Wenhao Yu, Tingnan Zhang, Dorsa Sadigh, and Fei Xia. Vision language models are in-context value learners, 2024. URLhttps://arxiv.org/abs/2411.04549

work page arXiv 2024
[46]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

NVIDIA, Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, et al. Gr00t n1: An open foundation model for generalist humanoid robots, 2025. URLhttps://arxiv.org/abs/2503.14734

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy, 2024. URLhttps://arxiv.org/abs/2405.12213

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open X-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, et al. Open x-embodiment: Robotic learning datasets and rt-x models, 2023. URLhttps://arxiv.org/abs/2310.08864

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, et al. π0.5: a vision-language-action model with open-world generalization, 2025. URLhttps://arxiv.org/abs/2504.16054

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Spatialvla: Exploring spatial representations for visual-language-action model, 2025

Delin Qu, Haoming Song, Qizhi Chen, et al. Spatialvla: Exploring spatial representations for visual-language-action model, 2025

work page 2025
[51]

A generalist agent, 2022

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent, 2022

work page 2022
[52]

Ren, Aakarsh Dixit, Anna Bodrova, Anikait Singh, Stephen Tu, Noah Brown, Peng Xu, Fei Xia, Ted Xiao, Sergey Levine, et al

Allen Z. Ren, Aakarsh Dixit, Anna Bodrova, Anikait Singh, Stephen Tu, Noah Brown, Peng Xu, Fei Xia, Ted Xiao, Sergey Levine, et al. Robopoint: A vision-language model for spatial affordance prediction for robotics, 2024

work page 2024
[53]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[54]

Unsupervised learning and segmentation of complex activ- ities from video

Fadime Sener and Angela Yao. Unsupervised learning and segmentation of complex activ- ities from video. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8368–8376, 2018

work page 2018
[55]

Joshi, et al

Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrish- nan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J. Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics, 2023. URLhttps://arxiv.org/abs/2311. 00899. 14

work page 2023
[56]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Progprompt: Generating situated robot task plans using large language models, 2023

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models, 2023

work page 2023
[58]

Sontakke, Jesse Zhang, Sebastian M

Sumedh A. Sontakke, Jesse Zhang, Sebastian M. R. Arnold, Karl Pertsch, Erdem Biyik, Dorsa Sadigh, Chelsea Finn, and Laurent Itti. Roboclip: One demonstration is enough to learn robot policies, 2023

work page 2023
[59]

Robo-dopamine: General process reward modeling for high- precision robotic manipulation, 2025

Huajie Tan, Sixiang Chen, Yijie Xu, Zixiao Wang, Yuheng Ji, Cheng Chi, Yaoxu Lyu, Zhongxia Zhao, Xiansheng Chen, Peterson Co, Shaoxuan Xie, Guocai Yao, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Robo-dopamine: General process reward modeling for high- precision robotic manipulation, 2025

work page 2025
[60]

Bridgedata v2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023

work page 2023
[61]

Enhancing numerical prediction of mllms with soft labeling

Pei Wang, Zhaowei Cai, Hao Yang, Davide Modolo, and Ashwin Swaminathan. Enhancing numerical prediction of mllms with soft labeling. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3424–3434, October 2025

work page 2025
[62]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

Large reward models: Generalizable online robot reward generation with vision-language models, 2026

Yanru Wu, Weiduo Yuan, Ang Qi, Vitor Guizilini, Jiageng Mao, and Yue Wang. Large reward models: Generalizable online robot reward generation with vision-language models, 2026. URL https://arxiv.org/abs/2603.16065

work page arXiv 2026
[64]

Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Yuqi Xie, Fengyuan Hu, Jimmy Wu, Zhengyi Luo, Linxi Fan, Guanya Shi, and Yuke Zhu. Self-improving vision-language-action models with data generation via residual rl.arXiv preprint arXiv:2511.00091, 2025

work page arXiv 2025
[65]

RoboChallenge: Large-scale real-robot evaluation of embodied policies

Adina Yakefu, Bin Xie, Chongyang Xu, Enwen Zhang, Erjin Zhou, Fan Jia, Haitao Yang, Haoqiang Fan, Haowei Zhang, Hongyang Peng, Jing Tan, Junwen Huang, Kai Liu, Kaixin Liu, Kefan Gu, Qinglun Zhang, Ruitao Zhang, Saike Huang, Shen Cheng, Shuaicheng Liu, Tiancai Wang, Tiezhen Wang, Wei Sun, Wenbin Tang, Yajun Wei, Yang Chen, Youqiang Gui, Yucheng Zhao, Yunch...

work page arXiv 2025
[66]

Instructvla: Vision-language-action instruction tuning from understanding to manipulation, 2025

Senqiao Yang, Hongyu Li, Bo Wang, et al. Instructvla: Vision-language-action instruction tuning from understanding to manipulation, 2025

work page 2025
[67]

arXiv preprint arXiv:2505.12224 , year=

Zewei Ye, Weifeng Lu, Minghao Ye, Tao Lin, Shuo Yang, Junchi Yan, and Bo Zhao. Robofac: A comprehensive framework for robotic failure analysis and correction, 2026. URL https: //arxiv.org/abs/2505.12224. 15

work page arXiv 2026
[68]

Generalizable dense reward for long-horizon robotic tasks, 2026

Silong Yong, Stephen Sheng, Carl Qi, Xiaojie Wang, Evan Sheehan, Anurag Shivaprasad, Yaqi Xie, Katia Sycara, and Yesh Dattatreya. Generalizable dense reward for long-horizon robotic tasks, 2026. URLhttps://arxiv.org/abs/2604.00055

work page arXiv 2026
[69]

Embodied-r1: Reinforced embodied reasoning for general robotic manipulation, 2025

Yifu Yuan, Haiqin Cui, Yaoting Huang, Yibin Chen, Fei Ni, Zibin Dong, Pengyi Li, Yan Zheng, and Jianye Hao. Embodied-r1: Reinforced embodied reasoning for general robotic manipulation, 2025

work page 2025
[70]

Regress, don’t guess: A regression-like loss on number tokens for language models

Jonas Zausinger, Lars Pennig, Anamarija Kozina, Sean Sdahl, Julian Sikora, Adrian Dendorfer, Timofey Kuznetsov, Mohamad Hagog, Nina Wiedemann, Kacper Chlodny, Vincent Limbach, Anna Ketteler, Thorben Prein, Vishwa Mohan Singh, Michael Morris Danziger, and Jannis Born. Regress, don’t guess: A regression-like loss on number tokens for language models. In Pro...

work page 2025
[71]

A vision-language-action-critic model for robotic real-world reinforcement learning, 2025

Shaopeng Zhai, Qi Zhang, Tianyi Zhang, Fuxian Huang, Haoran Zhang, Ming Zhou, Shengzhe Zhang, Litao Liu, Sixu Lin, and Jiangmiao Pang. A vision-language-action-critic model for robotic real-world reinforcement learning, 2025

work page 2025
[72]

Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang

Jiahui Zhang, Yusen Luo, Abrar Anwar, Sumedh Anand Sontakke, Joseph J. Lim, Jesse Thomason, Erdem Biyik, and Jesse Zhang. Rewind: Language-guided rewards teach robot policies without new demonstrations, 2025

work page 2025
[73]

PROGRESSLM: Towards progress reasoning in vision-language models, 2026

Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, and Han Liu. PROGRESSLM: Towards progress reasoning in vision-language models, 2026

work page 2026
[74]

Grape: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309,

Zijian Zhang, Kaiyuan Zheng, Zhaorun Chen, Joel Jang, Yi Li, Siwei Han, Chaoqi Wang, Mingyu Ding, Dieter Fox, and Huaxiu Yao. GRAPE: Generalizing robot policy via preference alignment.arXiv preprint arXiv:2411.19309, 2024. 16 Appendix Contents A Annotation Pipeline Details 18 A.1 Annotator Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ....

work page arXiv 2024