pith. machine review for the scientific record. sign in

arxiv: 2601.07060 · v2 · submitted 2026-01-11 · 💻 cs.RO

Recognition: 2 theorem links

· Lean Theorem

PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:02 UTC · model grok-4.3

classification 💻 cs.RO
keywords long-horizon robotic manipulationvision-language-action modelsaffordance reasoningsubtask progress predictionpolicy learningLIBERO benchmarkvisuomotor control
0
0 comments X

The pith

PALM uses distilled affordance cues and subtask progress signals to let vision-language-action models complete long robot tasks reliably.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PALM as a framework that adds internal reasoning to vision-language-action models for robotic manipulation. It distills affordance representations covering object relevance, contact geometry, placements, and motion, then predicts continuous progress inside each subtask. These additions reduce errors such as repeated actions or early stops, producing higher success on extended task sequences in both simulation and real settings.

Core claim

PALM distills complementary affordance representations that capture object relevance, contact geometry, spatial placements, and motion dynamics, and serve as task-relevant anchors for visuomotor control. To further stabilize long-horizon execution, PALM predicts continuous within-subtask progress, enabling seamless subtask transitions.

What carries the argument

Distilled affordance representations that capture object relevance, contact geometry, spatial placements, and motion dynamics, together with continuous within-subtask progress predictions.

If this is right

  • Vision-language-action policies gain internal anchors that keep execution aligned with task structure over many steps.
  • Seamless subtask handoffs become possible without explicit external supervision of each transition.
  • Average completed task length increases on benchmarks such as CALVIN ABC to D.
  • Real-world generalization improves by a factor of two across multiple long-horizon settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same affordance and progress signals could be added to other sequence-based control problems where intermediate state tracking reduces drift.
  • Testing whether the representations transfer to new robot hardware without retraining would reveal how embodiment-specific the cues are.
  • Combining the progress predictor with different action heads might allow the same backbone to support both manipulation and navigation tasks.

Load-bearing premise

The distilled affordance representations and progress predictions stay accurate and sufficient to prevent execution errors across diverse long-horizon tasks without creating new failure modes.

What would settle it

A new long-horizon test set where PALM produces more repeated actions, missed steps, or premature terminations than baseline methods because its affordance or progress estimates are inaccurate.

Figures

Figures reproduced from arXiv: 2601.07060 by Fangqiang Ding, Gen Li, Ismini Lourentzou, Jingyuan Zhu, Jin Jin, Tianjiao Yu, Wenzhen Yuan, Xu Cao, Yifan Shen, Yuanzhe Liu, Yuchen Mo, Zhengyuan Li.

Figure 1
Figure 1. Figure 1: In contrast to vanilla VLAs that directly map inputs to actions or to predictive methods that forecast dense future images, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PALM Overview. (a) Model Architecture: Given a language instruction l, observation ot , and robot state st , PALM encodes each modality using frozen encoders to obtain text, visual, and state tokens. These tokens are fused by a GPT-style transformer with unidirectional attention and two specialized query sets: fine-grained affordance and action–progress. During training, affordance queries attend to contex… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation studies of affordance components on CALVIN ABC→D and LIBERO-LONG benchmarks demonstrate the effectiveness of the four components of affordance prediction. increases (e.g., 82.0% for five consecutive subtasks). This is a +17.7% absolute improvement over the strongest prior baseline (Seer at 64.3%) at the 5-task horizon. PALM also yields the longest average task trajectory (4.48), exceeding Seer (3.… view at source ↗
Figure 4
Figure 4. Figure 4: Real-world experimental setup and task design. Left: We use a UFACTORY xArm6 robot with the matched Gripper G2 and two RealSense D455 cameras. Right: We design a real-world long-horizon manipulation task consisting of six consecutive subtasks, driven by a single high-level instruction [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Random Relocation Disturbances. Predicted progress in the "pick up grape" subtask under two random grape relocations [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Unseen Lighting Disturbances. Predicted progress in the "pick up grape" subtask under two unseen lighting changes [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Multi-Object Visual Distractions. Predicted progress in "pick up grape" under two injected visual distraction events [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of affordance predictions. Across sequential progress steps, the model predicts four complementary affordances to guide policy generation: Global Affordance segments task-relevant objects and goals; Local Affordance generates heatmaps for precise contact points; Spatial Affordance predicts candidate placement regions; and Dynamic Affordance forecasts motion trajectories. These visualizations … view at source ↗
read the original abstract

Recent advancements in vision-language-action (VLA) models have shown promise in robotic manipulation, yet they continue to struggle with long-horizon, multi-step tasks. Existing methods lack internal reasoning mechanisms that can identify task-relevant interaction cues or track progress within a subtask, leading to critical execution errors such as repeated actions, missed steps, and premature termination. To address these challenges, we introduce PALM, a VLA framework that structures policy learning around interaction-centric affordance reasoning and subtask progress cues. PALM distills complementary affordance representations that capture object relevance, contact geometry, spatial placements, and motion dynamics, and serve as task-relevant anchors for visuomotor control. To further stabilize long-horizon execution, PALM predicts continuous within-subtask progress, enabling seamless subtask transitions. Across extensive simulation and real-world experiments, PALM consistently outperforms baselines, achieving a 91.8% success rate on LIBERO-LONG, a 12.5% improvement in average length on CALVIN ABC->D, and a 2x improvement over real-world baselines across three long-horizon generalization settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PALM, a vision-language-action (VLA) framework for long-horizon robotic manipulation. It structures policy learning around distilled affordance representations (capturing object relevance, contact geometry, spatial placements, and motion dynamics) that serve as anchors for visuomotor control, combined with a dedicated head for continuous within-subtask progress prediction to enable seamless transitions and reduce errors such as repeated actions or premature termination. The central empirical claims are a 91.8% success rate on LIBERO-LONG (12.5% improvement in average length on CALVIN ABC->D) and a 2x improvement over real-world baselines across three long-horizon generalization settings.

Significance. If the results hold with proper isolation of components, the work could meaningfully advance VLA models by supplying explicit internal reasoning mechanisms for long-horizon stability, moving beyond pure end-to-end imitation. The use of complementary affordance distillation plus progress cues offers a concrete, testable way to mitigate common execution failures, with potential for broader impact if the gains prove robust across perception variance.

major comments (2)
  1. [Experiments / Ablation studies] The central claim that continuous within-subtask progress prediction (distinct from affordance anchors) enables seamless transitions and prevents execution errors is not supported by any ablation that isolates its contribution. Reported gains on LIBERO-LONG and CALVIN combine both the affordance distillation and the progress head, so it remains possible that all improvements derive from the affordance representations alone.
  2. [Results / Tables] No error bars, standard deviations, number of trials, or statistical significance tests accompany the headline metrics (91.8% success on LIBERO-LONG, 12.5% average-length improvement on CALVIN). Without these, the robustness of the claimed improvements cannot be assessed, especially given the acknowledged sensitivity of long-horizon tasks to perception noise.
minor comments (1)
  1. [Abstract / Method] The abstract states that affordance representations 'serve as task-relevant anchors for visuomotor control' but does not specify the exact fusion mechanism or loss terms used to distill the four complementary cues (object relevance, contact geometry, spatial placements, motion dynamics).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major concern point by point below and have revised the manuscript to strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Experiments / Ablation studies] The central claim that continuous within-subtask progress prediction (distinct from affordance anchors) enables seamless transitions and prevents execution errors is not supported by any ablation that isolates its contribution. Reported gains on LIBERO-LONG and CALVIN combine both the affordance distillation and the progress head, so it remains possible that all improvements derive from the affordance representations alone.

    Authors: We agree that an ablation isolating the progress prediction head is necessary to substantiate the claim. In the revised manuscript we have added a new ablation (Section 4.3, Table 4) that compares the full PALM model against an affordance-only variant (identical architecture and training but without the progress head). The results show that removing the progress head increases repeated actions by 18% and premature terminations by 12% on LIBERO-LONG, confirming its distinct contribution to seamless subtask transitions beyond the affordance anchors alone. revision: yes

  2. Referee: [Results / Tables] No error bars, standard deviations, number of trials, or statistical significance tests accompany the headline metrics (91.8% success on LIBERO-LONG, 12.5% average-length improvement on CALVIN). Without these, the robustness of the claimed improvements cannot be assessed, especially given the acknowledged sensitivity of long-horizon tasks to perception noise.

    Authors: We acknowledge the omission. In the revised version we have updated Tables 1–3 to report mean success rates with standard deviations across 5 independent seeds, explicitly state that each metric is computed over 100 evaluation trials per task, and include paired t-test p-values comparing PALM against all baselines. These additions demonstrate that the reported gains remain statistically significant (p < 0.01) even under the perception noise levels present in the benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no self-referential derivation

full rationale

The paper describes an empirical VLA architecture that distills affordance representations and adds a progress-prediction head, then reports success rates on LIBERO-LONG, CALVIN, and real-world tasks. No equations or derivation steps are presented that reduce any claimed prediction to a fitted parameter or self-citation by construction. Performance claims rest on comparative experiments against baselines rather than on any internal loop that would make the result tautological. The absence of a mathematical derivation chain means none of the enumerated circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework builds on standard assumptions in vision-language-action models and affordance-based robotics without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)
  • domain assumption Affordance representations and progress signals can be reliably distilled from visual and language inputs to guide visuomotor control.
    Core premise of the PALM architecture stated in the abstract.

pith-pipeline@v0.9.0 · 5537 in / 1182 out tokens · 35307 ms · 2026-05-16T15:02:19.567327+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

157 extracted references · 157 canonical work pages · 41 internal anchors

  1. [1]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Cheb- otar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022. 2

  2. [2]

    Affordances from human videos as a versatile representation for robotics

    Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versatile representation for robotics. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 13778–13790, 2023. 3

  3. [3]

    Evolve-vla: Test-time training from environment feedback for vision- language-action models.arXiv preprint arXiv:2512.14666,

    Zechen Bai, Chen Gao, and Mike Zheng Shou. Evolve-vla: Test-time training from environment feedback for vision- language-action models.arXiv preprint arXiv:2512.14666,

  4. [4]

    RT-H: action hierarchies using language

    Suneel Belkhale, Tianli Ding, Ted Xiao, Pierre Sermanet, Quon Vuong, Jonathan Tompson, Yevgen Chebotar, De- bidatta Dwibedi, and Dorsa Sadigh. Rt-h: Action hierarchies using language.arXiv preprint arXiv:2403.01823, 2024. 2

  5. [5]

    Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024. 2

  6. [6]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Casta ˜neda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 2

  7. [7]

    Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

    Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image- editing diffusion models.arXiv preprint arXiv:2310.10639,

  8. [8]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. π0: A visi...

  9. [9]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 2

  10. [10]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 2, 6

  11. [11]

    Goal- conditioned reinforcement learning with imagined subgoals

    Elliot Chane-Sane, Cordelia Schmid, and Ivan Laptev. Goal- conditioned reinforcement learning with imagined subgoals. InInternational Conference on Machine Learning (ICML), pages 1430–1440, 2021. 3

  12. [12]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation. arXiv preprint arXiv:2410.06158, 2024. 2

  13. [13]

    Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endow- ing vision-language models with spatial reasoning capabili- ties. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 14455–14465, 2024. 5

  14. [14]

    The language of motion: Unifying verbal and non-verbal language of 3d human motion

    Changan Chen, Juze Zhang, Shrinidhi K Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, and Ehsan Adeli. The language of motion: Unifying verbal and non-verbal language of 3d human motion. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6200–6211, 2025. 7

  15. [15]

    Affordance grounding from demonstration video to target image

    Joya Chen, Difei Gao, Kevin Qinghong Lin, and Mike Zheng Shou. Affordance grounding from demonstration video to target image. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6799–6808, 2023. 3

  16. [16]

    SARM: Stage-Aware Reward Modeling for Long Horizon Robot Manipulation

    Qianzhong Chen, Justin Yu, Mac Schwager, Pieter Abbeel, Yide Shentu, and Philipp Wu. Sarm: Stage-aware re- ward modeling for long horizon robot manipulation.arXiv preprint arXiv:2509.25358, 2025. 2

  17. [17]

    Bail: Best-action imitation learning for batch deep reinforcement learning

    Xinyue Chen, Zijian Zhou, Zheng Wang, Che Wang, Yanqiu Wu, and Keith Ross. Bail: Best-action imitation learning for batch deep reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), pages 18353– 18363, 2020. 3

  18. [18]

    villa-X: Enhancing Latent Action Modeling in Vision-Language-Action Models

    Xiaoyu Chen, Hangxing Wei, Pushi Zhang, Chuheng Zhang, Kaixin Wang, Yanjiang Guo, Rushuai Yang, Yucen Wang, Xinquan Xiao, Li Zhao, et al. Villa-x: enhancing latent action modeling in vision-language-action models.arXiv preprint arXiv:2507.23682, 2025. 2, 3

  19. [19]

    Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025. 2, 6, 7

  20. [20]

    The epic-kitchens dataset: Collection, challenges and base- lines.IEEE Trans

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. The epic-kitchens dataset: Collection, challenges and base- lines.IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI), 43 (11):4125–4141, 2020. 1, 2, 5, 7

  21. [21]

    GraspVLA: a Grasping Foun- dation Model Pre-trained on Billion-scale Synthetic Action Data

    Shengliang Deng, Mi Yan, Songlin Wei, Haixin Ma, Yuxin Yang, Jiayi Chen, Zhiqi Zhang, Taoyu Yang, Xuheng Zhang, Wenhao Zhang, et al. Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data.arXiv preprint arXiv:2505.03233, 2025. 2

  22. [22]

    Plan then action: High- level planning guidance reinforcement learning for llm rea- soning.arXiv preprint arXiv:2510.01833, 2025

    Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Towsif Raiyan, Benteng Chen, Qingtao Pan, Yang Ouyang, Zhiqiang Gao, et al. Plan then action: High- level planning guidance reinforcement learning for llm rea- soning.arXiv preprint arXiv:2510.01833, 2025. 2

  23. [23]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023. 2

  24. [24]

    Unsupervised hyper- spectral image super-resolution via self-supervised modality decoupling.International Journal of Computer Vision, 134 (4):152, 2026

    Songcheng Du, Yang Zou, Zixu Wang, Xingyuan Li, Ying Li, Changjing Shang, and Qiang Shen. Unsupervised hyper- spectral image super-resolution via self-supervised modality decoupling.International Journal of Computer Vision, 134 (4):152, 2026. 7

  25. [25]

    Video language planning.arXiv preprint arXiv:2310.10625, 2023

    Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B Tenenbaum, et al. Video language plan- ning.arXiv preprint arXiv:2310.10625, 2023. 2

  26. [26]

    Learning universal policies via text-guided video genera- tion

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video genera- tion. InAdvances in Neural Information Processing Systems (NeurIPS), pages 9156–9172, 2023. 2

  27. [27]

    Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

    Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting general- ization of robotic skills with cross-domain datasets.arXiv preprint arXiv:2109.13396, 2021. 2

  28. [28]

    Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation

    Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation.arXiv preprint arXiv:2401.02117,

  29. [29]

    Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning

    Caelan Reed Garrett, Tom´as Lozano-P´erez, and Leslie Pack Kaelbling. Pddlstream: Integrating symbolic planners and blackbox samplers via optimistic adaptive planning. InPro- ceedings of the International Conference on Automated Plan- ning and Scheduling, pages 440–448, 2020. 3

  30. [30]

    End-to-end affor- dance learning for robotic manipulation.arXiv preprint arXiv:2209.12941, 2022

    Yiran Geng, Boshi An, Haoran Geng, Yuanpei Chen, Yaodong Yang, and Hao Dong. End-to-end affor- dance learning for robotic manipulation.arXiv preprint arXiv:2209.12941, 2022. 3

  31. [31]

    Rt-trajectory: Robotic task generalization via hindsight trajectory sketches

    Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montser- rat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches. arXiv preprint arXiv:2311.01977, 2023. 2

  32. [32]

    Safe: Multitask failure detection for vision-language-action models.arXiv preprint arXiv:2506.09937, 2025

    Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschen- ski, Haruki Nishimura, Masha Itkina, and Florian Shkurti. Safe: Multitask failure detection for vision-language-action models.arXiv preprint arXiv:2506.09937, 2025. 7

  33. [33]

    Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation

    Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, and Si Liu. Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation. arXiv preprint arXiv:2506.06677, 2025. 1, 2, 5, 7

  34. [34]

    Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025

    Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, and Long Chen. Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025. 3

  35. [35]

    MiMo-Embodied: X-Embodied Foundation Model Technical Report

    Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, et al. Mimo-embodied: X- embodied foundation model technical report.arXiv preprint arXiv:2511.16518, 2025. 2

  36. [36]

    Masked autoencoders are scal- able vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scal- able vision learners. InIEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 16000–16009,

  37. [37]

    Dita: Scaling diffusion trans- former for generalist vision-language-action policy

    Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, et al. Dita: Scaling diffusion trans- former for generalist vision-language-action policy. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7686–7697, 2025. 2

  38. [38]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024. 2, 3

  39. [39]

    Copa: General robotic manipulation through spatial constraints of parts with foundation models

    Haoxu Huang, Fanqi Lin, Yingdong Hu, Shengjie Wang, and Yang Gao. Copa: General robotic manipulation through spatial constraints of parts with foundation models. InIn- ternational Conference on Intelligent Robots and Systems (IROS), pages 9488–9495. IEEE, 2024. 2

  40. [40]

    A3vlm: Actionable articulation-aware vision language model.arXiv preprint arXiv:2406.07549, 2024

    Siyuan Huang, Haonan Chang, Yuhan Liu, Yimeng Zhu, Hao Dong, Peng Gao, Abdeslam Boularias, and Hongsheng Li. A3vlm: Actionable articulation-aware vision language model.arXiv preprint arXiv:2406.07549, 2024. 3

  41. [41]

    Inner Monologue: Embodied Reasoning through Planning with Language Models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Em- bodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022. 2

  42. [42]

    ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

    Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of rela- tional keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024. 2

  43. [43]

    Pointworld: Scaling 3d world models for in-the-wild robotic manipula- tion.arXiv preprint arXiv:2601.03782, 2026

    Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming- Yu Liu, Dieter Fox, Kaichun Mo, and Li Fei-Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipula- tion.arXiv preprint arXiv:2601.03782, 2026. 7

  44. [44]

    Physical Intelligence, Ali Amin, Raichelle Aniceto, Ash- win Balakrishna, Kevin Black, Ken Conley, Grace Connors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas Godden, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter, Sz...

  45. [45]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsc...

  46. [46]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational Confer- ence on Machine Learning (ICML), pages 4651–4664, 2021. 3, 1

  47. [47]

    Robobrain: A unified brain model for robotic manipulation from abstract to concrete

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1724–1734, 2025. 2

  48. [48]

    Task planning in robotics: an empirical comparison of pddl-and asp-based systems.Frontiers of Information Technology & Electronic Engineering, 20(3):363–373, 2019

    Yu-qian Jiang, Shi-qi Zhang, Piyush Khandelwal, and Peter Stone. Task planning in robotics: an empirical comparison of pddl-and asp-based systems.Frontiers of Information Technology & Electronic Engineering, 20(3):363–373, 2019. 3

  49. [49]

    Robo-abc: Affordance general- ization beyond categories via semantic correspondence for robot manipulation

    Yuanchen Ju, Kaizhe Hu, Guowei Zhang, Gu Zhang, Min- grun Jiang, and Huazhe Xu. Robo-abc: Affordance general- ization beyond categories via semantic correspondence for robot manipulation. InEuropean Conference on Computer Vision (ECCV), pages 222–239. Springer, 2024. 3

  50. [50]

    Incorporating task progress knowledge for subgoal generation in robotic manipulation through image edits

    Xuhui Kang and Yen-Ling Kuo. Incorporating task progress knowledge for subgoal generation in robotic manipulation through image edits. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 7490–7499. IEEE, 2025. 2

  51. [51]

    Co- tracker: It is better to track together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. InEuropean Conference on Computer Vision (ECCV), pages 18–35, 2024. 5

  52. [52]

    Language-driven representation learning for robotics

    Siddharth Karamcheti, Suraj Nair, Annie S Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics.arXiv preprint arXiv:2302.12766, 2023. 2

  53. [53]

    3d diffuser actor: Policy diffusion with 3d scene representations

    Tsung-Wei Ke, Nikolaos Gkanatsios, and Katerina Fragki- adaki. 3d diffuser actor: Policy diffusion with 3d scene representations.arXiv preprint arXiv:2402.10885, 2024. 2, 6

  54. [54]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

  55. [55]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 2, 6, 7, 8, 5

  56. [56]

    Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming- Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026. 7

  57. [57]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InInternational Conference on Computer Vision (ICCV), pages 4015–4026, 2023. 4

  58. [58]

    Ram: Retrieval-based affordance transfer for gen- eralizable zero-shot robotic manipulation.arXiv preprint arXiv:2407.04689, 2024

    Yuxuan Kuang, Junjie Ye, Haoran Geng, Jiageng Mao, Congyue Deng, Leonidas Guibas, He Wang, and Yue Wang. Ram: Retrieval-based affordance transfer for gen- eralizable zero-shot robotic manipulation.arXiv preprint arXiv:2407.04689, 2024. 2, 3

  59. [59]

    Seer-var: Semantic egocentric environment reasoner for vehicle augmented reality.arXiv preprint arXiv:2508.17255, 2025

    Yuzhi Lai, Shenghai Yuan, Peizheng Li, Jun Lou, and An- dreas Zell. Seer-var: Semantic egocentric environment reasoner for vehicle augmented reality.arXiv preprint arXiv:2508.17255, 2025. 7

  60. [60]

    Exploring efficient open- vocabulary segmentation in the remote sensing.arXiv preprint arXiv:2509.12040, 2025

    Bingyu Li, Haocheng Dong, Da Zhang, Zhiyuan Zhao, Junyu Gao, and Xuelong Li. Exploring efficient open- vocabulary segmentation in the remote sensing.arXiv preprint arXiv:2509.12040, 2025

  61. [61]

    Maris: Marine open-vocabulary in- stance segmentation with geometric enhancement and se- mantic alignment.arXiv preprint arXiv:2510.15398, 2025

    Bingyu Li, Feiyu Wang, Da Zhang, Zhiyuan Zhao, Junyu Gao, and Xuelong Li. Maris: Marine open-vocabulary in- stance segmentation with geometric enhancement and se- mantic alignment.arXiv preprint arXiv:2510.15398, 2025

  62. [62]

    Stitchfusion: Weaving any visual modalities to enhance multimodal semantic segmentation

    Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, and Xue- long Li. Stitchfusion: Weaving any visual modalities to enhance multimodal semantic segmentation. InACM In- ternational Conference on Multimedia (ACM MM), pages 1308–1317, 2025. 7

  63. [63]

    Toward cognitive supersens- ing in multimodal large language model.arXiv preprint arXiv:2602.01541, 2026

    Boyi Li, Yifan Shen, Yuanzhe Liu, Yifan Xu, Jiateng Liu, Xinzhuo Li, Zhengyuan Li, Jingyuan Zhu, Yunhan Zhong, Fangzhou Lan, et al. Toward cognitive supersens- ing in multimodal large language model.arXiv preprint arXiv:2602.01541, 2026. 2

  64. [64]

    Locate: Localize and transfer object parts for weakly supervised affordance grounding

    Gen Li, Varun Jampani, Deqing Sun, and Laura Sevilla- Lara. Locate: Localize and transfer object parts for weakly supervised affordance grounding. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10922–10931, 2023. 3

  65. [65]

    One-shot open affordance learning with foundation mod- els

    Gen Li, Deqing Sun, Laura Sevilla-Lara, and Varun Jampani. One-shot open affordance learning with foundation mod- els. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3086–3096, 2024. 3

  66. [66]

    Learning precise affordances from egocentric videos for robotic manipulation

    Gen Li, Nikolaos Tsagkas, Jifei Song, Ruaridh Mon- Williams, Sethu Vijayakumar, Kun Shao, and Laura Sevilla- Lara. Learning precise affordances from egocentric videos for robotic manipulation. InInternational Conference on Computer Vision (ICCV), pages 10581–10591, 2025. 3

  67. [67]

    Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance

    Jinming Li, Yichen Zhu, Zhibin Tang, Junjie Wen, Minjie Zhu, Xiaoyu Liu, Chengmeng Li, Ran Cheng, Yaxin Peng, Yan Peng, et al. Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance. InInternational Conference on Computer Vision (ICCV), pages 9759–9769,

  68. [68]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, et al. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. 7

  69. [69]

    Vision-Language Foundation Models as Effective Robot Imitators

    Xinghang Li, Minghuan Liu, Hanbo Zhang, Cunjun Yu, Jie Xu, Hongtao Wu, Chilam Cheang, Ya Jing, Weinan Zhang, Huaping Liu, et al. Vision-language foundation models as effective robot imitators.arXiv preprint arXiv:2311.01378,

  70. [70]

    Llara: Su- percharging robot learning data for vision-language policy

    Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapi- tiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, et al. Llara: Su- percharging robot learning data for vision-language policy. arXiv preprint arXiv:2406.20095, 2024. 2

  71. [71]

    Oric: Benchmarking object recogni- tion under contextual incongruity in large vision-language models.arXiv preprint arXiv:2509.15695, 2025

    Zhaoyang Li, Zhan Ling, Yuchen Zhou, Litian Gong, Erdem Bıyık, and Hao Su. Oric: Benchmarking object recogni- tion under contextual incongruity in large vision-language models.arXiv preprint arXiv:2509.15695, 2025. 2

  72. [72]

    Data scaling laws in imitation learning for robotic manipulation

    Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Ji- acheng You, and Yang Gao. Data scaling laws in imitation learning for robotic manipulation. InInternational Confer- ence on Learning Representations (ICLR), 2024. 2

  73. [73]

    Echovla: Robotic vision-language- action model with synergistic declarative memory for mobile manipulation.arXiv preprint arXiv:2511.18112, 2025

    Min Lin, Xiwen Liang, Bingqian Lin, Liu Jingzhi, Zijian Jiao, Kehan Li, Yuhan Ma, Yuecheng Liu, Shen Zhao, Yuzheng Zhuang, et al. Echovla: Robotic vision-language- action model with synergistic declarative memory for mobile manipulation.arXiv preprint arXiv:2511.18112, 2025. 3

  74. [74]

    Evo-0: Vision-language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416,

    Tao Lin, Gen Li, Yilei Zhong, Yanwen Zou, Yuxin Du, Jiting Liu, Encheng Gu, and Bo Zhao. Evo-0: Vision-language- action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416, 2025. 2

  75. [75]

    Evo-1: Lightweight vision-language- action model with preserved semantic alignment.arXiv preprint arXiv:2511.04555, 2025

    Tao Lin, Yilei Zhong, Yuxin Du, Jingjing Zhang, Jiting Liu, Yinxinyu Chen, Encheng Gu, Ziyan Liu, Hongyi Cai, Yanwen Zou, et al. Evo-1: Lightweight vision-language- action model with preserved semantic alignment.arXiv preprint arXiv:2511.04555, 2025. 2

  76. [76]

    Ar- ticulated object manipulation with coarse-to-fine affordance for mitigating the effect of point cloud noise

    Suhan Ling, Yian Wang, Ruihai Wu, Shiguang Wu, Yuzheng Zhuang, Tianyi Xu, Yu Li, Chang Liu, and Hao Dong. Ar- ticulated object manipulation with coarse-to-fine affordance for mitigating the effect of point cloud noise. InInter- national Conference on Robotics and Automation (ICRA), pages 10895–10901. IEEE, 2024. 3

  77. [77]

    Libero: Benchmarking knowledge transfer for lifelong robot learning

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems (NeurIPS), pages 44776–44791, 2023. 1, 2, 5, 7

  78. [78]

    Moka: Open-world robotic manipulation through mark- based visual prompting.arXiv preprint arXiv:2403.03174,

    Fangchen Liu, Kuan Fang, Pieter Abbeel, and Sergey Levine. Moka: Open-world robotic manipulation through mark- based visual prompting.arXiv preprint arXiv:2403.03174,

  79. [79]

    Robouniview: Visual-language model with unified view representation for robotic manipu- lation.arXiv preprint arXiv:2406.18977, 2024

    Fanfan Liu, Feng Yan, Liming Zheng, Chengjian Feng, Yiyang Huang, and Lin Ma. Robouniview: Visual-language model with unified view representation for robotic manipu- lation.arXiv preprint arXiv:2406.18977, 2024. 6

  80. [80]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864, 2024. 2

Showing first 80 references.