pith. sign in

arxiv: 2510.01433 · v2 · submitted 2025-10-01 · 💻 cs.RO · cs.AI

AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation

Pith reviewed 2026-05-18 10:15 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords robotic manipulationkeypoint selectionaffordancevision-based policygeneralizationlightweight learning
0
0 comments X

The pith

AFFORD2ACT distills a minimal set of semantic 2D keypoints from a text prompt and single image to produce a compact 38-dimensional policy for robotic manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AFFORD2ACT as a three-stage pipeline that first filters affordances, then builds category-level keypoints, and finally trains a transformer policy with gating. This yields a lightweight state representation that trains in minutes and runs without dense visual inputs or robot proprioception. A sympathetic reader would care because the method promises to scale robot learning to new objects, categories, and scenes while keeping computation low enough for real-time use.

Core claim

AFFORD2ACT uses affordance filtering driven by language and vision to automatically select manipulation-centric 2D keypoints, then feeds the resulting 38-dimensional vector into a gated transformer policy that reaches 82 percent success on unseen objects, novel categories, backgrounds, and distractors across real-world tasks.

What carries the argument

The affordance-guided automatic keypoint selection pipeline that converts a text prompt and one image into a small set of semantic 2D points for policy input.

If this is right

  • Policies can be trained in 15 minutes on standard hardware without dense point clouds or full images.
  • The same keypoint set supports real-time control across varied backgrounds and distractors without retraining.
  • No task-specific manual keypoint design or extra data collection is required for new categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may lower barriers for deploying manipulation skills on resource-limited robots by shrinking the observation space.
  • Similar language-vision keypoint distillation could be tested on other control problems such as navigation or assembly.

Load-bearing premise

Affordance information extracted from a text prompt and single image will reliably identify manipulation-relevant keypoints that remain effective for objects and categories never seen during training.

What would settle it

Running the trained 38-dimensional policy on a new manipulation task involving objects from a category absent from the original experiments and observing whether success rate drops substantially below 82 percent.

Figures

Figures reproduced from arXiv: 2510.01433 by Anukriti Singh, Kasra Torshizi, Kelin Yu, Khuzema Habib, Pratap Tokekar, Ruohan Gao.

Figure 1
Figure 1. Figure 1: Top: policy trained with original prompts (“hold”, “stir”) executes the actions over time (t). Middle: at test time, the same policy generalizes to synonym prompts (“pick”, “mix”). Bottom: affordance-guided keypoints with learned importance (green; variable count; ring radius ∝ attention) remain stable under unseen object shapes, lighting conditions, dynamic distractors, and scene clutter—leading to succes… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the AFFORD2ACT pipeline. We extract task-relevant object keypoints using affordance masks and DINO filtering, track them across frames, and embed them with a transformer and attention gating. The policy head predicts pose and gripper commands from these compact keypoint representations, enabling vision-based manipulation without requiring proprioception. challenging visuomotor tasks. However, t… view at source ↗
Figure 3
Figure 3. Figure 3: Real-world rollouts of AFFORD2ACT on Pour (top) and Cut (bottom) tasks. In each sequence (left to right: approach → interaction → completion), green dots mark keypoints automatically selected on the affordance region (e.g., knife for “cut”, handle/spout for “pour”). The translucent ring size reflects the attention weight at each step. While keypoints remain fixed on task-relevant parts, the policy dynamica… view at source ↗
Figure 4
Figure 4. Figure 4: (Left) A figure of our physical setup, composing of a UR3e Robotic [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative generalization on two tasks. Each row shows rollouts for Pour (top) and Kick (bottom). Left: training objects (seen). Right: held-out settings with unseen scenes (new backgrounds/tabletops) and, for Pour, unseen object instances. Within each panel we show three key frames (setup, approach, completion). Without any test-time adaptation, AFFORD2ACT trained from single-view RGB using affordance-gu… view at source ↗
Figure 6
Figure 6. Figure 6: Total Average success rate across input modalities. For each input [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Semantic Information rubric scores across input modalities on seen instances. For each input modality, we collected 40 trials for each task (we [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: AFFORD2ACT outperforms all baselines even with just 10 demonstrations, with the performance gap remaining consistent as more demos are added. Standard deviation bars show variability across tasks. attempted to perform the final motion of the task (such as pouring or cutting) before actually grasping the object, indicating a confusion in the sequence of actions. In contrast, our keypoint-based policy typica… view at source ↗
Figure 11
Figure 11. Figure 11: Average effective number of keypoints used (over all time steps) [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 10
Figure 10. Figure 10: Top: Average success rate on four scenarios aggregated across all tasks. Bottom: Representative rollouts for each scenario. The small inset labeled “Ref” shows the training reference object for that task. F. Ablation Studies We conducted additional ablations to highlight the contri￾bution of different components of our pipeline: Language Prompt Robustness: We varied the wording of the affordance prompt to… view at source ↗
read the original abstract

Vision-based robot learning often relies on dense image or point-cloud inputs, which are computationally heavy and entangle irrelevant background features. Existing keypoint-based approaches can focus on manipulation-centric features and be lightweight, but either depend on manual heuristics or task-coupled selection, limiting scalability and semantic understanding. To address this, we propose AFFORD2ACT, an affordance-guided framework that distills a minimal set of semantic 2D keypoints from a text prompt and a single image. AFFORD2ACT follows a three-stage pipeline: affordance filtering, category-level keypoint construction, and transformer-based policy learning with embedded gating to reason about the most relevant keypoints, yielding a compact 38-dimensional state policy that can be trained in 15 minutes, which performs well in real-time without proprioception or dense representations. Across diverse real-world manipulation tasks, AFFORD2ACT consistently improves data efficiency, achieving an 82% success rate on unseen objects, novel categories, backgrounds, and distractors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents AFFORD2ACT, an affordance-guided framework for distilling a minimal set of semantic 2D keypoints from a text prompt and single image to support generalizable, lightweight robotic manipulation. The three-stage pipeline consists of affordance filtering, category-level keypoint construction, and gated transformer policy learning, producing a compact 38-dimensional state representation. The work claims this policy trains in 15 minutes, runs in real time without proprioception or dense inputs, and achieves an 82% success rate on unseen objects, novel categories, backgrounds, and distractors across diverse real-world tasks.

Significance. If the empirical results are robust, the approach could meaningfully advance scalable robot learning by automating extraction of manipulation-centric features, thereby improving data efficiency and reducing reliance on heavy visual representations. The combination of rapid training, real-time operation, and cross-category generalization without manual tuning or extra data would be a practical contribution to the field.

major comments (2)
  1. [Abstract] Abstract: the central performance claim of an 82% success rate is presented without baselines, error bars, dataset sizes, ablation studies, or statistical details. This omission is load-bearing because it prevents verification that the affordance-guided keypoint stages, rather than other factors, drive the reported gains on novel categories and distractors.
  2. [Method (affordance filtering and category-level keypoint construction)] Affordance filtering and category-level keypoint construction stages: the generalization claims rest on the untested assumption that a single image plus text prompt reliably yields stable, task-relevant 2D points for unseen categories. No quantitative results on prompt sensitivity, keypoint consistency metrics, or failure-mode analysis across category shifts are supplied, which directly undermines the “no manual tuning” and 82% success assertions.
minor comments (1)
  1. [Abstract] Abstract: the 15-minute training time and 38-dimensional state size are stated without reference to the specific hardware or policy architecture details that would allow readers to reproduce the efficiency claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claim of an 82% success rate is presented without baselines, error bars, dataset sizes, ablation studies, or statistical details. This omission is load-bearing because it prevents verification that the affordance-guided keypoint stages, rather than other factors, drive the reported gains on novel categories and distractors.

    Authors: We agree that the abstract would benefit from additional context to support the central claim. The body of the manuscript already contains baseline comparisons (Section 4.2), ablation studies isolating the affordance-guided keypoint stages (Section 4.3), dataset details with repeated trials across objects and scenes, error bars from multiple runs, and statistical tests. We have revised the abstract to include a brief clause referencing these experimental validations and noting that ablations confirm the contribution of the keypoint selection pipeline to generalization on novel categories and distractors. revision: yes

  2. Referee: [Method (affordance filtering and category-level keypoint construction)] Affordance filtering and category-level keypoint construction stages: the generalization claims rest on the untested assumption that a single image plus text prompt reliably yields stable, task-relevant 2D points for unseen categories. No quantitative results on prompt sensitivity, keypoint consistency metrics, or failure-mode analysis across category shifts are supplied, which directly undermines the “no manual tuning” and 82% success assertions.

    Authors: We recognize the value of direct quantitative support for keypoint stability. In the revised manuscript we have added keypoint consistency metrics (standard deviation of 2D locations across multiple text prompts for the same object and across instances within a category) together with a short failure-mode discussion in the supplementary material. These additions, drawn from re-analysis of our existing experimental data, provide evidence that the single-image-plus-text pipeline produces sufficiently stable points to support the reported generalization without manual tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent validation

full rationale

The paper presents AFFORD2ACT as a three-stage empirical pipeline (affordance filtering from text+image, category-level keypoint construction, gated transformer policy) whose 82% success rate on novel objects is reported as an experimental outcome rather than a quantity derived by construction from fitted inputs or self-citations. No equations appear that equate a prediction to a parameter defined by the method itself, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The central claims rest on real-world trials with unseen categories, backgrounds, and distractors, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard domain assumptions about affordance extraction from vision-language models and the sufficiency of 2D keypoints for manipulation policies; no free parameters or new invented entities are described in the abstract.

axioms (1)
  • domain assumption Affordance information extracted from a text prompt and single image can identify manipulation-relevant regions and keypoints that generalize to unseen objects and categories.
    This premise underpins the first two stages of the pipeline and the claim of generalization without task-specific tuning.

pith-pipeline@v0.9.0 · 5727 in / 1296 out tokens · 29360 ms · 2026-05-18T10:15:08.700660+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Semantic-Contact Fields for Category-Level Generalizable Tactile Tool Manipulation

    cs.RO 2026-02 unverdicted novelty 6.0

    SCFields fuses semantics and contact data in a sim-to-real pipeline to enable category-level generalization for tactile tool manipulation with diffusion policies.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Diffusion policy: Visuomotor policy learning via action diffusion,

    C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, 2024, journal version; earlier preprint available on arXiv

  2. [2]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” 2024. [Online]. Available: https: //arxiv.org/abs/2403.03954

  3. [3]

    Uad: Unsupervised affordance distillation for generalization in robotic manipulation,

    Y . Tang, W. Huang, Y . Wang, C. Li, R. Yuan, R. Zhang, J. Wu, and L. Fei-Fei, “Uad: Unsupervised affordance distillation for generalization in robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2506.09284

  4. [4]

    Gendp: 3d semantic fields for category-level generalizable diffusion policy,

    Y . Wang, G. Yin, B. Huang, T. Kelestemur, J. Wang, and Y . Li, “Gendp: 3d semantic fields for category-level generalizable diffusion policy,” 2024. [Online]. Available: https://arxiv.org/abs/2410.17488

  5. [5]

    Tapir: Tracking any point with per- frame initialization and temporal refinement,

    C. Doersch, Y . Yang, M. Vecerik, D. Gokay, A. Gupta, Y . Aytar, J. Carreira, and A. Zisserman, “Tapir: Tracking any point with per- frame initialization and temporal refinement,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 061–10 072

  6. [6]

    P3-po: Prescriptive point priors for visuo-spatial generalization of robot policies,

    M. Levy, S. Haldar, L. Pinto, and A. Shirivastava, “P3-po: Prescriptive point priors for visuo-spatial generalization of robot policies,” 2024. [Online]. Available: https://arxiv.org/abs/2412.06784

  7. [7]

    MOKA: Open-vocabulary robotic manipulation through mark-based visual prompting,

    F. Liu, K. Fang, P. Abbeel, and S. Levine, “MOKA: Open-vocabulary robotic manipulation through mark-based visual prompting,” in First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. [Online]. Available: https: //openreview.net/forum?id=K8eoYUofbQ

  8. [8]

    Any-point Trajectory Modeling for Policy Learning

    C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel, “Any-point trajectory modeling for policy learning,” 2024. [Online]. Available: https://arxiv.org/abs/2401.00025

  9. [9]

    Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,

    H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani, “Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,” 2024. [Online]. Available: https://arxiv.org/abs/ 2405.01527

  10. [10]

    Flow as the cross-domain manipulation interface

    M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song, “Flow as the cross-domain manipulation interface,” 2024. [Online]. Available: https://arxiv.org/abs/2407.15208

  11. [11]

    Genflowrl: Shaping rewards with generative object-centric flow in visual reinforcement learning,

    K. Yu*, S. Zhang*, H. Soora, F. Huang, H. Huang, P. Tokekar, and R. Gao, “Genflowrl: Shaping rewards with generative object-centric flow in visual reinforcement learning,” inInternational Conference on Computer Vision (ICCV), 2025

  12. [12]

    Atk: Automatic task-driven keypoint selection for robust policy learning,

    Y . Zhang, S. Mittal, Z. Zhang, L. Ke, S. Srinivasa, and A. Gupta, “Atk: Automatic task-driven keypoint selection for robust policy learning,” 2025. [Online]. Available: https://arxiv.org/abs/2506.13867

  13. [13]

    K-vil: Keypoints- based visual imitation learning,

    J. Gao, Z. Tao, N. Jaquier, and T. Asfour, “K-vil: Keypoints- based visual imitation learning,”IEEE Transactions on Robotics, vol. 39, no. 5, p. 3888–3908, Oct. 2023. [Online]. Available: http://dx.doi.org/10.1109/TRO.2023.3286074

  14. [14]

    Skil: Semantic keypoint imitation learning for generalizable data-efficient manipulation,

    S. Wang, J. You, Y . Hu, J. Li, and Y . Gao, “Skil: Semantic keypoint imitation learning for generalizable data-efficient manipulation,” in Proceedings of Robotics: Science and Systems (RSS), 2025. [Online]. Available: https://www.roboticsproceedings.org/rss21/p161.pdf

  15. [15]

    Robotic pick- and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,

    A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, N. Fazeli, F. Alet, P. Isola, A. Torralba, A. Rodriguez, L. P. Kaelbling, and T. Lozano-P ´erez, “Robotic pick- and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,” in2018 IEEE International Con- ference on Roboti...

  16. [16]

    End-to-end learning of semantic grasping,

    E. Jang, S. Vijayanarasimhan, P. Pastor, J. Ibarz, and S. Levine, “End-to-end learning of semantic grasping,” inProceedings of the 1st Annual Conference on Robot Learning, ser. Proceedings of Machine Learning Research, S. Levine, V . Vanhoucke, and K. Goldberg, Eds., vol. 78. PMLR, 13–15 Nov 2017, pp. 119–132. [Online]. Available: https://proceedings.mlr....

  17. [17]

    Dense object nets: Learning dense visual object descriptors by and for robotic manipulation,

    P. R. Florence, L. Manuelli, and R. Tedrake, “Dense object nets: Learning dense visual object descriptors by and for robotic manipulation,” inProceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, A. Billard, A. Dragan, J. Peters, and J. Morimoto, Eds., vol. 87. PMLR, 29–31 Oct 2018, pp. 373–385. [Online]. Avai...

  18. [18]

    Perceiver-actor: A multi- task transformer for robotic manipulation,

    M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi- task transformer for robotic manipulation,” inProceedings of The 6th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol

  19. [19]

    PMLR, 14–18 Dec 2023, pp. 785–799. [Online]. Available: https://proceedings.mlr.press/v205/shridhar23a.html

  20. [20]

    Sketch- to-skill: Bootstrapping robot learning with human drawn trajectory sketches,

    P. Yu, A. Bhaskar, A. Singh, Z. Mahammad, and P. Tokekar, “Sketch- to-skill: Bootstrapping robot learning with human drawn trajectory sketches,” 2025. [Online]. Available: https://arxiv.org/abs/2503.11918

  21. [21]

    Embodiment-agnostic action planning via object-part scene flow,

    W. Tang, J.-H. Pan, W. Zhan, J. Zhou, H. Yao, Y .-H. Liu, M. Tomizuka, M. Ding, and C.-W. Fu, “Embodiment-agnostic action planning via object-part scene flow,” 2024. [Online]. Available: https://arxiv.org/abs/2409.10032

  22. [22]

    Spot: Se (3) pose trajectory diffusion for object- centric manipulation

    C.-C. Hsu, B. Wen, J. Xu, Y . Narang, X. Wang, Y . Zhu, J. Biswas, and S. Birchfield, “Spot: Se(3) pose trajectory diffusion for object-centric manipulation,” 2025. [Online]. Available: https: //arxiv.org/abs/2411.00965

  23. [23]

    kpam: Keypoint affordances for category-level robotic manipulation,

    L. Manuelli, W. Gao, P. Florence, and R. Tedrake, “kpam: Keypoint affordances for category-level robotic manipulation,” 2019. [Online]. Available: https://arxiv.org/abs/1903.06684

  24. [24]

    J. J. Gibson,The Ecological Approach to Visual Perception. Hillsdale, NJ: Lawrence Erlbaum Associates, 1979

  25. [25]

    Dimensionality reduction for hand-independent dexterous robotic grasping,

    M. Ciocarlie, C. Goldfeder, and P. K. Allen, “Dimensionality reduction for hand-independent dexterous robotic grasping,”The International Journal of Robotics Research, vol. 28, no. 7, pp. 851–871, 2009

  26. [26]

    Affordance detection of tool parts from geometric features,

    A. Myers, C. L. Teo, C. Ferm ¨uller, and Y . Aloimonos, “Affordance detection of tool parts from geometric features,” inIEEE International Conference on Robotics and Automation (ICRA), 2015, pp. 1374– 1381

  27. [27]

    Affordancenet: An end-to-end deep learning approach for object affordance detection,

    T.-T. Do, A. Nguyen, and I. Reid, “Affordancenet: An end-to-end deep learning approach for object affordance detection,” inIEEE International Conference on Robotics and Automation (ICRA), 2018

  28. [28]

    Affordances from human videos as a versatile representation for robotics,

    S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile representation for robotics,”CVPR, 2023

  29. [29]

    One-shot open affordance learning with foundation models,

    G. Li, D. Sun, L. Sevilla-Lara, and V . Jampani, “One-shot open affordance learning with foundation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  30. [30]

    DINOv2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features without ...

  31. [31]

    Cotracker: It is better to track together,

    N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker: It is better to track together,” inEuropean conference on computer vision. Springer, 2024, pp. 18–35

  32. [32]

    Baku: An efficient transformer for multi-task policy learning, 2024

    S. Haldar, Z. Peng, and L. Pinto, “Baku: An efficient transformer for multi-task policy learning,” 2024. [Online]. Available: https: //arxiv.org/abs/2406.07539

  33. [33]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13705

  34. [34]

    Learning latent plans from play,

    C. Lynch, M. Khansari, T. Xiao, V . Kumar, J. Tompson, S. Levine, and P. Sermanet, “Learning latent plans from play,” 2019. [Online]. Available: https://arxiv.org/abs/1903.01973

  35. [35]

    Behavior transformers: Cloning k modes with one stone, 2022

    N. M. M. Shafiullah, Z. J. Cui, A. Altanzaya, and L. Pinto, “Behavior transformers: Cloningkmodes with one stone,” 2022. [Online]. Available: https://arxiv.org/abs/2206.11251

  36. [36]

    π0: A vision-language-action flow model for general robot control,

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky, “π0: A vision-language-action flow model for general robot control,”

  37. [37]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    [Online]. Available: https://arxiv.org/abs/2410.24164

  38. [38]

    Robot learning as an empirical science: Best practices for policy evaluation,

    H. Kress-Gazit, K. Hashimoto, N. Kuppuswamy, P. Shah, P. Horgan, G. Richardson, S. Feng, and B. Burchfiel, “Robot learning as an empirical science: Best practices for policy evaluation,”arXiv preprint arXiv:2409.09491, 2024

  39. [39]

    You Only Look Once: Unified, Real-Time Object Detection

    J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” 2016. [Online]. Available: https://arxiv.org/abs/1506.02640

  40. [40]

    Segment Anything

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick, “Segment anything,” 2023. [Online]. Available: https://arxiv.org/abs/2304.02643