AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation

Anukriti Singh; Kasra Torshizi; Kelin Yu; Khuzema Habib; Pratap Tokekar; Ruohan Gao

arxiv: 2510.01433 · v2 · submitted 2025-10-01 · 💻 cs.RO · cs.AI

AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation

Anukriti Singh , Kasra Torshizi , Khuzema Habib , Kelin Yu , Ruohan Gao , Pratap Tokekar This is my paper

Pith reviewed 2026-05-18 10:15 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords robotic manipulationkeypoint selectionaffordancevision-based policygeneralizationlightweight learning

0 comments

The pith

AFFORD2ACT distills a minimal set of semantic 2D keypoints from a text prompt and single image to produce a compact 38-dimensional policy for robotic manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AFFORD2ACT as a three-stage pipeline that first filters affordances, then builds category-level keypoints, and finally trains a transformer policy with gating. This yields a lightweight state representation that trains in minutes and runs without dense visual inputs or robot proprioception. A sympathetic reader would care because the method promises to scale robot learning to new objects, categories, and scenes while keeping computation low enough for real-time use.

Core claim

AFFORD2ACT uses affordance filtering driven by language and vision to automatically select manipulation-centric 2D keypoints, then feeds the resulting 38-dimensional vector into a gated transformer policy that reaches 82 percent success on unseen objects, novel categories, backgrounds, and distractors across real-world tasks.

What carries the argument

The affordance-guided automatic keypoint selection pipeline that converts a text prompt and one image into a small set of semantic 2D points for policy input.

If this is right

Policies can be trained in 15 minutes on standard hardware without dense point clouds or full images.
The same keypoint set supports real-time control across varied backgrounds and distractors without retraining.
No task-specific manual keypoint design or extra data collection is required for new categories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may lower barriers for deploying manipulation skills on resource-limited robots by shrinking the observation space.
Similar language-vision keypoint distillation could be tested on other control problems such as navigation or assembly.

Load-bearing premise

Affordance information extracted from a text prompt and single image will reliably identify manipulation-relevant keypoints that remain effective for objects and categories never seen during training.

What would settle it

Running the trained 38-dimensional policy on a new manipulation task involving objects from a category absent from the original experiments and observing whether success rate drops substantially below 82 percent.

Figures

Figures reproduced from arXiv: 2510.01433 by Anukriti Singh, Kasra Torshizi, Kelin Yu, Khuzema Habib, Pratap Tokekar, Ruohan Gao.

**Figure 1.** Figure 1: Top: policy trained with original prompts (“hold”, “stir”) executes the actions over time (t). Middle: at test time, the same policy generalizes to synonym prompts (“pick”, “mix”). Bottom: affordance-guided keypoints with learned importance (green; variable count; ring radius ∝ attention) remain stable under unseen object shapes, lighting conditions, dynamic distractors, and scene clutter—leading to succes… view at source ↗

**Figure 2.** Figure 2: Overview of the AFFORD2ACT pipeline. We extract task-relevant object keypoints using affordance masks and DINO filtering, track them across frames, and embed them with a transformer and attention gating. The policy head predicts pose and gripper commands from these compact keypoint representations, enabling vision-based manipulation without requiring proprioception. challenging visuomotor tasks. However, t… view at source ↗

**Figure 3.** Figure 3: Real-world rollouts of AFFORD2ACT on Pour (top) and Cut (bottom) tasks. In each sequence (left to right: approach → interaction → completion), green dots mark keypoints automatically selected on the affordance region (e.g., knife for “cut”, handle/spout for “pour”). The translucent ring size reflects the attention weight at each step. While keypoints remain fixed on task-relevant parts, the policy dynamica… view at source ↗

**Figure 4.** Figure 4: (Left) A figure of our physical setup, composing of a UR3e Robotic [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative generalization on two tasks. Each row shows rollouts for Pour (top) and Kick (bottom). Left: training objects (seen). Right: held-out settings with unseen scenes (new backgrounds/tabletops) and, for Pour, unseen object instances. Within each panel we show three key frames (setup, approach, completion). Without any test-time adaptation, AFFORD2ACT trained from single-view RGB using affordance-gu… view at source ↗

**Figure 6.** Figure 6: Total Average success rate across input modalities. For each input [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Semantic Information rubric scores across input modalities on seen instances. For each input modality, we collected 40 trials for each task (we [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: AFFORD2ACT outperforms all baselines even with just 10 demonstrations, with the performance gap remaining consistent as more demos are added. Standard deviation bars show variability across tasks. attempted to perform the final motion of the task (such as pouring or cutting) before actually grasping the object, indicating a confusion in the sequence of actions. In contrast, our keypoint-based policy typica… view at source ↗

**Figure 11.** Figure 11: Average effective number of keypoints used (over all time steps) [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗

**Figure 10.** Figure 10: Top: Average success rate on four scenarios aggregated across all tasks. Bottom: Representative rollouts for each scenario. The small inset labeled “Ref” shows the training reference object for that task. F. Ablation Studies We conducted additional ablations to highlight the contribution of different components of our pipeline: Language Prompt Robustness: We varied the wording of the affordance prompt to… view at source ↗

read the original abstract

Vision-based robot learning often relies on dense image or point-cloud inputs, which are computationally heavy and entangle irrelevant background features. Existing keypoint-based approaches can focus on manipulation-centric features and be lightweight, but either depend on manual heuristics or task-coupled selection, limiting scalability and semantic understanding. To address this, we propose AFFORD2ACT, an affordance-guided framework that distills a minimal set of semantic 2D keypoints from a text prompt and a single image. AFFORD2ACT follows a three-stage pipeline: affordance filtering, category-level keypoint construction, and transformer-based policy learning with embedded gating to reason about the most relevant keypoints, yielding a compact 38-dimensional state policy that can be trained in 15 minutes, which performs well in real-time without proprioception or dense representations. Across diverse real-world manipulation tasks, AFFORD2ACT consistently improves data efficiency, achieving an 82% success rate on unseen objects, novel categories, backgrounds, and distractors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AFFORD2ACT automates keypoint selection via affordance from one image and text prompt to get a compact 38-dim state for quick-training transformer policies.

read the letter

The main point is a three-stage pipeline that starts with affordance filtering on a text prompt plus single image, builds category-level keypoints, and feeds a gated transformer policy. This produces a 38-dimensional state that trains in 15 minutes and runs in real time without proprioception or dense inputs, claiming 82% success on unseen objects, novel categories, backgrounds, and distractors across real-world tasks. That integration is the concrete advance over manual or task-specific keypoint choices mentioned in the abstract. The approach targets a real pain point in vision-based manipulation by trying to keep semantic focus while cutting input size and training time. The reported numbers suggest it can handle some generalization without extra data or tuning. The soft spots sit in the experimental detail. The abstract states the success rate but gives no baselines, ablations, dataset sizes, or error bars, so it is hard to judge whether the gains come from the pipeline or from other factors. The assumption that affordance filtering plus category-level construction reliably surfaces manipulation-relevant points on truly novel categories rests on limited visible evidence; if that stage picks background or irrelevant points, the downstream policy has no way to recover. Prompt sensitivity and failure cases are not addressed in the summary. This is for roboticists focused on lightweight, data-efficient policies who want to move past dense representations or hand-crafted keypoints. A reader working on affordance models or real-time manipulation would get practical ideas from the pipeline even if the numbers need checking. It deserves peer review so the experiments and generalization tests can be examined directly.

Referee Report

2 major / 1 minor

Summary. The manuscript presents AFFORD2ACT, an affordance-guided framework for distilling a minimal set of semantic 2D keypoints from a text prompt and single image to support generalizable, lightweight robotic manipulation. The three-stage pipeline consists of affordance filtering, category-level keypoint construction, and gated transformer policy learning, producing a compact 38-dimensional state representation. The work claims this policy trains in 15 minutes, runs in real time without proprioception or dense inputs, and achieves an 82% success rate on unseen objects, novel categories, backgrounds, and distractors across diverse real-world tasks.

Significance. If the empirical results are robust, the approach could meaningfully advance scalable robot learning by automating extraction of manipulation-centric features, thereby improving data efficiency and reducing reliance on heavy visual representations. The combination of rapid training, real-time operation, and cross-category generalization without manual tuning or extra data would be a practical contribution to the field.

major comments (2)

[Abstract] Abstract: the central performance claim of an 82% success rate is presented without baselines, error bars, dataset sizes, ablation studies, or statistical details. This omission is load-bearing because it prevents verification that the affordance-guided keypoint stages, rather than other factors, drive the reported gains on novel categories and distractors.
[Method (affordance filtering and category-level keypoint construction)] Affordance filtering and category-level keypoint construction stages: the generalization claims rest on the untested assumption that a single image plus text prompt reliably yields stable, task-relevant 2D points for unseen categories. No quantitative results on prompt sensitivity, keypoint consistency metrics, or failure-mode analysis across category shifts are supplied, which directly undermines the “no manual tuning” and 82% success assertions.

minor comments (1)

[Abstract] Abstract: the 15-minute training time and 38-dimensional state size are stated without reference to the specific hardware or policy architecture details that would allow readers to reproduce the efficiency claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claim of an 82% success rate is presented without baselines, error bars, dataset sizes, ablation studies, or statistical details. This omission is load-bearing because it prevents verification that the affordance-guided keypoint stages, rather than other factors, drive the reported gains on novel categories and distractors.

Authors: We agree that the abstract would benefit from additional context to support the central claim. The body of the manuscript already contains baseline comparisons (Section 4.2), ablation studies isolating the affordance-guided keypoint stages (Section 4.3), dataset details with repeated trials across objects and scenes, error bars from multiple runs, and statistical tests. We have revised the abstract to include a brief clause referencing these experimental validations and noting that ablations confirm the contribution of the keypoint selection pipeline to generalization on novel categories and distractors. revision: yes
Referee: [Method (affordance filtering and category-level keypoint construction)] Affordance filtering and category-level keypoint construction stages: the generalization claims rest on the untested assumption that a single image plus text prompt reliably yields stable, task-relevant 2D points for unseen categories. No quantitative results on prompt sensitivity, keypoint consistency metrics, or failure-mode analysis across category shifts are supplied, which directly undermines the “no manual tuning” and 82% success assertions.

Authors: We recognize the value of direct quantitative support for keypoint stability. In the revised manuscript we have added keypoint consistency metrics (standard deviation of 2D locations across multiple text prompts for the same object and across instances within a category) together with a short failure-mode discussion in the supplementary material. These additions, drawn from re-analysis of our existing experimental data, provide evidence that the single-image-plus-text pipeline produces sufficiently stable points to support the reported generalization without manual tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with independent validation

full rationale

The paper presents AFFORD2ACT as a three-stage empirical pipeline (affordance filtering from text+image, category-level keypoint construction, gated transformer policy) whose 82% success rate on novel objects is reported as an experimental outcome rather than a quantity derived by construction from fitted inputs or self-citations. No equations appear that equate a prediction to a parameter defined by the method itself, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The central claims rest on real-world trials with unseen categories, backgrounds, and distractors, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard domain assumptions about affordance extraction from vision-language models and the sufficiency of 2D keypoints for manipulation policies; no free parameters or new invented entities are described in the abstract.

axioms (1)

domain assumption Affordance information extracted from a text prompt and single image can identify manipulation-relevant regions and keypoints that generalize to unseen objects and categories.
This premise underpins the first two stages of the pipeline and the claim of generalization without task-specific tuning.

pith-pipeline@v0.9.0 · 5727 in / 1296 out tokens · 29360 ms · 2026-05-18T10:15:08.700660+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AFFORD2ACT follows a three-stage pipeline: affordance filtering, category-level keypoint construction, and transformer-based policy learning with embedded gating... yielding a compact 38-dimensional state policy
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use One-Shot Open Affordance model f_aff to produce a heatmap H = f_aff(I1, A) ... binary mask M(x,y) = [H(x,y) >= tau_q(H)]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Semantic-Contact Fields for Category-Level Generalizable Tactile Tool Manipulation
cs.RO 2026-02 unverdicted novelty 6.0

SCFields fuses semantics and contact data in a sim-to-real pipeline to enable category-level generalization for tactile tool manipulation with diffusion policies.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, 2024, journal version; earlier preprint available on arXiv

work page 2024
[2]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” 2024. [Online]. Available: https: //arxiv.org/abs/2403.03954

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Uad: Unsupervised affordance distillation for generalization in robotic manipulation,

Y . Tang, W. Huang, Y . Wang, C. Li, R. Yuan, R. Zhang, J. Wu, and L. Fei-Fei, “Uad: Unsupervised affordance distillation for generalization in robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2506.09284

work page arXiv 2025
[4]

Gendp: 3d semantic fields for category-level generalizable diffusion policy,

Y . Wang, G. Yin, B. Huang, T. Kelestemur, J. Wang, and Y . Li, “Gendp: 3d semantic fields for category-level generalizable diffusion policy,” 2024. [Online]. Available: https://arxiv.org/abs/2410.17488

work page arXiv 2024
[5]

Tapir: Tracking any point with per- frame initialization and temporal refinement,

C. Doersch, Y . Yang, M. Vecerik, D. Gokay, A. Gupta, Y . Aytar, J. Carreira, and A. Zisserman, “Tapir: Tracking any point with per- frame initialization and temporal refinement,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 061–10 072

work page 2023
[6]

P3-po: Prescriptive point priors for visuo-spatial generalization of robot policies,

M. Levy, S. Haldar, L. Pinto, and A. Shirivastava, “P3-po: Prescriptive point priors for visuo-spatial generalization of robot policies,” 2024. [Online]. Available: https://arxiv.org/abs/2412.06784

work page arXiv 2024
[7]

MOKA: Open-vocabulary robotic manipulation through mark-based visual prompting,

F. Liu, K. Fang, P. Abbeel, and S. Levine, “MOKA: Open-vocabulary robotic manipulation through mark-based visual prompting,” in First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. [Online]. Available: https: //openreview.net/forum?id=K8eoYUofbQ

work page 2024
[8]

Any-point Trajectory Modeling for Policy Learning

C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel, “Any-point trajectory modeling for policy learning,” 2024. [Online]. Available: https://arxiv.org/abs/2401.00025

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,

H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani, “Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,” 2024. [Online]. Available: https://arxiv.org/abs/ 2405.01527

work page arXiv 2024
[10]

Flow as the cross-domain manipulation interface

M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song, “Flow as the cross-domain manipulation interface,” 2024. [Online]. Available: https://arxiv.org/abs/2407.15208

work page arXiv 2024
[11]

Genflowrl: Shaping rewards with generative object-centric flow in visual reinforcement learning,

K. Yu*, S. Zhang*, H. Soora, F. Huang, H. Huang, P. Tokekar, and R. Gao, “Genflowrl: Shaping rewards with generative object-centric flow in visual reinforcement learning,” inInternational Conference on Computer Vision (ICCV), 2025

work page 2025
[12]

Atk: Automatic task-driven keypoint selection for robust policy learning,

Y . Zhang, S. Mittal, Z. Zhang, L. Ke, S. Srinivasa, and A. Gupta, “Atk: Automatic task-driven keypoint selection for robust policy learning,” 2025. [Online]. Available: https://arxiv.org/abs/2506.13867

work page arXiv 2025
[13]

K-vil: Keypoints- based visual imitation learning,

J. Gao, Z. Tao, N. Jaquier, and T. Asfour, “K-vil: Keypoints- based visual imitation learning,”IEEE Transactions on Robotics, vol. 39, no. 5, p. 3888–3908, Oct. 2023. [Online]. Available: http://dx.doi.org/10.1109/TRO.2023.3286074

work page doi:10.1109/tro.2023.3286074 2023
[14]

Skil: Semantic keypoint imitation learning for generalizable data-efficient manipulation,

S. Wang, J. You, Y . Hu, J. Li, and Y . Gao, “Skil: Semantic keypoint imitation learning for generalizable data-efficient manipulation,” in Proceedings of Robotics: Science and Systems (RSS), 2025. [Online]. Available: https://www.roboticsproceedings.org/rss21/p161.pdf

work page 2025
[15]

Robotic pick- and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,

A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, N. Fazeli, F. Alet, P. Isola, A. Torralba, A. Rodriguez, L. P. Kaelbling, and T. Lozano-P ´erez, “Robotic pick- and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,” in2018 IEEE International Con- ference on Roboti...

work page 2018
[16]

End-to-end learning of semantic grasping,

E. Jang, S. Vijayanarasimhan, P. Pastor, J. Ibarz, and S. Levine, “End-to-end learning of semantic grasping,” inProceedings of the 1st Annual Conference on Robot Learning, ser. Proceedings of Machine Learning Research, S. Levine, V . Vanhoucke, and K. Goldberg, Eds., vol. 78. PMLR, 13–15 Nov 2017, pp. 119–132. [Online]. Available: https://proceedings.mlr....

work page 2017
[17]

Dense object nets: Learning dense visual object descriptors by and for robotic manipulation,

P. R. Florence, L. Manuelli, and R. Tedrake, “Dense object nets: Learning dense visual object descriptors by and for robotic manipulation,” inProceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, A. Billard, A. Dragan, J. Peters, and J. Morimoto, Eds., vol. 87. PMLR, 29–31 Oct 2018, pp. 373–385. [Online]. Avai...

work page 2018
[18]

Perceiver-actor: A multi- task transformer for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi- task transformer for robotic manipulation,” inProceedings of The 6th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol

work page
[19]

PMLR, 14–18 Dec 2023, pp. 785–799. [Online]. Available: https://proceedings.mlr.press/v205/shridhar23a.html

work page 2023
[20]

Sketch- to-skill: Bootstrapping robot learning with human drawn trajectory sketches,

P. Yu, A. Bhaskar, A. Singh, Z. Mahammad, and P. Tokekar, “Sketch- to-skill: Bootstrapping robot learning with human drawn trajectory sketches,” 2025. [Online]. Available: https://arxiv.org/abs/2503.11918

work page arXiv 2025
[21]

Embodiment-agnostic action planning via object-part scene flow,

W. Tang, J.-H. Pan, W. Zhan, J. Zhou, H. Yao, Y .-H. Liu, M. Tomizuka, M. Ding, and C.-W. Fu, “Embodiment-agnostic action planning via object-part scene flow,” 2024. [Online]. Available: https://arxiv.org/abs/2409.10032

work page arXiv 2024
[22]

Spot: Se (3) pose trajectory diffusion for object- centric manipulation

C.-C. Hsu, B. Wen, J. Xu, Y . Narang, X. Wang, Y . Zhu, J. Biswas, and S. Birchfield, “Spot: Se(3) pose trajectory diffusion for object-centric manipulation,” 2025. [Online]. Available: https: //arxiv.org/abs/2411.00965

work page arXiv 2025
[23]

kpam: Keypoint affordances for category-level robotic manipulation,

L. Manuelli, W. Gao, P. Florence, and R. Tedrake, “kpam: Keypoint affordances for category-level robotic manipulation,” 2019. [Online]. Available: https://arxiv.org/abs/1903.06684

work page arXiv 2019
[24]

J. J. Gibson,The Ecological Approach to Visual Perception. Hillsdale, NJ: Lawrence Erlbaum Associates, 1979

work page 1979
[25]

Dimensionality reduction for hand-independent dexterous robotic grasping,

M. Ciocarlie, C. Goldfeder, and P. K. Allen, “Dimensionality reduction for hand-independent dexterous robotic grasping,”The International Journal of Robotics Research, vol. 28, no. 7, pp. 851–871, 2009

work page 2009
[26]

Affordance detection of tool parts from geometric features,

A. Myers, C. L. Teo, C. Ferm ¨uller, and Y . Aloimonos, “Affordance detection of tool parts from geometric features,” inIEEE International Conference on Robotics and Automation (ICRA), 2015, pp. 1374– 1381

work page 2015
[27]

Affordancenet: An end-to-end deep learning approach for object affordance detection,

T.-T. Do, A. Nguyen, and I. Reid, “Affordancenet: An end-to-end deep learning approach for object affordance detection,” inIEEE International Conference on Robotics and Automation (ICRA), 2018

work page 2018
[28]

Affordances from human videos as a versatile representation for robotics,

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile representation for robotics,”CVPR, 2023

work page 2023
[29]

One-shot open affordance learning with foundation models,

G. Li, D. Sun, L. Sevilla-Lara, and V . Jampani, “One-shot open affordance learning with foundation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[30]

DINOv2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features without ...

work page 2024
[31]

Cotracker: It is better to track together,

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker: It is better to track together,” inEuropean conference on computer vision. Springer, 2024, pp. 18–35

work page 2024
[32]

Baku: An efficient transformer for multi-task policy learning, 2024

S. Haldar, Z. Peng, and L. Pinto, “Baku: An efficient transformer for multi-task policy learning,” 2024. [Online]. Available: https: //arxiv.org/abs/2406.07539

work page arXiv 2024
[33]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13705

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Learning latent plans from play,

C. Lynch, M. Khansari, T. Xiao, V . Kumar, J. Tompson, S. Levine, and P. Sermanet, “Learning latent plans from play,” 2019. [Online]. Available: https://arxiv.org/abs/1903.01973

work page arXiv 2019
[35]

Behavior transformers: Cloning k modes with one stone, 2022

N. M. M. Shafiullah, Z. J. Cui, A. Altanzaya, and L. Pinto, “Behavior transformers: Cloningkmodes with one stone,” 2022. [Online]. Available: https://arxiv.org/abs/2206.11251

work page arXiv 2022
[36]

π0: A vision-language-action flow model for general robot control,

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky, “π0: A vision-language-action flow model for general robot control,”

work page
[37]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

[Online]. Available: https://arxiv.org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Robot learning as an empirical science: Best practices for policy evaluation,

H. Kress-Gazit, K. Hashimoto, N. Kuppuswamy, P. Shah, P. Horgan, G. Richardson, S. Feng, and B. Burchfiel, “Robot learning as an empirical science: Best practices for policy evaluation,”arXiv preprint arXiv:2409.09491, 2024

work page arXiv 2024
[39]

You Only Look Once: Unified, Real-Time Object Detection

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” 2016. [Online]. Available: https://arxiv.org/abs/1506.02640

work page internal anchor Pith review Pith/arXiv arXiv 2016
[40]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick, “Segment anything,” 2023. [Online]. Available: https://arxiv.org/abs/2304.02643

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Diffusion policy: Visuomotor policy learning via action diffusion,

C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, 2024, journal version; earlier preprint available on arXiv

work page 2024

[2] [2]

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” 2024. [Online]. Available: https: //arxiv.org/abs/2403.03954

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Uad: Unsupervised affordance distillation for generalization in robotic manipulation,

Y . Tang, W. Huang, Y . Wang, C. Li, R. Yuan, R. Zhang, J. Wu, and L. Fei-Fei, “Uad: Unsupervised affordance distillation for generalization in robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2506.09284

work page arXiv 2025

[4] [4]

Gendp: 3d semantic fields for category-level generalizable diffusion policy,

Y . Wang, G. Yin, B. Huang, T. Kelestemur, J. Wang, and Y . Li, “Gendp: 3d semantic fields for category-level generalizable diffusion policy,” 2024. [Online]. Available: https://arxiv.org/abs/2410.17488

work page arXiv 2024

[5] [5]

Tapir: Tracking any point with per- frame initialization and temporal refinement,

C. Doersch, Y . Yang, M. Vecerik, D. Gokay, A. Gupta, Y . Aytar, J. Carreira, and A. Zisserman, “Tapir: Tracking any point with per- frame initialization and temporal refinement,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 061–10 072

work page 2023

[6] [6]

P3-po: Prescriptive point priors for visuo-spatial generalization of robot policies,

M. Levy, S. Haldar, L. Pinto, and A. Shirivastava, “P3-po: Prescriptive point priors for visuo-spatial generalization of robot policies,” 2024. [Online]. Available: https://arxiv.org/abs/2412.06784

work page arXiv 2024

[7] [7]

MOKA: Open-vocabulary robotic manipulation through mark-based visual prompting,

F. Liu, K. Fang, P. Abbeel, and S. Levine, “MOKA: Open-vocabulary robotic manipulation through mark-based visual prompting,” in First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. [Online]. Available: https: //openreview.net/forum?id=K8eoYUofbQ

work page 2024

[8] [8]

Any-point Trajectory Modeling for Policy Learning

C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel, “Any-point trajectory modeling for policy learning,” 2024. [Online]. Available: https://arxiv.org/abs/2401.00025

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,

H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani, “Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,” 2024. [Online]. Available: https://arxiv.org/abs/ 2405.01527

work page arXiv 2024

[10] [10]

Flow as the cross-domain manipulation interface

M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song, “Flow as the cross-domain manipulation interface,” 2024. [Online]. Available: https://arxiv.org/abs/2407.15208

work page arXiv 2024

[11] [11]

Genflowrl: Shaping rewards with generative object-centric flow in visual reinforcement learning,

K. Yu*, S. Zhang*, H. Soora, F. Huang, H. Huang, P. Tokekar, and R. Gao, “Genflowrl: Shaping rewards with generative object-centric flow in visual reinforcement learning,” inInternational Conference on Computer Vision (ICCV), 2025

work page 2025

[12] [12]

Atk: Automatic task-driven keypoint selection for robust policy learning,

Y . Zhang, S. Mittal, Z. Zhang, L. Ke, S. Srinivasa, and A. Gupta, “Atk: Automatic task-driven keypoint selection for robust policy learning,” 2025. [Online]. Available: https://arxiv.org/abs/2506.13867

work page arXiv 2025

[13] [13]

K-vil: Keypoints- based visual imitation learning,

J. Gao, Z. Tao, N. Jaquier, and T. Asfour, “K-vil: Keypoints- based visual imitation learning,”IEEE Transactions on Robotics, vol. 39, no. 5, p. 3888–3908, Oct. 2023. [Online]. Available: http://dx.doi.org/10.1109/TRO.2023.3286074

work page doi:10.1109/tro.2023.3286074 2023

[14] [14]

Skil: Semantic keypoint imitation learning for generalizable data-efficient manipulation,

S. Wang, J. You, Y . Hu, J. Li, and Y . Gao, “Skil: Semantic keypoint imitation learning for generalizable data-efficient manipulation,” in Proceedings of Robotics: Science and Systems (RSS), 2025. [Online]. Available: https://www.roboticsproceedings.org/rss21/p161.pdf

work page 2025

[15] [15]

Robotic pick- and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,

A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, N. Fazeli, F. Alet, P. Isola, A. Torralba, A. Rodriguez, L. P. Kaelbling, and T. Lozano-P ´erez, “Robotic pick- and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,” in2018 IEEE International Con- ference on Roboti...

work page 2018

[16] [16]

End-to-end learning of semantic grasping,

E. Jang, S. Vijayanarasimhan, P. Pastor, J. Ibarz, and S. Levine, “End-to-end learning of semantic grasping,” inProceedings of the 1st Annual Conference on Robot Learning, ser. Proceedings of Machine Learning Research, S. Levine, V . Vanhoucke, and K. Goldberg, Eds., vol. 78. PMLR, 13–15 Nov 2017, pp. 119–132. [Online]. Available: https://proceedings.mlr....

work page 2017

[17] [17]

Dense object nets: Learning dense visual object descriptors by and for robotic manipulation,

P. R. Florence, L. Manuelli, and R. Tedrake, “Dense object nets: Learning dense visual object descriptors by and for robotic manipulation,” inProceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, A. Billard, A. Dragan, J. Peters, and J. Morimoto, Eds., vol. 87. PMLR, 29–31 Oct 2018, pp. 373–385. [Online]. Avai...

work page 2018

[18] [18]

Perceiver-actor: A multi- task transformer for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi- task transformer for robotic manipulation,” inProceedings of The 6th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol

work page

[19] [19]

PMLR, 14–18 Dec 2023, pp. 785–799. [Online]. Available: https://proceedings.mlr.press/v205/shridhar23a.html

work page 2023

[20] [20]

Sketch- to-skill: Bootstrapping robot learning with human drawn trajectory sketches,

P. Yu, A. Bhaskar, A. Singh, Z. Mahammad, and P. Tokekar, “Sketch- to-skill: Bootstrapping robot learning with human drawn trajectory sketches,” 2025. [Online]. Available: https://arxiv.org/abs/2503.11918

work page arXiv 2025

[21] [21]

Embodiment-agnostic action planning via object-part scene flow,

W. Tang, J.-H. Pan, W. Zhan, J. Zhou, H. Yao, Y .-H. Liu, M. Tomizuka, M. Ding, and C.-W. Fu, “Embodiment-agnostic action planning via object-part scene flow,” 2024. [Online]. Available: https://arxiv.org/abs/2409.10032

work page arXiv 2024

[22] [22]

Spot: Se (3) pose trajectory diffusion for object- centric manipulation

C.-C. Hsu, B. Wen, J. Xu, Y . Narang, X. Wang, Y . Zhu, J. Biswas, and S. Birchfield, “Spot: Se(3) pose trajectory diffusion for object-centric manipulation,” 2025. [Online]. Available: https: //arxiv.org/abs/2411.00965

work page arXiv 2025

[23] [23]

kpam: Keypoint affordances for category-level robotic manipulation,

L. Manuelli, W. Gao, P. Florence, and R. Tedrake, “kpam: Keypoint affordances for category-level robotic manipulation,” 2019. [Online]. Available: https://arxiv.org/abs/1903.06684

work page arXiv 2019

[24] [24]

J. J. Gibson,The Ecological Approach to Visual Perception. Hillsdale, NJ: Lawrence Erlbaum Associates, 1979

work page 1979

[25] [25]

Dimensionality reduction for hand-independent dexterous robotic grasping,

M. Ciocarlie, C. Goldfeder, and P. K. Allen, “Dimensionality reduction for hand-independent dexterous robotic grasping,”The International Journal of Robotics Research, vol. 28, no. 7, pp. 851–871, 2009

work page 2009

[26] [26]

Affordance detection of tool parts from geometric features,

A. Myers, C. L. Teo, C. Ferm ¨uller, and Y . Aloimonos, “Affordance detection of tool parts from geometric features,” inIEEE International Conference on Robotics and Automation (ICRA), 2015, pp. 1374– 1381

work page 2015

[27] [27]

Affordancenet: An end-to-end deep learning approach for object affordance detection,

T.-T. Do, A. Nguyen, and I. Reid, “Affordancenet: An end-to-end deep learning approach for object affordance detection,” inIEEE International Conference on Robotics and Automation (ICRA), 2018

work page 2018

[28] [28]

Affordances from human videos as a versatile representation for robotics,

S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile representation for robotics,”CVPR, 2023

work page 2023

[29] [29]

One-shot open affordance learning with foundation models,

G. Li, D. Sun, L. Sevilla-Lara, and V . Jampani, “One-shot open affordance learning with foundation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[30] [30]

DINOv2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features without ...

work page 2024

[31] [31]

Cotracker: It is better to track together,

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker: It is better to track together,” inEuropean conference on computer vision. Springer, 2024, pp. 18–35

work page 2024

[32] [32]

Baku: An efficient transformer for multi-task policy learning, 2024

S. Haldar, Z. Peng, and L. Pinto, “Baku: An efficient transformer for multi-task policy learning,” 2024. [Online]. Available: https: //arxiv.org/abs/2406.07539

work page arXiv 2024

[33] [33]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13705

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Learning latent plans from play,

C. Lynch, M. Khansari, T. Xiao, V . Kumar, J. Tompson, S. Levine, and P. Sermanet, “Learning latent plans from play,” 2019. [Online]. Available: https://arxiv.org/abs/1903.01973

work page arXiv 2019

[35] [35]

Behavior transformers: Cloning k modes with one stone, 2022

N. M. M. Shafiullah, Z. J. Cui, A. Altanzaya, and L. Pinto, “Behavior transformers: Cloningkmodes with one stone,” 2022. [Online]. Available: https://arxiv.org/abs/2206.11251

work page arXiv 2022

[36] [36]

π0: A vision-language-action flow model for general robot control,

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky, “π0: A vision-language-action flow model for general robot control,”

work page

[37] [37]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

[Online]. Available: https://arxiv.org/abs/2410.24164

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Robot learning as an empirical science: Best practices for policy evaluation,

H. Kress-Gazit, K. Hashimoto, N. Kuppuswamy, P. Shah, P. Horgan, G. Richardson, S. Feng, and B. Burchfiel, “Robot learning as an empirical science: Best practices for policy evaluation,”arXiv preprint arXiv:2409.09491, 2024

work page arXiv 2024

[39] [39]

You Only Look Once: Unified, Real-Time Object Detection

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” 2016. [Online]. Available: https://arxiv.org/abs/1506.02640

work page internal anchor Pith review Pith/arXiv arXiv 2016

[40] [40]

Segment Anything

A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick, “Segment anything,” 2023. [Online]. Available: https://arxiv.org/abs/2304.02643

work page internal anchor Pith review Pith/arXiv arXiv 2023