AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation
Pith reviewed 2026-05-18 10:15 UTC · model grok-4.3
The pith
AFFORD2ACT distills a minimal set of semantic 2D keypoints from a text prompt and single image to produce a compact 38-dimensional policy for robotic manipulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AFFORD2ACT uses affordance filtering driven by language and vision to automatically select manipulation-centric 2D keypoints, then feeds the resulting 38-dimensional vector into a gated transformer policy that reaches 82 percent success on unseen objects, novel categories, backgrounds, and distractors across real-world tasks.
What carries the argument
The affordance-guided automatic keypoint selection pipeline that converts a text prompt and one image into a small set of semantic 2D points for policy input.
If this is right
- Policies can be trained in 15 minutes on standard hardware without dense point clouds or full images.
- The same keypoint set supports real-time control across varied backgrounds and distractors without retraining.
- No task-specific manual keypoint design or extra data collection is required for new categories.
Where Pith is reading between the lines
- The approach may lower barriers for deploying manipulation skills on resource-limited robots by shrinking the observation space.
- Similar language-vision keypoint distillation could be tested on other control problems such as navigation or assembly.
Load-bearing premise
Affordance information extracted from a text prompt and single image will reliably identify manipulation-relevant keypoints that remain effective for objects and categories never seen during training.
What would settle it
Running the trained 38-dimensional policy on a new manipulation task involving objects from a category absent from the original experiments and observing whether success rate drops substantially below 82 percent.
Figures
read the original abstract
Vision-based robot learning often relies on dense image or point-cloud inputs, which are computationally heavy and entangle irrelevant background features. Existing keypoint-based approaches can focus on manipulation-centric features and be lightweight, but either depend on manual heuristics or task-coupled selection, limiting scalability and semantic understanding. To address this, we propose AFFORD2ACT, an affordance-guided framework that distills a minimal set of semantic 2D keypoints from a text prompt and a single image. AFFORD2ACT follows a three-stage pipeline: affordance filtering, category-level keypoint construction, and transformer-based policy learning with embedded gating to reason about the most relevant keypoints, yielding a compact 38-dimensional state policy that can be trained in 15 minutes, which performs well in real-time without proprioception or dense representations. Across diverse real-world manipulation tasks, AFFORD2ACT consistently improves data efficiency, achieving an 82% success rate on unseen objects, novel categories, backgrounds, and distractors.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents AFFORD2ACT, an affordance-guided framework for distilling a minimal set of semantic 2D keypoints from a text prompt and single image to support generalizable, lightweight robotic manipulation. The three-stage pipeline consists of affordance filtering, category-level keypoint construction, and gated transformer policy learning, producing a compact 38-dimensional state representation. The work claims this policy trains in 15 minutes, runs in real time without proprioception or dense inputs, and achieves an 82% success rate on unseen objects, novel categories, backgrounds, and distractors across diverse real-world tasks.
Significance. If the empirical results are robust, the approach could meaningfully advance scalable robot learning by automating extraction of manipulation-centric features, thereby improving data efficiency and reducing reliance on heavy visual representations. The combination of rapid training, real-time operation, and cross-category generalization without manual tuning or extra data would be a practical contribution to the field.
major comments (2)
- [Abstract] Abstract: the central performance claim of an 82% success rate is presented without baselines, error bars, dataset sizes, ablation studies, or statistical details. This omission is load-bearing because it prevents verification that the affordance-guided keypoint stages, rather than other factors, drive the reported gains on novel categories and distractors.
- [Method (affordance filtering and category-level keypoint construction)] Affordance filtering and category-level keypoint construction stages: the generalization claims rest on the untested assumption that a single image plus text prompt reliably yields stable, task-relevant 2D points for unseen categories. No quantitative results on prompt sensitivity, keypoint consistency metrics, or failure-mode analysis across category shifts are supplied, which directly undermines the “no manual tuning” and 82% success assertions.
minor comments (1)
- [Abstract] Abstract: the 15-minute training time and 38-dimensional state size are stated without reference to the specific hardware or policy architecture details that would allow readers to reproduce the efficiency claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments point by point below and indicate the revisions made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central performance claim of an 82% success rate is presented without baselines, error bars, dataset sizes, ablation studies, or statistical details. This omission is load-bearing because it prevents verification that the affordance-guided keypoint stages, rather than other factors, drive the reported gains on novel categories and distractors.
Authors: We agree that the abstract would benefit from additional context to support the central claim. The body of the manuscript already contains baseline comparisons (Section 4.2), ablation studies isolating the affordance-guided keypoint stages (Section 4.3), dataset details with repeated trials across objects and scenes, error bars from multiple runs, and statistical tests. We have revised the abstract to include a brief clause referencing these experimental validations and noting that ablations confirm the contribution of the keypoint selection pipeline to generalization on novel categories and distractors. revision: yes
-
Referee: [Method (affordance filtering and category-level keypoint construction)] Affordance filtering and category-level keypoint construction stages: the generalization claims rest on the untested assumption that a single image plus text prompt reliably yields stable, task-relevant 2D points for unseen categories. No quantitative results on prompt sensitivity, keypoint consistency metrics, or failure-mode analysis across category shifts are supplied, which directly undermines the “no manual tuning” and 82% success assertions.
Authors: We recognize the value of direct quantitative support for keypoint stability. In the revised manuscript we have added keypoint consistency metrics (standard deviation of 2D locations across multiple text prompts for the same object and across instances within a category) together with a short failure-mode discussion in the supplementary material. These additions, drawn from re-analysis of our existing experimental data, provide evidence that the single-image-plus-text pipeline produces sufficiently stable points to support the reported generalization without manual tuning. revision: yes
Circularity Check
No circularity: empirical pipeline with independent validation
full rationale
The paper presents AFFORD2ACT as a three-stage empirical pipeline (affordance filtering from text+image, category-level keypoint construction, gated transformer policy) whose 82% success rate on novel objects is reported as an experimental outcome rather than a quantity derived by construction from fitted inputs or self-citations. No equations appear that equate a prediction to a parameter defined by the method itself, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The central claims rest on real-world trials with unseen categories, backgrounds, and distractors, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Affordance information extracted from a text prompt and single image can identify manipulation-relevant regions and keypoints that generalize to unseen objects and categories.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AFFORD2ACT follows a three-stage pipeline: affordance filtering, category-level keypoint construction, and transformer-based policy learning with embedded gating... yielding a compact 38-dimensional state policy
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use One-Shot Open Affordance model f_aff to produce a heatmap H = f_aff(I1, A) ... binary mask M(x,y) = [H(x,y) >= tau_q(H)]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Semantic-Contact Fields for Category-Level Generalizable Tactile Tool Manipulation
SCFields fuses semantics and contact data in a sim-to-real pipeline to enable category-level generalization for tactile tool manipulation with diffusion policies.
Reference graph
Works this paper leans on
-
[1]
Diffusion policy: Visuomotor policy learning via action diffusion,
C. Chi, Z. Xu, C. Pan, E. Cousineau, B. Burchfiel, S. Feng, R. Tedrake, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,”The International Journal of Robotics Research, 2024, journal version; earlier preprint available on arXiv
work page 2024
-
[2]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
Y . Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu, “3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations,” 2024. [Online]. Available: https: //arxiv.org/abs/2403.03954
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Uad: Unsupervised affordance distillation for generalization in robotic manipulation,
Y . Tang, W. Huang, Y . Wang, C. Li, R. Yuan, R. Zhang, J. Wu, and L. Fei-Fei, “Uad: Unsupervised affordance distillation for generalization in robotic manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2506.09284
-
[4]
Gendp: 3d semantic fields for category-level generalizable diffusion policy,
Y . Wang, G. Yin, B. Huang, T. Kelestemur, J. Wang, and Y . Li, “Gendp: 3d semantic fields for category-level generalizable diffusion policy,” 2024. [Online]. Available: https://arxiv.org/abs/2410.17488
-
[5]
Tapir: Tracking any point with per- frame initialization and temporal refinement,
C. Doersch, Y . Yang, M. Vecerik, D. Gokay, A. Gupta, Y . Aytar, J. Carreira, and A. Zisserman, “Tapir: Tracking any point with per- frame initialization and temporal refinement,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10 061–10 072
work page 2023
-
[6]
P3-po: Prescriptive point priors for visuo-spatial generalization of robot policies,
M. Levy, S. Haldar, L. Pinto, and A. Shirivastava, “P3-po: Prescriptive point priors for visuo-spatial generalization of robot policies,” 2024. [Online]. Available: https://arxiv.org/abs/2412.06784
-
[7]
MOKA: Open-vocabulary robotic manipulation through mark-based visual prompting,
F. Liu, K. Fang, P. Abbeel, and S. Levine, “MOKA: Open-vocabulary robotic manipulation through mark-based visual prompting,” in First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024. [Online]. Available: https: //openreview.net/forum?id=K8eoYUofbQ
work page 2024
-
[8]
Any-point Trajectory Modeling for Policy Learning
C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y . Gao, and P. Abbeel, “Any-point trajectory modeling for policy learning,” 2024. [Online]. Available: https://arxiv.org/abs/2401.00025
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,
H. Bharadhwaj, R. Mottaghi, A. Gupta, and S. Tulsiani, “Track2act: Predicting point tracks from internet videos enables generalizable robot manipulation,” 2024. [Online]. Available: https://arxiv.org/abs/ 2405.01527
-
[10]
Flow as the cross-domain manipulation interface
M. Xu, Z. Xu, Y . Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song, “Flow as the cross-domain manipulation interface,” 2024. [Online]. Available: https://arxiv.org/abs/2407.15208
-
[11]
Genflowrl: Shaping rewards with generative object-centric flow in visual reinforcement learning,
K. Yu*, S. Zhang*, H. Soora, F. Huang, H. Huang, P. Tokekar, and R. Gao, “Genflowrl: Shaping rewards with generative object-centric flow in visual reinforcement learning,” inInternational Conference on Computer Vision (ICCV), 2025
work page 2025
-
[12]
Atk: Automatic task-driven keypoint selection for robust policy learning,
Y . Zhang, S. Mittal, Z. Zhang, L. Ke, S. Srinivasa, and A. Gupta, “Atk: Automatic task-driven keypoint selection for robust policy learning,” 2025. [Online]. Available: https://arxiv.org/abs/2506.13867
-
[13]
K-vil: Keypoints- based visual imitation learning,
J. Gao, Z. Tao, N. Jaquier, and T. Asfour, “K-vil: Keypoints- based visual imitation learning,”IEEE Transactions on Robotics, vol. 39, no. 5, p. 3888–3908, Oct. 2023. [Online]. Available: http://dx.doi.org/10.1109/TRO.2023.3286074
-
[14]
Skil: Semantic keypoint imitation learning for generalizable data-efficient manipulation,
S. Wang, J. You, Y . Hu, J. Li, and Y . Gao, “Skil: Semantic keypoint imitation learning for generalizable data-efficient manipulation,” in Proceedings of Robotics: Science and Systems (RSS), 2025. [Online]. Available: https://www.roboticsproceedings.org/rss21/p161.pdf
work page 2025
-
[15]
A. Zeng, S. Song, K.-T. Yu, E. Donlon, F. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo, N. Fazeli, F. Alet, P. Isola, A. Torralba, A. Rodriguez, L. P. Kaelbling, and T. Lozano-P ´erez, “Robotic pick- and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching,” in2018 IEEE International Con- ference on Roboti...
work page 2018
-
[16]
End-to-end learning of semantic grasping,
E. Jang, S. Vijayanarasimhan, P. Pastor, J. Ibarz, and S. Levine, “End-to-end learning of semantic grasping,” inProceedings of the 1st Annual Conference on Robot Learning, ser. Proceedings of Machine Learning Research, S. Levine, V . Vanhoucke, and K. Goldberg, Eds., vol. 78. PMLR, 13–15 Nov 2017, pp. 119–132. [Online]. Available: https://proceedings.mlr....
work page 2017
-
[17]
Dense object nets: Learning dense visual object descriptors by and for robotic manipulation,
P. R. Florence, L. Manuelli, and R. Tedrake, “Dense object nets: Learning dense visual object descriptors by and for robotic manipulation,” inProceedings of The 2nd Conference on Robot Learning, ser. Proceedings of Machine Learning Research, A. Billard, A. Dragan, J. Peters, and J. Morimoto, Eds., vol. 87. PMLR, 29–31 Oct 2018, pp. 373–385. [Online]. Avai...
work page 2018
-
[18]
Perceiver-actor: A multi- task transformer for robotic manipulation,
M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi- task transformer for robotic manipulation,” inProceedings of The 6th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, K. Liu, D. Kulic, and J. Ichnowski, Eds., vol
-
[19]
PMLR, 14–18 Dec 2023, pp. 785–799. [Online]. Available: https://proceedings.mlr.press/v205/shridhar23a.html
work page 2023
-
[20]
Sketch- to-skill: Bootstrapping robot learning with human drawn trajectory sketches,
P. Yu, A. Bhaskar, A. Singh, Z. Mahammad, and P. Tokekar, “Sketch- to-skill: Bootstrapping robot learning with human drawn trajectory sketches,” 2025. [Online]. Available: https://arxiv.org/abs/2503.11918
-
[21]
Embodiment-agnostic action planning via object-part scene flow,
W. Tang, J.-H. Pan, W. Zhan, J. Zhou, H. Yao, Y .-H. Liu, M. Tomizuka, M. Ding, and C.-W. Fu, “Embodiment-agnostic action planning via object-part scene flow,” 2024. [Online]. Available: https://arxiv.org/abs/2409.10032
-
[22]
Spot: Se (3) pose trajectory diffusion for object- centric manipulation
C.-C. Hsu, B. Wen, J. Xu, Y . Narang, X. Wang, Y . Zhu, J. Biswas, and S. Birchfield, “Spot: Se(3) pose trajectory diffusion for object-centric manipulation,” 2025. [Online]. Available: https: //arxiv.org/abs/2411.00965
-
[23]
kpam: Keypoint affordances for category-level robotic manipulation,
L. Manuelli, W. Gao, P. Florence, and R. Tedrake, “kpam: Keypoint affordances for category-level robotic manipulation,” 2019. [Online]. Available: https://arxiv.org/abs/1903.06684
-
[24]
J. J. Gibson,The Ecological Approach to Visual Perception. Hillsdale, NJ: Lawrence Erlbaum Associates, 1979
work page 1979
-
[25]
Dimensionality reduction for hand-independent dexterous robotic grasping,
M. Ciocarlie, C. Goldfeder, and P. K. Allen, “Dimensionality reduction for hand-independent dexterous robotic grasping,”The International Journal of Robotics Research, vol. 28, no. 7, pp. 851–871, 2009
work page 2009
-
[26]
Affordance detection of tool parts from geometric features,
A. Myers, C. L. Teo, C. Ferm ¨uller, and Y . Aloimonos, “Affordance detection of tool parts from geometric features,” inIEEE International Conference on Robotics and Automation (ICRA), 2015, pp. 1374– 1381
work page 2015
-
[27]
Affordancenet: An end-to-end deep learning approach for object affordance detection,
T.-T. Do, A. Nguyen, and I. Reid, “Affordancenet: An end-to-end deep learning approach for object affordance detection,” inIEEE International Conference on Robotics and Automation (ICRA), 2018
work page 2018
-
[28]
Affordances from human videos as a versatile representation for robotics,
S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak, “Affordances from human videos as a versatile representation for robotics,”CVPR, 2023
work page 2023
-
[29]
One-shot open affordance learning with foundation models,
G. Li, D. Sun, L. Sevilla-Lara, and V . Jampani, “One-shot open affordance learning with foundation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[30]
DINOv2: Learning robust visual features without supervision,
M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. HAZIZA, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning robust visual features without ...
work page 2024
-
[31]
Cotracker: It is better to track together,
N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht, “Cotracker: It is better to track together,” inEuropean conference on computer vision. Springer, 2024, pp. 18–35
work page 2024
-
[32]
Baku: An efficient transformer for multi-task policy learning, 2024
S. Haldar, Z. Peng, and L. Pinto, “Baku: An efficient transformer for multi-task policy learning,” 2024. [Online]. Available: https: //arxiv.org/abs/2406.07539
-
[33]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13705
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Learning latent plans from play,
C. Lynch, M. Khansari, T. Xiao, V . Kumar, J. Tompson, S. Levine, and P. Sermanet, “Learning latent plans from play,” 2019. [Online]. Available: https://arxiv.org/abs/1903.01973
-
[35]
Behavior transformers: Cloning k modes with one stone, 2022
N. M. M. Shafiullah, Z. J. Cui, A. Altanzaya, and L. Pinto, “Behavior transformers: Cloningkmodes with one stone,” 2022. [Online]. Available: https://arxiv.org/abs/2206.11251
-
[36]
π0: A vision-language-action flow model for general robot control,
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky, “π0: A vision-language-action flow model for general robot control,”
-
[37]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
[Online]. Available: https://arxiv.org/abs/2410.24164
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Robot learning as an empirical science: Best practices for policy evaluation,
H. Kress-Gazit, K. Hashimoto, N. Kuppuswamy, P. Shah, P. Horgan, G. Richardson, S. Feng, and B. Burchfiel, “Robot learning as an empirical science: Best practices for policy evaluation,”arXiv preprint arXiv:2409.09491, 2024
-
[39]
You Only Look Once: Unified, Real-Time Object Detection
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” 2016. [Online]. Available: https://arxiv.org/abs/1506.02640
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[40]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll ´ar, and R. Girshick, “Segment anything,” 2023. [Online]. Available: https://arxiv.org/abs/2304.02643
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.