Assessing VLM-Driven Semantic-Affordance Inference for Non-Humanoid Robot Morphologies

Jess Jones; Raul Santos-Rodriguez; Sabine Hauert

arxiv: 2604.19509 · v1 · submitted 2026-04-21 · 💻 cs.RO · cs.MA

Assessing VLM-Driven Semantic-Affordance Inference for Non-Humanoid Robot Morphologies

Jess Jones , Raul Santos-Rodriguez , Sabine Hauert This is my paper

Pith reviewed 2026-05-10 01:57 UTC · model grok-4.3

classification 💻 cs.RO cs.MA

keywords vision-language modelsaffordance inferencenon-humanoid robotsrobot morphologiessemantic understandingconservative predictionsfalse negative bias

0 comments

The pith

VLMs generalize to non-humanoid robot affordances but with a consistent conservative bias.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether vision-language models that work well for human-like object interactions can also determine what non-humanoid robots can do with objects. It builds a hybrid dataset of real robotic examples mixed with synthetic ones generated by the models themselves, then measures performance across different robot body types and object categories. The analysis finds that the models do extend to unusual robot shapes but produce inconsistent results by object domain. They almost never suggest unsafe or impossible uses yet frequently overlook valid ones, especially in new or unconventional situations.

Core claim

While VLMs show promising generalisation to non-humanoid robot forms, their performance is notably inconsistent across different object domains, with a consistent pattern of low false positive rates but high false negative rates across all morphologies and object categories, indicating that VLMs tend toward conservative affordance predictions that is particularly pronounced for novel tool use scenarios and unconventional object manipulations.

What carries the argument

The hybrid dataset of annotated real-world robotic affordance-object relations combined with VLM-generated synthetic scenarios, used as the testbed to measure affordance inference performance across morphologies.

If this is right

VLMs can be integrated into robotic systems while retaining safety advantages from their low false-positive rates.
Complementary techniques are required to reduce the high false-negative rates and improve coverage of valid affordances.
Affordance performance varies enough by object category that domain-specific adjustments or additional signals will be needed.
Novel tool-use and unconventional manipulation cases are the areas where the conservative bias is strongest and most limiting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The conservative tendency may make VLMs more useful as safety filters than as primary planners in robot control loops.
Extending tests to additional morphologies such as aerial or soft robots could reveal whether the bias is morphology-dependent.
Pairing VLM outputs with geometric or physics simulation checks might offset the missed affordances without raising false-positive risk.

Load-bearing premise

The hybrid dataset of annotated real-world robotic affordance-object relations combined with VLM-generated synthetic scenarios provides a valid and representative testbed for measuring true affordance inference performance across non-humanoid morphologies.

What would settle it

A new evaluation that uses only human-annotated real-world data for the same non-humanoid morphologies and objects, yet still produces the same high false-negative pattern, would support the claim; the opposite result would indicate the conservatism arises mainly from how the hybrid dataset was built.

Figures

Figures reproduced from arXiv: 2604.19509 by Jess Jones, Raul Santos-Rodriguez, Sabine Hauert.

**Figure 1.** Figure 1: An illustration of our Semantic-Affordance Mapping pipeline applied to a non-humanoid robot (1) The camera feed [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Affordance-object inference F1 scores and standard deviation over five independent trials. Objects have been clustered [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Confusion matrices across the three VLMs for aggregated performance of True-Positive (green), False-Positive (red), [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of VLM semantic-affordance inference mapped to bounding boxes with GroundingDINO. The top row [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding human-object interactions, but their application to robotic systems with non-humanoid morphologies remains largely unexplored. This work investigates whether VLMs can effectively infer affordances for robots with fundamentally different embodiments than humans, addressing a critical gap in the deployment of these models for diverse robotic applications. We introduce a novel hybrid dataset that combines annotated real-world robotic affordance-object relations with VLM-generated synthetic scenarios, and perform an empirical analysis of VLM performance across multiple object categories and robot morphologies, revealing significant variations in affordance inference capabilities. Our experiments demonstrate that while VLMs show promising generalisation to non-humanoid robot forms, their performance is notably inconsistent across different object domains. Critically, we identify a consistent pattern of low false positive rates but high false negative rates across all morphologies and object categories, indicating that VLMs tend toward conservative affordance predictions. Our analysis reveals that this pattern is particularly pronounced for novel tool use scenarios and unconventional object manipulations, suggesting that effective integration of VLMs in robotic systems requires complementary approaches to mitigate over-conservative behaviour while preserving the inherent safety benefits of low false positive rates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives some empirical data on VLM conservatism for non-humanoid robots but the VLM-generated synthetic data undercuts how independent the test really is.

read the letter

The main thing to know is that VLMs show some generalization to non-humanoid robot bodies but perform inconsistently across object types and lean conservative, with low false positives and high false negatives that could limit their usefulness in practice. The work flags this pattern especially in novel tool-use cases and suggests complementary methods to balance safety with better coverage. That observation is the useful part. What is actually new is the explicit targeting of non-humanoid morphologies, which the abstract positions as an unexplored area, plus the attempt to quantify the conservative bias across morphologies and categories using a hybrid dataset. The paper does a reasonable job of laying out the empirical setup and connecting the bias pattern to real deployment implications, such as preserving low false positives for safety while noting the downside of missed affordances. It stays grounded in the robotics context without overclaiming broader AI advances. The soft spot is the hybrid dataset itself. A substantial portion comes from VLM-generated synthetic scenarios, and without clear separation between the generator and evaluator models, human validation of those labels, or an ablation restricted to the real-world annotated portion, the reported inconsistency and bias could partly reflect the generator's own priors rather than independent evidence about embodiment differences. The abstract does not spell out those safeguards, so the central claims rest on weaker footing than they appear. Minor gaps like missing error bars or exact quantitative breakdowns in the summary are secondary but would need fixing for clarity. This paper is for roboticists and embodied-AI researchers who are already working on VLM integration for non-standard robot platforms. A reader in that niche would pick up practical notes on limitations and the safety trade-off, even if the results are incremental. It deserves peer review because the gap it addresses is real and the empirical direction is worth refining rather than discarding outright.

Referee Report

1 major / 1 minor

Summary. The paper claims that VLMs generalize affordance inference to non-humanoid robot morphologies with promising but inconsistent performance across object domains. Using a hybrid dataset of real-world annotated affordance relations and VLM-generated synthetic scenarios, experiments reveal a consistent conservative bias (low false-positive rates paired with high false-negative rates) that is especially pronounced for novel tool-use and unconventional manipulations, implying that VLMs require complementary methods to reduce over-conservatism while retaining safety benefits.

Significance. If the empirical patterns are shown to be independent of dataset construction artifacts, the work would usefully document embodiment-related limitations in VLM affordance reasoning and motivate hybrid VLM-plus-verification pipelines for diverse robot platforms. It addresses an underexplored gap between human-centric VLM benchmarks and non-humanoid robotic deployment.

major comments (1)

[Section 3] Dataset Construction (Section 3): the hybrid dataset description states that a substantial portion consists of VLM-generated synthetic scenarios, yet provides no explicit statement that the evaluator VLMs are distinct from the generator, no human validation protocol for the synthetic labels, and no ablation restricted to the real-world annotated subset. Because the central claims of domain inconsistency and the low-FP/high-FN conservative pattern rest on these observations, the reported generalization results risk being artifacts of the generator VLM's own priors rather than independent evidence of embodiment effects.

minor comments (1)

[Results] The abstract and results sections report performance patterns without accompanying quantitative tables, confidence intervals, or per-morphology/per-category breakdowns; adding these (with explicit sample sizes) would improve reproducibility and allow readers to assess the magnitude of the reported inconsistency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our dataset construction that require clarification to strengthen the paper's claims. We address the single major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [Section 3] Dataset Construction (Section 3): the hybrid dataset description states that a substantial portion consists of VLM-generated synthetic scenarios, yet provides no explicit statement that the evaluator VLMs are distinct from the generator, no human validation protocol for the synthetic labels, and no ablation restricted to the real-world annotated subset. Because the central claims of domain inconsistency and the low-FP/high-FN conservative pattern rest on these observations, the reported generalization results risk being artifacts of the generator VLM's own priors rather than independent evidence of embodiment effects.

Authors: We acknowledge the validity of this concern and agree that additional transparency is needed. In the revised manuscript, we will explicitly identify the distinct VLMs used for synthetic scenario generation versus evaluation to rule out circularity. We will also document the protocol for synthetic label validation, including any human review steps performed during dataset curation. To directly mitigate the risk of artifacts, we will add an ablation analysis restricted to the real-world annotated subset and report whether the conservative bias (low FP, high FN) and domain inconsistencies persist in that subset alone. These changes will provide clearer evidence that the observed patterns are attributable to embodiment effects rather than generator priors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical VLM evaluation study

full rationale

The paper is a purely empirical assessment with no mathematical derivations, equations, fitted parameters, or predictive models. Claims rest on observed performance metrics (low FP/high FN rates, inconsistency across domains) from testing VLMs on the introduced hybrid dataset. No load-bearing step reduces by construction to its own inputs, no self-citations justify uniqueness or ansatz, and no renaming of known results occurs. The hybrid dataset (real annotations plus VLM-generated scenarios) is presented as an experimental testbed rather than a self-referential loop; without explicit quotes showing identical generator/evaluator models or ground-truth labels derived from the evaluated VLM itself, the evaluation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the introduced hybrid dataset and the assumption that VLM performance on synthetic scenarios transfers to real robot affordances; no free parameters, new entities, or non-standard axioms are introduced in the abstract.

axioms (1)

domain assumption VLMs have demonstrated remarkable capabilities in understanding human-object interactions
Invoked in the opening sentence as established background to motivate extension to non-humanoid cases.

pith-pipeline@v0.9.0 · 5510 in / 1397 out tokens · 70563 ms · 2026-05-10T01:57:49.032434+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Michael Ahn, Debidatta Dwibedi, Chelsea Finn, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Karol Hausman, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, et al. 2024. Autort: Embodied foundation models for large scale orchestration of robotic agents.arXiv preprint arXiv:2401.12963(2024)

work page arXiv 2024
[2]

Paola Ardón, Èric Pairet, Katrin S Lohan, Subramanian Ramamoorthy, and Ronald Petrick. 2020. Affordances in robotic tasks–a survey.arXiv preprint arXiv:2004.07400(2020)

work page arXiv 2020
[3]

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. 2023. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13778–13790

work page 2023
[4]

Matthew Chang, Theophile Gervet, Mukul Khanna, Sriram Yenamandra, Dhruv Shah, So Yeon Min, Kavit Shah, Chris Paxton, Saurabh Gupta, Dhruv Batra, Roozbeh Mottaghi, Jitendra Malik, and Devendra Singh Chaplot. 2023. GOAT: GO to Any Thing. arXiv:2311.06430 [cs.RO] https://arxiv.org/abs/2311.06430

work page arXiv 2023
[5]

Boyuan Chen, Fei Xia, Brian Ichter, Kanishka Rao, Keerthana Gopalakrishnan, Michael S Ryoo, Austin Stone, and Daniel Kappler. 2023. Open-vocabulary queryable scene representations for real world planning. In2023 IEEE Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 11509–11522

work page 2023
[6]

Dongpan Chen, Dehui Kong, Jinghua Li, Shaofan Wang, and Baocai Yin. 2023. A survey of visual affordance recognition based on deep learning.IEEE Transactions on Big Data9, 6 (2023), 1458–1476

work page 2023
[7]

Ophelia Deroy, Davide Bacciu, Bahador Bahrami, Cosimo Della Santina, and Sabine Hauert. 2024. Shared Awareness Across Domain-Specific Artificial Intelli- gence: An Alternative to Domain-General Intelligence and Artificial Conscious- ness.Advanced Intelligent Systems6, 10 (2024), 2300740

work page 2024
[8]

Thanh-Toan Do, Anh Nguyen, and Ian Reid. 2018. Affordancenet: An end-to-end deep learning approach for object affordance detection. In2018 IEEE international conference on robotics and automation (ICRA). IEEE, 5882–5889

work page 2018
[9]

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. 2024. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401(2024)

work page arXiv 2024
[10]

Akash Ghosh, Arkadeep Acharya, Sriparna Saha, Vinija Jain, and Aman Chadha

work page
[11]

Exploring the frontier of vision-language models: A survey of current methodologies and future directions.arXiv preprint arXiv:2404.07214(2024)

work page arXiv 2024
[12]

2014.The ecological approach to visual perception: classic edition

James J Gibson. 2014.The ecological approach to visual perception: classic edition. Psychology press

work page 2014
[13]

James J Gibson. 2014. The theory of affordances:(1979). InThe people, place, and space reader. Routledge, 56–60

work page 2014
[14]

Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. 2023. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977(2023)

work page arXiv 2023
[15]

Lorenzo Jamone, Emre Ugur, Angelo Cangelosi, Luciano Fadiga, Alexandre Bernardino, Justus Piater, and José Santos-Victor. 2016. Affordances in psy- chology, neuroscience, and robotics: A survey.IEEE Transactions on Cognitive and Developmental Systems10, 1 (2016), 4–25

work page 2016
[16]

Simon Jones, Emma Milner, Mahesh Sooriyabandara, and Sabine Hauert. 2022. DOTS: An open testbed for industrial swarm robotic solutions.arXiv preprint arXiv:2203.13809(2022)

work page arXiv 2022
[17]

Christina Kassab, Matias Mattamala, Lintong Zhang, and Maurice Fallon. 2024. Language-extended indoor slam (lexis): A versatile system for real-time visual scene understanding. In2024 IEEE International Conference on Robotics and Au- tomation (ICRA). IEEE, 15988–15994

work page 2024
[18]

Olivia Y Lee, Annie Xie, Kuan Fang, Karl Pertsch, and Chelsea Finn. 2024. Affordance-guided reinforcement learning via visual prompting.arXiv preprint arXiv:2407.10341(2024)

work page arXiv 2024
[19]

Gen Li, Deqing Sun, Laura Sevilla-Lara, and Varun Jampani. 2024. One-shot open affordance learning with foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3086–3096

work page 2024
[20]

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. 2024. Grounding dino: Marry- ing dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision. Springer, 38–55

work page 2024
[21]

Raphaël Millière and Charles Rathkopf. 2024. Anthropocentric bias and the possibility of artificial cognition. InICML 2024 Workshop on LLMs and Cognition

work page 2024
[22]

Maria Parelli, Alexandros Delitzas, Nikolas Hars, Georgios Vlassis, Sotirios Anag- nostidis, Gregor Bachmann, and Thomas Hofmann. 2023. Clip-guided vision- language pre-training for question answering in 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5607–5612

work page 2023
[23]

Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, and Li Erran Li. 2024. Affordancellm: Grounding affordance from vision language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7587–7597

work page 2024
[24]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021
[25]

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. 2024. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721(2024)

work page arXiv 2024

[1] [1]

Michael Ahn, Debidatta Dwibedi, Chelsea Finn, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Karol Hausman, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, et al. 2024. Autort: Embodied foundation models for large scale orchestration of robotic agents.arXiv preprint arXiv:2401.12963(2024)

work page arXiv 2024

[2] [2]

Paola Ardón, Èric Pairet, Katrin S Lohan, Subramanian Ramamoorthy, and Ronald Petrick. 2020. Affordances in robotic tasks–a survey.arXiv preprint arXiv:2004.07400(2020)

work page arXiv 2020

[3] [3]

Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. 2023. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13778–13790

work page 2023

[4] [4]

Matthew Chang, Theophile Gervet, Mukul Khanna, Sriram Yenamandra, Dhruv Shah, So Yeon Min, Kavit Shah, Chris Paxton, Saurabh Gupta, Dhruv Batra, Roozbeh Mottaghi, Jitendra Malik, and Devendra Singh Chaplot. 2023. GOAT: GO to Any Thing. arXiv:2311.06430 [cs.RO] https://arxiv.org/abs/2311.06430

work page arXiv 2023

[5] [5]

Boyuan Chen, Fei Xia, Brian Ichter, Kanishka Rao, Keerthana Gopalakrishnan, Michael S Ryoo, Austin Stone, and Daniel Kappler. 2023. Open-vocabulary queryable scene representations for real world planning. In2023 IEEE Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 11509–11522

work page 2023

[6] [6]

Dongpan Chen, Dehui Kong, Jinghua Li, Shaofan Wang, and Baocai Yin. 2023. A survey of visual affordance recognition based on deep learning.IEEE Transactions on Big Data9, 6 (2023), 1458–1476

work page 2023

[7] [7]

Ophelia Deroy, Davide Bacciu, Bahador Bahrami, Cosimo Della Santina, and Sabine Hauert. 2024. Shared Awareness Across Domain-Specific Artificial Intelli- gence: An Alternative to Domain-General Intelligence and Artificial Conscious- ness.Advanced Intelligent Systems6, 10 (2024), 2300740

work page 2024

[8] [8]

Thanh-Toan Do, Anh Nguyen, and Ian Reid. 2018. Affordancenet: An end-to-end deep learning approach for object affordance detection. In2018 IEEE international conference on robotics and automation (ICRA). IEEE, 5882–5889

work page 2018

[9] [9]

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, and Wenhan Xiong. 2024. Scene-llm: Extending language model for 3d visual understanding and reasoning.arXiv preprint arXiv:2403.11401(2024)

work page arXiv 2024

[10] [10]

Akash Ghosh, Arkadeep Acharya, Sriparna Saha, Vinija Jain, and Aman Chadha

work page

[11] [11]

Exploring the frontier of vision-language models: A survey of current methodologies and future directions.arXiv preprint arXiv:2404.07214(2024)

work page arXiv 2024

[12] [12]

2014.The ecological approach to visual perception: classic edition

James J Gibson. 2014.The ecological approach to visual perception: classic edition. Psychology press

work page 2014

[13] [13]

James J Gibson. 2014. The theory of affordances:(1979). InThe people, place, and space reader. Routledge, 56–60

work page 2014

[14] [14]

Jiayuan Gu, Sean Kirmani, Paul Wohlhart, Yao Lu, Montserrat Gonzalez Arenas, Kanishka Rao, Wenhao Yu, Chuyuan Fu, Keerthana Gopalakrishnan, Zhuo Xu, et al. 2023. Rt-trajectory: Robotic task generalization via hindsight trajectory sketches.arXiv preprint arXiv:2311.01977(2023)

work page arXiv 2023

[15] [15]

Lorenzo Jamone, Emre Ugur, Angelo Cangelosi, Luciano Fadiga, Alexandre Bernardino, Justus Piater, and José Santos-Victor. 2016. Affordances in psy- chology, neuroscience, and robotics: A survey.IEEE Transactions on Cognitive and Developmental Systems10, 1 (2016), 4–25

work page 2016

[16] [16]

Simon Jones, Emma Milner, Mahesh Sooriyabandara, and Sabine Hauert. 2022. DOTS: An open testbed for industrial swarm robotic solutions.arXiv preprint arXiv:2203.13809(2022)

work page arXiv 2022

[17] [17]

Christina Kassab, Matias Mattamala, Lintong Zhang, and Maurice Fallon. 2024. Language-extended indoor slam (lexis): A versatile system for real-time visual scene understanding. In2024 IEEE International Conference on Robotics and Au- tomation (ICRA). IEEE, 15988–15994

work page 2024

[18] [18]

Olivia Y Lee, Annie Xie, Kuan Fang, Karl Pertsch, and Chelsea Finn. 2024. Affordance-guided reinforcement learning via visual prompting.arXiv preprint arXiv:2407.10341(2024)

work page arXiv 2024

[19] [19]

Gen Li, Deqing Sun, Laura Sevilla-Lara, and Varun Jampani. 2024. One-shot open affordance learning with foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3086–3096

work page 2024

[20] [20]

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. 2024. Grounding dino: Marry- ing dino with grounded pre-training for open-set object detection. InEuropean conference on computer vision. Springer, 38–55

work page 2024

[21] [21]

Raphaël Millière and Charles Rathkopf. 2024. Anthropocentric bias and the possibility of artificial cognition. InICML 2024 Workshop on LLMs and Cognition

work page 2024

[22] [22]

Maria Parelli, Alexandros Delitzas, Nikolas Hars, Georgios Vlassis, Sotirios Anag- nostidis, Gregor Bachmann, and Thomas Hofmann. 2023. Clip-guided vision- language pre-training for question answering in 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5607–5612

work page 2023

[23] [23]

Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, and Li Erran Li. 2024. Affordancellm: Grounding affordance from vision language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7587–7597

work page 2024

[24] [24]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021

[25] [25]

Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. 2024. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721(2024)

work page arXiv 2024