pith. sign in

arxiv: 2606.04172 · v1 · pith:LAL3XJG4new · submitted 2026-06-02 · 💻 cs.RO

Affordance2Action: Task-Conditioned Scene-level Affordance Grounding for Real-Time Manipulation

Pith reviewed 2026-06-28 09:43 UTC · model grok-4.3

classification 💻 cs.RO
keywords affordance groundingtask-conditioned manipulationscene-level affordancerobotic benchmarkpart segmentationmulti-region correspondencemanipulation policy
0
0 comments X

The pith

Task-conditioned manipulation requires grounding instructions to functional parts rather than object categories, often with multiple valid regions in cluttered scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing affordance datasets fail to match task-conditioned manipulation because they emphasize grasping, object-level labels, synthetic data, or single instruction-region pairs. It introduces the A2A framework built around A2A-Bench to supply scene-level annotations that explicitly include both single-region and multi-region correspondences in real everyday environments. The A2A-AffordGen pipeline scales annotation through language-model filtering, interactive segmentation, mask refinement, and human verification. Experiments using the benchmark data show that generic segmentation, vision-language models, and prior affordance methods leave large gaps in task-level accuracy, while the new supervision yields tighter localization and usable spatial priors for manipulation policies.

Core claim

Affordance2Action establishes a benchmark-centered framework in which A2A-Bench provides supervision for scene-dependent, task-conditioned part affordance grounding that accounts for one-to-many correspondences between instructions and functional regions, directly supporting real-time grounding and affordance-conditioned manipulation policies.

What carries the argument

A2A-Bench, the manipulation-oriented benchmark that records single-region and multi-region instruction correspondences in cluttered everyday scenes.

If this is right

  • Generic segmentation and VLM-based grounding methods exhibit substantial gaps when evaluated on task-level localization in multi-object scenes.
  • Models trained with A2A-Bench supervision achieve higher accuracy at identifying task-relevant functional regions.
  • The resulting affordance maps supply spatial priors that improve performance of downstream manipulation policies.
  • Real-time affordance grounding becomes practical once trained on the new scene-level annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robotic systems could execute more varied interactions in unstructured settings by querying multiple valid affordance regions instead of committing to one.
  • Annotation pipelines that combine language models with targeted human review may transfer to other perception tasks that involve inherent ambiguity.
  • Future benchmarks in related domains such as instruction following may need explicit multi-correspondence designs to avoid underestimating model capability.

Load-bearing premise

The A2A-AffordGen pipeline of language-model filtering, interactive part segmentation, mask-out refinement, task-reasoning generation, and human verification yields accurate, unbiased annotations that match actual scene affordances.

What would settle it

A controlled study in which independent human annotators consistently select different functional regions for the same task instructions than those stored in A2A-Bench would show the labels do not capture true affordances.

Figures

Figures reproduced from arXiv: 2606.04172 by Enze Yuan, Hanqing Wang, Haoran Du, Jingjin Yu, Litao Liu, Michael Liu, Pengfei Yi, Qi Zhang, Ruiding Feng, Wenbo Yu, Yifan Han, Zilin Yuan.

Figure 1
Figure 1. Figure 1: Overview of Affordance2Action (A2A). A2A is a benchmark-centered learning framework that constructs A2A-Bench with A2A-AffordGen, producing task-conditioned functional-region annotations in natural multi-object scenes. The middle panel visualizes a representative subset of the generated annotations for clarity. A2A-Bench provides supervision for A2A-GroundingModel and supports A2A-Policy by converting grou… view at source ↗
Figure 2
Figure 2. Figure 2: A2A-AffordGen pipeline. SAG (left) casts part segmentation as an iterative point-prompting loop: the policy ingests an ORPS prompt and the current mask, emits a {+, −} point prompt, and a frozen SAM3 applies it. MAG (right) extends SAG to scene-level multi-instance and multi-part annotation by triaging SAM3 text-prompted masks, refining low-confidence instances with SAG, and reassembling the resulting mask… view at source ↗
Figure 3
Figure 3. Figure 3: A2A-GroundingModel and A2A-Policy. A2A adapts a frozen SAM3 backbone with lightweight adapters and text-conditioned visual prompt injection for task-conditioned affordance grounding, and transfers the resulting affordance prior to manipulation through explicit mask highlighting or implicit feature injection. (ii) are expanded via the SimpleClick distance-maximizing oracle [28] into ∼150K click-trajectory s… view at source ↗
Figure 4
Figure 4. Figure 4: Per-step training loss of A2A-AffordGen (red) vs. the LISA-7B finetune baseline [47] (blue) on the same A2A-Bench split with matched optimiser settings. Red triangles are A2A-AffordGen’s validation-loss probes; the solid blue curve is LISA’s running average over its light-blue per-step trace. Discussion. Without any RefCOCO supervision, A2A-AffordGen reaches 75.86/64.78/72.55 cIoU on RefCOCO/+/g, surpassin… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on A2A-Bench. Each example shows the input image, ground-truth mask, A2A-GroundingModel prediction, binarized UAD prediction, and UAD heatmap. Compared with UAD, A2A￾GroundingModel produces more compact and instruction-consistent affordance masks under task-level TRPS instructions. Please zoom in to see the details. C A2A-Policy Implementation Details and Simulation Results Baselines… view at source ↗
Figure 6
Figure 6. Figure 6: Stack the blue cube on the wooden cube. Start End A2A -Explicit UAD -DP Task: place the phone on the phone stand A2A -Implicit [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Place the phone on the phone stand. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Open the microwave oven. Start End A2A -Explicit A2A -Implicit UAD -DP Task: place the cup on the mat [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Place the cup on the mat. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Task-conditioned manipulation requires grounding instructions to task-relevant functional parts rather than object categories. This setting is scene-dependent and often one-to-many in cluttered scenes: the same object may afford different interactions across tasks, while a single task may correspond to either one functional region or multiple valid functional regions, depending on the scene layout. Existing affordance datasets and benchmarks remain misaligned with this setting, as they typically focus on grasping or object-level affordances, rely on synthetic scenes, or assume a single instruction-region correspondence. We present Affordance2Action (A2A), a benchmark-centered learning framework for scene-level, task-conditioned part affordance grounding. At its core is A2A-Bench, a manipulation-oriented benchmark that covers both single-region and multi-region instruction correspondences in everyday scenes, with the latter highlighting the ambiguity and diversity of affordance grounding in realistic multi-object environments. To construct it at scale, we build A2A-AffordGen, an agent-assisted annotation pipeline that combines language-model filtering, interactive part segmentation, instance-level mask-out refinement, task-reasoning instruction generation, and human verification. A2A-Bench's supervision further supports diverse downstream applications, with real-time affordance grounding and affordance-conditioned manipulation policies as two representative examples. Experiments show that A2A exposes substantial gaps in generic segmentation, VLM-based grounding, and affordance distillation baselines, while improving task-level localization and providing useful spatial priors for downstream manipulation. All datasets and code will be publicly released to promote open research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Affordance2Action (A2A), a benchmark-centered learning framework for scene-level, task-conditioned part affordance grounding aimed at manipulation. It presents A2A-Bench covering single-region and multi-region instruction correspondences in everyday scenes, constructed at scale via the A2A-AffordGen pipeline (LM filtering, interactive part segmentation, instance-level mask-out refinement, task-reasoning instruction generation, and human verification). The work demonstrates applications to real-time affordance grounding and affordance-conditioned manipulation policies, with experiments claiming to expose gaps in generic segmentation, VLM-based grounding, and affordance distillation baselines while improving task-level localization.

Significance. If valid, the benchmark addresses a clear gap by targeting scene-dependent, one-to-many affordance correspondences rather than object-level or grasping-focused datasets. Public release of datasets and code is a concrete strength that supports reproducibility and follow-on work. The downstream policy examples provide a useful bridge from grounding to manipulation.

major comments (1)
  1. [A2A-AffordGen pipeline (Section 3)] A2A-AffordGen pipeline (Section 3): the central claim that A2A-Bench correctly captures scene-dependent one-to-many affordance correspondences rests on the pipeline producing accurate, unbiased labels, yet no quantitative inter-annotator agreement, error rates, or ablation on pipeline stages (LM filtering, segmentation, mask-out refinement) is reported. Human verification alone does not address potential systematic biases in multi-region cases.
minor comments (2)
  1. [Abstract] Abstract: no dataset statistics (number of scenes, instructions, regions), evaluation protocol, or quantitative results are provided, which weakens the ability to assess the scale and strength of the reported gaps and improvements.
  2. [Conclusion] The paper states that all datasets and code will be publicly released; confirming the exact release contents and license in the camera-ready version would strengthen the reproducibility claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the importance of validating the A2A-AffordGen pipeline. We address the concern point-by-point below and commit to strengthening the manuscript accordingly.

read point-by-point responses
  1. Referee: A2A-AffordGen pipeline (Section 3): the central claim that A2A-Bench correctly captures scene-dependent one-to-many affordance correspondences rests on the pipeline producing accurate, unbiased labels, yet no quantitative inter-annotator agreement, error rates, or ablation on pipeline stages (LM filtering, segmentation, mask-out refinement) is reported. Human verification alone does not address potential systematic biases in multi-region cases.

    Authors: We agree that quantitative validation of the annotation pipeline is important for supporting the central claims. In the revised version, we will add inter-annotator agreement metrics (e.g., Cohen's kappa or IoU overlap) computed on a held-out subset of annotations, along with error rates observed during the human verification stage. We will also include ablations isolating the impact of each pipeline component (LM filtering, interactive part segmentation, instance-level mask-out refinement, and task-reasoning instruction generation) on final label quality and downstream task performance. For multi-region cases specifically, the human verification protocol required annotators to flag and resolve ambiguous or inconsistent correspondences; we will augment the revision with a dedicated breakdown of multi-region annotation statistics and any detected biases to directly address this point. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new benchmark and pipeline are independent contributions

full rationale

The paper introduces A2A-Bench via the A2A-AffordGen annotation pipeline (LM filtering, interactive segmentation, mask refinement, instruction generation, human verification) and evaluates baselines on the resulting data. No equations, fitted parameters, or derivations are present that reduce by construction to the inputs. The central experimental claims compare methods on newly generated annotations rather than predicting quantities from self-referential fits or self-citation chains. The pipeline's accuracy is an unverified assumption (as noted by the skeptic), but this is a validity concern, not a circular reduction of the derivation to its own outputs. No self-citations or uniqueness theorems from prior author work are invoked as load-bearing in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract-only review yields limited visibility into internal parameters or assumptions; the central contribution rests on the unverified quality of the generated benchmark labels and the representativeness of the scenes chosen for annotation.

axioms (1)
  • domain assumption The agent-assisted annotation pipeline produces accurate and representative labels for real cluttered scenes
    The entire benchmark and downstream claims depend on this pipeline step being reliable; invoked in the description of A2A-Bench construction.
invented entities (2)
  • A2A-Bench no independent evidence
    purpose: Manipulation-oriented benchmark covering single- and multi-region task correspondences
    New dataset introduced by the paper; no independent evidence outside this work.
  • A2A-AffordGen no independent evidence
    purpose: Agent-assisted annotation pipeline for scaling benchmark creation
    New pipeline introduced by the paper; no independent evidence outside this work.

pith-pipeline@v0.9.1-grok · 5852 in / 1499 out tokens · 24355 ms · 2026-06-28T09:43:09.759260+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 16 canonical work pages · 10 internal anchors

  1. [1]

    J. Li, Y . Zhu, Z. Tang, J. Wen, M. Zhu, X. Liu, C. Li, R. Cheng, Y . Peng, Y . Peng, et al. Coa-vla: Improving vision-language-action models via visual-text chain-of-affordance. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9759–9769, 2025

  2. [2]

    D. Wu, Y . Fu, S. Huang, Y . Liu, F. Jia, N. Liu, F. Dai, T. Wang, R. M. Anwer, F. S. Khan, et al. Ragnet: Large-scale reasoning-based affordance segmentation benchmark towards gen- eral grasping. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11980–11990, 2025

  3. [3]

    Z. Wan, Y . Xie, C. Zhang, Z. Lin, Z. Wang, S. Stepputtis, D. Ramanan, and K. P. Sycara. Instructpart: Task-oriented part segmentation with instruction reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 24202–24227, 2025

  4. [4]

    Y . Tang, W. Huang, Y . Wang, C. Li, R. Yuan, R. Zhang, J. Wu, and L. Fei-Fei. Uad: Un- supervised affordance distillation for generalization in robotic manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3822–3831. IEEE, 2025

  5. [5]

    J. Li, Y . Feng, Y . Guo, J. Huang, W. Ji, Q. Bi, Y . Piao, M. Zhang, X. Zhao, Q. Chen, et al. Sam3-i: Segment anything with instructions.arXiv preprint arXiv:2512.04585, 2025

  6. [6]

    SAM 3: Segment Anything with Concepts

    N. Carion, L. Gustafson, Y .-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V . Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  7. [7]

    Y . Qing, Y . Chi, S. Chen, S. Liu, K. Yao, S. Lin, L. Liu, and C. Zou. Bitrajdiff: Bidirectional trajectory generation with diffusion models for offline reinforcement learning.arXiv preprint arXiv:2506.05762, 2025

  8. [8]

    S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun. Objects365: A large- scale, high-quality dataset for object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019

  9. [9]

    M. Zhu, Y . Tian, H. Chen, C. Zhou, Q. Guo, Y . Liu, M. Yang, and C. Shen. Segagent: Ex- ploring pixel understanding capabilities in mllms by imitating human annotator trajectories. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 3686–3696, 2025

  10. [10]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

  11. [11]

    G. Li, V . Jampani, D. Sun, and L. Sevilla-Lara. Locate: Localize and transfer object parts for weakly supervised affordance grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10922–10931, 2023

  12. [12]

    L. Liu, W. Wang, Y . Han, Z. Xie, P. Yi, J. Li, and W. Lian. Foam: Foresight-augmented multi-task imitation policy for robotic manipulation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18460–18468, 2026

  13. [13]

    G. Li, D. Sun, L. Sevilla-Lara, and V . Jampani. One-shot open affordance learning with foun- dation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3086–3096, 2024

  14. [14]

    H. Luo, W. Zhai, J. Zhang, Y . Cao, and D. Tao. Learning affordance grounding from exo- centric images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2252–2261, 2022. 10

  15. [15]

    Y . Yang, W. Zhai, H. Luo, Y . Cao, J. Luo, and Z.-J. Zha. Grounding 3d object affordance from 2d interactions in images. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10905–10915, 2023

  16. [16]

    S. Deng, X. Xu, C. Wu, K. Chen, and K. Jia. 3d affordancenet: A benchmark for visual object affordance understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1778–1787, 2021

  17. [17]

    S. Lin, J. Chen, H. Xu, Z. Li, G. Wang, Y . Jing, S. Xu, R. Zhao, B. Sheil, L.-P. Chau, et al. Roboflow4d: A lightweight flow world model toward real-time flow-guided robotic manipula- tion.arXiv preprint arXiv:2605.17522, 2026

  18. [18]

    H. Wang, M. Liu, X. Chen, C. Ma, Y . Zhong, W. Yin, Y . Liu, Z. Cui, J. Yuan, L. Dai, et al. Videoafford: Grounding 3d affordance from human-object-interaction videos via multimodal large language model.arXiv preprint arXiv:2602.09638, 2026

  19. [19]

    S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023

  20. [20]

    S. Lin, Y . Qing, L. Liu, M. Zhou, R. Jin, X. Fan, and G. Liu. Dygro-vla: Cross-task scaling of vision-language-action models via dynamic grouped residual optimization.arXiv preprint arXiv:2605.17486, 2026

  21. [21]

    C. Yu, H. Wang, Y . Shi, H. Luo, S. Yang, J. Yu, and J. Wang. Seqafford: Sequential 3d affordance reasoning via multimodal large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1691–1701, 2025

  22. [22]

    Zhang, K

    Z. Zhang, K. Chen, H. Wang, H. Zhang, H. H. Chen, C. Liao, L. Guo, and Y .-C. Chen. A4-agent: An agentic framework for zero-shot affordance reasoning.arXiv preprint arXiv:2512.14442, 2025

  23. [23]

    S. Qian, W. Chen, M. Bai, X. Zhou, Z. Tu, and L. E. Li. Affordancellm: Grounding affordance from vision language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7587–7597, 2024

  24. [24]

    Zhang, Y

    Z. Zhang, Y . Shi, L. Yang, S. Ni, Q. Ye, and J. Wang. Openhoi: Open-world hand-object interaction synthesis with multimodal large language model.Advances in Neural Information Processing Systems, 38:166582–166612, 2026

  25. [25]

    H. Wang, S. Wang, Y . Zhong, Z. Yang, J. Wang, Z. Cui, J. Yuan, Y . Han, M. Liu, and Y . Ma. Affordance-r1: Reinforcement learning for generalizable affordance reasoning in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 9738–9746, 2026

  26. [26]

    Yu, C.-H

    Y . Yu, C.-H. H. Yang, J. Kolehmainen, P. G. Shivakumar, Y . Gu, S. R. R. Ren, Q. Luo, A. Gourav, I.-F. Chen, Y .-C. Liu, et al. Low-rank adaptation of large language model rescoring for parameter-efficient speech recognition. In2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023

  27. [27]

    Q. Team. Qwen3. 5-omni technical report.arXiv preprint arXiv:2604.15804, 2026

  28. [28]

    Q. Liu, Z. Xu, G. Bertasius, and M. Niethammer. Simpleclick: Interactive image segmentation with simple vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22290–22300, 2023

  29. [29]

    N. Xu, B. Price, S. Cohen, J. Yang, and T. S. Huang. Deep interactive object selection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 373– 381, 2016. 11

  30. [30]

    S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European conference on computer vision, pages 38–55. Springer, 2024

  31. [31]

    R. M. Haralick, S. R. Sternberg, and X. Zhuang. Image analysis using mathematical morphol- ogy.IEEE transactions on pattern analysis and machine intelligence, (4):532–550, 1987

  32. [32]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  33. [33]

    A. Guo, B. Wen, J. Yuan, J. Tremblay, S. Tyree, J. Smith, and S. Birchfield. Handal: A dataset of real-world manipulable object categories with pose annotations, affordances, and reconstructions. In2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 11428–11435. IEEE, 2023

  34. [34]

    Z. Wan, Y . Xie, C. Zhang, Z. Lin, Z. Wang, S. Stepputtis, D. Ramanan, and K. P. Sycara. Instructpart: Affordance-based part segmentation from language instruction. InAAAI-2024 Workshop on Public Sector LLMs: Algorithmic and Sociotechnical Design, 2024

  35. [35]

    H.-S. Fang, C. Wang, M. Gou, and C. Lu. Graspnet-1billion: A large-scale benchmark for general object grasping. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11444–11453, 2020

  36. [36]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

  37. [37]

    C. Zhu, F. Xiao, A. Alvarado, Y . Babaei, J. Hu, H. El-Mohri, S. Culatana, R. Sumbaly, and Z. Yan. Egoobjects: A large-scale egocentric dataset for fine-grained object understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 20110– 20120, 2023

  38. [38]

    Qian and D

    S. Qian and D. F. Fouhey. Understanding 3d object interaction from a single image. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 21753–21763, 2023

  39. [39]

    T. Z. Zhao, V . Kumar, S. Levine, and C. Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

  40. [40]

    H. Fang, M. Grotz, W. Pumacay, Y . R. Wang, D. Fox, R. Krishna, and J. Duan. Sam2act: In- tegrating visual foundation model with a memory architecture for robotic manipulation.arXiv preprint arXiv:2501.18564, 2025

  41. [41]

    L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expres- sions. InEuropean conference on computer vision, pages 69–85. Springer, 2016

  42. [42]

    Kazemzadeh, V

    S. Kazemzadeh, V . Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014

  43. [43]

    J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016

  44. [44]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone. Libero: Benchmarking knowl- edge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023. 12

  45. [45]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haz- iza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  46. [46]

    X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y . J. Lee. Segment everything everywhere all at once.Advances in neural information processing systems, 36: 19769–19782, 2023

  47. [47]

    X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9579–9589, 2024

  48. [48]

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  49. [49]

    Kirillov, E

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  50. [50]

    Zhong, Z

    Z. Zhong, Z. Tang, T. He, H. Fang, and C. Yuan. Convolution meets lora: Parameter efficient finetuning for segment anything model.arXiv preprint arXiv:2401.17868, 2024

  51. [51]

    Myers, C

    A. Myers, C. L. Teo, C. Ferm ¨uller, and Y . Aloimonos. Affordance detection of tool parts from geometric features. In2015 IEEE International Conference on Robotics and Automation (ICRA), pages 1374–1381, 2015. doi:10.1109/ICRA.2015.7139369

  52. [52]

    Z. Yang, J. Wang, Y . Tang, K. Chen, H. Zhao, and P. H. Torr. Lavt: Language-aware vision transformer for referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18155–18165, 2022

  53. [53]

    Z. Wang, Y . Lu, Q. Li, X. Tao, Y . Guo, M. Gong, and T. Liu. Cris: Clip-driven referring image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11686–11695, 2022

  54. [54]

    J. Liu, H. Ding, Z. Cai, Y . Zhang, R. K. Satzoda, V . Mahadevan, and R. Manmatha. Poly- former: Referring image segmentation as sequential polygon generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18653–18663, 2023

  55. [55]

    Z. Ren, Z. Huang, Y . Wei, Y . Zhao, D. Fu, J. Feng, and X. Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024

  56. [56]

    R. Pi, L. Yao, J. Gao, J. Zhang, and T. Zhang. Perceptiongpt: Effectively fusing visual per- ception into llm. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27124–27133, 2024

  57. [57]

    Z. Xia, D. Han, Y . Han, X. Pan, S. Song, and G. Huang. Gsva: Generalized segmentation via multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3858–3869, 2024

  58. [58]

    Positive point:(x, y). Next IoU:v. Gain:δ

    Y .-C. Chen, W.-H. Li, C. Sun, Y .-C. F. Wang, and C.-S. Chen. Sam4mllm: Enhance multi- modal large language model for referring expression segmentation. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024. 13 Appendix A Implementation Details of A2A-AffordGen This appendix complements Sec. 3.1 with (i) the distance-maximizing oracle a...