CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects
Pith reviewed 2026-05-21 09:48 UTC · model grok-4.3
The pith
CompassNet selects the right object for an implicit task among functionally similar competitors in a 3D scene.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that Intent-Driven Confusable Affordance Grounding requires predicting a per-point affordance mask on the single appropriate object inside a multi-object point cloud when the query supplies only implicit natural-language intent. CompassAD supplies the first large-scale testbed for this setting: 30 confusing pairs spanning 16 affordance types, 6,422 compositions, and more than 88,000 query-answer pairs. CompassNet solves the task by combining Instance-bounded Cross Injection, which restricts language-geometry fusion to stay inside each object’s boundary, with Bi-level Contrastive Refinement, which contrasts target and confusable surfaces at both geometric-group and per- 3
What carries the argument
Instance-bounded Cross Injection (ICI) and Bi-level Contrastive Refinement (BCR) modules that together enforce language-geometry alignment inside object boundaries and discriminate target from confusable surfaces at two scales.
If this is right
- The framework reaches state-of-the-art accuracy on both seen and unseen queries in the CompassAD benchmark.
- The same model transfers directly to real-world robotic grasping experiments involving cluttered compositions of competing objects.
- Affordance prediction no longer requires explicit category names in the language query.
- Cross-object semantic leakage is reduced by keeping language features inside instance boundaries.
Where Pith is reading between the lines
- The same boundary-constrained injection idea could be tested on other ambiguous grounding tasks such as part segmentation under vague instructions.
- Extending the contrastive refinement to include temporal consistency might help in video-based robotic manipulation.
- If the benchmark construction pipeline generalizes, similar datasets could be built for affordance confusion in 2D images or depth-only inputs.
Load-bearing premise
The benchmark of 30 confusing object pairs and 6,422 compositions captures enough of the variety found in everyday scenes where objects share affordances but differ in task appropriateness.
What would settle it
Run CompassNet on a fresh collection of confusing object pairs and intent queries that were never seen during benchmark construction and check whether per-point mask accuracy on the correct object drops sharply below the reported SOTA figures.
read the original abstract
When told to "cut the cake," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Intent-Driven Confusable Affordance Grounding, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusing multi-object compositions. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 compositions, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object compositions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes Intent-Driven Confusable Affordance Grounding as a new 3D task requiring per-point affordance masks on the correct object in multi-object point clouds, conditioned on implicit natural language intent rather than explicit categories. It introduces the CompassAD benchmark (30 confusing object pairs spanning 16 affordance types, 6,422 compositions, 88K+ query-answer pairs) and CompassNet, which adds Instance-bounded Cross Injection (ICI) to constrain language-geometry alignment within object boundaries and Bi-level Contrastive Refinement (BCR) for geometric-group and point-level discrimination. Experiments report state-of-the-art results on seen and unseen query splits, with additional validation via robotic manipulator deployment in real-world confusing multi-object scenes.
Significance. If the central claims hold, the work fills a clear gap in existing 3D affordance methods, which typically evaluate isolated objects with explicit queries. The new benchmark and the two dedicated modules (ICI and BCR) provide concrete tools for handling functionally competing objects under implicit intent, with potential impact on robotic manipulation. The scale of the benchmark and the inclusion of both seen/unseen splits plus real-robot transfer are positive contributions that could seed follow-on research in intent-aware 3D perception.
major comments (2)
- [Benchmark construction] Benchmark construction (described in the section introducing CompassAD): no quantitative metrics are reported for intent implicitness, inter-annotator agreement on task appropriateness, or coverage of edge cases such as fine-grained geometric similarity versus culturally varying intents. This is load-bearing because the central claim—that ICI and BCR enable resolution of affordance conflicts via implicit intent rather than surface cues or memorization—rests on the benchmark genuinely forcing such distinctions; without these measures the SOTA gains on unseen queries and robotic transfer results are difficult to interpret.
- [Robotic experiments] Robotic deployment experiments: the transfer results to real-world grasping lack detailed quantitative comparison against baselines or ablation of ICI/BCR in the physical setting, and no error analysis is provided for failure cases in confusing compositions. This weakens the claim of effective real-world transfer, as the abstract-only description leaves open whether gains derive from the proposed modules or from other factors.
minor comments (2)
- Notation for the two modules (ICI and BCR) should be introduced with full names on first use and used consistently thereafter.
- Figure captions for example confusing pairs would benefit from explicit call-outs indicating which object is appropriate under the given implicit intent.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for highlighting the potential of our work on intent-driven affordance grounding for functionally competing objects. We address the major comments point by point below, agreeing to revisions that strengthen the manuscript where the concerns are valid.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction (described in the section introducing CompassAD): no quantitative metrics are reported for intent implicitness, inter-annotator agreement on task appropriateness, or coverage of edge cases such as fine-grained geometric similarity versus culturally varying intents. This is load-bearing because the central claim—that ICI and BCR enable resolution of affordance conflicts via implicit intent rather than surface cues or memorization—rests on the benchmark genuinely forcing such distinctions; without these measures the SOTA gains on unseen queries and robotic transfer results are difficult to interpret.
Authors: We agree that quantitative validation of benchmark quality would better substantiate the central claims. The manuscript describes the selection of 30 confusing pairs based on shared affordances, the generation of implicit natural language queries, and the resulting 6,422 compositions with 88K+ pairs, along with seen/unseen splits. However, it does not report metrics such as inter-annotator agreement or explicit implicitness scores. In the revised version, we will add a dedicated subsection with inter-annotator agreement (e.g., Fleiss' kappa computed on a sampled subset for intent appropriateness and object choice), a quantitative implicitness metric (percentage of queries omitting explicit object categories), and coverage statistics for edge cases including geometric similarity (via average Chamfer distance between confusable pairs) and a discussion of cultural intent variations in the limitations. These additions will directly support that performance gains reflect intent resolution rather than memorization or surface cues. revision: yes
-
Referee: [Robotic experiments] Robotic deployment experiments: the transfer results to real-world grasping lack detailed quantitative comparison against baselines or ablation of ICI/BCR in the physical setting, and no error analysis is provided for failure cases in confusing compositions. This weakens the claim of effective real-world transfer, as the abstract-only description leaves open whether gains derive from the proposed modules or from other factors.
Authors: We acknowledge that the robotic deployment section demonstrates successful transfer in real confusing multi-object scenes but provides only high-level confirmation without detailed quantitative baselines, module ablations, or error analysis. The current text reports effective grasping using CompassNet but does not include side-by-side success rates versus baselines or physical ablations of ICI/BCR. In revision, we will expand this section with quantitative tables reporting grasping success rates for CompassNet versus adapted baselines in the physical setup, ablations isolating the contribution of ICI and BCR on real-robot performance, and a categorized error analysis of failure cases (e.g., intent misinterpretation versus geometric confusion) with frequencies and representative examples from the trials. These changes will clarify that observed gains are attributable to the proposed modules. revision: yes
Circularity Check
No significant circularity; new benchmark and modules are independently introduced and experimentally validated
full rationale
The paper defines a new task (Intent-Driven Confusable Affordance Grounding), constructs a fresh benchmark (CompassAD with 30 pairs, 6,422 compositions, 88K+ pairs), and proposes CompassNet with two novel modules (ICI and BCR). No load-bearing step reduces by construction to prior self-citations, fitted inputs renamed as predictions, or self-definitional loops. Central claims rest on experimental results for seen/unseen queries and robotic transfer rather than tautological derivations. The work is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from overlapping prior work in a circular manner.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye Hao, and Liqiang Nie. 3d-affordancellm: Har- nessing large language models for open-vocabulary affordance detection in 3d worlds.arXiv preprint arXiv:2502.20041,
-
[3]
Unlocking 3d affordance segmentation with 2d semantic knowledge.arXiv preprint arXiv:2510.08316,
Yu Huang, Zelin Peng, Changsong Wen, Xiaokang Yang, and Wei Shen. Unlocking 3d affordance segmentation with 2d semantic knowledge.arXiv preprint arXiv:2510.08316,
-
[4]
Wenzel Jakob, Sébastien Speierer, Nicolas Roussel, Merlin Nimier-David, Delio Vicini, Tizian Zeltner, Baptiste Nicolet, Miguel Crespo, Vincent Leroy, and Ziyi Zhang. Mitsuba 3 renderer, 2022,
work page 2022
-
[5]
Affordancesam: Segment anything once more in affordance grounding.arXiv preprint arXiv:2504.15650,
Dengyang Jiang, Zanyi Wang, Hengzhuang Li, Sizhe Dang, Teli Ma, Wei Wei, Guang Dai, Lei Zhang, and Mengmeng Wang. Affordancesam: Segment anything once more in affordance grounding.arXiv preprint arXiv:2504.15650,
-
[6]
Junha Lee, Eunha Park, Chunghyun Park, Dahyun Kang, and Minsu Cho. Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009,
-
[7]
Di Li, Jie Feng, Jiahao Chen, Weisheng Dong, Guanbin Li, Yuhui Zheng, Mingtao Feng, and Guangming Shi. Seqaffordsplat: Scene-level sequential affordance reasoning on 3d gaussian splatting.arXiv preprint arXiv:2507.23772, 2025a. Gen Li, Varun Jampani, Deqing Sun, and Laura Sevilla-Lara. Locate: Localize and transfer object parts for weakly supervised affor...
-
[8]
One-shot open affordance learning with foundation models
Gen Li, Deqing Sun, Laura Sevilla-Lara, and Varun Jampani. One-shot open affordance learning with foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3086–3096, 2024a. Gen Li, Nikolaos Tsagkas, Jifei Song, Ruaridh Mon-Williams, Sethu Vijayakumar, Kun Shao, and Laura Sevilla-Lara. Learning precise a...
-
[9]
Grounding 3d scene affordance from egocentric interactions.arXiv preprint arXiv:2409.19650,
Cuiyu Liu, Wei Zhai, Yuhang Yang, Hongchen Luo, Sen Liang, Yang Cao, and Zheng-Jun Zha. Grounding 3d scene affordance from egocentric interactions.arXiv preprint arXiv:2409.19650,
-
[10]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[11]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Affostruction: 3D Affordance Grounding with Generative Reconstruction
Chunghyun Park, Seunghyeon Lee, and Minsu Cho. Affostruction: 3d affordance grounding with generative reconstruc- tion.arXiv preprint arXiv:2601.09211,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
A deep learning approach to object affordance segmentation
Spyridon Thermos, Petros Daras, and Gerasimos Potamianos. A deep learning approach to object affordance segmentation. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2358–2362. IEEE,
work page 2020
-
[15]
Hanqing Wang, Mingyu Liu, Xiaoyu Chen, Chengwei Ma, Yiming Zhong, Wenti Yin, Yuhao Liu, Zhiqing Cui, Jiahao Yuan, Lu Dai, et al. Videoafford: Grounding 3d affordance from human-object-interaction videos via multimodal large language model.arXiv preprint arXiv:2602.09638,
-
[16]
Partafford: Part-level affordance discovery from 3d objects.arXiv preprint arXiv:2202.13519,
Chao Xu, Yixin Chen, He Wang, Song-Chun Zhu, Yixin Zhu, and Siyuan Huang. Partafford: Part-level affordance discovery from 3d objects.arXiv preprint arXiv:2202.13519,
-
[17]
Uni3d: Exploring unified 3d representation at scale
Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.