CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

Bofan Lyu; Boyu Ma; Chuhao Zhou; Gen Li; Jianfei Yang; Jindou Jia; Jingliang Li; Shilin Shan; Tuo An; Xiangyu Chen

arxiv: 2604.02060 · v2 · pith:ZCLXOZNAnew · submitted 2026-04-02 · 💻 cs.CV · cs.RO

CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects

Jingliang Li , Jindou Jia , Tuo An , Chuhao Zhou , Xiangyu Chen , Shilin Shan , Boyu Ma , Bofan Lyu

show 2 more authors

Gen Li Jianfei Yang

This is my paper

Pith reviewed 2026-05-21 09:48 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords 3D affordance groundingconfusable objectsintent-driven groundingmulti-object point cloudsrobotic graspinglanguage-conditioned segmentationCompassAD benchmarkCompassNet

0 comments

The pith

CompassNet selects the right object for an implicit task among functionally similar competitors in a 3D scene.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve affordance grounding when multiple objects in one scene share the same basic function yet only one fits the unspoken goal expressed in natural language. Prior 3D methods usually test on single isolated objects and often supply the object category name directly, which avoids the real-world problem of choice among look-alike tools. The authors therefore define Intent-Driven Confusable Affordance Grounding, release the CompassAD benchmark of confusing object pairs, and introduce CompassNet whose two modules keep language features from leaking across object boundaries while sharpening geometric differences at both group and point scales. If the approach works, a robot given the instruction “cut the cake” will reliably mask the knife rather than the scissors even when both are present and both can cut.

Core claim

The central claim is that Intent-Driven Confusable Affordance Grounding requires predicting a per-point affordance mask on the single appropriate object inside a multi-object point cloud when the query supplies only implicit natural-language intent. CompassAD supplies the first large-scale testbed for this setting: 30 confusing pairs spanning 16 affordance types, 6,422 compositions, and more than 88,000 query-answer pairs. CompassNet solves the task by combining Instance-bounded Cross Injection, which restricts language-geometry fusion to stay inside each object’s boundary, with Bi-level Contrastive Refinement, which contrasts target and confusable surfaces at both geometric-group and per- 3

What carries the argument

Instance-bounded Cross Injection (ICI) and Bi-level Contrastive Refinement (BCR) modules that together enforce language-geometry alignment inside object boundaries and discriminate target from confusable surfaces at two scales.

If this is right

The framework reaches state-of-the-art accuracy on both seen and unseen queries in the CompassAD benchmark.
The same model transfers directly to real-world robotic grasping experiments involving cluttered compositions of competing objects.
Affordance prediction no longer requires explicit category names in the language query.
Cross-object semantic leakage is reduced by keeping language features inside instance boundaries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same boundary-constrained injection idea could be tested on other ambiguous grounding tasks such as part segmentation under vague instructions.
Extending the contrastive refinement to include temporal consistency might help in video-based robotic manipulation.
If the benchmark construction pipeline generalizes, similar datasets could be built for affordance confusion in 2D images or depth-only inputs.

Load-bearing premise

The benchmark of 30 confusing object pairs and 6,422 compositions captures enough of the variety found in everyday scenes where objects share affordances but differ in task appropriateness.

What would settle it

Run CompassNet on a fresh collection of confusing object pairs and intent queries that were never seen during benchmark construction and check whether per-point mask accuracy on the correct object drops sharply below the reported SOTA figures.

read the original abstract

When told to "cut the cake," a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Intent-Driven Confusable Affordance Grounding, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusing multi-object compositions. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 compositions, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object compositions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper carves out a new setting and benchmark for intent-driven affordance in confusing multi-object scenes, but the evidence that the data actually forces implicit intent resolution is still thin.

read the letter

The main thing to know is that this work formalizes Intent-Driven Confusable Affordance Grounding and releases CompassAD, the first benchmark built around implicit natural-language intent in multi-object point clouds where objects share affordances. That is a concrete step beyond the single-object, explicit-query setups that dominate prior 3D affordance papers. The authors also supply CompassNet with two modules, ICI to limit language-geometry alignment to object boundaries and BCR to add contrast at both group and point levels, which look like direct responses to cross-object leakage and surface similarity. Those choices make sense for the stated problem and give the method a fighting chance on the new data. The robotic grasping deployment is a reasonable next step to show transfer. The soft spot is exactly the one the stress-test note flags. The central claim that the model picks the right object via implicit intent on both seen and unseen queries depends on the 30 pairs and 6,422 compositions genuinely requiring that reasoning rather than geometric shortcuts or query templating. The abstract gives no numbers on how implicit the intents are, no inter-annotator agreement on task appropriateness, and no ablation that isolates whether ICI and BCR gains survive when surface cues are controlled. Until those checks appear, the SOTA numbers and real-world results are harder to interpret as strong evidence for the intended capability. This paper is aimed at people working on language-conditioned 3D perception and manipulation in cluttered scenes. A reader who needs a starting benchmark for multi-object affordance conflicts will find the data release and the formalization worth examining. It deserves a serious referee because the new setting is practical and the benchmark opens a direction that prior datasets skipped, even if the current experiments need more diagnostic tests on intent implicitness. I would send it to peer review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper formalizes Intent-Driven Confusable Affordance Grounding as a new 3D task requiring per-point affordance masks on the correct object in multi-object point clouds, conditioned on implicit natural language intent rather than explicit categories. It introduces the CompassAD benchmark (30 confusing object pairs spanning 16 affordance types, 6,422 compositions, 88K+ query-answer pairs) and CompassNet, which adds Instance-bounded Cross Injection (ICI) to constrain language-geometry alignment within object boundaries and Bi-level Contrastive Refinement (BCR) for geometric-group and point-level discrimination. Experiments report state-of-the-art results on seen and unseen query splits, with additional validation via robotic manipulator deployment in real-world confusing multi-object scenes.

Significance. If the central claims hold, the work fills a clear gap in existing 3D affordance methods, which typically evaluate isolated objects with explicit queries. The new benchmark and the two dedicated modules (ICI and BCR) provide concrete tools for handling functionally competing objects under implicit intent, with potential impact on robotic manipulation. The scale of the benchmark and the inclusion of both seen/unseen splits plus real-robot transfer are positive contributions that could seed follow-on research in intent-aware 3D perception.

major comments (2)

[Benchmark construction] Benchmark construction (described in the section introducing CompassAD): no quantitative metrics are reported for intent implicitness, inter-annotator agreement on task appropriateness, or coverage of edge cases such as fine-grained geometric similarity versus culturally varying intents. This is load-bearing because the central claim—that ICI and BCR enable resolution of affordance conflicts via implicit intent rather than surface cues or memorization—rests on the benchmark genuinely forcing such distinctions; without these measures the SOTA gains on unseen queries and robotic transfer results are difficult to interpret.
[Robotic experiments] Robotic deployment experiments: the transfer results to real-world grasping lack detailed quantitative comparison against baselines or ablation of ICI/BCR in the physical setting, and no error analysis is provided for failure cases in confusing compositions. This weakens the claim of effective real-world transfer, as the abstract-only description leaves open whether gains derive from the proposed modules or from other factors.

minor comments (2)

Notation for the two modules (ICI and BCR) should be introduced with full names on first use and used consistently thereafter.
Figure captions for example confusing pairs would benefit from explicit call-outs indicating which object is appropriate under the given implicit intent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the potential of our work on intent-driven affordance grounding for functionally competing objects. We address the major comments point by point below, agreeing to revisions that strengthen the manuscript where the concerns are valid.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction (described in the section introducing CompassAD): no quantitative metrics are reported for intent implicitness, inter-annotator agreement on task appropriateness, or coverage of edge cases such as fine-grained geometric similarity versus culturally varying intents. This is load-bearing because the central claim—that ICI and BCR enable resolution of affordance conflicts via implicit intent rather than surface cues or memorization—rests on the benchmark genuinely forcing such distinctions; without these measures the SOTA gains on unseen queries and robotic transfer results are difficult to interpret.

Authors: We agree that quantitative validation of benchmark quality would better substantiate the central claims. The manuscript describes the selection of 30 confusing pairs based on shared affordances, the generation of implicit natural language queries, and the resulting 6,422 compositions with 88K+ pairs, along with seen/unseen splits. However, it does not report metrics such as inter-annotator agreement or explicit implicitness scores. In the revised version, we will add a dedicated subsection with inter-annotator agreement (e.g., Fleiss' kappa computed on a sampled subset for intent appropriateness and object choice), a quantitative implicitness metric (percentage of queries omitting explicit object categories), and coverage statistics for edge cases including geometric similarity (via average Chamfer distance between confusable pairs) and a discussion of cultural intent variations in the limitations. These additions will directly support that performance gains reflect intent resolution rather than memorization or surface cues. revision: yes
Referee: [Robotic experiments] Robotic deployment experiments: the transfer results to real-world grasping lack detailed quantitative comparison against baselines or ablation of ICI/BCR in the physical setting, and no error analysis is provided for failure cases in confusing compositions. This weakens the claim of effective real-world transfer, as the abstract-only description leaves open whether gains derive from the proposed modules or from other factors.

Authors: We acknowledge that the robotic deployment section demonstrates successful transfer in real confusing multi-object scenes but provides only high-level confirmation without detailed quantitative baselines, module ablations, or error analysis. The current text reports effective grasping using CompassNet but does not include side-by-side success rates versus baselines or physical ablations of ICI/BCR. In revision, we will expand this section with quantitative tables reporting grasping success rates for CompassNet versus adapted baselines in the physical setup, ablations isolating the contribution of ICI and BCR on real-robot performance, and a categorized error analysis of failure cases (e.g., intent misinterpretation versus geometric confusion) with frequencies and representative examples from the trials. These changes will clarify that observed gains are attributable to the proposed modules. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new benchmark and modules are independently introduced and experimentally validated

full rationale

The paper defines a new task (Intent-Driven Confusable Affordance Grounding), constructs a fresh benchmark (CompassAD with 30 pairs, 6,422 compositions, 88K+ pairs), and proposes CompassNet with two novel modules (ICI and BCR). No load-bearing step reduces by construction to prior self-citations, fitted inputs renamed as predictions, or self-definitional loops. Central claims rest on experimental results for seen/unseen queries and robotic transfer rather than tautological derivations. The work is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from overlapping prior work in a circular manner.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access prevents identification of specific free parameters, axioms, or invented entities; no equations or implementation details are available to audit.

pith-pipeline@v0.9.0 · 5832 in / 1145 out tokens · 51462 ms · 2026-05-21T09:48:10.058181+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 5 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

3d-affordancellm: Harnessing large language models for open-vocabulary affordance detection in 3d worlds,

Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye Hao, and Liqiang Nie. 3d-affordancellm: Har- nessing large language models for open-vocabulary affordance detection in 3d worlds.arXiv preprint arXiv:2502.20041,

work page arXiv
[3]

Unlocking 3d affordance segmentation with 2d semantic knowledge.arXiv preprint arXiv:2510.08316,

Yu Huang, Zelin Peng, Changsong Wen, Xiaokang Yang, and Wei Shen. Unlocking 3d affordance segmentation with 2d semantic knowledge.arXiv preprint arXiv:2510.08316,

work page arXiv
[4]

Mitsuba 3 renderer, 2022,

Wenzel Jakob, Sébastien Speierer, Nicolas Roussel, Merlin Nimier-David, Delio Vicini, Tizian Zeltner, Baptiste Nicolet, Miguel Crespo, Vincent Leroy, and Ziyi Zhang. Mitsuba 3 renderer, 2022,

work page 2022
[5]

Affordancesam: Segment anything once more in affordance grounding.arXiv preprint arXiv:2504.15650,

Dengyang Jiang, Zanyi Wang, Hengzhuang Li, Sizhe Dang, Teli Ma, Wei Wei, Guang Dai, Lei Zhang, and Mengmeng Wang. Affordancesam: Segment anything once more in affordance grounding.arXiv preprint arXiv:2504.15650,

work page arXiv
[6]

Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009,

Junha Lee, Eunha Park, Chunghyun Park, Dahyun Kang, and Minsu Cho. Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009,

work page arXiv
[7]

Seqaffordsplat: Scene-level sequential affordance reasoning on 3d gaussian splatting.arXiv preprint arXiv:2507.23772, 2025a

Di Li, Jie Feng, Jiahao Chen, Weisheng Dong, Guanbin Li, Yuhui Zheng, Mingtao Feng, and Guangming Shi. Seqaffordsplat: Scene-level sequential affordance reasoning on 3d gaussian splatting.arXiv preprint arXiv:2507.23772, 2025a. Gen Li, Varun Jampani, Deqing Sun, and Laura Sevilla-Lara. Locate: Localize and transfer object parts for weakly supervised affor...

work page arXiv
[8]

One-shot open affordance learning with foundation models

Gen Li, Deqing Sun, Laura Sevilla-Lara, and Varun Jampani. One-shot open affordance learning with foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3086–3096, 2024a. Gen Li, Nikolaos Tsagkas, Jifei Song, Ruaridh Mon-Williams, Sethu Vijayakumar, Kun Shao, and Laura Sevilla-Lara. Learning precise a...

work page arXiv
[9]

Grounding 3d scene affordance from egocentric interactions.arXiv preprint arXiv:2409.19650,

Cuiyu Liu, Wei Zhai, Yuhang Yang, Hongchen Luo, Sen Liang, Yang Cao, and Zheng-Jun Zha. Grounding 3d scene affordance from egocentric interactions.arXiv preprint arXiv:2409.19650,

work page arXiv
[10]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[11]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Affostruction: 3D Affordance Grounding with Generative Reconstruction

Chunghyun Park, Seunghyeon Lee, and Minsu Cho. Affostruction: 3d affordance grounding with generative reconstruc- tion.arXiv preprint arXiv:2601.09211,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

A deep learning approach to object affordance segmentation

Spyridon Thermos, Petros Daras, and Gerasimos Potamianos. A deep learning approach to object affordance segmentation. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2358–2362. IEEE,

work page 2020
[15]

Videoafford: Grounding 3d affordance from human-object-interaction videos via multimodal large language model.arXiv preprint arXiv:2602.09638,

Hanqing Wang, Mingyu Liu, Xiaoyu Chen, Chengwei Ma, Yiming Zhong, Wenti Yin, Yuhao Liu, Zhiqing Cui, Jiahao Yuan, Lu Dai, et al. Videoafford: Grounding 3d affordance from human-object-interaction videos via multimodal large language model.arXiv preprint arXiv:2602.09638,

work page arXiv
[16]

Partafford: Part-level affordance discovery from 3d objects.arXiv preprint arXiv:2202.13519,

Chao Xu, Yixin Chen, He Wang, Song-Chun Zhu, Yixin Zhu, and Siyuan Huang. Partafford: Part-level affordance discovery from 3d objects.arXiv preprint arXiv:2202.13519,

work page arXiv
[17]

Uni3d: Exploring unified 3d representation at scale

Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

work page arXiv

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

3d-affordancellm: Harnessing large language models for open-vocabulary affordance detection in 3d worlds,

Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye Hao, and Liqiang Nie. 3d-affordancellm: Har- nessing large language models for open-vocabulary affordance detection in 3d worlds.arXiv preprint arXiv:2502.20041,

work page arXiv

[3] [3]

Unlocking 3d affordance segmentation with 2d semantic knowledge.arXiv preprint arXiv:2510.08316,

Yu Huang, Zelin Peng, Changsong Wen, Xiaokang Yang, and Wei Shen. Unlocking 3d affordance segmentation with 2d semantic knowledge.arXiv preprint arXiv:2510.08316,

work page arXiv

[4] [4]

Mitsuba 3 renderer, 2022,

Wenzel Jakob, Sébastien Speierer, Nicolas Roussel, Merlin Nimier-David, Delio Vicini, Tizian Zeltner, Baptiste Nicolet, Miguel Crespo, Vincent Leroy, and Ziyi Zhang. Mitsuba 3 renderer, 2022,

work page 2022

[5] [5]

Affordancesam: Segment anything once more in affordance grounding.arXiv preprint arXiv:2504.15650,

Dengyang Jiang, Zanyi Wang, Hengzhuang Li, Sizhe Dang, Teli Ma, Wei Wei, Guang Dai, Lei Zhang, and Mengmeng Wang. Affordancesam: Segment anything once more in affordance grounding.arXiv preprint arXiv:2504.15650,

work page arXiv

[6] [6]

Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009,

Junha Lee, Eunha Park, Chunghyun Park, Dahyun Kang, and Minsu Cho. Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009,

work page arXiv

[7] [7]

Seqaffordsplat: Scene-level sequential affordance reasoning on 3d gaussian splatting.arXiv preprint arXiv:2507.23772, 2025a

Di Li, Jie Feng, Jiahao Chen, Weisheng Dong, Guanbin Li, Yuhui Zheng, Mingtao Feng, and Guangming Shi. Seqaffordsplat: Scene-level sequential affordance reasoning on 3d gaussian splatting.arXiv preprint arXiv:2507.23772, 2025a. Gen Li, Varun Jampani, Deqing Sun, and Laura Sevilla-Lara. Locate: Localize and transfer object parts for weakly supervised affor...

work page arXiv

[8] [8]

One-shot open affordance learning with foundation models

Gen Li, Deqing Sun, Laura Sevilla-Lara, and Varun Jampani. One-shot open affordance learning with foundation models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3086–3096, 2024a. Gen Li, Nikolaos Tsagkas, Jifei Song, Ruaridh Mon-Williams, Sethu Vijayakumar, Kun Shao, and Laura Sevilla-Lara. Learning precise a...

work page arXiv

[9] [9]

Grounding 3d scene affordance from egocentric interactions.arXiv preprint arXiv:2409.19650,

Cuiyu Liu, Wei Zhai, Yuhang Yang, Hongchen Luo, Sen Liang, Yang Cao, and Zheng-Jun Zha. Grounding 3d scene affordance from egocentric interactions.arXiv preprint arXiv:2409.19650,

work page arXiv

[10] [10]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907

[11] [11]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Affostruction: 3D Affordance Grounding with Generative Reconstruction

Chunghyun Park, Seunghyeon Lee, and Minsu Cho. Affostruction: 3d affordance grounding with generative reconstruc- tion.arXiv preprint arXiv:2601.09211,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

A deep learning approach to object affordance segmentation

Spyridon Thermos, Petros Daras, and Gerasimos Potamianos. A deep learning approach to object affordance segmentation. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2358–2362. IEEE,

work page 2020

[15] [15]

Videoafford: Grounding 3d affordance from human-object-interaction videos via multimodal large language model.arXiv preprint arXiv:2602.09638,

Hanqing Wang, Mingyu Liu, Xiaoyu Chen, Chengwei Ma, Yiming Zhong, Wenti Yin, Yuhao Liu, Zhiqing Cui, Jiahao Yuan, Lu Dai, et al. Videoafford: Grounding 3d affordance from human-object-interaction videos via multimodal large language model.arXiv preprint arXiv:2602.09638,

work page arXiv

[16] [16]

Partafford: Part-level affordance discovery from 3d objects.arXiv preprint arXiv:2202.13519,

Chao Xu, Yixin Chen, He Wang, Song-Chun Zhu, Yixin Zhu, and Siyuan Huang. Partafford: Part-level affordance discovery from 3d objects.arXiv preprint arXiv:2202.13519,

work page arXiv

[17] [17]

Uni3d: Exploring unified 3d representation at scale

Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

work page arXiv