pith. machine review for the scientific record. sign in

arxiv: 2605.11616 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D affordance groundingfunctional affordancescross-scene memoryscene graphvision-language modelstraining-free methodSceneFun3D dataset
0
0 comments X

The pith

A reusable memory bank of source-scene affordance images plus an in-scene spatial graph lets a frozen vision-language model localize precise functional regions in new 3D scenes without fine-tuning or target labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that functional affordance grounding in 3D can be achieved by recalling geometry from past scenes and tracking spatial relations within the current one, allowing a pre-trained model to handle small and ambiguous interaction points that text prompts overlook. This would matter for building agents that interact reliably with real-world objects like pulling handles or pressing buttons, especially when multiple similar items appear in a scene. The method builds a cross-scene memory bank with affordance overlays on RGB images and an in-scene scene graph for resolving references like distance or order. If the approach holds, it demonstrates that memory can replace the need for scene-specific training data in grounding tasks.

Core claim

AFFORDMEM grounds 3D functional affordances by maintaining a category-level memory bank of RGB images with affordance regions from source scenes to guide a frozen VLM toward small operable subregions, and by organizing candidate instances into a structured scene graph for in-scene spatial memory that resolves references over distant or unobserved objects.

What carries the argument

Dual-level memory: cross-scene affordance memory bank of annotated source images recalled at query time, and in-scene spatial memory as a scene graph for spatial relations.

If this is right

  • Cross-scene affordance memory improves fine-grained localization of small actionable regions that text-only prompting misses.
  • In-scene spatial memory delivers larger gains on queries involving spatial qualifiers such as 'the second handle from the top.'
  • The full method raises AP50 by 3.23 on Split 0 and 3.7 on Split 1 of SceneFun3D over prior training-free baselines.
  • No model fine-tuning or target-scene annotation is required since the memory bank is built only from source scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the memory bank with more varied source scenes could further improve robustness to novel layouts.
  • The scene-graph approach might transfer to other tasks requiring disambiguation of repeated object parts across time or views.
  • Online updating of the in-scene memory could support agents moving through changing environments.

Load-bearing premise

Examples stored from source scenes will match the visual and spatial patterns of affordances in completely unseen target scenes well enough to guide the VLM effectively.

What would settle it

Testing the method on a new dataset with object categories absent from the source memory bank and measuring whether AP50 falls back to or below the level of a text-only VLM baseline.

Figures

Figures reproduced from arXiv: 2605.11616 by Jingyi He, Qirui Wang, Shijie Li, Xulei Yang, Yining Pan.

Figure 3
Figure 3. Figure 3: Qualitative comparison on four representative cases. Each column shows one affordance grounding example. The first row shows the ground-truth target regions, the second row shows Fun3DU failure cases where no corresponding target mask is produced, and the third row shows predictions from AFFORDMEM. AFFORDMEM recovers compact affordance regions for small and visually ambiguous targets that are missed by the… view at source ↗
read the original abstract

Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as "the second handle from the top." AFFORDMEM requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves AP50 over the prior training-free state of the art by 3.23 on Split 0 and 3.7 on Split 1. Ablation studies support complementary benefits: cross-scene affordance memory improves fine-grained localization, while in-scene spatial memory provides the larger gain on spatially qualified queries. The project homepage is available at the project page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes AFFORDMEM, a training-free framework for 3D functional affordance grounding in unseen scenes. It maintains a reusable cross-scene category-level memory bank of RGB images with rendered affordance overlays to guide a frozen VLM toward small operable subregions, and builds an in-scene spatial memory as a structured scene graph to resolve references over distant or unobserved instances. On SceneFun3D, the method reports AP50 gains of 3.23 on Split 0 and 3.7 on Split 1 over prior training-free SOTA, with ablations attributing complementary benefits to the two memory components.

Significance. If the empirical results hold under a fully specified protocol, the work demonstrates that explicit cross-scene and in-scene memory mechanisms can measurably improve zero-shot fine-grained localization with frozen VLMs, without target-scene fine-tuning or annotation. The reusable memory bank and the ablation-supported split between localization and spatial-reference benefits constitute a clear, practical contribution to training-free 3D vision-language pipelines. The absence of invented parameters or self-referential derivations is a strength.

major comments (2)
  1. [Results section (Table 2)] Results section (Table 2 and associated text): the reported AP50 improvements of 3.23 and 3.7 are given as single-point estimates without error bars, standard deviations, or the number of independent runs. Because the central claim is an empirical benchmark advance on a fixed split, this omission makes it impossible to judge whether the gains exceed typical run-to-run variance and therefore weakens the strength of the quantitative conclusion.
  2. [Section 3.1] Section 3.1 (cross-scene memory retrieval): the procedure for selecting the 'most informative examples' from the memory bank is described at a high level but does not specify the exact similarity metric, top-k value, or rendering overlay format used at inference time. This detail is load-bearing for the claim that cross-scene memory supplies fine-grained localization cues that text-only prompting misses.
minor comments (3)
  1. [Abstract and Section 4.1] The abstract and Section 4.1 refer to 'prior training-free state of the art' without an explicit citation or short description of the strongest baseline method; adding this reference would improve reproducibility.
  2. [Figure 3] Figure 3 (qualitative examples) would benefit from an additional row or inset showing the exact memory-bank image that was retrieved for each query, to illustrate the cross-scene guidance mechanism.
  3. [Section 3.2] Notation for the scene-graph nodes and edges is introduced in Section 3.2 but never summarized in a single table; a compact legend would reduce reader effort.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work and for the constructive comments. We address each major comment point by point below. We agree that both points identify areas where the manuscript can be strengthened and will revise accordingly.

read point-by-point responses
  1. Referee: [Results section (Table 2)] Results section (Table 2 and associated text): the reported AP50 improvements of 3.23 and 3.7 are given as single-point estimates without error bars, standard deviations, or the number of independent runs. Because the central claim is an empirical benchmark advance on a fixed split, this omission makes it impossible to judge whether the gains exceed typical run-to-run variance and therefore weakens the strength of the quantitative conclusion.

    Authors: We thank the referee for this observation. AFFORDMEM is a training-free method whose core components (cross-scene memory retrieval via fixed embeddings and in-scene scene-graph construction) are fully deterministic; we employ greedy decoding in the frozen VLM, so run-to-run variance is not expected and a single execution on the fixed splits is reproducible. To address the concern and improve the presentation, we will add an explicit statement in the revised results section clarifying the deterministic nature of the pipeline and confirming that the reported AP50 figures come from a single, fully specified run. revision: partial

  2. Referee: [Section 3.1] Section 3.1 (cross-scene memory retrieval): the procedure for selecting the 'most informative examples' from the memory bank is described at a high level but does not specify the exact similarity metric, top-k value, or rendering overlay format used at inference time. This detail is load-bearing for the claim that cross-scene memory supplies fine-grained localization cues that text-only prompting misses.

    Authors: We agree that these implementation details are necessary for reproducibility and for readers to understand how the memory bank supplies the claimed localization cues. We will revise Section 3.1 to specify the exact similarity metric, the top-k value, and the rendering overlay format used at inference time. We will also consider adding a short algorithm box or pseudocode to make the retrieval procedure fully transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method (AFFORDMEM) for 3D affordance grounding via reusable cross-scene memory banks and in-scene scene graphs, evaluated on SceneFun3D with reported AP50 gains and ablations. No derivation chain, equations, fitted parameters renamed as predictions, or self-referential definitions exist; the memory construction is independent of target scenes, and results are benchmark measurements rather than quantities forced by construction or self-citation loops. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions about VLM behavior under visual prompting and the utility of scene graphs for spatial reference resolution; no new free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Frozen VLMs can be effectively steered toward small operable regions by retrieved example images with affordance overlays
    Central to the cross-scene memory component described in the abstract.
  • domain assumption A structured scene graph of 3D object relations enables resolution of spatial references across unobserved instances
    Underpins the in-scene spatial memory mechanism.

pith-pipeline@v0.9.0 · 5599 in / 1418 out tokens · 34061 ms · 2026-05-13T01:51:18.489084+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

  1. [1]

    Referit3d: Neural listeners for fine-grained 3D object identification in real-world scenes

    Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3D object identification in real-world scenes. In European Conference on Computer Vision (ECCV), 2020

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. A...

  3. [3]

    Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, and Alexei A. Efros. Visual prompting via image inpainting. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  4. [4]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

  5. [5]

    ScanRefer: 3D object localization in RGB-D scans using natural language

    Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. ScanRefer: 3D object localization in RGB-D scans using natural language. InEuropean Conference on Computer Vision (ECCV), 2020

  6. [6]

    Functional- ity understanding and segmentation in 3d scenes

    Jaime Corsetti, Francesco Giuliari, Alice Fasoli, Davide Boscaini, and Fabio Poiesi. Functional- ity understanding and segmentation in 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24550–24559, 2025

  7. [7]

    SceneFun3D: Fine-grained functionality and affordance understanding in 3D scenes

    Alexandros Delitzas, Ayca Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann. SceneFun3D: Fine-grained functionality and affordance understanding in 3D scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  8. [8]

    3D affordancenet: A benchmark for visual object affordance understanding

    Shengheng Deng, Xun Xu, Chaozheng Wu, Ke Chen, and Kui Jia. 3D affordancenet: A benchmark for visual object affordance understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  9. [9]

    Learning 2d invariant affordance knowledge for 3d affordance grounding

    Xianqiang Gao, Pingrui Zhang, Delin Qu, Dong Wang, Zhigang Wang, Yan Ding, and Bin Zhao. Learning 2d invariant affordance knowledge for 3d affordance grounding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3095–3103, 2025

  10. [10]

    Task-aware 3d affordance segmentation via 2d guidance and geometric refinement

    Lian He, Meng Liu, Qilang Ye, Yu Zhou, Xiang Deng, and Gangyi Ding. Task-aware 3d affordance segmentation via 2d guidance and geometric refinement. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4654–4662, 2026

  11. [11]

    Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels

    Rui Huang, Songyou Peng, Ayca Takmaz, Federico Tombari, Marc Pollefeys, Shiji Song, Gao Huang, and Francis Engelmann. Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels. pages 278–295, 2024

  12. [12]

    OpenIns3D: Snap and lookup for 3D open-vocabulary instance segmentation

    Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, and Joan Lasenby. OpenIns3D: Snap and lookup for 3D open-vocabulary instance segmentation. InEuropean Conference on Computer Vision (ECCV), 2024

  13. [13]

    ConceptFusion: Open-set multimodal 3D mapping

    Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, Joshua B Tenenbaum, Shalini de Mello, Liangkai Liu, Ravi Ramamoorthi, Charless C Fowlkes, Siddharth Garg, and Liam Paull. ConceptFusion: Open-set multimodal 3D mapping. InRobotics: Science and Systems ...

  14. [14]

    LERF: Language embedded radiance fields

    Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. LERF: Language embedded radiance fields. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 10

  15. [15]

    Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  16. [16]

    Where2Act: From pixels to actions for articulated 3D objects

    Kaichun Mo, Leonidas Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2Act: From pixels to actions for articulated 3D objects. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

  17. [17]

    OpenScene: 3D scene understanding with open vocabularies

    Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser. OpenScene: 3D scene understanding with open vocabularies. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  18. [18]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Kheradmand, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos.arXiv preprint arXiv:240...

  19. [19]

    Griffin, Matthias Nießner, Federico Tombari, and Francis Engelmann

    Ayca Takmaz, Elisabetta Fedele, Robert J. Griffin, Matthias Nießner, Federico Tombari, and Francis Engelmann. OpenMask3D: Open-vocabulary 3D instance segmentation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  20. [20]

    SegGPT: Segmenting everything in context

    Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. SegGPT: Segmenting everything in context. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  21. [21]

    Affordbot: 3d fine-grained embodied reasoning via multimodal large language models.arXiv preprint arXiv:2511.10017, 2025

    Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li, and Na Zhao. Affordbot: 3d fine-grained embodied reasoning via multimodal large language models.arXiv preprint arXiv:2511.10017, 2025

  22. [22]

    3d-gres: Generalized 3d referring expression segmentation

    Changli Wu, Yihang Liu, Jiayi Ji, Yiwei Ma, Haowei Wang, Gen Luo, Henghui Ding, Xi- aoshuai Sun, and Rongrong Ji. 3d-gres: Generalized 3d referring expression segmentation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7852–7861, 2024

  23. [23]

    MVGGT: Multimodal visual geometry grounded transformer for multiview 3D referring expression segmentation.arXiv preprint arXiv:2601.06874, 2026

    Changli Wu, Haodong Wang, Jiayi Ji, Yutian Yao, Chunsai Du, Jihua Kang, Yanwei Fu, and Liujuan Cao. MVGGT: Multimodal visual geometry grounded transformer for multiview 3D referring expression segmentation.arXiv preprint arXiv:2601.06874, 2026. A Memory Mask: Category-Level Prior vs. Instance Ground Truth A key concern is whether the geometry memory mask ...

  24. [24]

    11 It must be explicitly mentioned; use "None" if unavailable

    contextual_object: the concrete physical object that must be operated on first. 11 It must be explicitly mentioned; use "None" if unavailable

  25. [25]

    interactive_objects: concrete physical interfaces involved in the task such as handles, knobs, switches, remotes, sockets, ports, or keyholes

  26. [26]

    functional_object_candidates: ranked intrinsic functional parts of the contextual_object relevant to the task

  27. [27]

    action: choose exactly one from [rotate, key_press, tip_push, hook_pull, pinch_pull, hook_turn, foot_push, plug_in, unplug]

  28. [28]

    {instruction}

    spatial_relation [X, Y]: extract at most one explicit spatial relation if mentioned using [contextual_object, referenced_object]; otherwise output "N/A". Classify the physical execution implicitly: - intrinsic manipulation: directly operate a part of the target object. - external-mediated manipulation: operate a separate control/tool to change the target ...

  29. [29]

    The action verb in the instruction

  30. [30]

    The object being acted upon

  31. [31]

    we were unable to find the license for the dataset we used

    The semantic meaning and context Respond with ONLY the number (1-{len(affordance_nodes)}) of the most appropriate node. Selected node number: 12 NeurIPS Paper Checklist The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. D...

  32. [32]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...