Recognition: 2 theorem links
· Lean TheoremGrounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances
Pith reviewed 2026-05-13 01:51 UTC · model grok-4.3
The pith
A reusable memory bank of source-scene affordance images plus an in-scene spatial graph lets a frozen vision-language model localize precise functional regions in new 3D scenes without fine-tuning or target labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AFFORDMEM grounds 3D functional affordances by maintaining a category-level memory bank of RGB images with affordance regions from source scenes to guide a frozen VLM toward small operable subregions, and by organizing candidate instances into a structured scene graph for in-scene spatial memory that resolves references over distant or unobserved objects.
What carries the argument
Dual-level memory: cross-scene affordance memory bank of annotated source images recalled at query time, and in-scene spatial memory as a scene graph for spatial relations.
If this is right
- Cross-scene affordance memory improves fine-grained localization of small actionable regions that text-only prompting misses.
- In-scene spatial memory delivers larger gains on queries involving spatial qualifiers such as 'the second handle from the top.'
- The full method raises AP50 by 3.23 on Split 0 and 3.7 on Split 1 of SceneFun3D over prior training-free baselines.
- No model fine-tuning or target-scene annotation is required since the memory bank is built only from source scenes.
Where Pith is reading between the lines
- Extending the memory bank with more varied source scenes could further improve robustness to novel layouts.
- The scene-graph approach might transfer to other tasks requiring disambiguation of repeated object parts across time or views.
- Online updating of the in-scene memory could support agents moving through changing environments.
Load-bearing premise
Examples stored from source scenes will match the visual and spatial patterns of affordances in completely unseen target scenes well enough to guide the VLM effectively.
What would settle it
Testing the method on a new dataset with object categories absent from the source memory bank and measuring whether AP50 falls back to or below the level of a text-only VLM baseline.
Figures
read the original abstract
Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as "the second handle from the top." AFFORDMEM requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves AP50 over the prior training-free state of the art by 3.23 on Split 0 and 3.7 on Split 1. Ablation studies support complementary benefits: cross-scene affordance memory improves fine-grained localization, while in-scene spatial memory provides the larger gain on spatially qualified queries. The project homepage is available at the project page.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes AFFORDMEM, a training-free framework for 3D functional affordance grounding in unseen scenes. It maintains a reusable cross-scene category-level memory bank of RGB images with rendered affordance overlays to guide a frozen VLM toward small operable subregions, and builds an in-scene spatial memory as a structured scene graph to resolve references over distant or unobserved instances. On SceneFun3D, the method reports AP50 gains of 3.23 on Split 0 and 3.7 on Split 1 over prior training-free SOTA, with ablations attributing complementary benefits to the two memory components.
Significance. If the empirical results hold under a fully specified protocol, the work demonstrates that explicit cross-scene and in-scene memory mechanisms can measurably improve zero-shot fine-grained localization with frozen VLMs, without target-scene fine-tuning or annotation. The reusable memory bank and the ablation-supported split between localization and spatial-reference benefits constitute a clear, practical contribution to training-free 3D vision-language pipelines. The absence of invented parameters or self-referential derivations is a strength.
major comments (2)
- [Results section (Table 2)] Results section (Table 2 and associated text): the reported AP50 improvements of 3.23 and 3.7 are given as single-point estimates without error bars, standard deviations, or the number of independent runs. Because the central claim is an empirical benchmark advance on a fixed split, this omission makes it impossible to judge whether the gains exceed typical run-to-run variance and therefore weakens the strength of the quantitative conclusion.
- [Section 3.1] Section 3.1 (cross-scene memory retrieval): the procedure for selecting the 'most informative examples' from the memory bank is described at a high level but does not specify the exact similarity metric, top-k value, or rendering overlay format used at inference time. This detail is load-bearing for the claim that cross-scene memory supplies fine-grained localization cues that text-only prompting misses.
minor comments (3)
- [Abstract and Section 4.1] The abstract and Section 4.1 refer to 'prior training-free state of the art' without an explicit citation or short description of the strongest baseline method; adding this reference would improve reproducibility.
- [Figure 3] Figure 3 (qualitative examples) would benefit from an additional row or inset showing the exact memory-bank image that was retrieved for each query, to illustrate the cross-scene guidance mechanism.
- [Section 3.2] Notation for the scene-graph nodes and edges is introduced in Section 3.2 but never summarized in a single table; a compact legend would reduce reader effort.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work and for the constructive comments. We address each major comment point by point below. We agree that both points identify areas where the manuscript can be strengthened and will revise accordingly.
read point-by-point responses
-
Referee: [Results section (Table 2)] Results section (Table 2 and associated text): the reported AP50 improvements of 3.23 and 3.7 are given as single-point estimates without error bars, standard deviations, or the number of independent runs. Because the central claim is an empirical benchmark advance on a fixed split, this omission makes it impossible to judge whether the gains exceed typical run-to-run variance and therefore weakens the strength of the quantitative conclusion.
Authors: We thank the referee for this observation. AFFORDMEM is a training-free method whose core components (cross-scene memory retrieval via fixed embeddings and in-scene scene-graph construction) are fully deterministic; we employ greedy decoding in the frozen VLM, so run-to-run variance is not expected and a single execution on the fixed splits is reproducible. To address the concern and improve the presentation, we will add an explicit statement in the revised results section clarifying the deterministic nature of the pipeline and confirming that the reported AP50 figures come from a single, fully specified run. revision: partial
-
Referee: [Section 3.1] Section 3.1 (cross-scene memory retrieval): the procedure for selecting the 'most informative examples' from the memory bank is described at a high level but does not specify the exact similarity metric, top-k value, or rendering overlay format used at inference time. This detail is load-bearing for the claim that cross-scene memory supplies fine-grained localization cues that text-only prompting misses.
Authors: We agree that these implementation details are necessary for reproducibility and for readers to understand how the memory bank supplies the claimed localization cues. We will revise Section 3.1 to specify the exact similarity metric, the top-k value, and the rendering overlay format used at inference time. We will also consider adding a short algorithm box or pseudocode to make the retrieval procedure fully transparent. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an empirical method (AFFORDMEM) for 3D affordance grounding via reusable cross-scene memory banks and in-scene scene graphs, evaluated on SceneFun3D with reported AP50 gains and ablations. No derivation chain, equations, fitted parameters renamed as predictions, or self-referential definitions exist; the memory construction is independent of target scenes, and results are benchmark measurements rather than quantities forced by construction or self-citation loops. The approach is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Frozen VLMs can be effectively steered toward small operable regions by retrieved example images with affordance overlays
- domain assumption A structured scene graph of 3D object relations enables resolution of spatial references across unobserved instances
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.lean (D=3 forcing via circle linking)alexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
improves AP50 over prior training-free SOTA by +3.23 / +3.7 on SceneFun3D splits
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Referit3d: Neural listeners for fine-grained 3D object identification in real-world scenes
Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3D object identification in real-world scenes. In European Conference on Computer Vision (ECCV), 2020
work page 2020
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. A...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, and Alexei A. Efros. Visual prompting via image inpainting. InAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[4]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
ScanRefer: 3D object localization in RGB-D scans using natural language
Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. ScanRefer: 3D object localization in RGB-D scans using natural language. InEuropean Conference on Computer Vision (ECCV), 2020
work page 2020
-
[6]
Functional- ity understanding and segmentation in 3d scenes
Jaime Corsetti, Francesco Giuliari, Alice Fasoli, Davide Boscaini, and Fabio Poiesi. Functional- ity understanding and segmentation in 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24550–24559, 2025
work page 2025
-
[7]
SceneFun3D: Fine-grained functionality and affordance understanding in 3D scenes
Alexandros Delitzas, Ayca Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann. SceneFun3D: Fine-grained functionality and affordance understanding in 3D scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
work page 2024
-
[8]
3D affordancenet: A benchmark for visual object affordance understanding
Shengheng Deng, Xun Xu, Chaozheng Wu, Ke Chen, and Kui Jia. 3D affordancenet: A benchmark for visual object affordance understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
work page 2021
-
[9]
Learning 2d invariant affordance knowledge for 3d affordance grounding
Xianqiang Gao, Pingrui Zhang, Delin Qu, Dong Wang, Zhigang Wang, Yan Ding, and Bin Zhao. Learning 2d invariant affordance knowledge for 3d affordance grounding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3095–3103, 2025
work page 2025
-
[10]
Task-aware 3d affordance segmentation via 2d guidance and geometric refinement
Lian He, Meng Liu, Qilang Ye, Yu Zhou, Xiang Deng, and Gangyi Ding. Task-aware 3d affordance segmentation via 2d guidance and geometric refinement. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4654–4662, 2026
work page 2026
-
[11]
Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels
Rui Huang, Songyou Peng, Ayca Takmaz, Federico Tombari, Marc Pollefeys, Shiji Song, Gao Huang, and Francis Engelmann. Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels. pages 278–295, 2024
work page 2024
-
[12]
OpenIns3D: Snap and lookup for 3D open-vocabulary instance segmentation
Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, and Joan Lasenby. OpenIns3D: Snap and lookup for 3D open-vocabulary instance segmentation. InEuropean Conference on Computer Vision (ECCV), 2024
work page 2024
-
[13]
ConceptFusion: Open-set multimodal 3D mapping
Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, Joshua B Tenenbaum, Shalini de Mello, Liangkai Liu, Ravi Ramamoorthi, Charless C Fowlkes, Siddharth Garg, and Liam Paull. ConceptFusion: Open-set multimodal 3D mapping. InRobotics: Science and Systems ...
work page 2023
-
[14]
LERF: Language embedded radiance fields
Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. LERF: Language embedded radiance fields. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 10
work page 2023
-
[15]
Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023
work page 2023
-
[16]
Where2Act: From pixels to actions for articulated 3D objects
Kaichun Mo, Leonidas Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2Act: From pixels to actions for articulated 3D objects. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021
work page 2021
-
[17]
OpenScene: 3D scene understanding with open vocabularies
Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser. OpenScene: 3D scene understanding with open vocabularies. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
work page 2023
-
[18]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Kheradmand, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos.arXiv preprint arXiv:240...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Griffin, Matthias Nießner, Federico Tombari, and Francis Engelmann
Ayca Takmaz, Elisabetta Fedele, Robert J. Griffin, Matthias Nießner, Federico Tombari, and Francis Engelmann. OpenMask3D: Open-vocabulary 3D instance segmentation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[20]
SegGPT: Segmenting everything in context
Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. SegGPT: Segmenting everything in context. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023
work page 2023
-
[21]
Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li, and Na Zhao. Affordbot: 3d fine-grained embodied reasoning via multimodal large language models.arXiv preprint arXiv:2511.10017, 2025
-
[22]
3d-gres: Generalized 3d referring expression segmentation
Changli Wu, Yihang Liu, Jiayi Ji, Yiwei Ma, Haowei Wang, Gen Luo, Henghui Ding, Xi- aoshuai Sun, and Rongrong Ji. 3d-gres: Generalized 3d referring expression segmentation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7852–7861, 2024
work page 2024
-
[23]
Changli Wu, Haodong Wang, Jiayi Ji, Yutian Yao, Chunsai Du, Jihua Kang, Yanwei Fu, and Liujuan Cao. MVGGT: Multimodal visual geometry grounded transformer for multiview 3D referring expression segmentation.arXiv preprint arXiv:2601.06874, 2026. A Memory Mask: Category-Level Prior vs. Instance Ground Truth A key concern is whether the geometry memory mask ...
-
[24]
11 It must be explicitly mentioned; use "None" if unavailable
contextual_object: the concrete physical object that must be operated on first. 11 It must be explicitly mentioned; use "None" if unavailable
-
[25]
interactive_objects: concrete physical interfaces involved in the task such as handles, knobs, switches, remotes, sockets, ports, or keyholes
-
[26]
functional_object_candidates: ranked intrinsic functional parts of the contextual_object relevant to the task
-
[27]
action: choose exactly one from [rotate, key_press, tip_push, hook_pull, pinch_pull, hook_turn, foot_push, plug_in, unplug]
-
[28]
spatial_relation [X, Y]: extract at most one explicit spatial relation if mentioned using [contextual_object, referenced_object]; otherwise output "N/A". Classify the physical execution implicitly: - intrinsic manipulation: directly operate a part of the target object. - external-mediated manipulation: operate a separate control/tool to change the target ...
-
[29]
The action verb in the instruction
-
[30]
The object being acted upon
-
[31]
we were unable to find the license for the dataset we used
The semantic meaning and context Respond with ONLY the number (1-{len(affordance_nodes)}) of the most appropriate node. Selected node number: 12 NeurIPS Paper Checklist The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. D...
-
[32]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.