arxiv: 2605.11616 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Grounding by Remembering: Cross-Scene and In-Scene Memory for 3D Functional Affordances

Qirui Wang , Jingyi He , Yining Pan , Xulei Yang , Shijie Li

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D affordance groundingfunctional affordancescross-scene memoryscene graphvision-language modelstraining-free methodSceneFun3D dataset

0 comments

The pith

A reusable memory bank of source-scene affordance images plus an in-scene spatial graph lets a frozen vision-language model localize precise functional regions in new 3D scenes without fine-tuning or target labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that functional affordance grounding in 3D can be achieved by recalling geometry from past scenes and tracking spatial relations within the current one, allowing a pre-trained model to handle small and ambiguous interaction points that text prompts overlook. This would matter for building agents that interact reliably with real-world objects like pulling handles or pressing buttons, especially when multiple similar items appear in a scene. The method builds a cross-scene memory bank with affordance overlays on RGB images and an in-scene scene graph for resolving references like distance or order. If the approach holds, it demonstrates that memory can replace the need for scene-specific training data in grounding tasks.

Core claim

AFFORDMEM grounds 3D functional affordances by maintaining a category-level memory bank of RGB images with affordance regions from source scenes to guide a frozen VLM toward small operable subregions, and by organizing candidate instances into a structured scene graph for in-scene spatial memory that resolves references over distant or unobserved objects.

What carries the argument

Dual-level memory: cross-scene affordance memory bank of annotated source images recalled at query time, and in-scene spatial memory as a scene graph for spatial relations.

If this is right

Cross-scene affordance memory improves fine-grained localization of small actionable regions that text-only prompting misses.
In-scene spatial memory delivers larger gains on queries involving spatial qualifiers such as 'the second handle from the top.'
The full method raises AP50 by 3.23 on Split 0 and 3.7 on Split 1 of SceneFun3D over prior training-free baselines.
No model fine-tuning or target-scene annotation is required since the memory bank is built only from source scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the memory bank with more varied source scenes could further improve robustness to novel layouts.
The scene-graph approach might transfer to other tasks requiring disambiguation of repeated object parts across time or views.
Online updating of the in-scene memory could support agents moving through changing environments.

Load-bearing premise

Examples stored from source scenes will match the visual and spatial patterns of affordances in completely unseen target scenes well enough to guide the VLM effectively.

What would settle it

Testing the method on a new dataset with object categories absent from the source memory bank and measuring whether AP50 falls back to or below the level of a text-only VLM baseline.

Figures

Figures reproduced from arXiv: 2605.11616 by Jingyi He, Qirui Wang, Shijie Li, Xulei Yang, Yining Pan.

**Figure 3.** Figure 3: Qualitative comparison on four representative cases. Each column shows one affordance grounding example. The first row shows the ground-truth target regions, the second row shows Fun3DU failure cases where no corresponding target mask is produced, and the third row shows predictions from AFFORDMEM. AFFORDMEM recovers compact affordance regions for small and visually ambiguous targets that are missed by the… view at source ↗

read the original abstract

Functional affordance grounding requires more than recognizing an object: an agent must localize the specific region that supports an interaction, such as the handle to pull or the button to press. This is difficult for training-free vision-language pipelines because actionable regions are often small, visually ambiguous, and repeated across multiple same-category instances in a scene. We propose AFFORDMEM, a framework that grounds 3D functional affordances by remembering geometry at two levels. The first is cross-scene affordance memory: the agent maintains a category-level memory bank of RGB images with affordance regions rendered as overlays, and recalls the most informative examples at query time to guide a frozen VLM toward small operable subregions that text-only prompting consistently misses. The second is in-scene spatial memory: as the agent processes the scene, it organizes candidate instances and their 3D spatial relations into a structured scene graph, enabling the language model to resolve references over distant or currently unobserved candidates such as "the second handle from the top." AFFORDMEM requires no model fine-tuning and no target-scene annotation, using a reusable memory bank built from source scenes. On SceneFun3D, our method improves AP50 over the prior training-free state of the art by 3.23 on Split 0 and 3.7 on Split 1. Ablation studies support complementary benefits: cross-scene affordance memory improves fine-grained localization, while in-scene spatial memory provides the larger gain on spatially qualified queries. The project homepage is available at the project page.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The dual-memory setup delivers modest AP50 gains for training-free 3D affordance localization on SceneFun3D.

read the letter

The dual-memory design in this paper offers a straightforward way to boost training-free functional affordance grounding in 3D scenes by a few AP50 points on SceneFun3D. It is new in pairing a cross-scene category-level memory bank of example images with rendered affordance regions and an in-scene spatial scene graph for handling references across instances. This lets the frozen VLM get better at small operable subregions and distant spatial queries without any fine-tuning or target annotations. The work does well by demonstrating through ablations that the two memories provide complementary gains: one for fine-grained localization, the other for spatial resolution. The reusable bank built from source scenes is a practical choice for embodied settings. Soft spots include the modest size of the improvements and limited visibility into implementation details like exact retrieval or memory construction. The assumption that source-derived memory transfers well could be fragile if scenes vary more than in the benchmark. This is for CV and robotics folks working on zero-shot 3D grounding. It shows clear thinking on the problem and has enough empirical grounding to merit peer review, though more analysis on generalization would help.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes AFFORDMEM, a training-free framework for 3D functional affordance grounding in unseen scenes. It maintains a reusable cross-scene category-level memory bank of RGB images with rendered affordance overlays to guide a frozen VLM toward small operable subregions, and builds an in-scene spatial memory as a structured scene graph to resolve references over distant or unobserved instances. On SceneFun3D, the method reports AP50 gains of 3.23 on Split 0 and 3.7 on Split 1 over prior training-free SOTA, with ablations attributing complementary benefits to the two memory components.

Significance. If the empirical results hold under a fully specified protocol, the work demonstrates that explicit cross-scene and in-scene memory mechanisms can measurably improve zero-shot fine-grained localization with frozen VLMs, without target-scene fine-tuning or annotation. The reusable memory bank and the ablation-supported split between localization and spatial-reference benefits constitute a clear, practical contribution to training-free 3D vision-language pipelines. The absence of invented parameters or self-referential derivations is a strength.

major comments (2)

[Results section (Table 2)] Results section (Table 2 and associated text): the reported AP50 improvements of 3.23 and 3.7 are given as single-point estimates without error bars, standard deviations, or the number of independent runs. Because the central claim is an empirical benchmark advance on a fixed split, this omission makes it impossible to judge whether the gains exceed typical run-to-run variance and therefore weakens the strength of the quantitative conclusion.
[Section 3.1] Section 3.1 (cross-scene memory retrieval): the procedure for selecting the 'most informative examples' from the memory bank is described at a high level but does not specify the exact similarity metric, top-k value, or rendering overlay format used at inference time. This detail is load-bearing for the claim that cross-scene memory supplies fine-grained localization cues that text-only prompting misses.

minor comments (3)

[Abstract and Section 4.1] The abstract and Section 4.1 refer to 'prior training-free state of the art' without an explicit citation or short description of the strongest baseline method; adding this reference would improve reproducibility.
[Figure 3] Figure 3 (qualitative examples) would benefit from an additional row or inset showing the exact memory-bank image that was retrieved for each query, to illustrate the cross-scene guidance mechanism.
[Section 3.2] Notation for the scene-graph nodes and edges is introduced in Section 3.2 but never summarized in a single table; a compact legend would reduce reader effort.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work and for the constructive comments. We address each major comment point by point below. We agree that both points identify areas where the manuscript can be strengthened and will revise accordingly.

read point-by-point responses

Referee: [Results section (Table 2)] Results section (Table 2 and associated text): the reported AP50 improvements of 3.23 and 3.7 are given as single-point estimates without error bars, standard deviations, or the number of independent runs. Because the central claim is an empirical benchmark advance on a fixed split, this omission makes it impossible to judge whether the gains exceed typical run-to-run variance and therefore weakens the strength of the quantitative conclusion.

Authors: We thank the referee for this observation. AFFORDMEM is a training-free method whose core components (cross-scene memory retrieval via fixed embeddings and in-scene scene-graph construction) are fully deterministic; we employ greedy decoding in the frozen VLM, so run-to-run variance is not expected and a single execution on the fixed splits is reproducible. To address the concern and improve the presentation, we will add an explicit statement in the revised results section clarifying the deterministic nature of the pipeline and confirming that the reported AP50 figures come from a single, fully specified run. revision: partial
Referee: [Section 3.1] Section 3.1 (cross-scene memory retrieval): the procedure for selecting the 'most informative examples' from the memory bank is described at a high level but does not specify the exact similarity metric, top-k value, or rendering overlay format used at inference time. This detail is load-bearing for the claim that cross-scene memory supplies fine-grained localization cues that text-only prompting misses.

Authors: We agree that these implementation details are necessary for reproducibility and for readers to understand how the memory bank supplies the claimed localization cues. We will revise Section 3.1 to specify the exact similarity metric, the top-k value, and the rendering overlay format used at inference time. We will also consider adding a short algorithm box or pseudocode to make the retrieval procedure fully transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method (AFFORDMEM) for 3D affordance grounding via reusable cross-scene memory banks and in-scene scene graphs, evaluated on SceneFun3D with reported AP50 gains and ablations. No derivation chain, equations, fitted parameters renamed as predictions, or self-referential definitions exist; the memory construction is independent of target scenes, and results are benchmark measurements rather than quantities forced by construction or self-citation loops. The approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard assumptions about VLM behavior under visual prompting and the utility of scene graphs for spatial reference resolution; no new free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Frozen VLMs can be effectively steered toward small operable regions by retrieved example images with affordance overlays
Central to the cross-scene memory component described in the abstract.
domain assumption A structured scene graph of 3D object relations enables resolution of spatial references across unobserved instances
Underpins the in-scene spatial memory mechanism.

pith-pipeline@v0.9.0 · 5599 in / 1418 out tokens · 34061 ms · 2026-05-13T01:51:18.489084+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean (D=3 forcing via circle linking) alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

improves AP50 over prior training-free SOTA by +3.23 / +3.7 on SceneFun3D splits

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

[1]

Referit3d: Neural listeners for fine-grained 3D object identification in real-world scenes

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3D object identification in real-world scenes. In European Conference on Computer Vision (ECCV), 2020

work page 2020
[2]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. A...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, and Alexei A. Efros. Visual prompting via image inpainting. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[4]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

ScanRefer: 3D object localization in RGB-D scans using natural language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. ScanRefer: 3D object localization in RGB-D scans using natural language. InEuropean Conference on Computer Vision (ECCV), 2020

work page 2020
[6]

Functional- ity understanding and segmentation in 3d scenes

Jaime Corsetti, Francesco Giuliari, Alice Fasoli, Davide Boscaini, and Fabio Poiesi. Functional- ity understanding and segmentation in 3d scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24550–24559, 2025

work page 2025
[7]

SceneFun3D: Fine-grained functionality and affordance understanding in 3D scenes

Alexandros Delitzas, Ayca Takmaz, Federico Tombari, Robert Sumner, Marc Pollefeys, and Francis Engelmann. SceneFun3D: Fine-grained functionality and affordance understanding in 3D scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[8]

3D affordancenet: A benchmark for visual object affordance understanding

Shengheng Deng, Xun Xu, Chaozheng Wu, Ke Chen, and Kui Jia. 3D affordancenet: A benchmark for visual object affordance understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[9]

Learning 2d invariant affordance knowledge for 3d affordance grounding

Xianqiang Gao, Pingrui Zhang, Delin Qu, Dong Wang, Zhigang Wang, Yan Ding, and Bin Zhao. Learning 2d invariant affordance knowledge for 3d affordance grounding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3095–3103, 2025

work page 2025
[10]

Task-aware 3d affordance segmentation via 2d guidance and geometric refinement

Lian He, Meng Liu, Qilang Ye, Yu Zhou, Xiang Deng, and Gangyi Ding. Task-aware 3d affordance segmentation via 2d guidance and geometric refinement. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 4654–4662, 2026

work page 2026
[11]

Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels

Rui Huang, Songyou Peng, Ayca Takmaz, Federico Tombari, Marc Pollefeys, Shiji Song, Gao Huang, and Francis Engelmann. Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels. pages 278–295, 2024

work page 2024
[12]

OpenIns3D: Snap and lookup for 3D open-vocabulary instance segmentation

Zhening Huang, Xiaoyang Wu, Xi Chen, Hengshuang Zhao, Lei Zhu, and Joan Lasenby. OpenIns3D: Snap and lookup for 3D open-vocabulary instance segmentation. InEuropean Conference on Computer Vision (ECCV), 2024

work page 2024
[13]

ConceptFusion: Open-set multimodal 3D mapping

Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, Joshua B Tenenbaum, Shalini de Mello, Liangkai Liu, Ravi Ramamoorthi, Charless C Fowlkes, Siddharth Garg, and Liam Paull. ConceptFusion: Open-set multimodal 3D mapping. InRobotics: Science and Systems ...

work page 2023
[14]

LERF: Language embedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. LERF: Language embedded radiance fields. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 10

work page 2023
[15]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[16]

Where2Act: From pixels to actions for articulated 3D objects

Kaichun Mo, Leonidas Guibas, Mustafa Mukadam, Abhinav Gupta, and Shubham Tulsiani. Where2Act: From pixels to actions for articulated 3D objects. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021

work page 2021
[17]

OpenScene: 3D scene understanding with open vocabularies

Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, and Thomas Funkhouser. OpenScene: 3D scene understanding with open vocabularies. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[18]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Kheradmand, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos.arXiv preprint arXiv:240...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Griffin, Matthias Nießner, Federico Tombari, and Francis Engelmann

Ayca Takmaz, Elisabetta Fedele, Robert J. Griffin, Matthias Nießner, Federico Tombari, and Francis Engelmann. OpenMask3D: Open-vocabulary 3D instance segmentation. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[20]

SegGPT: Segmenting everything in context

Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, and Tiejun Huang. SegGPT: Segmenting everything in context. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

work page 2023
[21]

Affordbot: 3d fine-grained embodied reasoning via multimodal large language models.arXiv preprint arXiv:2511.10017, 2025

Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li, and Na Zhao. Affordbot: 3d fine-grained embodied reasoning via multimodal large language models.arXiv preprint arXiv:2511.10017, 2025

work page arXiv 2025
[22]

3d-gres: Generalized 3d referring expression segmentation

Changli Wu, Yihang Liu, Jiayi Ji, Yiwei Ma, Haowei Wang, Gen Luo, Henghui Ding, Xi- aoshuai Sun, and Rongrong Ji. 3d-gres: Generalized 3d referring expression segmentation. InProceedings of the 32nd ACM International Conference on Multimedia, pages 7852–7861, 2024

work page 2024
[23]

MVGGT: Multimodal visual geometry grounded transformer for multiview 3D referring expression segmentation.arXiv preprint arXiv:2601.06874, 2026

Changli Wu, Haodong Wang, Jiayi Ji, Yutian Yao, Chunsai Du, Jihua Kang, Yanwei Fu, and Liujuan Cao. MVGGT: Multimodal visual geometry grounded transformer for multiview 3D referring expression segmentation.arXiv preprint arXiv:2601.06874, 2026. A Memory Mask: Category-Level Prior vs. Instance Ground Truth A key concern is whether the geometry memory mask ...

work page arXiv 2026
[24]

11 It must be explicitly mentioned; use "None" if unavailable

contextual_object: the concrete physical object that must be operated on first. 11 It must be explicitly mentioned; use "None" if unavailable

work page
[25]

interactive_objects: concrete physical interfaces involved in the task such as handles, knobs, switches, remotes, sockets, ports, or keyholes

work page
[26]

functional_object_candidates: ranked intrinsic functional parts of the contextual_object relevant to the task

work page
[27]

action: choose exactly one from [rotate, key_press, tip_push, hook_pull, pinch_pull, hook_turn, foot_push, plug_in, unplug]

work page
[28]

{instruction}

spatial_relation [X, Y]: extract at most one explicit spatial relation if mentioned using [contextual_object, referenced_object]; otherwise output "N/A". Classify the physical execution implicitly: - intrinsic manipulation: directly operate a part of the target object. - external-mediated manipulation: operate a separate control/tool to change the target ...

work page
[29]

The action verb in the instruction

work page
[30]

The object being acted upon

work page
[31]

we were unable to find the license for the dataset we used

The semantic meaning and context Respond with ONLY the number (1-{len(affordance_nodes)}) of the most appropriate node. Selected node number: 12 NeurIPS Paper Checklist The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. D...

work page
[32]

Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page