pith. machine review for the scientific record. sign in

arxiv: 2601.09211 · v2 · submitted 2026-01-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Affostruction: 3D Affordance Grounding with Generative Reconstruction

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D affordance groundinggenerative reconstructionRGBD imagesshape completionflow-based modelingactive view selectionaffordance localizationpartial observations
0
0 comments X

The pith

A generative model reconstructs full 3D object geometry from partial RGBD views and locates action-specific regions on both visible and hidden surfaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Affostruction as a way to ground affordances—surface areas matching a text-described action—from single or few RGBD images of an object. Prior approaches stop at visible surfaces, but this method first generates the complete shape and then predicts affordance distributions over the entire geometry. It achieves this through constant-complexity reconstruction that fuses multi-view features into sparse voxels, a flow model to represent uncertainty in where an action can occur, and a strategy that picks new views based on early affordance estimates. If successful, systems could reason about how to interact with objects even when large portions remain unseen in the input.

Core claim

Affostruction reconstructs complete object geometry from partial RGBD observations via sparse voxel fusion of multi-view features and grounds affordances on the full shape including unobserved regions, using a flow-based formulation to capture inherent ambiguity in affordance distributions together with active view selection guided by predicted affordances.

What carries the argument

Sparse voxel fusion of multi-view features for generative reconstruction, paired with flow-based modeling of affordance ambiguity and active view selection.

If this is right

  • Affordance grounding extends to surfaces not observed in the input RGBD images.
  • Reconstruction maintains constant computational cost even as the number of input views grows.
  • Active view selection uses initial affordance predictions to choose additional observations that improve final grounding accuracy.
  • Flow-based modeling expresses the probabilistic nature of suitable action regions rather than single deterministic maps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robotic planning could use these completed shapes to simulate grasps or placements on back sides of objects before physical interaction.
  • The same reconstruction pipeline might support incremental mapping when an agent circles an object and collects new views over time.
  • Text queries could be refined by large language models to produce more precise affordance distributions within the generated geometry.

Load-bearing premise

The generative reconstruction accurately fills in geometry for unseen object parts so that affordance predictions on those parts stay reliable.

What would settle it

Objects with substantial occluded geometry where the reconstructed shape differs markedly from ground truth and produces affordance maps that disagree with human annotations on the hidden surfaces.

Figures

Figures reproduced from arXiv: 2601.09211 by Chunghyun Park, Minsu Cho, Seunghyeon Lee.

Figure 1
Figure 1. Figure 1: Affostruction. Given an initial RGBD observation where functional regions for an affordance query (e.g., “attach a light fixture”) are poorly visible, we perform generative recon￾struction to complete the 3D geometry and predict affordances on the full shape including occluded surfaces using flow-based grounding. Our affordance-driven active view selection identi￾fies optimal viewpoints (red) that maximize… view at source ↗
Figure 2
Figure 2. Figure 2: Affostruction overview. Our approach consists of three stages. (1) Generative multi-view reconstruction: DINOv2 [33] features from multiple RGBD views are fused into sparse voxels using depth and camera parameters. A Flow Transformer conditioned on these multi-view features and trained with stochastic multi-view training extrapolates complete 3D structure from partial observations, decoded via frozen spars… view at source ↗
Figure 3
Figure 3. Figure 3: Diverse affordance predictions. Four sampling itera￾tions for the same object-query pair produce diverse valid affor￾dance distributions, demonstrating that our generative approach effectively captures the inherent ambiguity in affordance. outperforming both TRELLIS [20] (19.49, +67.7%) and MCC [45] (21.10, +54.8%). While TRELLIS produces high-fidelity generations, its RGB-only approach without depth condi… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on partial 3D affordance grounding. Affostruction reconstructs complete geometry and grounds affor￾dances throughout entire objects from single RGBD views. Despite limited observations, our method predicts affordances on occluded regions, demonstrating the ability to reason about 3D functional interactions even when large portions of objects are unobserved. values without considering an… view at source ↗
Figure 5
Figure 5. Figure 5: Multi-view training impact. We compare IoU (geomet￾ric reconstruction accuracy) as a function of the number of input views for methods trained with and without multi-view supervi￾sion. Methods trained on single views show minimal improve￾ment or even degradation when given multiple views at inference (left group), while multi-view trained methods show consistent improvements with additional views (right gr… view at source ↗
Figure 6
Figure 6. Figure 6: Affordance-driven active view sampling. We compare affordance grounding quality (aIoU) as views are incrementally added using different sampling strategies on the Affogato test set. All methods start from the same challenging viewpoint with poor affordance visibility. Our affordance-driven active sampling (red) achieves the fastest improvement by prioritizing viewpoints that reveal important regions. Seque… view at source ↗
read the original abstract

This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose Affostruction, a generative framework that reconstructs complete object geometry from partial RGBD observations and grounds affordances on the full shape including unobserved regions. Our approach introduces sparse voxel fusion of multi-view features for constant-complexity generative reconstruction, a flow-based formulation that captures the inherent ambiguity of affordance distributions, and an active view selection strategy guided by predicted affordances. Affostruction outperforms existing methods by large margins on challenging benchmarks, achieving 19.1 aIoU on affordance grounding and 32.67 IoU for 3D reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Affostruction, a generative framework for 3D affordance grounding from partial RGBD observations. It reconstructs complete object geometry via sparse voxel fusion of multi-view features, models affordance distributions with a flow-based formulation to handle ambiguity, and uses active view selection guided by predicted affordances. The central claim is that this enables reliable affordance grounding on the full shape including unobserved regions, yielding large gains over prior methods (19.1 aIoU on grounding, 32.67 IoU on reconstruction).

Significance. If the reconstruction of unobserved geometry proves sufficiently accurate and the affordance predictions on those regions are validated, the work would advance 3D scene understanding for interaction tasks by moving beyond visible-surface limitations. The constant-complexity sparse fusion and flow-based ambiguity modeling are technically interesting contributions that could influence downstream robotics and AR applications.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (experiments): the reported 19.1 aIoU improvement is presented as evidence that affordances are successfully grounded on unobserved reconstructed geometry, yet standard benchmarks annotate only visible surfaces from the input RGBD views. No direct GT annotations, per-region (visible vs. hallucinated) breakdown, or uncertainty-aware metrics are described for the unobserved voxels, despite the reconstruction IoU of 32.67 indicating non-negligible error.
  2. [§3.1] §3.1 (reconstruction module): the sparse voxel fusion is claimed to produce complete geometry suitable for downstream affordance grounding, but the manuscript provides no ablation isolating the effect of reconstruction accuracy on affordance aIoU specifically in unobserved regions, leaving the load-bearing assumption untested.
minor comments (2)
  1. [§3.2] Notation for the flow-based affordance model could be clarified with an explicit equation for the conditional density in the methods section.
  2. Figure captions should explicitly label which surfaces are input-visible versus reconstructed to aid reader interpretation of qualitative results.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful review and constructive criticism. The points raised about validating performance on unobserved geometry are important, and we will revise the manuscript to better address them while acknowledging the limitations of available ground-truth data.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (experiments): the reported 19.1 aIoU improvement is presented as evidence that affordances are successfully grounded on unobserved reconstructed geometry, yet standard benchmarks annotate only visible surfaces from the input RGBD views. No direct GT annotations, per-region (visible vs. hallucinated) breakdown, or uncertainty-aware metrics are described for the unobserved voxels, despite the reconstruction IoU of 32.67 indicating non-negligible error.

    Authors: We agree that the reported aIoU is computed on visible surfaces as per the benchmarks, and there are no direct GT annotations for unobserved regions. This makes it difficult to directly quantify affordance grounding accuracy on hallucinated geometry. In the revised manuscript, we will clarify this in the abstract and experiments section, add qualitative results showing affordance predictions on reconstructed unobserved surfaces, and include uncertainty-aware analysis using the flow-based model to evaluate prediction confidence in those regions. We will also discuss how the reconstruction IoU impacts the reliability of these predictions. revision: partial

  2. Referee: [§3.1] §3.1 (reconstruction module): the sparse voxel fusion is claimed to produce complete geometry suitable for downstream affordance grounding, but the manuscript provides no ablation isolating the effect of reconstruction accuracy on affordance aIoU specifically in unobserved regions, leaving the load-bearing assumption untested.

    Authors: We acknowledge that an ablation specifically for unobserved regions would be ideal but is limited by the absence of GT affordance labels there. We will add an ablation study measuring the effect of reconstruction accuracy (e.g., our method vs. ground-truth geometry where possible for the full shape) on the overall affordance aIoU, and analyze the correlation with reconstruction quality to indirectly support the assumption. This will be included in §4. revision: partial

standing simulated objections not resolved
  • Quantitative evaluation of affordance grounding specifically on unobserved regions due to lack of ground-truth annotations in the benchmarks.

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper introduces Affostruction as a new generative framework combining sparse voxel fusion for reconstruction, flow-based affordance modeling, and active view selection. No equations, definitions, or claims in the abstract or described components reduce by construction to fitted parameters, self-referential inputs, or load-bearing self-citations. Performance metrics (aIoU, IoU) are standard benchmarks applied to the outputs rather than being redefined within the method itself. The derivation chain remains self-contained as an independent technical proposal without the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that generative models can reliably infer unobserved geometry from partial RGBD input; no explicit free parameters or new invented entities are named in the abstract.

axioms (1)
  • domain assumption Generative reconstruction from partial multi-view RGBD observations can produce accurate geometry for unobserved regions
    This assumption underpins the claim that affordances can be grounded on the full shape.

pith-pipeline@v0.9.0 · 5442 in / 1309 out tokens · 28140 ms · 2026-05-16T14:48:55.782690+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

  1. [1]

    Abo: Dataset and benchmarks for real-world 3d object un- derstanding

    Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object un- derstanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126– 21136, 2022. 5

  2. [2]

    Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration.ACM Transactions on Graphics (ToG), 36(4): 1, 2017

    Angela Dai, Matthias Nießner, Michael Zollh ¨ofer, Shahram Izadi, and Christian Theobalt. Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration.ACM Transactions on Graphics (ToG), 36(4): 1, 2017. 2

  3. [3]

    Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36:35799–35813, 2023

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36:35799–35813, 2023. 5

  4. [4]

    3d affordancenet: A benchmark for visual object af- fordance understanding

    Shengheng Deng, Xun Xu, Chaozheng Wu, Ke Chen, and Kui Jia. 3d affordancenet: A benchmark for visual object af- fordance understanding. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 1778– 1787, 2021. 1, 2

  5. [5]

    Transmvsnet: Global context-aware multi-view stereo network with trans- formers

    Yikang Ding, Wentao Yuan, Qingtian Zhu, Haotian Zhang, Xiangyue Liu, Yuanjiang Wang, and Xiao Liu. Transmvsnet: Global context-aware multi-view stereo network with trans- formers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8585–8594,

  6. [6]

    3d-future: 3d fur- niture shape with texture.International Journal of Computer Vision, 129(12):3313–3337, 2021

    Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d fur- niture shape with texture.International Journal of Computer Vision, 129(12):3313–3337, 2021. 5

  7. [7]

    Get3d: A generative model of high quality 3d tex- tured shapes learned from images.Advances in neural infor- mation processing systems, 35:31841–31854, 2022

    Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d tex- tured shapes learned from images.Advances in neural infor- mation processing systems, 35:31841–31854, 2022. 2

  8. [8]

    Cascade cost volume for high-resolution multi-view stereo and stereo matching

    Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2495–2504, 2020. 2

  9. [9]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 5, 3

  10. [10]

    Zero-shot multi-object scene completion

    Shun Iwase, Katherine Liu, Vitor Guizilini, Adrien Gaidon, Kris Kitani, Rares ¸ Ambrus ¸, and Sergey Zakharov. Zero-shot multi-object scene completion. InEuropean Conference on Computer Vision, pages 96–113. Springer, 2024. 1

  11. [11]

    Synergies between affordance and geometry: 6- dof grasp detection via implicit representations

    Zhenyu Jiang, Yifeng Zhu, Maxwell Svetlik, Kuan Fang, and Yuke Zhu. Synergies between affordance and geometry: 6- dof grasp detection via implicit representations. InRobotics: Science and Systems (RSS), 2021. 1

  12. [12]

    Shap-E: Generating Conditional 3D Implicit Functions

    Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023. 2, 5

  13. [13]

    Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion

    Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X Chang, and Manolis Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern...

  14. [14]

    Chisel: Real time large scale 3d re- construction onboard a mobile device using spatially hashed signed distance fields

    Matthew Klingensmith, Ivan Dryanovski, Siddhartha Srini- vasa, and Jianxiong Xiao. Chisel: Real time large scale 3d re- construction onboard a mobile device using spatially hashed signed distance fields. InRobotics: Science and Systems (RSS), 2015. 2

  15. [15]

    Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009, 2025

    Junha Lee, Eunha Park, Chunghyun Park, Dahyun Kang, and Minsu Cho. Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009, 2025. 2, 5, 6, 7, 3

  16. [16]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3d with mast3r. InarXiv preprint arXiv:2406.09756, 2024. 2

  17. [17]

    Mvcontrol: Adding conditional control to multi-view diffu- sion for controllable text-to-3d generation.arXiv preprint arXiv:2311.14494, 2023

    Zhiqi Li, Yiming Chen, Lingzhe Zhao, and Peidong Liu. Mvcontrol: Adding conditional control to multi-view diffu- sion for controllable text-to-3d generation.arXiv preprint arXiv:2311.14494, 2023. 2

  18. [18]

    Magic3d: High-resolution text-to-3d content creation

    Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023. 2

  19. [19]

    Flow matching for genera- tive modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations. 3

  20. [20]

    Structured 3D Latents for Scalable and Versatile 3D Generation

    Jianfeng Liu, Xiaoshui Zeng, Zeyuan Wu, Yujun Lu, Yuan Li, Ming-Hsuan Chen, and Song-Hai Zhang. Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024. 2, 3, 4, 5, 7, 1

  21. [21]

    One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion.Advances in Neural Information Processing Systems, 36:22226–22246, 2023

    Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion.Advances in Neural Information Processing Systems, 36:22226–22246, 2023. 2

  22. [22]

    Zero-1-to- 3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 2

  23. [23]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Repre- sentations. 3, 1

  24. [24]

    SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

    Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age.arXiv preprint arXiv:2309.03453, 2023. 2 7

  25. [25]

    Decoupled weight de- cay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 5, 1

  26. [26]

    Neat: Learning neural implicit surfaces with arbitrary topologies from multi- view images

    Xiaoxu Meng, Weikai Chen, and Bo Yang. Neat: Learning neural implicit surfaces with arbitrary topologies from multi- view images. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3034–3043, 2023. 1

  27. [27]

    V-net: Fully convolutional neural networks for volumetric medical image segmentation

    Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016. 4

  28. [28]

    Where2act: From pixels to actions for articulated 3d objects

    Kaichun Mo, Leonidas J Guibas, Mustafa Mukadam, Abhi- nav Gupta, and Shubham Tulsiani. Where2act: From pixels to actions for articulated 3d objects. InIEEE International Conference on Computer Vision (ICCV), pages 6813–6823,

  29. [29]

    Kinectfusion: Real-time dense surface mapping and tracking

    Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgib- bon. Kinectfusion: Real-time dense surface mapping and tracking. InIEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 127–136. IEEE, 2011. 2

  30. [30]

    Open-vocabulary af- fordance detection in 3d point clouds

    Toan Nguyen, Minh Nhat Vu, An Vuong, Dzung Nguyen, Thieu V o, Ngan Le, and Anh Nguyen. Open-vocabulary af- fordance detection in 3d point clouds. InIEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 5692–5698. IEEE, 2023. 1, 2, 5, 6, 7

  31. [31]

    Language-conditioned affordance-pose detection in 3d point clouds

    Toan Nguyen, Minh Nhat Vu, Baoru Huang, Tuan Van V o, Vy Truong, Ngan Le, Thieu V o, Bac Le, and Anh Nguyen. Language-conditioned affordance-pose detection in 3d point clouds. InIEEE International Conference on Robotics and Automation (ICRA), pages 4216–4223, 2024. 1, 2, 5, 6, 7

  32. [32]

    V oxblox: Incremental 3d eu- clidean signed distance fields for on-board mav planning

    Helen Oleynikova, Zachary Taylor, Marius Fehr, Roland Siegwart, and Juan Nieto. V oxblox: Incremental 3d eu- clidean signed distance fields for on-board mav planning. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1366–1373, 2017. 2

  33. [33]

    Dinov2: Learning robust visual features without su- pervision.Transactions on Machine Learning Research

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without su- pervision.Transactions on Machine Learning Research. 1, 3, 5

  34. [34]

    Affordancellm: Grounding affordance from vision language models

    Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, and Li Erran Li. Affordancellm: Grounding affordance from vision language models. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), pages 5627–5637, 2024. 1

  35. [35]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 3, 4, 5, 1

  36. [36]

    Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

    Jiakai Ren, Zehuan Liang, Xiang Feng, Yu-Guan Hwang, Yan-Pei Chen, Zeqi Liu, Xin Zhou, Chen Cao, Pan Gao, and Tobias Ritschel. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 7309– 7318, 2024. 2

  37. [37]

    MVDream: Multi-view Diffusion for 3D Generation

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 2

  38. [38]

    Ova- fields: Weakly supervised open-vocabulary affordance fields for robot operational part detection

    Heng Su, Mengying Xie, Nieqing Cao, Yan Ding, Beichen Shao, Xianlei Long, Fuqiang Gu, and Chao Chen. Ova- fields: Weakly supervised open-vocabulary affordance fields for robot operational part detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6385–6395, 2025. 1

  39. [39]

    Neuralrecon: Real-time coherent 3d re- construction from monocular video

    Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and Hujun Bao. Neuralrecon: Real-time coherent 3d re- construction from monocular video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15598–15607, 2021. 1

  40. [40]

    Lgm: Large multi-view gaussian model for high- resolution 3d content creation

    Jiaxiang Tang, Zhaoxi Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Lgm: Large multi-view gaussian model for high- resolution 3d content creation. InEuropean Conference on Computer Vision (ECCV), pages 381–399, 2024. 2, 5

  41. [41]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 2

  42. [42]

    Neus: Learning neural im- plicit surfaces by volume rendering for multi-view recon- struction

    Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural im- plicit surfaces by volume rendering for multi-view recon- struction. InAdvances in Neural Information Processing Systems (NeurIPS), pages 27171–27183, 2021. 1

  43. [43]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024. 1, 2

  44. [44]

    Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions

    Yian Wang, Ruihai Wu, Kaichun Mo, Jiaqi Ke, Qingnan Fan, Leonidas J Guibas, and Hao Dong. Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions. InEuropean Conference on Computer Vision (ECCV), pages 90–107, 2022. 1

  45. [45]

    Multiview compres- sive coding for 3d reconstruction

    Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, and Georgia Gkioxari. Multiview compres- sive coding for 3d reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9065–9075, 2023. 2, 5, 7

  46. [46]

    InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

    Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. InarXiv preprint arXiv:2404.07191,

  47. [47]

    Dreamcomposer: Controllable 3d object generation via multi-view conditions

    Yunhan Yang, Yukun Huang, Xiaoyang Wu, Yuan-Chen Guo, Song-Hai Zhang, Hengshuang Zhao, Tong He, and Xi- hui Liu. Dreamcomposer: Controllable 3d object generation via multi-view conditions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8111–8120, 2024. 2 8

  48. [48]

    Mvsnet: Depth inference for unstructured multi-view stereo

    Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InProceedings of the European conference on computer vi- sion (ECCV), pages 767–783, 2018. 1, 2

  49. [49]

    Recurrent mvsnet for high-resolution multi- view stereo depth inference

    Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi- view stereo depth inference. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 5525– 5534, 2019. 2

  50. [50]

    Grounding 3d ob- ject affordance with language instructions, visual observa- tions and interactions

    He Zhu, Quyu Kong, Kechun Xu, Xunlong Xia, Bing Deng, Jieping Ye, Rong Xiong, and Yue Wang. Grounding 3d ob- ject affordance with language instructions, visual observa- tions and interactions. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025. 1 9