arxiv: 2601.09211 · v2 · submitted 2026-01-14 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Affostruction: 3D Affordance Grounding with Generative Reconstruction

Chunghyun Park , Seunghyeon Lee , Minsu Cho

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:48 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D affordance groundinggenerative reconstructionRGBD imagesshape completionflow-based modelingactive view selectionaffordance localizationpartial observations

0 comments

The pith

A generative model reconstructs full 3D object geometry from partial RGBD views and locates action-specific regions on both visible and hidden surfaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Affostruction as a way to ground affordances—surface areas matching a text-described action—from single or few RGBD images of an object. Prior approaches stop at visible surfaces, but this method first generates the complete shape and then predicts affordance distributions over the entire geometry. It achieves this through constant-complexity reconstruction that fuses multi-view features into sparse voxels, a flow model to represent uncertainty in where an action can occur, and a strategy that picks new views based on early affordance estimates. If successful, systems could reason about how to interact with objects even when large portions remain unseen in the input.

Core claim

Affostruction reconstructs complete object geometry from partial RGBD observations via sparse voxel fusion of multi-view features and grounds affordances on the full shape including unobserved regions, using a flow-based formulation to capture inherent ambiguity in affordance distributions together with active view selection guided by predicted affordances.

What carries the argument

Sparse voxel fusion of multi-view features for generative reconstruction, paired with flow-based modeling of affordance ambiguity and active view selection.

If this is right

Affordance grounding extends to surfaces not observed in the input RGBD images.
Reconstruction maintains constant computational cost even as the number of input views grows.
Active view selection uses initial affordance predictions to choose additional observations that improve final grounding accuracy.
Flow-based modeling expresses the probabilistic nature of suitable action regions rather than single deterministic maps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robotic planning could use these completed shapes to simulate grasps or placements on back sides of objects before physical interaction.
The same reconstruction pipeline might support incremental mapping when an agent circles an object and collects new views over time.
Text queries could be refined by large language models to produce more precise affordance distributions within the generated geometry.

Load-bearing premise

The generative reconstruction accurately fills in geometry for unseen object parts so that affordance predictions on those parts stay reliable.

What would settle it

Objects with substantial occluded geometry where the reconstructed shape differs markedly from ground truth and produces affordance maps that disagree with human annotations on the hidden surfaces.

Figures

Figures reproduced from arXiv: 2601.09211 by Chunghyun Park, Minsu Cho, Seunghyeon Lee.

**Figure 1.** Figure 1: Affostruction. Given an initial RGBD observation where functional regions for an affordance query (e.g., “attach a light fixture”) are poorly visible, we perform generative reconstruction to complete the 3D geometry and predict affordances on the full shape including occluded surfaces using flow-based grounding. Our affordance-driven active view selection identifies optimal viewpoints (red) that maximize… view at source ↗

**Figure 2.** Figure 2: Affostruction overview. Our approach consists of three stages. (1) Generative multi-view reconstruction: DINOv2 [33] features from multiple RGBD views are fused into sparse voxels using depth and camera parameters. A Flow Transformer conditioned on these multi-view features and trained with stochastic multi-view training extrapolates complete 3D structure from partial observations, decoded via frozen spars… view at source ↗

**Figure 3.** Figure 3: Diverse affordance predictions. Four sampling iterations for the same object-query pair produce diverse valid affordance distributions, demonstrating that our generative approach effectively captures the inherent ambiguity in affordance. outperforming both TRELLIS [20] (19.49, +67.7%) and MCC [45] (21.10, +54.8%). While TRELLIS produces high-fidelity generations, its RGB-only approach without depth condi… view at source ↗

**Figure 4.** Figure 4: Qualitative results on partial 3D affordance grounding. Affostruction reconstructs complete geometry and grounds affordances throughout entire objects from single RGBD views. Despite limited observations, our method predicts affordances on occluded regions, demonstrating the ability to reason about 3D functional interactions even when large portions of objects are unobserved. values without considering an… view at source ↗

**Figure 5.** Figure 5: Multi-view training impact. We compare IoU (geometric reconstruction accuracy) as a function of the number of input views for methods trained with and without multi-view supervision. Methods trained on single views show minimal improvement or even degradation when given multiple views at inference (left group), while multi-view trained methods show consistent improvements with additional views (right gr… view at source ↗

**Figure 6.** Figure 6: Affordance-driven active view sampling. We compare affordance grounding quality (aIoU) as views are incrementally added using different sampling strategies on the Affogato test set. All methods start from the same challenging viewpoint with poor affordance visibility. Our affordance-driven active sampling (red) achieves the fastest improvement by prioritizing viewpoints that reveal important regions. Seque… view at source ↗

read the original abstract

This paper addresses the problem of affordance grounding from RGBD images of an object, which aims to localize surface regions corresponding to a text query that describes an action on the object. While existing methods predict affordance regions only on visible surfaces, we propose Affostruction, a generative framework that reconstructs complete object geometry from partial RGBD observations and grounds affordances on the full shape including unobserved regions. Our approach introduces sparse voxel fusion of multi-view features for constant-complexity generative reconstruction, a flow-based formulation that captures the inherent ambiguity of affordance distributions, and an active view selection strategy guided by predicted affordances. Affostruction outperforms existing methods by large margins on challenging benchmarks, achieving 19.1 aIoU on affordance grounding and 32.67 IoU for 3D reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Affostruction combines generative full-shape reconstruction with flow-based affordance grounding to handle unobserved object parts, but its evaluation leaves the reliability of those hidden predictions under-tested.

read the letter

The main point is that this paper gives a practical route to grounding affordances on complete 3D shapes rather than stopping at visible surfaces. It reconstructs the missing geometry from partial RGBD views and then predicts affordance regions across the whole object, including areas never seen in the input. The reported numbers—19.1 aIoU on affordance grounding and 32.67 IoU on reconstruction—show clear gains over prior visible-only methods on the benchmarks they tested. The technical pieces that stand out are the sparse voxel fusion for keeping reconstruction at constant complexity, the flow-based model that treats affordance as a distribution instead of a single map, and the active view selection that uses the affordance predictions to choose better angles. These choices directly target the limitation the authors identify in earlier work. The framework looks internally consistent, with no obvious circular definitions or unfalsifiable claims in the abstract and methods summary. The approach builds on established 3D reconstruction and affordance lines without simply re-labeling old ideas. The soft spot is the validation for the unobserved regions. Standard affordance benchmarks annotate only the visible surfaces from the input views, so the gains on full-shape predictions rest on the assumption that the generative reconstruction is accurate enough in the hidden voxels for the downstream affordance step to be meaningful. The reconstruction IoU of 32.67 already signals non-trivial error in those areas, yet the paper does not appear to supply per-region breakdowns, uncertainty estimates, or extra multi-view ground truth for the hallucinated parts. This makes the central claim harder to assess fully. The work is aimed at researchers in 3D vision and robotics who need to plan interactions with complete objects. Anyone already working on generative models or affordance grounding will find the specific combination useful to examine. It has enough concrete results and a clear technical direction to merit sending out for serious peer review rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes Affostruction, a generative framework for 3D affordance grounding from partial RGBD observations. It reconstructs complete object geometry via sparse voxel fusion of multi-view features, models affordance distributions with a flow-based formulation to handle ambiguity, and uses active view selection guided by predicted affordances. The central claim is that this enables reliable affordance grounding on the full shape including unobserved regions, yielding large gains over prior methods (19.1 aIoU on grounding, 32.67 IoU on reconstruction).

Significance. If the reconstruction of unobserved geometry proves sufficiently accurate and the affordance predictions on those regions are validated, the work would advance 3D scene understanding for interaction tasks by moving beyond visible-surface limitations. The constant-complexity sparse fusion and flow-based ambiguity modeling are technically interesting contributions that could influence downstream robotics and AR applications.

major comments (2)

[Abstract and §4] Abstract and §4 (experiments): the reported 19.1 aIoU improvement is presented as evidence that affordances are successfully grounded on unobserved reconstructed geometry, yet standard benchmarks annotate only visible surfaces from the input RGBD views. No direct GT annotations, per-region (visible vs. hallucinated) breakdown, or uncertainty-aware metrics are described for the unobserved voxels, despite the reconstruction IoU of 32.67 indicating non-negligible error.
[§3.1] §3.1 (reconstruction module): the sparse voxel fusion is claimed to produce complete geometry suitable for downstream affordance grounding, but the manuscript provides no ablation isolating the effect of reconstruction accuracy on affordance aIoU specifically in unobserved regions, leaving the load-bearing assumption untested.

minor comments (2)

[§3.2] Notation for the flow-based affordance model could be clarified with an explicit equation for the conditional density in the methods section.
Figure captions should explicitly label which surfaces are input-visible versus reconstructed to aid reader interpretation of qualitative results.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful review and constructive criticism. The points raised about validating performance on unobserved geometry are important, and we will revise the manuscript to better address them while acknowledging the limitations of available ground-truth data.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (experiments): the reported 19.1 aIoU improvement is presented as evidence that affordances are successfully grounded on unobserved reconstructed geometry, yet standard benchmarks annotate only visible surfaces from the input RGBD views. No direct GT annotations, per-region (visible vs. hallucinated) breakdown, or uncertainty-aware metrics are described for the unobserved voxels, despite the reconstruction IoU of 32.67 indicating non-negligible error.

Authors: We agree that the reported aIoU is computed on visible surfaces as per the benchmarks, and there are no direct GT annotations for unobserved regions. This makes it difficult to directly quantify affordance grounding accuracy on hallucinated geometry. In the revised manuscript, we will clarify this in the abstract and experiments section, add qualitative results showing affordance predictions on reconstructed unobserved surfaces, and include uncertainty-aware analysis using the flow-based model to evaluate prediction confidence in those regions. We will also discuss how the reconstruction IoU impacts the reliability of these predictions. revision: partial
Referee: [§3.1] §3.1 (reconstruction module): the sparse voxel fusion is claimed to produce complete geometry suitable for downstream affordance grounding, but the manuscript provides no ablation isolating the effect of reconstruction accuracy on affordance aIoU specifically in unobserved regions, leaving the load-bearing assumption untested.

Authors: We acknowledge that an ablation specifically for unobserved regions would be ideal but is limited by the absence of GT affordance labels there. We will add an ablation study measuring the effect of reconstruction accuracy (e.g., our method vs. ground-truth geometry where possible for the full shape) on the overall affordance aIoU, and analyze the correlation with reconstruction quality to indirectly support the assumption. This will be included in §4. revision: partial

standing simulated objections not resolved

Quantitative evaluation of affordance grounding specifically on unobserved regions due to lack of ground-truth annotations in the benchmarks.

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper introduces Affostruction as a new generative framework combining sparse voxel fusion for reconstruction, flow-based affordance modeling, and active view selection. No equations, definitions, or claims in the abstract or described components reduce by construction to fitted parameters, self-referential inputs, or load-bearing self-citations. Performance metrics (aIoU, IoU) are standard benchmarks applied to the outputs rather than being redefined within the method itself. The derivation chain remains self-contained as an independent technical proposal without the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that generative models can reliably infer unobserved geometry from partial RGBD input; no explicit free parameters or new invented entities are named in the abstract.

axioms (1)

domain assumption Generative reconstruction from partial multi-view RGBD observations can produce accurate geometry for unobserved regions
This assumption underpins the claim that affordances can be grounded on the full shape.

pith-pipeline@v0.9.0 · 5442 in / 1309 out tokens · 28140 ms · 2026-05-16T14:48:55.782690+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We extend TRELLIS with two key components: multi-view sparse voxel fusion that aggregates DINOv2 features conditioned on depth, and a flow-based affordance module that generates heatmaps from text queries.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Affostruction achieves 19.1 aIoU on affordance grounding (40.4% improvement) and 32.67 IoU for 3D reconstruction (67.7% improvement)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 6 internal anchors

[1]

Abo: Dataset and benchmarks for real-world 3d object un- derstanding

Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object un- derstanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126– 21136, 2022. 5

work page 2022
[2]

Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration.ACM Transactions on Graphics (ToG), 36(4): 1, 2017

Angela Dai, Matthias Nießner, Michael Zollh ¨ofer, Shahram Izadi, and Christian Theobalt. Bundlefusion: Real-time globally consistent 3d reconstruction using on-the-fly surface reintegration.ACM Transactions on Graphics (ToG), 36(4): 1, 2017. 2

work page 2017
[3]

Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36:35799–35813, 2023

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36:35799–35813, 2023. 5

work page 2023
[4]

3d affordancenet: A benchmark for visual object af- fordance understanding

Shengheng Deng, Xun Xu, Chaozheng Wu, Ke Chen, and Kui Jia. 3d affordancenet: A benchmark for visual object af- fordance understanding. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 1778– 1787, 2021. 1, 2

work page 2021
[5]

Transmvsnet: Global context-aware multi-view stereo network with trans- formers

Yikang Ding, Wentao Yuan, Qingtian Zhu, Haotian Zhang, Xiangyue Liu, Yuanjiang Wang, and Xiao Liu. Transmvsnet: Global context-aware multi-view stereo network with trans- formers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8585–8594,

work page
[6]

3d-future: 3d fur- niture shape with texture.International Journal of Computer Vision, 129(12):3313–3337, 2021

Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d fur- niture shape with texture.International Journal of Computer Vision, 129(12):3313–3337, 2021. 5

work page 2021
[7]

Get3d: A generative model of high quality 3d tex- tured shapes learned from images.Advances in neural infor- mation processing systems, 35:31841–31854, 2022

Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d tex- tured shapes learned from images.Advances in neural infor- mation processing systems, 35:31841–31854, 2022. 2

work page 2022
[8]

Cascade cost volume for high-resolution multi-view stereo and stereo matching

Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade cost volume for high-resolution multi-view stereo and stereo matching. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2495–2504, 2020. 2

work page 2020
[9]

Classifier-free diffusion guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. InNeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 5, 3

work page 2021
[10]

Zero-shot multi-object scene completion

Shun Iwase, Katherine Liu, Vitor Guizilini, Adrien Gaidon, Kris Kitani, Rares ¸ Ambrus ¸, and Sergey Zakharov. Zero-shot multi-object scene completion. InEuropean Conference on Computer Vision, pages 96–113. Springer, 2024. 1

work page 2024
[11]

Synergies between affordance and geometry: 6- dof grasp detection via implicit representations

Zhenyu Jiang, Yifeng Zhu, Maxwell Svetlik, Kuan Fang, and Yuke Zhu. Synergies between affordance and geometry: 6- dof grasp detection via implicit representations. InRobotics: Science and Systems (RSS), 2021. 1

work page 2021
[12]

Shap-E: Generating Conditional 3D Implicit Functions

Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023. 2, 5

work page internal anchor Pith review arXiv 2023
[13]

Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion

Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X Chang, and Manolis Savva. Habitat synthetic scenes dataset (hssd-200): An analysis of 3d scene scale and realism tradeoffs for objectgoal naviga- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern...

work page
[14]

Chisel: Real time large scale 3d re- construction onboard a mobile device using spatially hashed signed distance fields

Matthew Klingensmith, Ivan Dryanovski, Siddhartha Srini- vasa, and Jianxiong Xiao. Chisel: Real time large scale 3d re- construction onboard a mobile device using spatially hashed signed distance fields. InRobotics: Science and Systems (RSS), 2015. 2

work page 2015
[15]

Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009, 2025

Junha Lee, Eunha Park, Chunghyun Park, Dahyun Kang, and Minsu Cho. Affogato: Learning open-vocabulary affordance grounding with automated data generation at scale.arXiv preprint arXiv:2506.12009, 2025. 2, 5, 6, 7, 3

work page arXiv 2025
[16]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3d with mast3r. InarXiv preprint arXiv:2406.09756, 2024. 2

work page arXiv 2024
[17]

Mvcontrol: Adding conditional control to multi-view diffu- sion for controllable text-to-3d generation.arXiv preprint arXiv:2311.14494, 2023

Zhiqi Li, Yiming Chen, Lingzhe Zhao, and Peidong Liu. Mvcontrol: Adding conditional control to multi-view diffu- sion for controllable text-to-3d generation.arXiv preprint arXiv:2311.14494, 2023. 2

work page arXiv 2023
[18]

Magic3d: High-resolution text-to-3d content creation

Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 300–309, 2023. 2

work page 2023
[19]

Flow matching for genera- tive modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim- ilian Nickel, and Matthew Le. Flow matching for genera- tive modeling. InThe Eleventh International Conference on Learning Representations. 3

work page
[20]

Structured 3D Latents for Scalable and Versatile 3D Generation

Jianfeng Liu, Xiaoshui Zeng, Zeyuan Wu, Yujun Lu, Yuan Li, Ming-Hsuan Chen, and Song-Hai Zhang. Structured 3d latents for scalable and versatile 3d generation.arXiv preprint arXiv:2412.01506, 2024. 2, 3, 4, 5, 7, 1

work page internal anchor Pith review arXiv 2024
[21]

One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion.Advances in Neural Information Processing Systems, 36:22226–22246, 2023

Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion.Advances in Neural Information Processing Systems, 36:22226–22246, 2023. 2

work page 2023
[22]

Zero-1-to- 3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 2

work page 2023
[23]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Repre- sentations. 3, 1

work page
[24]

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

Yuan Liu, Cheng Lin, Zijiao Zeng, Xiaoxiao Long, Lingjie Liu, Taku Komura, and Wenping Wang. Syncdreamer: Gen- erating multiview-consistent images from a single-view im- age.arXiv preprint arXiv:2309.03453, 2023. 2 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. InInternational Conference on Learning Representations, 2019. 5, 1

work page 2019
[26]

Neat: Learning neural implicit surfaces with arbitrary topologies from multi- view images

Xiaoxu Meng, Weikai Chen, and Bo Yang. Neat: Learning neural implicit surfaces with arbitrary topologies from multi- view images. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3034–3043, 2023. 1

work page 2023
[27]

V-net: Fully convolutional neural networks for volumetric medical image segmentation

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016. 4

work page 2016
[28]

Where2act: From pixels to actions for articulated 3d objects

Kaichun Mo, Leonidas J Guibas, Mustafa Mukadam, Abhi- nav Gupta, and Shubham Tulsiani. Where2act: From pixels to actions for articulated 3d objects. InIEEE International Conference on Computer Vision (ICCV), pages 6813–6823,

work page
[29]

Kinectfusion: Real-time dense surface mapping and tracking

Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgib- bon. Kinectfusion: Real-time dense surface mapping and tracking. InIEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 127–136. IEEE, 2011. 2

work page 2011
[30]

Open-vocabulary af- fordance detection in 3d point clouds

Toan Nguyen, Minh Nhat Vu, An Vuong, Dzung Nguyen, Thieu V o, Ngan Le, and Anh Nguyen. Open-vocabulary af- fordance detection in 3d point clouds. InIEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 5692–5698. IEEE, 2023. 1, 2, 5, 6, 7

work page 2023
[31]

Language-conditioned affordance-pose detection in 3d point clouds

Toan Nguyen, Minh Nhat Vu, Baoru Huang, Tuan Van V o, Vy Truong, Ngan Le, Thieu V o, Bac Le, and Anh Nguyen. Language-conditioned affordance-pose detection in 3d point clouds. InIEEE International Conference on Robotics and Automation (ICRA), pages 4216–4223, 2024. 1, 2, 5, 6, 7

work page 2024
[32]

V oxblox: Incremental 3d eu- clidean signed distance fields for on-board mav planning

Helen Oleynikova, Zachary Taylor, Marius Fehr, Roland Siegwart, and Juan Nieto. V oxblox: Incremental 3d eu- clidean signed distance fields for on-board mav planning. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1366–1373, 2017. 2

work page 2017
[33]

Dinov2: Learning robust visual features without su- pervision.Transactions on Machine Learning Research

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without su- pervision.Transactions on Machine Learning Research. 1, 3, 5

work page
[34]

Affordancellm: Grounding affordance from vision language models

Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, and Li Erran Li. Affordancellm: Grounding affordance from vision language models. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), pages 5627–5637, 2024. 1

work page 2024
[35]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2, 3, 4, 5, 1

work page 2021
[36]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

Jiakai Ren, Zehuan Liang, Xiang Feng, Yu-Guan Hwang, Yan-Pei Chen, Zeqi Liu, Xin Zhou, Chen Cao, Pan Gao, and Tobias Ritschel. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 7309– 7318, 2024. 2

work page 2024
[37]

MVDream: Multi-view Diffusion for 3D Generation

Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Ova- fields: Weakly supervised open-vocabulary affordance fields for robot operational part detection

Heng Su, Mengying Xie, Nieqing Cao, Yan Ding, Beichen Shao, Xianlei Long, Fuqiang Gu, and Chao Chen. Ova- fields: Weakly supervised open-vocabulary affordance fields for robot operational part detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6385–6395, 2025. 1

work page 2025
[39]

Neuralrecon: Real-time coherent 3d re- construction from monocular video

Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and Hujun Bao. Neuralrecon: Real-time coherent 3d re- construction from monocular video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15598–15607, 2021. 1

work page 2021
[40]

Lgm: Large multi-view gaussian model for high- resolution 3d content creation

Jiaxiang Tang, Zhaoxi Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Lgm: Large multi-view gaussian model for high- resolution 3d content creation. InEuropean Conference on Computer Vision (ECCV), pages 381–399, 2024. 2, 5

work page 2024
[41]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Ta- tiana Matejovicova, Alexandre Ram ´e, Morgane Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Neus: Learning neural im- plicit surfaces by volume rendering for multi-view recon- struction

Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural im- plicit surfaces by volume rendering for multi-view recon- struction. InAdvances in Neural Information Processing Systems (NeurIPS), pages 27171–27183, 2021. 1

work page 2021
[43]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024. 1, 2

work page 2024
[44]

Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions

Yian Wang, Ruihai Wu, Kaichun Mo, Jiaqi Ke, Qingnan Fan, Leonidas J Guibas, and Hao Dong. Adaafford: Learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions. InEuropean Conference on Computer Vision (ECCV), pages 90–107, 2022. 1

work page 2022
[45]

Multiview compres- sive coding for 3d reconstruction

Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, and Georgia Gkioxari. Multiview compres- sive coding for 3d reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9065–9075, 2023. 2, 5, 7

work page 2023
[46]

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, and Ying Shan. Instantmesh: Efficient 3d mesh generation from a single image with sparse-view large reconstruction models. InarXiv preprint arXiv:2404.07191,

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Dreamcomposer: Controllable 3d object generation via multi-view conditions

Yunhan Yang, Yukun Huang, Xiaoyang Wu, Yuan-Chen Guo, Song-Hai Zhang, Hengshuang Zhao, Tong He, and Xi- hui Liu. Dreamcomposer: Controllable 3d object generation via multi-view conditions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8111–8120, 2024. 2 8

work page 2024
[48]

Mvsnet: Depth inference for unstructured multi-view stereo

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InProceedings of the European conference on computer vi- sion (ECCV), pages 767–783, 2018. 1, 2

work page 2018
[49]

Recurrent mvsnet for high-resolution multi- view stereo depth inference

Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, and Long Quan. Recurrent mvsnet for high-resolution multi- view stereo depth inference. InIEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 5525– 5534, 2019. 2

work page 2019
[50]

Grounding 3d ob- ject affordance with language instructions, visual observa- tions and interactions

He Zhu, Quyu Kong, Kechun Xu, Xunlong Xia, Bing Deng, Jieping Ye, Rong Xiong, and Yue Wang. Grounding 3d ob- ject affordance with language instructions, visual observa- tions and interactions. InIEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2025. 1 9

work page 2025