Recognition: 2 theorem links
· Lean TheoremStructured 3D Latents for Scalable and Versatile 3D Generation
Pith reviewed 2026-05-16 15:05 UTC · model grok-4.3
The pith
A structured latent that merges sparse 3D grids with dense multiview features supports high-quality generation of 3D assets in multiple output formats from text or image input.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Integrating a sparsely-populated 3D grid with dense multiview visual features from a vision foundation model produces a flexible latent representation that captures both structural geometry and textural appearance, allowing a single trained generator to decode into radiance fields, 3D Gaussians, or meshes while scaling effectively with models up to two billion parameters.
What carries the argument
The Structured LATent (SLAT) representation, formed by fusing a sparse 3D grid with dense multiview features extracted from a vision foundation model to jointly encode geometry and appearance while preserving decoding flexibility.
If this is right
- A single generator can output radiance fields, 3D Gaussians, or meshes on demand after training.
- Local editing of generated 3D assets becomes feasible without retraining the model.
- Model scale up to two billion parameters remains stable when trained on a 500K-object dataset.
- Performance surpasses prior methods at similar model sizes under both text and image conditioning.
Where Pith is reading between the lines
- The separation of sparse structural encoding from dense appearance features may simplify downstream tasks such as material editing or animation transfer.
- Because the latent supports multiple decoders, the same generator could be paired with new output representations developed after training.
- Scaling laws observed for the rectified-flow transformers on this latent may guide further increases in model size for even higher fidelity.
Load-bearing premise
Combining the sparse 3D grid with dense multiview features from a foundation model is sufficient to capture both geometry and appearance without restricting the range of possible output formats.
What would settle it
A controlled benchmark in which the SLAT model produces visibly lower-quality or less consistent 3D assets than recent comparable-scale baselines when evaluated on identical text-to-3D and image-to-3D prompts.
read the original abstract
We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Structured LATent (SLAT), a unified 3D representation formed by integrating a sparsely-populated 3D grid with dense multiview visual features from a vision foundation model. This latent supports decoding into multiple output formats including Radiance Fields, 3D Gaussians, and meshes. The authors train rectified flow transformers (up to 2B parameters) on a 500K-object dataset and claim that the resulting models produce high-quality 3D assets from text or image conditions, significantly outperforming prior methods at comparable scales while also enabling flexible format selection and local editing.
Significance. If the empirical claims are substantiated, the work would represent a meaningful advance in scalable 3D generation by offering a single latent that preserves both geometry and appearance while supporting multiple downstream decoders. The scale of training (2B parameters on 500K assets) and the promised public release of code, models, and data would further increase its utility for the community.
major comments (2)
- [Abstract] Abstract: The central claim that the model 'significantly surpassing existing methods, including recent ones at similar scales' is presented without any quantitative metrics (e.g., FID, PSNR, Chamfer distance), ablation tables, or comparative results. This absence makes the superiority assertion impossible to evaluate and is load-bearing for the paper's main contribution.
- [Abstract] Abstract: The assertion that the SLAT representation 'comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding' lacks any supporting analysis, reconstruction-error bounds, or ablation showing that the sparse-grid + multiview-feature combination is information-complete for all three target decoders. If the sparse grid under-samples fine geometry or the vision features misalign with 3D structure, the claimed decoding versatility cannot hold.
minor comments (1)
- The acronym SLAT is defined on first use, but subsequent references to 'SLAT' would benefit from a brief reminder of its components when the architecture is first detailed.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting areas where the abstract could more clearly substantiate our claims. We address each major comment below with references to the full manuscript and commit to targeted revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the model 'significantly surpassing existing methods, including recent ones at similar scales' is presented without any quantitative metrics (e.g., FID, PSNR, Chamfer distance), ablation tables, or comparative results. This absence makes the superiority assertion impossible to evaluate and is load-bearing for the paper's main contribution.
Authors: We agree the abstract would be stronger with explicit metrics. Section 4 of the manuscript reports quantitative comparisons on standard benchmarks, including FID for generation quality, PSNR for novel-view rendering, and Chamfer distance for geometry accuracy, showing consistent gains over prior methods at comparable model scales (e.g., 1-2B parameters). We will revise the abstract to include concise references to these key metrics and point readers to the corresponding tables and figures. revision: yes
-
Referee: [Abstract] Abstract: The assertion that the SLAT representation 'comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding' lacks any supporting analysis, reconstruction-error bounds, or ablation showing that the sparse-grid + multiview-feature combination is information-complete for all three target decoders. If the sparse grid under-samples fine geometry or the vision features misalign with 3D structure, the claimed decoding versatility cannot hold.
Authors: Sections 3.2 and 5.1 present reconstruction experiments and ablations that quantify geometry and appearance fidelity when decoding SLAT to Radiance Fields, 3D Gaussians, and meshes. These include per-decoder error metrics and ablation studies on grid sparsity and multiview feature alignment, demonstrating that the hybrid representation preserves the necessary information for high-quality outputs across formats. Formal information-theoretic bounds are not derived, as they are intractable for this learned hybrid latent; we will add a brief discussion of this point and of potential edge cases in sparse sampling. revision: partial
Circularity Check
No circularity: empirical representation and training on external data
full rationale
The paper introduces SLAT as a novel latent representation defined by the explicit integration of a sparse 3D grid and dense multiview features from an external vision foundation model. This is presented as an architectural choice trained end-to-end on a 500K-object dataset, with performance claims resting on empirical results rather than any derivation that reduces to fitted parameters, self-referential definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that collapse the central claims back to the inputs by construction. The approach is self-contained against external benchmarks and does not rely on load-bearing self-citations for its core premise.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Structured LATent (SLAT)
no independent evidence
Lean theorems connected to this paper
-
Foundation.DimensionForcingalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats... by integrating a sparsely-populated 3D grid with dense multiview visual features... comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
-
Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch
A conditional diffusion model using proprioception and multi-contact touch produces metric-scale, physically consistent 3D object reconstructions under hand occlusion.
-
SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation
SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
-
ATATA: One Algorithm to Align Them All
ATATA enables fast joint inference of structurally aligned pairs using Rectified Flow models via segment transport, improving state-of-the-art for image and video generation while matching 3D quality at much higher speed.
-
Affostruction: 3D Affordance Grounding with Generative Reconstruction
Affostruction reconstructs full 3D object geometry from partial RGBD views and grounds text-based affordances on both visible and unobserved surfaces, reporting large gains over prior methods.
-
PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World
PhysForge generates physics-grounded 3D assets via a VLM-planned Hierarchical Physical Blueprint and a KineVoxel Injection diffusion model, backed by the new PhysDB dataset of 150,000 annotated assets.
-
Velox: Learning Representations of 4D Geometry and Appearance
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
-
3D-ReGen: A Unified 3D Geometry Regeneration Framework
3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.
-
REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement
REVIVE 3D generates voluminous 3D assets from flat 2D images via an inflated prior construction followed by latent-space refinement, plus new metrics for volume and flatness validated by user study.
-
Pair2Scene: Learning Local Object Relations for Procedural Scene Generation
Pair2Scene generates complex 3D scenes beyond training data by recursively applying a learned model of local support and functional object-pair relations inside hierarchies, using collision-aware rejection sampling fo...
-
Pair2Scene: Learning Local Object Relations for Procedural Scene Generation
Pair2Scene generates complex 3D scenes beyond training data by training a network on local object-pair placement rules and applying them recursively with collision-aware sampling.
-
WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations
WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match tele...
-
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
-
UniRecGen: Unifying Multi-View 3D Reconstruction and Generation
UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.
-
MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation
MV-SAM3D adds multi-view fusion via multi-diffusion with attention-entropy and visibility weighting plus physics-aware optimization to improve fidelity and physical plausibility in layout-aware 3D generation.
-
Depth Anything 3: Recovering the Visual Space from Any Views
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
-
Syn4D: A Multiview Synthetic 4D Dataset
Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
-
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...
-
Hunyuan3D 2.5: Towards High-Fidelity 3D Assets Generation with Ultimate Details
Hunyuan3D 2.5's LATTICE model with 10B parameters generates detailed 3D shapes from images and uses multi-view PBR for textures, outperforming prior methods in fidelity and mesh quality.
Reference graph
Works this paper leans on
-
[1]
Gpt-4o system card. 2024. 6, 16
work page 2024
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Build- ing normalizing flows with stochastic interpolants
Michael Samuel Albergo and Eric Vanden-Eijnden. Build- ing normalizing flows with stochastic interpolants. InICLR,
-
[4]
Improving image generation with better captions
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 6
work page 2023
-
[5]
Mikołaj Bi ´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. In Interna- tional Conference on Learning Representations , 2018. 8, 18
work page 2018
-
[6]
Efficient geometry-aware 3d generative adversarial net- works
Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial net- works. In IEEE/CVF International Conference on Com- puter Vision, 2022. 3
work page 2022
-
[7]
Tensorf: Tensorial radiance fields
Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In European conference on computer vision , pages 333–350. Springer,
-
[8]
Single-stage dif- fusion nerf: A unified approach to 3d generation and recon- struction
Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage dif- fusion nerf: A unified approach to 3d generation and recon- struction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2416–2425, 2023. 3
work page 2023
-
[9]
Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image syn- thesis
Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image syn- thesis. In ICLR, 2024. 2
work page 2024
-
[10]
Meshanything: Artist-created mesh gener- ation with autoregressive transformers
Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Jiaxiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, et al. Meshanything: Artist-created mesh gener- ation with autoregressive transformers. arXiv preprint arXiv:2406.10163, 2024. 3
-
[11]
3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion
Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion. arXiv preprint arXiv:2409.12957, 2024. 3, 6, 17
-
[12]
Sdfusion: Multimodal 3d shape completion, reconstruction, and generation
Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexan- der G Schwing, and Liang-Yan Gui. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 4456–4465, 2023. 2
work page 2023
-
[13]
Abo: Dataset and benchmarks for real-world 3d ob- ject understanding
Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d ob- ject understanding. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 21126–21136, 2022. 6, 16
work page 2022
-
[14]
Flashattention-2: Faster attention with better par- allelism and work partitioning
Tri Dao. Flashattention-2: Faster attention with better par- allelism and work partitioning. In ICLR, 2024. 14
work page 2024
-
[15]
Objaverse: A universe of annotated 3d objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 16
work page 2023
-
[16]
Objaverse-xl: A universe of 10m+ 3d objects
Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Chris- tian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36, 2024. 6, 16
work page 2024
-
[17]
Gram: Generative radiance manifolds for 3d-aware im- age generation
Yu Deng, Jiaolong Yang, Jianfeng Xiang, and Xin Tong. Gram: Generative radiance manifolds for 3d-aware im- age generation. In IEEE/CVF International Conference on Computer Vision, 2022. 3
work page 2022
-
[18]
Probing the 3d awareness of visual foundation models
Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Probing the 3d awareness of visual foundation models. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition , pages 21795–21806, 2024. 2
work page 2024
-
[19]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML, 2024. 2, 3, 14
work page 2024
-
[20]
3d-future: 3d furniture shape with texture
Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d furniture shape with texture. International Journal of Com- puter Vision, pages 1–25, 2021. 6, 16
work page 2021
-
[21]
Get3d: A generative model of high quality 3d tex- tured shapes learned from images
Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja 9 Fidler. Get3d: A generative model of high quality 3d tex- tured shapes learned from images. Advances In Neural In- formation Processing Systems, 35:31841–31854, 2022. 3
work page 2022
-
[22]
Strivec: Sparse tri-vector radiance fields
Quankai Gao, Qiangeng Xu, Hao Su, Ulrich Neumann, and Zexiang Xu. Strivec: Sparse tri-vector radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17569–17579, 2023. 2, 4
work page 2023
-
[23]
Visual fact checker: Enabling high-fidelity detailed caption generation
Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung- Yi Lin, Ming-Yu Liu, and Yin Cui. Visual fact checker: Enabling high-fidelity detailed caption generation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14033–14042, 2024. 16, 17
work page 2024
-
[24]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems, 27, 2014. 2
work page 2014
-
[25]
3dgen: Triplane latent diffusion for textured mesh generation
Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas O ˘guz. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023. 2, 3
-
[26]
Gvgen: Text-to-3d generation with volumetric representation
Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yang- guang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. Gvgen: Text-to-3d generation with volumetric representation. In ECCV, 2024. 3
work page 2024
-
[27]
Gans trained by a two time-scale update rule converge to a local nash equilib- rium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. Advances in neural information processing systems , 30, 2017. 8, 18
work page 2017
-
[28]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021. 6
work page 2021
-
[29]
Denoising dif- fusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. Advances in Neural Informa- tion Processing Systems, 33:6840–6851, 2020. 3
work page 2020
-
[30]
Lrm: Large reconstruction model for single image to 3d
Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. In ICLR, 2024. 3
work page 2024
-
[31]
Neural wavelet-domain diffusion for 3d shape generation
Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. Neural wavelet-domain diffusion for 3d shape generation. In SIGGRAPH Asia 2022 Conference Papers , pages 1–9,
work page 2022
-
[32]
Shap-E: Generating Conditional 3D Implicit Functions
Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023. 3, 7
work page internal anchor Pith review arXiv 2023
-
[33]
3d gaussian splatting for real-time radiance field rendering
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph., 42(4):139–1,
-
[34]
Mukul Khanna*, Yongsen Mao*, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, and Manolis Savva. Habitat Synthetic Scenes Dataset (HSSD-200): An Anal- ysis of 3D Scene Scale and Realism Tradeoffs for Object- Goal Navigation. arXiv preprint, 2023. 6, 16
work page 2023
-
[35]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 4015–4026, 2023. 19
work page 2023
-
[36]
Modular primitives for high-performance differentiable rendering
Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering. ACM Transac- tions on Graphics (ToG), 39(6):1–14, 2020. 15
work page 2020
-
[37]
Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation
Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. In ECCV, 2024. 2, 3, 7
work page 2024
-
[38]
xformers: A modular and hackable trans- former modelling library
Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, Luca Wehrstedt, Jeremy Reizenstein, and Grigory Sizov. xformers: A modular and hackable trans- former modelling library. https://github.com/ facebookresearch/xformers, 2022. 14
work page 2022
-
[39]
Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model
Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, and Sai Bi. Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. In ICLR, 2024. 3
work page 2024
-
[40]
arXiv preprint arXiv:2405.14979 , year=
Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner. arXiv preprint arXiv:2405.14979, 2024. 3
-
[41]
Generalized deep 3d shape prior via part-discretized diffusion process
Yuhan Li, Yishun Dou, Xuanhong Chen, Bingbing Ni, Yilin Sun, Yutian Liu, and Fuzhen Wang. Generalized deep 3d shape prior via part-discretized diffusion process. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16784–16794, 2023. 2
work page 2023
-
[42]
Luciddreamer: Towards high- fidelity text-to-3d generation via interval score matching
Yixun Liang, Xin Yang, Jiantao Lin, Haodong Li, Xiaogang Xu, and Yingcong Chen. Luciddreamer: Towards high- fidelity text-to-3d generation via interval score matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6517–6526, 2024. 3
work page 2024
-
[43]
Magic3d: High- resolution text-to-3d content creation
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High- resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023. 3
work page 2023
-
[44]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. In ICLR, 2023. 3, 5
work page 2023
-
[45]
Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields. Advances in Neural Information Processing Systems, 33:15651–15663,
-
[46]
One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion
Minghua Liu, Ruoxi Shi, Linghao Chen, Zhuoyang Zhang, Chao Xu, Xinyue Wei, Hansheng Chen, Chong Zeng, Ji- ayuan Gu, and Hao Su. One-2-3-45++: Fast single image to 3d objects with consistent multi-view generation and 3d diffusion. In Proceedings of the IEEE/CVF Conference on 10 Computer Vision and Pattern Recognition , pages 10072– 10083, 2024. 3
work page 2024
-
[47]
Meshformer: High-quality mesh generation with 3d-guided reconstruction model
Minghua Liu, Chong Zeng, Xinyue Wei, Ruoxi Shi, Ling- hao Chen, Chao Xu, Mengqi Zhang, Zhaoning Wang, Xi- aoshuai Zhang, Isabella Liu, et al. Meshformer: High- quality mesh generation with 3d-guided reconstruction model. arXiv preprint arXiv:2408.10198, 2024
-
[48]
Zero-1-to- 3: Zero-shot one image to 3d object
Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov, and Carl V ondrick. Zero-1-to- 3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF international conference on computer vi- sion, pages 9298–9309, 2023. 2, 3
work page 2023
-
[49]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In ICLR, 2023. 3
work page 2023
-
[50]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 4
work page 2021
-
[51]
Mix- ture of volumetric primitives for efficient neural rendering
Stephen Lombardi, Tomas Simon, Gabriel Schwartz, Michael Zollhoefer, Yaser Sheikh, and Jason Saragih. Mix- ture of volumetric primitives for efficient neural rendering. ACM Transactions on Graphics (ToG) , 40(4):1–13, 2021. 15
work page 2021
-
[52]
Wonder3d: Single image to 3d using cross-domain diffusion
Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9970–9980, 2024. 3
work page 2024
-
[53]
Decoupled Weight Decay Regularization
I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017. 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[54]
Scaffold-gs: Structured 3d gaussians for view-adaptive rendering
Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024. 2
work page 2024
-
[55]
Repaint: In- painting using denoising diffusion probabilistic models
Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: In- painting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 11461–11471, 2022. 5
work page 2022
-
[56]
Diffusion probabilistic mod- els for 3d point cloud generation
Shitong Luo and Wei Hu. Diffusion probabilistic mod- els for 3d point cloud generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2837–2845, 2021. 3
work page 2021
-
[57]
Scalable 3d captioning with pretrained models
Tiange Luo, Chris Rockwell, Honglak Lee, and Justin Johnson. Scalable 3d captioning with pretrained models. Advances in Neural Information Processing Systems , 36,
-
[58]
Occupancy net- works: Learning 3d reconstruction in function space
Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- bastian Nowozin, and Andreas Geiger. Occupancy net- works: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4460–4470, 2019. 2
work page 2019
-
[59]
Nerf: Representing scenes as neural radiance fields for view syn- thesis
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. Communications of the ACM, 65(1):99–106, 2021. 2, 15
work page 2021
-
[60]
V-net: Fully convolutional neural networks for volumetric medical image segmentation
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016. 15
work page 2016
-
[61]
Diffrf: Rendering-guided 3d radiance field diffusion
Norman M ¨uller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4328–4338, 2023. 3
work page 2023
-
[62]
Polygen: An autoregressive generative model of 3d meshes
Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. Polygen: An autoregressive generative model of 3d meshes. In International conference on machine learn- ing, pages 7220–7229. PMLR, 2020. 3
work page 2020
-
[63]
Point-E: A System for Generating 3D Point Clouds from Complex Prompts
Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022. 3, 18
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[64]
Au- todecoding latent 3d diffusion models
Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc V Gool, and Sergey Tulyakov. Au- todecoding latent 3d diffusion models. Advances in Neural Information Processing Systems, 36:67021–67047, 2023. 3
work page 2023
-
[65]
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El- Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Pa...
work page 2024
-
[66]
Deepsdf: Learning continuous signed distance functions for shape representa- tion
Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representa- tion. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 165–174, 2019. 2
work page 2019
-
[67]
Scalable diffusion mod- els with transformers
William Peebles and Saining Xie. Scalable diffusion mod- els with transformers. In Proceedings of the IEEE/CVF In- ternational Conference on Computer Vision , pages 4195– 4205, 2023. 5, 8
work page 2023
-
[68]
Dreamfusion: Text-to-3d using 2d diffusion
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Milden- hall. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR,
-
[69]
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space.Advances in neural information processing systems, 30, 2017. 8, 18
work page 2017
-
[70]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9,
-
[71]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In International conference on machine learning , pages 8748–8763. PMLR, 2021. 5, 8, 18
work page 2021
-
[72]
Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies
Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4209–4219, 2024. 2, 3
work page 2024
-
[73]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 3, 4, 14
work page 2022
-
[74]
Flexible isosur- face extraction for gradient-based mesh optimization
Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible isosur- face extraction for gradient-based mesh optimization. ACM Trans. Graph., 42(4), 2023. 2, 5, 15
work page 2023
-
[75]
Wenzhe Shi, Jose Caballero, Ferenc Husz ´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 14
work page 2016
-
[76]
Mvdream: Multi-view diffusion for 3d generation
Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. In ICLR, 2024. 3
work page 2024
-
[77]
3d neural field gen- eration using triplane diffusion
J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field gen- eration using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20875–20886, 2023. 3
work page 2023
-
[78]
Ivan Skorokhodov, Aliaksandr Siarohin, Yinghao Xu, Jian Ren, Hsin-Ying Lee, Peter Wonka, and Sergey Tulyakov. 3d generation on imagenet. arXiv preprint arXiv:2303.01416,
-
[79]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Con- ference on Machine Learning , pages 2256–2265. PMLR,
-
[80]
Using shape to categorize: Low-shot learning with an explicit shape bias
Stefan Stojanov, Anh Thai, and James M Rehg. Using shape to categorize: Low-shot learning with an explicit shape bias. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1798–1808,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.