Native and Compact Structured Latents for 3D Generation
Pith reviewed 2026-05-21 05:15 UTC · model grok-4.3
The pith
O-Voxel encodes geometry and appearance in a sparse structure to support higher-quality 3D generation from compact latents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that an omni-voxel called O-Voxel can model arbitrary topology, including open, non-manifold, and fully enclosed surfaces, while also storing comprehensive surface attributes beyond color such as physically-based rendering parameters. A Sparse Compression VAE built on this structure delivers high spatial compression and a compact latent space. Large-scale flow-matching models with 4 billion parameters trained on diverse public 3D assets then produce outputs whose geometry and material quality exceed those of existing models, all while maintaining fast inference.
What carries the argument
O-Voxel, a sparse voxel representation that jointly encodes geometry and appearance attributes for arbitrary topologies.
If this is right
- Generated 3D assets exhibit geometry and material quality that exceeds existing models.
- Inference stays highly efficient even for models with 4 billion parameters.
- The representation supports open, non-manifold, and enclosed surfaces without special handling.
- Surface attributes include physically-based rendering parameters in addition to color.
Where Pith is reading between the lines
- The compact latents could be reused for downstream tasks such as 3D editing or view synthesis without retraining the generator.
- Combining this voxel structure with existing mesh or implicit surface pipelines might reduce conversion errors in production workflows.
- If the compression rate holds at larger scales, similar latent designs could apply to 4D or animated asset generation.
- Public datasets used here suggest the method may generalize across asset styles without heavy curation.
Load-bearing premise
O-Voxel can robustly model arbitrary topology including open, non-manifold, and fully-enclosed surfaces while capturing comprehensive surface attributes beyond texture color.
What would settle it
A direct test on 3D models containing non-manifold junctions or open surfaces where O-Voxel either fails to encode the topology correctly or omits the physically-based rendering parameters in the output assets.
read the original abstract
Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces O-Voxel, a new sparse voxel representation that encodes both geometry (arbitrary topologies including open, non-manifold, and enclosed surfaces) and comprehensive appearance attributes (including PBR parameters beyond texture color). It builds a Sparse Compression VAE for high-rate compression into compact structured latents and trains 4B-parameter flow-matching models on public 3D datasets for generation, asserting that the resulting assets exhibit geometry and material quality that far exceeds existing models while maintaining efficient inference.
Significance. If the central claims hold, the work would advance 3D generative modeling by providing a native, topology-robust representation that supports full PBR attributes and high compression, potentially enabling higher-fidelity outputs from large-scale flow models trained on diverse assets.
major comments (2)
- [Abstract] Abstract: the headline claim that 'the geometry and material quality of our generated assets far exceed those of existing models' is load-bearing for the paper's contribution yet is presented without any quantitative metrics, benchmark comparisons, or ablation results on topology handling or PBR attribute fidelity; this leaves the assertion that O-Voxel plus the VAE and flow model preserve and improve these properties unverified at the point required to support the quality superiority statement.
- [Abstract] The description of O-Voxel asserts robust modeling of open, non-manifold, and fully-enclosed surfaces together with full PBR parameters (roughness, metallic, etc.), but this premise is not secured by targeted validation; if experiments are limited to closed manifold objects or texture-only appearance, the Sparse Compression VAE and downstream flow-matching results cannot be shown to deliver the claimed quality gains without topological collapse or attribute loss.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We agree that the abstract claims require stronger anchoring in quantitative evidence and targeted validation. We address each major comment below and will incorporate revisions to improve clarity and support for our assertions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that 'the geometry and material quality of our generated assets far exceed those of existing models' is load-bearing for the paper's contribution yet is presented without any quantitative metrics, benchmark comparisons, or ablation results on topology handling or PBR attribute fidelity; this leaves the assertion that O-Voxel plus the VAE and flow model preserve and improve these properties unverified at the point required to support the quality superiority statement.
Authors: We acknowledge that the abstract presents a strong claim without embedding supporting metrics directly within it. The full manuscript provides quantitative evaluations, benchmark comparisons against prior 3D generation methods, and ablations on geometry and material quality in the experiments section. To address the concern, we will revise the abstract to include key quantitative results and explicit references to the relevant experimental findings, thereby better verifying that O-Voxel, the Sparse Compression VAE, and the flow-matching model preserve and enhance these properties. revision: yes
-
Referee: [Abstract] The description of O-Voxel asserts robust modeling of open, non-manifold, and fully-enclosed surfaces together with full PBR parameters (roughness, metallic, etc.), but this premise is not secured by targeted validation; if experiments are limited to closed manifold objects or texture-only appearance, the Sparse Compression VAE and downstream flow-matching results cannot be shown to deliver the claimed quality gains without topological collapse or attribute loss.
Authors: O-Voxel is explicitly constructed as a sparse voxel structure that does not presuppose closed or manifold topology, enabling representation of open, non-manifold, and fully-enclosed surfaces while encoding full PBR attributes including roughness and metallic values. Our training uses diverse public 3D asset datasets containing such topological and material variations. We agree that dedicated validation would strengthen the presentation; we will add a targeted subsection with examples, visualizations, and metrics demonstrating topology robustness and PBR attribute preservation in the revised manuscript. revision: yes
Circularity Check
No circularity: new representation and trained models are independent of target claims
full rationale
The paper introduces O-Voxel as a novel sparse voxel structure for encoding geometry and PBR attributes, followed by a Sparse Compression VAE and large-scale flow-matching models trained on public 3D datasets. The central quality claims rest on empirical outputs from these trained components rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or derivation steps in the provided text reduce the asserted robustness or quality gains to inputs by construction; the approach is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Flow-matching models trained on O-Voxel latents will produce high-quality 3D generation at 4B parameter scale
invented entities (1)
-
O-Voxel
no independent evidence
Lean theorems connected to this paper
-
Foundation/DimensionForcingdimension_forced echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters.
-
Foundation/LedgerForcingconservation_from_balance unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the geometry and material quality of our generated assets far exceed those of existing models
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 25 Pith papers
-
Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation
Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset ...
-
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...
-
Stream3D: Sequential Multi-View 3D Generation via Evidential Memory
Stream3D is a training-free method that maintains temporal consistency in 3D generation from monocular streams by dynamically caching a fixed number of informative historical frames using an evidence score.
-
The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting
MixCount provides a scalable synthetic dataset for mixed-object counting that improves state-of-the-art models on real benchmarks, cutting MAE by 20.14% on FSC-147 and 18.3% on PairTally.
-
Count Anything at Any Granularity
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
-
Velocity-Space 3D Asset Editing
VS3D performs local 3D asset editing by injecting reconstruction-anchored source signals, partial-mean guidance, and twin-agreement residuals into the velocity sampler to control edit strength and preserve identity.
-
3D Generation for Embodied AI and Robotic Simulation: A Survey
3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.
-
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...
-
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
-
PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects
PhysX-Omni unifies simulation-ready 3D asset generation across rigid, deformable, and articulated objects via a new geometry representation, the PhysXVerse dataset, and the PhysX-Bench evaluation suite.
-
ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation
ROAR-3D adds a token-wise view router and dual-stream attention to pretrained single-view 3D generators so they can use arbitrary unposed images for higher-fidelity output.
-
Pixal3D: Pixel-Aligned 3D Generation from Images
Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.
-
Generative 3D Gaussians with Learned Density Control
DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.
-
DVD: Discrete Voxel Diffusion for 3D Generation and Editing
DVD treats voxel occupancy as a discrete variable in a diffusion framework to generate, assess, and edit sparse 3D voxels without continuous thresholding.
-
LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows
LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.
-
CMAG: Concept-Scaffolded Retrieval for Marketplace Avatar Generation
CMAG combines 3D concept scaffolding, prompt decomposition, taxonomy routing, hybrid retrieval, and agentic VLM verification to assemble topologically consistent avatars from catalog assets given free-form text prompts.
-
EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers
EVA01 introduces a Mixture-of-Transformers model that natively adds 3D mesh understanding, generation, and multi-turn editing to MLLMs by decoupling understanding and generation experts with shared global self-attention.
-
Pose Tracking with a Foundation Pose Model and an Ensemble Directional Kalman Filter
EnDKF combines ensemble Kalman filtering with directional statistics and unit quaternions to achieve lower pose tracking error than raw measurements in synthetic constant-velocity tests and FoundationPose-based head tracking.
-
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation
The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.
-
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...
-
Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation
Hitem3D 2.0 combines multi-view image synthesis with native 3D texture projection to improve completeness, cross-view consistency, and geometry alignment over prior methods.
-
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation
The paper surveys 3D content generation literature using a taxonomy of asset types and production stages to evaluate progress toward engine-ready assets.
-
Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation
Seed3D 2.0 advances 3D content generation via a coarse-to-fine geometry pipeline, unified PBR material model, and simulation-ready scene tools, reporting 69-89.9% win rates over commercial systems in human studies.
-
3D Generation for Embodied AI and Robotic Simulation: A Survey
The survey organizes 3D generation for embodied AI into data generators for assets, simulation environments for interaction, and sim-to-real bridges, noting a shift toward interaction readiness and listing bottlenecks...
-
3D Generation for Embodied AI and Robotic Simulation: A Survey
The paper surveys 3D generation techniques for embodied AI and robotics, categorizing them into data generation, simulation environments, and sim-to-real bridging while identifying bottlenecks in physical validity and...
Reference graph
Works this paper leans on
-
[1]
Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffu- sion models.arXiv preprint arXiv:2410.10733, 2024. 2, 5, 1
-
[2]
Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis
Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis. InICLR, 2024. 6, 1
work page 2024
-
[3]
Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders
Rui Chen, Jianfeng Zhang, Yixun Liang, Guan Luo, Weiyu Li, Jiarui Liu, Xiu Li, Xiaoxiao Long, Jiashi Feng, and Ping Tan. Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16251–16261,
-
[4]
Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Ji- axiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, et al. Meshanything: Artist-created mesh generation with au- toregressive transformers.arXiv preprint arXiv:2406.10163,
-
[5]
Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, and Guosheng Lin. Ultra3d: Efficient and high- fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025. 3
-
[6]
Neural dual contouring.ACM Transactions on Graphics (TOG), 41(4):1–13, 2022
Zhiqin Chen, Andrea Tagliasacchi, Thomas Funkhouser, and Hao Zhang. Neural dual contouring.ACM Transactions on Graphics (TOG), 41(4):1–13, 2022. 3
work page 2022
-
[7]
Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion.arXiv preprint arXiv:2409.12957, 2024. 2, 3
-
[8]
Warpconvnet: High- performance 3d deep learning library.https://github
Chris Choy and NVIDIA Research. Warpconvnet: High- performance 3d deep learning library.https://github. com/NVlabs/warpconvnet, 2025. 4
work page 2025
-
[9]
Abo: Dataset and benchmarks for real-world 3d object un- derstanding
Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object un- derstanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126– 21136, 2022. 6, 4
work page 2022
-
[10]
Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018
Blender Online Community.Blender — a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018. 6, 4
work page 2018
-
[11]
Spconv: Spatially sparse convolu- tion library.https://github.com/traveller59/ spconv, 2022
Spconv Contributors. Spconv: Spatially sparse convolu- tion library.https://github.com/traveller59/ spconv, 2022. 3, 4
work page 2022
-
[12]
Objaverse: A universe of annotated 3d objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 3
work page 2023
-
[13]
Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36, 2024. 3, 6, 4
work page 2024
-
[14]
Deformed implicit field: Modeling 3d shapes with learned dense correspon- dence
Yu Deng, Jiaolong Yang, and Xin Tong. Deformed implicit field: Modeling 3d shapes with learned dense correspon- dence. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 10286–10296,
-
[15]
Scaling recti- fied flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 2
work page 2024
-
[16]
Introducing gemini 2.5 flash image: Our state-of-the-art image model.https://developers
Alisa Fortin, Guillaume Vernade, Kat Kampf, and Am- maar Reshi. Introducing gemini 2.5 flash image: Our state-of-the-art image model.https://developers. googleblog.com/en/introducing- gemini- 2- 5-flash-image/, 2025. Google Developer Blog. 6
work page 2025
-
[17]
Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d fur- niture shape with texture.International Journal of Computer Vision, pages 1–25, 2021. 4 9
work page 2021
-
[18]
Submanifold Sparse Convolutional Networks
Benjamin Graham and Laurens Van der Maaten. Sub- manifold sparse convolutional networks.arXiv preprint arXiv:1706.01307, 2017. 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[19]
3dgen: Triplane latent diffusion for textured mesh generation
Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Bar- las O˘guz. 3dgen: Triplane latent diffusion for textured mesh generation.arXiv preprint arXiv:2303.05371, 2023. 3
-
[20]
Gvgen: Text-to-3d generation with volumetric rep- resentation
Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yang- guang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. Gvgen: Text-to-3d generation with volumetric rep- resentation. InECCV, 2024. 2
work page 2024
-
[21]
Xianglong He, Zi-Xin Zou, Chia-Hao Chen, Yuan-Chen Guo, Ding Liang, Chun Yuan, Wanli Ouyang, Yan-Pei Cao, and Yangguang Li. Sparseflex: High-resolution and arbitrary-topology 3d shape modeling.arXiv preprint arXiv:2503.21732, 2025. 2, 3, 5, 6
-
[22]
Neural wavelet-domain diffusion for 3d shape generation
Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. Neural wavelet-domain diffusion for 3d shape generation. InSIG- GRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 2
work page 2022
-
[23]
Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material
Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, et al. Hunyuan3d 2.1: From images to high- fidelity 3d assets with production-ready pbr material.arXiv preprint arXiv:2506.15442, 2025. 3, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop- pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs.arXiv preprint arXiv:2107.14795, 2021. 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[25]
Dual contouring of hermite data
Tao Ju, Frank Losasso, Scott Schaefer, and Joe Warren. Dual contouring of hermite data. InProceedings of the 29th an- nual conference on Computer graphics and interactive tech- niques, pages 339–346, 2002. 3, 4
work page 2002
-
[26]
Shap-E: Generating Conditional 3D Implicit Functions
Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,
-
[28]
Mukul Khanna*, Yongsen Mao*, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, and Manolis Savva. Habitat Synthetic Scenes Dataset (HSSD-200): An Analy- sis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation.arXiv preprint, 2023. 6, 4
work page 2023
-
[29]
Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation
Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. InECCV, 2024. 3
work page 2024
-
[30]
Gaussiananything: Interactive point cloud latent diffu- sion for 3d generation
Yushi Lan, Shangchen Zhou, Zhaoyang Lyu, Fangzhou Hong, Shuai Yang, Bo Dai, Xingang Pan, and Chen Change Loy. Gaussiananything: Interactive point cloud latent diffu- sion for 3d generation. InICLR, 2025. 3
work page 2025
-
[31]
2025.doi:10.48550/arXiv.2405.14979
Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner.arXiv preprint arXiv:2405.14979, 2024. 3
-
[32]
2025.doi:10.48550/arXiv.2505.07747
Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, et al. Step1x-3d: Towards high-fidelity and con- trollable generation of textured 3d assets.arXiv preprint arXiv:2505.07747, 2025. 3, 7
-
[33]
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models
Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Zhihao Li, Yufei Wang, Heliang Zheng, Yihao Luo, and Bihan Wen. Sparc3d: Sparse representation and construc- tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025. 2, 3
-
[35]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 6, 3
work page 2023
-
[36]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023. 3
work page 2023
-
[37]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,
-
[38]
Decoupled Weight Decay Regularization
I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[39]
Diffusion probabilistic models for 3d point cloud generation
Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2837–2845, 2021. 2
work page 2021
-
[40]
Lt3sd: Latent trees for 3d scene diffusion
Quan Meng, Lei Li, Matthias Nießner, and Angela Dai. Lt3sd: Latent trees for 3d scene diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 650–660, 2025. 3
work page 2025
-
[41]
Occupancy networks: Learning 3d reconstruction in function space
Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- bastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019. 2
work page 2019
-
[42]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2
work page 2021
-
[43]
Diffrf: Rendering-guided 3d radiance field diffusion
Norman M ¨uller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4328–4338, 2023. 2
work page 2023
-
[44]
Extracting triangular 3d models, materials, and lighting from images
Jacob Munkberg, Jon Hasselgren, Tianchang Shen, Jun Gao, Wenzheng Chen, Alex Evans, Thomas M¨uller, and Sanja Fi- dler. Extracting triangular 3d models, materials, and lighting from images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8280– 8290, 2022. 6 10
work page 2022
-
[45]
Polygen: An autoregressive generative model of 3d meshes
Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. Polygen: An autoregressive generative model of 3d meshes. InInternational conference on machine learning, pages 7220–7229. PMLR, 2020. 2
work page 2020
-
[46]
Point-E: A System for Generating 3D Point Clouds from Complex Prompts
Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc V Gool, and Sergey Tulyakov. Au- todecoding latent 3d diffusion models.Advances in Neural Information Processing Systems, 36:67021–67047, 2023. 3
work page 2023
-
[48]
Deepsdf: Learning con- tinuous signed distance functions for shape representation
Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning con- tinuous signed distance functions for shape representation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 165–174, 2019. 2
work page 2019
-
[49]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,
-
[50]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 7, 6
work page 2021
-
[51]
Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies
Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4209–4219, 2024. 3, 5
work page 2024
-
[52]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3, 5
work page 2022
-
[53]
Flexible isosurface extraction for gradient-based mesh optimization.ACM Trans
Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible isosurface extraction for gradient-based mesh optimization.ACM Trans. Graph., 42(4), 2023. 2, 4
work page 2023
-
[54]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Sketchfab - the best 3d viewer on the web
Sketchfab. Sketchfab - the best 3d viewer on the web. https://sketchfab.com/, 2025. 6, 4
work page 2025
-
[56]
Using shape to categorize: Low-shot learning with an explicit shape bias
Stefan Stojanov, Anh Thai, and James M Rehg. Using shape to categorize: Low-shot learning with an explicit shape bias. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1798–1808, 2021. 6, 4
work page 2021
-
[57]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
-
[58]
Torchsparse++: Efficient training and inference framework for sparse convolution on gpus
Haotian Tang, Shang Yang, Zhijian Liu, Ke Hong, Zhong- ming Yu, Xiuyu Li, Guohao Dai, Yu Wang, and Song Han. Torchsparse++: Efficient training and inference framework for sparse convolution on gpus. InIEEE/ACM International Symposium on Microarchitecture (MICRO), 2023. 4
work page 2023
-
[59]
Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, and Baining Guo. V olumediffu- sion: Flexible text-to-3d generation with efficient volumetric encoder.arXiv preprint arXiv:2312.11459, 2023. 2
-
[60]
Tri- ton: an intermediate language and compiler for tiled neu- ral network computations
Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Tri- ton: an intermediate language and compiler for tiled neu- ral network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019. 6, 3
work page 2019
-
[61]
Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, Karsten Kreis, et al. Lion: Latent point dif- fusion models for 3d shape generation.Advances in Neural Information Processing Systems, 35:10021–10039, 2022. 3
work page 2022
-
[62]
Francis Williams, Jiahui Huang, Jonathan Swartz, Gergely Klar, Vijay Thakkar, Matthew Cong, Xuanchi Ren, Ruilong Li, Clement Fuji-Tsang, Sanja Fidler, et al. fvdb: A deep- learning framework for sparse, large scale, and high perfor- mance spatial intelligence.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024. 4
work page 2024
-
[63]
Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer
Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer. arXiv preprint arXiv:2405.14832, 2024. 3
-
[64]
Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Xun Cao, Philip Torr, et al. Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025. 2, 3, 6, 7
-
[65]
Structured 3d latents for scalable and versatile 3d gen- eration
Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21469–21480, 2025. 2, 3, 5, 6, 7, 4
work page 2025
-
[66]
Octfusion: Octree- based diffusion models for 3d shape generation
Bojun Xiong, Si-Tong Wei, Xin-Yang Zheng, Yan-Pei Cao, Zhouhui Lian, and Peng-Shuai Wang. Octfusion: Octree- based diffusion models for 3d shape generation. InComputer Graphics Forum, page e70198. Wiley Online Library, 2025. 3
work page 2025
-
[67]
Ulip-2: Towards scalable multimodal pre-training for 3d understanding
Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Jun- nan Li, Roberto Mart´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27091–27101, 2024. 7, 6
work page 2024
-
[68]
Haitao Yang, Yuan Dong, Hanwen Jiang, Dejia Xu, Georgios Pavlakos, and Qixing Huang. Atlas gaussians diffusion for 3d generation with infinite number of points.arXiv preprint arXiv:2408.13055, 2024. 3
-
[69]
Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation
Jiayu Yang, Taizhang Shang, Weixuan Sun, Xibin Song, Ziang Cheng, Senbo Wang, Shenzhou Chen, Weizhe Liu, Hongdong Li, and Pan Ji. Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation. arXiv preprint arXiv:2502.14247, 2025. 3 11
-
[70]
Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xi- aoyang Guo, Jiaqing Zhou, Hao Zhao, and Xiaoguang Han. Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging.arXiv preprint arXiv:2503.22236, 3:2,
-
[71]
Texgen: a generative diffusion model for mesh textures
Xin Yu, Ze Yuan, Yuan-Chen Guo, Ying-Tian Liu, Jianhui Liu, Yangguang Li, Yan-Pei Cao, Ding Liang, and Xiaojuan Qi. Texgen: a generative diffusion model for mesh textures. ACM Transactions on Graphics (TOG), 43(6):1–14, 2024. 7
work page 2024
-
[72]
Mip-splatting: Alias-free 3d gaussian splat- ting
Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 19447–19456,
-
[73]
Root mean square layer nor- malization.Advances in Neural Information Processing Sys- tems, 32, 2019
Biao Zhang and Rico Sennrich. Root mean square layer nor- malization.Advances in Neural Information Processing Sys- tems, 32, 2019. 2
work page 2019
-
[74]
Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023. 3
work page 2023
-
[75]
Bowen Zhang, Yiji Cheng, Jiaolong Yang, Chunyu Wang, Feng Zhao, Yansong Tang, Dong Chen, and Baining Guo. Gaussiancube: Structuring gaussian splatting using opti- mal transport for 3d generative modeling.arXiv preprint arXiv:2403.19655, 2024. 2
-
[76]
Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024. 2, 3
work page 2024
-
[77]
Yibo Zhang, Li Zhang, Rui Ma, and Nan Cao. Texverse: A universe of 3d objects with high-resolution textures.arXiv preprint arXiv:2508.10868, 2025. 3, 6, 4
-
[78]
Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, and Shenghua Gao. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation.Advances in Neural Information Processing Systems, 36, 2024. 3
work page 2024
-
[79]
Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, and Heung-Yeung Shum. Locally attentional sdf diffusion for controllable 3d shape generation.ACM Trans- actions on Graphics (SIGGRAPH), 42(4), 2023. 2
work page 2023
-
[80]
2023.doi: 10.48550/arXiv.2310.06773
Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.