pith. sign in

arxiv: 2512.14692 · v1 · pith:MXB5A2GNnew · submitted 2025-12-16 · 💻 cs.CV · cs.AI

Native and Compact Structured Latents for 3D Generation

Pith reviewed 2026-05-21 05:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D generative modelingstructured latent representationO-Voxelsparse voxelflow-matchingSparse Compression VAE3D assetsphysically-based rendering
0
0 comments X

The pith

O-Voxel encodes geometry and appearance in a sparse structure to support higher-quality 3D generation from compact latents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a structured latent representation learned directly from native 3D data to overcome limits in capturing complex topologies and detailed appearance. Its central mechanism is O-Voxel, a sparse voxel form that records both shape and surface properties such as physically-based rendering parameters. This feeds into a Sparse Compression VAE that shrinks the data into a compact latent space, which then trains 4-billion-parameter flow-matching models on public datasets. The result is generated assets whose geometry and material fidelity surpass prior models while inference stays efficient. A reader would care because current 3D generators still falter on open surfaces, non-manifold shapes, and realistic materials, restricting uses in graphics and simulation.

Core claim

The paper shows that an omni-voxel called O-Voxel can model arbitrary topology, including open, non-manifold, and fully enclosed surfaces, while also storing comprehensive surface attributes beyond color such as physically-based rendering parameters. A Sparse Compression VAE built on this structure delivers high spatial compression and a compact latent space. Large-scale flow-matching models with 4 billion parameters trained on diverse public 3D assets then produce outputs whose geometry and material quality exceed those of existing models, all while maintaining fast inference.

What carries the argument

O-Voxel, a sparse voxel representation that jointly encodes geometry and appearance attributes for arbitrary topologies.

If this is right

  • Generated 3D assets exhibit geometry and material quality that exceeds existing models.
  • Inference stays highly efficient even for models with 4 billion parameters.
  • The representation supports open, non-manifold, and enclosed surfaces without special handling.
  • Surface attributes include physically-based rendering parameters in addition to color.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The compact latents could be reused for downstream tasks such as 3D editing or view synthesis without retraining the generator.
  • Combining this voxel structure with existing mesh or implicit surface pipelines might reduce conversion errors in production workflows.
  • If the compression rate holds at larger scales, similar latent designs could apply to 4D or animated asset generation.
  • Public datasets used here suggest the method may generalize across asset styles without heavy curation.

Load-bearing premise

O-Voxel can robustly model arbitrary topology including open, non-manifold, and fully-enclosed surfaces while capturing comprehensive surface attributes beyond texture color.

What would settle it

A direct test on 3D models containing non-manifold junctions or open surfaces where O-Voxel either fails to encode the topology correctly or omits the physically-based rendering parameters in the output assets.

read the original abstract

Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces O-Voxel, a new sparse voxel representation that encodes both geometry (arbitrary topologies including open, non-manifold, and enclosed surfaces) and comprehensive appearance attributes (including PBR parameters beyond texture color). It builds a Sparse Compression VAE for high-rate compression into compact structured latents and trains 4B-parameter flow-matching models on public 3D datasets for generation, asserting that the resulting assets exhibit geometry and material quality that far exceeds existing models while maintaining efficient inference.

Significance. If the central claims hold, the work would advance 3D generative modeling by providing a native, topology-robust representation that supports full PBR attributes and high compression, potentially enabling higher-fidelity outputs from large-scale flow models trained on diverse assets.

major comments (2)
  1. [Abstract] Abstract: the headline claim that 'the geometry and material quality of our generated assets far exceed those of existing models' is load-bearing for the paper's contribution yet is presented without any quantitative metrics, benchmark comparisons, or ablation results on topology handling or PBR attribute fidelity; this leaves the assertion that O-Voxel plus the VAE and flow model preserve and improve these properties unverified at the point required to support the quality superiority statement.
  2. [Abstract] The description of O-Voxel asserts robust modeling of open, non-manifold, and fully-enclosed surfaces together with full PBR parameters (roughness, metallic, etc.), but this premise is not secured by targeted validation; if experiments are limited to closed manifold objects or texture-only appearance, the Sparse Compression VAE and downstream flow-matching results cannot be shown to deliver the claimed quality gains without topological collapse or attribute loss.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We agree that the abstract claims require stronger anchoring in quantitative evidence and targeted validation. We address each major comment below and will incorporate revisions to improve clarity and support for our assertions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that 'the geometry and material quality of our generated assets far exceed those of existing models' is load-bearing for the paper's contribution yet is presented without any quantitative metrics, benchmark comparisons, or ablation results on topology handling or PBR attribute fidelity; this leaves the assertion that O-Voxel plus the VAE and flow model preserve and improve these properties unverified at the point required to support the quality superiority statement.

    Authors: We acknowledge that the abstract presents a strong claim without embedding supporting metrics directly within it. The full manuscript provides quantitative evaluations, benchmark comparisons against prior 3D generation methods, and ablations on geometry and material quality in the experiments section. To address the concern, we will revise the abstract to include key quantitative results and explicit references to the relevant experimental findings, thereby better verifying that O-Voxel, the Sparse Compression VAE, and the flow-matching model preserve and enhance these properties. revision: yes

  2. Referee: [Abstract] The description of O-Voxel asserts robust modeling of open, non-manifold, and fully-enclosed surfaces together with full PBR parameters (roughness, metallic, etc.), but this premise is not secured by targeted validation; if experiments are limited to closed manifold objects or texture-only appearance, the Sparse Compression VAE and downstream flow-matching results cannot be shown to deliver the claimed quality gains without topological collapse or attribute loss.

    Authors: O-Voxel is explicitly constructed as a sparse voxel structure that does not presuppose closed or manifold topology, enabling representation of open, non-manifold, and fully-enclosed surfaces while encoding full PBR attributes including roughness and metallic values. Our training uses diverse public 3D asset datasets containing such topological and material variations. We agree that dedicated validation would strengthen the presentation; we will add a targeted subsection with examples, visualizations, and metrics demonstrating topology robustness and PBR attribute preservation in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: new representation and trained models are independent of target claims

full rationale

The paper introduces O-Voxel as a novel sparse voxel structure for encoding geometry and PBR attributes, followed by a Sparse Compression VAE and large-scale flow-matching models trained on public 3D datasets. The central quality claims rest on empirical outputs from these trained components rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or derivation steps in the provided text reduce the asserted robustness or quality gains to inputs by construction; the approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the effectiveness of the newly introduced O-Voxel and VAE without independent evidence or derivations provided in the abstract; training assumptions for flow-matching on compressed latents are taken as given.

axioms (1)
  • domain assumption Flow-matching models trained on O-Voxel latents will produce high-quality 3D generation at 4B parameter scale
    Invoked when stating that large-scale models achieve superior results on public datasets
invented entities (1)
  • O-Voxel no independent evidence
    purpose: Sparse voxel structure encoding geometry and appearance for arbitrary topologies and PBR attributes
    Newly proposed representation forming the core of the latent learning approach

pith-pipeline@v0.9.0 · 5752 in / 1315 out tokens · 40371 ms · 2026-05-21T05:15:08.308845+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation/DimensionForcing dimension_forced echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters.

  • Foundation/LedgerForcing conservation_from_balance unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    the geometry and material quality of our generated assets far exceed those of existing models

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

    cs.GR 2026-05 unverdicted novelty 8.0

    Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset ...

  2. On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

    cs.CR 2026-05 conditional novelty 8.0

    Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...

  3. Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

    cs.CV 2026-05 unverdicted novelty 7.0

    Stream3D is a training-free method that maintains temporal consistency in 3D generation from monocular streams by dynamically caching a fixed number of informative historical frames using an evidence score.

  4. The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting

    cs.CV 2026-05 conditional novelty 7.0

    MixCount provides a scalable synthetic dataset for mixed-object counting that improves state-of-the-art models on real benchmarks, cutting MAE by 20.14% on FSC-147 and 18.3% on PairTally.

  5. Count Anything at Any Granularity

    cs.CV 2026-05 unverdicted novelty 7.0

    Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...

  6. Velocity-Space 3D Asset Editing

    cs.GR 2026-05 unverdicted novelty 7.0

    VS3D performs local 3D asset editing by injecting reconstruction-anchored source signals, partial-mean guidance, and twin-agreement residuals into the velocity sampler to control edit strength and preserve identity.

  7. 3D Generation for Embodied AI and Robotic Simulation: A Survey

    cs.RO 2026-04 accept novelty 7.0

    3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.

  8. Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

    cs.CV 2026-04 unverdicted novelty 7.0

    A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...

  9. Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

  10. PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

    cs.CV 2026-05 unverdicted novelty 6.0

    PhysX-Omni unifies simulation-ready 3D asset generation across rigid, deformable, and articulated objects via a new geometry representation, the PhysXVerse dataset, and the PhysX-Bench evaluation suite.

  11. ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    ROAR-3D adds a token-wise view router and dual-stream attention to pretrained single-view 3D generators so they can use arbitrary unposed images for higher-fidelity output.

  12. Pixal3D: Pixel-Aligned 3D Generation from Images

    cs.CV 2026-05 unverdicted novelty 6.0

    Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.

  13. Generative 3D Gaussians with Learned Density Control

    cs.GR 2026-05 unverdicted novelty 6.0

    DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.

  14. DVD: Discrete Voxel Diffusion for 3D Generation and Editing

    cs.CV 2026-05 unverdicted novelty 6.0

    DVD treats voxel occupancy as a discrete variable in a diffusion framework to generate, assess, and edit sparse 3D voxels without continuous thresholding.

  15. LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

    cs.CV 2026-04 conditional novelty 6.0

    LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.

  16. CMAG: Concept-Scaffolded Retrieval for Marketplace Avatar Generation

    cs.CV 2026-05 unverdicted novelty 5.0

    CMAG combines 3D concept scaffolding, prompt decomposition, taxonomy routing, hybrid retrieval, and agentic VLM verification to assemble topologically consistent avatars from catalog assets given free-form text prompts.

  17. EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

    cs.CV 2026-05 unverdicted novelty 5.0

    EVA01 introduces a Mixture-of-Transformers model that natively adds 3D mesh understanding, generation, and multi-turn editing to MLLMs by decoupling understanding and generation experts with shared global self-attention.

  18. Pose Tracking with a Foundation Pose Model and an Ensemble Directional Kalman Filter

    cs.LG 2026-05 unverdicted novelty 5.0

    EnDKF combines ensemble Kalman filtering with directional statistics and unit quaternions to achieve lower pose tracking error than raw measurements in synthetic constant-velocity tests and FoundationPose-based head tracking.

  19. From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation

    cs.GR 2026-04 unverdicted novelty 5.0

    The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.

  20. Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation

    cs.CV 2026-04 unverdicted novelty 5.0

    Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...

  21. Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation

    cs.CV 2026-04 unverdicted novelty 5.0

    Hitem3D 2.0 combines multi-view image synthesis with native 3D texture projection to improve completeness, cross-view consistency, and geometry alignment over prior methods.

  22. From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation

    cs.GR 2026-04 unverdicted novelty 4.0

    The paper surveys 3D content generation literature using a taxonomy of asset types and production stages to evaluate progress toward engine-ready assets.

  23. Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation

    cs.GR 2026-04 unverdicted novelty 4.0

    Seed3D 2.0 advances 3D content generation via a coarse-to-fine geometry pipeline, unified PBR material model, and simulation-ready scene tools, reporting 69-89.9% win rates over commercial systems in human studies.

  24. 3D Generation for Embodied AI and Robotic Simulation: A Survey

    cs.RO 2026-04 unverdicted novelty 3.0

    The survey organizes 3D generation for embodied AI into data generators for assets, simulation environments for interaction, and sim-to-real bridges, noting a shift toward interaction readiness and listing bottlenecks...

  25. 3D Generation for Embodied AI and Robotic Simulation: A Survey

    cs.RO 2026-04 unverdicted novelty 2.0

    The paper surveys 3D generation techniques for embodied AI and robotics, categorizing them into data generation, simulation environments, and sim-to-real bridging while identifying bottlenecks in physical validity and...

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 22 Pith papers · 8 internal anchors

  1. [1]

    Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

    Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffu- sion models.arXiv preprint arXiv:2410.10733, 2024. 2, 5, 1

  2. [2]

    Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis

    Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis. InICLR, 2024. 6, 1

  3. [3]

    Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders

    Rui Chen, Jianfeng Zhang, Yixun Liang, Guan Luo, Weiyu Li, Jiarui Liu, Xiu Li, Xiaoxiao Long, Jiashi Feng, and Ping Tan. Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16251–16261,

  4. [4]

    Meshanything: Artist-created mesh generation with autoregressive transformers.arXiv preprint arXiv:2406.10163, 2024

    Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Ji- axiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, et al. Meshanything: Artist-created mesh generation with au- toregressive transformers.arXiv preprint arXiv:2406.10163,

  5. [5]

    Ultra3d: Efficient and high- fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025

    Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, and Guosheng Lin. Ultra3d: Efficient and high- fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025. 3

  6. [6]

    Neural dual contouring.ACM Transactions on Graphics (TOG), 41(4):1–13, 2022

    Zhiqin Chen, Andrea Tagliasacchi, Thomas Funkhouser, and Hao Zhang. Neural dual contouring.ACM Transactions on Graphics (TOG), 41(4):1–13, 2022. 3

  7. [7]

    3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion.arXiv preprint arXiv:2409.12957, 2024

    Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion.arXiv preprint arXiv:2409.12957, 2024. 2, 3

  8. [8]

    Warpconvnet: High- performance 3d deep learning library.https://github

    Chris Choy and NVIDIA Research. Warpconvnet: High- performance 3d deep learning library.https://github. com/NVlabs/warpconvnet, 2025. 4

  9. [9]

    Abo: Dataset and benchmarks for real-world 3d object un- derstanding

    Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object un- derstanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126– 21136, 2022. 6, 4

  10. [10]

    Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018

    Blender Online Community.Blender — a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018. 6, 4

  11. [11]

    Spconv: Spatially sparse convolu- tion library.https://github.com/traveller59/ spconv, 2022

    Spconv Contributors. Spconv: Spatially sparse convolu- tion library.https://github.com/traveller59/ spconv, 2022. 3, 4

  12. [12]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 3

  13. [13]

    Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36, 2024

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36, 2024. 3, 6, 4

  14. [14]

    Deformed implicit field: Modeling 3d shapes with learned dense correspon- dence

    Yu Deng, Jiaolong Yang, and Xin Tong. Deformed implicit field: Modeling 3d shapes with learned dense correspon- dence. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 10286–10296,

  15. [15]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 2

  16. [16]

    Introducing gemini 2.5 flash image: Our state-of-the-art image model.https://developers

    Alisa Fortin, Guillaume Vernade, Kat Kampf, and Am- maar Reshi. Introducing gemini 2.5 flash image: Our state-of-the-art image model.https://developers. googleblog.com/en/introducing- gemini- 2- 5-flash-image/, 2025. Google Developer Blog. 6

  17. [17]

    3d-future: 3d fur- niture shape with texture.International Journal of Computer Vision, pages 1–25, 2021

    Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d fur- niture shape with texture.International Journal of Computer Vision, pages 1–25, 2021. 4 9

  18. [18]

    Submanifold Sparse Convolutional Networks

    Benjamin Graham and Laurens Van der Maaten. Sub- manifold sparse convolutional networks.arXiv preprint arXiv:1706.01307, 2017. 6

  19. [19]

    3dgen: Triplane latent diffusion for textured mesh generation

    Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Bar- las O˘guz. 3dgen: Triplane latent diffusion for textured mesh generation.arXiv preprint arXiv:2303.05371, 2023. 3

  20. [20]

    Gvgen: Text-to-3d generation with volumetric rep- resentation

    Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yang- guang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. Gvgen: Text-to-3d generation with volumetric rep- resentation. InECCV, 2024. 2

  21. [21]

    Sparseflex: High-resolution and arbitrary-topology 3d shape modeling.arXiv preprint arXiv:2503.21732, 2025

    Xianglong He, Zi-Xin Zou, Chia-Hao Chen, Yuan-Chen Guo, Ding Liang, Chun Yuan, Wanli Ouyang, Yan-Pei Cao, and Yangguang Li. Sparseflex: High-resolution and arbitrary-topology 3d shape modeling.arXiv preprint arXiv:2503.21732, 2025. 2, 3, 5, 6

  22. [22]

    Neural wavelet-domain diffusion for 3d shape generation

    Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. Neural wavelet-domain diffusion for 3d shape generation. InSIG- GRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 2

  23. [23]

    Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

    Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, et al. Hunyuan3d 2.1: From images to high- fidelity 3d assets with production-ready pbr material.arXiv preprint arXiv:2506.15442, 2025. 3, 7

  24. [24]

    Perceiver IO: A General Architecture for Structured Inputs & Outputs

    Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop- pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs.arXiv preprint arXiv:2107.14795, 2021. 3

  25. [25]

    Dual contouring of hermite data

    Tao Ju, Frank Losasso, Scott Schaefer, and Joe Warren. Dual contouring of hermite data. InProceedings of the 29th an- nual conference on Computer graphics and interactive tech- niques, pages 339–346, 2002. 3, 4

  26. [26]

    Shap-E: Generating Conditional 3D Implicit Functions

    Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023. 3

  27. [27]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  28. [28]

    Chang, and Manolis Savva

    Mukul Khanna*, Yongsen Mao*, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, and Manolis Savva. Habitat Synthetic Scenes Dataset (HSSD-200): An Analy- sis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation.arXiv preprint, 2023. 6, 4

  29. [29]

    Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation

    Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. InECCV, 2024. 3

  30. [30]

    Gaussiananything: Interactive point cloud latent diffu- sion for 3d generation

    Yushi Lan, Shangchen Zhou, Zhaoyang Lyu, Fangzhou Hong, Shuai Yang, Bo Dai, Xingang Pan, and Chen Change Loy. Gaussiananything: Interactive point cloud latent diffu- sion for 3d generation. InICLR, 2025. 3

  31. [31]

    2025.doi:10.48550/arXiv.2405.14979

    Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner.arXiv preprint arXiv:2405.14979, 2024. 3

  32. [32]

    2025.doi:10.48550/arXiv.2505.07747

    Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, et al. Step1x-3d: Towards high-fidelity and con- trollable generation of textured 3d assets.arXiv preprint arXiv:2505.07747, 2025. 3, 7

  33. [33]

    TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

    Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025. 3

  34. [34]

    Sparc: Sparse representation and construc- tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025

    Zhihao Li, Yufei Wang, Heliang Zheng, Yihao Luo, and Bihan Wen. Sparc3d: Sparse representation and construc- tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025. 2, 3

  35. [35]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 6, 3

  36. [36]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023. 3

  37. [37]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,

  38. [38]

    Decoupled Weight Decay Regularization

    I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  39. [39]

    Diffusion probabilistic models for 3d point cloud generation

    Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2837–2845, 2021. 2

  40. [40]

    Lt3sd: Latent trees for 3d scene diffusion

    Quan Meng, Lei Li, Matthias Nießner, and Angela Dai. Lt3sd: Latent trees for 3d scene diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 650–660, 2025. 3

  41. [41]

    Occupancy networks: Learning 3d reconstruction in function space

    Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- bastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019. 2

  42. [42]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

  43. [43]

    Diffrf: Rendering-guided 3d radiance field diffusion

    Norman M ¨uller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4328–4338, 2023. 2

  44. [44]

    Extracting triangular 3d models, materials, and lighting from images

    Jacob Munkberg, Jon Hasselgren, Tianchang Shen, Jun Gao, Wenzheng Chen, Alex Evans, Thomas M¨uller, and Sanja Fi- dler. Extracting triangular 3d models, materials, and lighting from images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8280– 8290, 2022. 6 10

  45. [45]

    Polygen: An autoregressive generative model of 3d meshes

    Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. Polygen: An autoregressive generative model of 3d meshes. InInternational conference on machine learning, pages 7220–7229. PMLR, 2020. 2

  46. [46]

    Point-E: A System for Generating 3D Point Clouds from Complex Prompts

    Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022. 2

  47. [47]

    Au- todecoding latent 3d diffusion models.Advances in Neural Information Processing Systems, 36:67021–67047, 2023

    Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc V Gool, and Sergey Tulyakov. Au- todecoding latent 3d diffusion models.Advances in Neural Information Processing Systems, 36:67021–67047, 2023. 3

  48. [48]

    Deepsdf: Learning con- tinuous signed distance functions for shape representation

    Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning con- tinuous signed distance functions for shape representation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 165–174, 2019. 2

  49. [49]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

  50. [50]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 7, 6

  51. [51]

    Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

    Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4209–4219, 2024. 3, 5

  52. [52]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3, 5

  53. [53]

    Flexible isosurface extraction for gradient-based mesh optimization.ACM Trans

    Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible isosurface extraction for gradient-based mesh optimization.ACM Trans. Graph., 42(4), 2023. 2, 4

  54. [54]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 6

  55. [55]

    Sketchfab - the best 3d viewer on the web

    Sketchfab. Sketchfab - the best 3d viewer on the web. https://sketchfab.com/, 2025. 6, 4

  56. [56]

    Using shape to categorize: Low-shot learning with an explicit shape bias

    Stefan Stojanov, Anh Thai, and James M Rehg. Using shape to categorize: Low-shot learning with an explicit shape bias. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1798–1808, 2021. 6, 4

  57. [57]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  58. [58]

    Torchsparse++: Efficient training and inference framework for sparse convolution on gpus

    Haotian Tang, Shang Yang, Zhijian Liu, Ke Hong, Zhong- ming Yu, Xiuyu Li, Guohao Dai, Yu Wang, and Song Han. Torchsparse++: Efficient training and inference framework for sparse convolution on gpus. InIEEE/ACM International Symposium on Microarchitecture (MICRO), 2023. 4

  59. [59]

    V olumediffu- sion: Flexible text-to-3d generation with efficient volumetric encoder.arXiv preprint arXiv:2312.11459, 2023

    Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, and Baining Guo. V olumediffu- sion: Flexible text-to-3d generation with efficient volumetric encoder.arXiv preprint arXiv:2312.11459, 2023. 2

  60. [60]

    Tri- ton: an intermediate language and compiler for tiled neu- ral network computations

    Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Tri- ton: an intermediate language and compiler for tiled neu- ral network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019. 6, 3

  61. [61]

    Lion: Latent point dif- fusion models for 3d shape generation.Advances in Neural Information Processing Systems, 35:10021–10039, 2022

    Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, Karsten Kreis, et al. Lion: Latent point dif- fusion models for 3d shape generation.Advances in Neural Information Processing Systems, 35:10021–10039, 2022. 3

  62. [62]

    fvdb: A deep- learning framework for sparse, large scale, and high perfor- mance spatial intelligence.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024

    Francis Williams, Jiahui Huang, Jonathan Swartz, Gergely Klar, Vijay Thakkar, Matthew Cong, Xuanchi Ren, Ruilong Li, Clement Fuji-Tsang, Sanja Fidler, et al. fvdb: A deep- learning framework for sparse, large scale, and high perfor- mance spatial intelligence.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024. 4

  63. [63]

    Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer

    Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer. arXiv preprint arXiv:2405.14832, 2024. 3

  64. [64]

    Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025

    Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Xun Cao, Philip Torr, et al. Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025. 2, 3, 6, 7

  65. [65]

    Structured 3d latents for scalable and versatile 3d gen- eration

    Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21469–21480, 2025. 2, 3, 5, 6, 7, 4

  66. [66]

    Octfusion: Octree- based diffusion models for 3d shape generation

    Bojun Xiong, Si-Tong Wei, Xin-Yang Zheng, Yan-Pei Cao, Zhouhui Lian, and Peng-Shuai Wang. Octfusion: Octree- based diffusion models for 3d shape generation. InComputer Graphics Forum, page e70198. Wiley Online Library, 2025. 3

  67. [67]

    Ulip-2: Towards scalable multimodal pre-training for 3d understanding

    Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Jun- nan Li, Roberto Mart´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27091–27101, 2024. 7, 6

  68. [68]

    Atlas gaussians diffusion for 3d generation with infinite number of points.arXiv preprint arXiv:2408.13055, 2024

    Haitao Yang, Yuan Dong, Hanwen Jiang, Dejia Xu, Georgios Pavlakos, and Qixing Huang. Atlas gaussians diffusion for 3d generation with infinite number of points.arXiv preprint arXiv:2408.13055, 2024. 3

  69. [69]

    Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation

    Jiayu Yang, Taizhang Shang, Weixuan Sun, Xibin Song, Ziang Cheng, Senbo Wang, Shenzhou Chen, Weizhe Liu, Hongdong Li, and Pan Ji. Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation. arXiv preprint arXiv:2502.14247, 2025. 3 11

  70. [70]

    Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging.arXiv preprint arXiv:2503.22236, 3:2,

    Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xi- aoyang Guo, Jiaqing Zhou, Hao Zhao, and Xiaoguang Han. Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging.arXiv preprint arXiv:2503.22236, 3:2,

  71. [71]

    Texgen: a generative diffusion model for mesh textures

    Xin Yu, Ze Yuan, Yuan-Chen Guo, Ying-Tian Liu, Jianhui Liu, Yangguang Li, Yan-Pei Cao, Ding Liang, and Xiaojuan Qi. Texgen: a generative diffusion model for mesh textures. ACM Transactions on Graphics (TOG), 43(6):1–14, 2024. 7

  72. [72]

    Mip-splatting: Alias-free 3d gaussian splat- ting

    Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 19447–19456,

  73. [73]

    Root mean square layer nor- malization.Advances in Neural Information Processing Sys- tems, 32, 2019

    Biao Zhang and Rico Sennrich. Root mean square layer nor- malization.Advances in Neural Information Processing Sys- tems, 32, 2019. 2

  74. [74]

    3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023

    Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023. 3

  75. [75]

    11 Frontiers in Probabilistic Inference: Sampling meets Learning (FPI) workshop at ICLR 2025 A DATA We use two datasets in our experiments

    Bowen Zhang, Yiji Cheng, Jiaolong Yang, Chunyu Wang, Feng Zhao, Yansong Tang, Dong Chen, and Baining Guo. Gaussiancube: Structuring gaussian splatting using opti- mal transport for 3d generative modeling.arXiv preprint arXiv:2403.19655, 2024. 2

  76. [76]

    Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

    Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024. 2, 3

  77. [77]

    Texverse: A universe of 3d objects with high-resolution textures.arXiv preprint arXiv:2508.10868, 2025

    Yibo Zhang, Li Zhang, Rui Ma, and Nan Cao. Texverse: A universe of 3d objects with high-resolution textures.arXiv preprint arXiv:2508.10868, 2025. 3, 6, 4

  78. [78]

    Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation.Advances in Neural Information Processing Systems, 36, 2024

    Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, and Shenghua Gao. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation.Advances in Neural Information Processing Systems, 36, 2024. 3

  79. [79]

    Locally attentional sdf diffusion for controllable 3d shape generation.ACM Trans- actions on Graphics (SIGGRAPH), 42(4), 2023

    Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, and Heung-Yeung Shum. Locally attentional sdf diffusion for controllable 3d shape generation.ACM Trans- actions on Graphics (SIGGRAPH), 42(4), 2023. 2

  80. [80]

    2023.doi: 10.48550/arXiv.2310.06773

    Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

Showing first 80 references.