Native and Compact Structured Latents for 3D Generation

Hao Zhao; Hongyuan Zhu; Jianfeng Xiang; Jiaolong Yang; Nicholas Jing Yuan; Ruicheng Wang; Sicheng Xu; Xiaoxue Chen; Yu Deng; Yue Dong

arxiv: 2512.14692 · v1 · pith:MXB5A2GNnew · submitted 2025-12-16 · 💻 cs.CV · cs.AI

Native and Compact Structured Latents for 3D Generation

Jianfeng Xiang , Xiaoxue Chen , Sicheng Xu , Ruicheng Wang , Zelong Lv , Yu Deng , Hongyuan Zhu , Yue Dong

show 3 more authors

Hao Zhao Nicholas Jing Yuan Jiaolong Yang

This is my paper

Pith reviewed 2026-05-21 05:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords 3D generative modelingstructured latent representationO-Voxelsparse voxelflow-matchingSparse Compression VAE3D assetsphysically-based rendering

0 comments

The pith

O-Voxel encodes geometry and appearance in a sparse structure to support higher-quality 3D generation from compact latents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a structured latent representation learned directly from native 3D data to overcome limits in capturing complex topologies and detailed appearance. Its central mechanism is O-Voxel, a sparse voxel form that records both shape and surface properties such as physically-based rendering parameters. This feeds into a Sparse Compression VAE that shrinks the data into a compact latent space, which then trains 4-billion-parameter flow-matching models on public datasets. The result is generated assets whose geometry and material fidelity surpass prior models while inference stays efficient. A reader would care because current 3D generators still falter on open surfaces, non-manifold shapes, and realistic materials, restricting uses in graphics and simulation.

Core claim

The paper shows that an omni-voxel called O-Voxel can model arbitrary topology, including open, non-manifold, and fully enclosed surfaces, while also storing comprehensive surface attributes beyond color such as physically-based rendering parameters. A Sparse Compression VAE built on this structure delivers high spatial compression and a compact latent space. Large-scale flow-matching models with 4 billion parameters trained on diverse public 3D assets then produce outputs whose geometry and material quality exceed those of existing models, all while maintaining fast inference.

What carries the argument

O-Voxel, a sparse voxel representation that jointly encodes geometry and appearance attributes for arbitrary topologies.

If this is right

Generated 3D assets exhibit geometry and material quality that exceeds existing models.
Inference stays highly efficient even for models with 4 billion parameters.
The representation supports open, non-manifold, and enclosed surfaces without special handling.
Surface attributes include physically-based rendering parameters in addition to color.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The compact latents could be reused for downstream tasks such as 3D editing or view synthesis without retraining the generator.
Combining this voxel structure with existing mesh or implicit surface pipelines might reduce conversion errors in production workflows.
If the compression rate holds at larger scales, similar latent designs could apply to 4D or animated asset generation.
Public datasets used here suggest the method may generalize across asset styles without heavy curation.

Load-bearing premise

O-Voxel can robustly model arbitrary topology including open, non-manifold, and fully-enclosed surfaces while capturing comprehensive surface attributes beyond texture color.

What would settle it

A direct test on 3D models containing non-manifold junctions or open surfaces where O-Voxel either fails to encode the topology correctly or omits the physically-based rendering parameters in the output assets.

read the original abstract

Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

O-Voxel is a new structured latent that targets complex topologies and PBR attributes, but the claim of far superior output quality still needs quantitative checks.

read the letter

The paper's main contribution is O-Voxel, a sparse omni-voxel representation that encodes both geometry and surface attributes like physically-based rendering parameters directly from native 3D data. It is designed to handle open, non-manifold, and enclosed surfaces without the topology problems that hit standard meshes or implicits. They pair it with a Sparse Compression VAE for compact latents and then train 4B-parameter flow-matching models on public datasets, keeping inference efficient despite the scale.

Referee Report

2 major / 0 minor

Summary. The paper introduces O-Voxel, a new sparse voxel representation that encodes both geometry (arbitrary topologies including open, non-manifold, and enclosed surfaces) and comprehensive appearance attributes (including PBR parameters beyond texture color). It builds a Sparse Compression VAE for high-rate compression into compact structured latents and trains 4B-parameter flow-matching models on public 3D datasets for generation, asserting that the resulting assets exhibit geometry and material quality that far exceeds existing models while maintaining efficient inference.

Significance. If the central claims hold, the work would advance 3D generative modeling by providing a native, topology-robust representation that supports full PBR attributes and high compression, potentially enabling higher-fidelity outputs from large-scale flow models trained on diverse assets.

major comments (2)

[Abstract] Abstract: the headline claim that 'the geometry and material quality of our generated assets far exceed those of existing models' is load-bearing for the paper's contribution yet is presented without any quantitative metrics, benchmark comparisons, or ablation results on topology handling or PBR attribute fidelity; this leaves the assertion that O-Voxel plus the VAE and flow model preserve and improve these properties unverified at the point required to support the quality superiority statement.
[Abstract] The description of O-Voxel asserts robust modeling of open, non-manifold, and fully-enclosed surfaces together with full PBR parameters (roughness, metallic, etc.), but this premise is not secured by targeted validation; if experiments are limited to closed manifold objects or texture-only appearance, the Sparse Compression VAE and downstream flow-matching results cannot be shown to deliver the claimed quality gains without topological collapse or attribute loss.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We agree that the abstract claims require stronger anchoring in quantitative evidence and targeted validation. We address each major comment below and will incorporate revisions to improve clarity and support for our assertions.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that 'the geometry and material quality of our generated assets far exceed those of existing models' is load-bearing for the paper's contribution yet is presented without any quantitative metrics, benchmark comparisons, or ablation results on topology handling or PBR attribute fidelity; this leaves the assertion that O-Voxel plus the VAE and flow model preserve and improve these properties unverified at the point required to support the quality superiority statement.

Authors: We acknowledge that the abstract presents a strong claim without embedding supporting metrics directly within it. The full manuscript provides quantitative evaluations, benchmark comparisons against prior 3D generation methods, and ablations on geometry and material quality in the experiments section. To address the concern, we will revise the abstract to include key quantitative results and explicit references to the relevant experimental findings, thereby better verifying that O-Voxel, the Sparse Compression VAE, and the flow-matching model preserve and enhance these properties. revision: yes
Referee: [Abstract] The description of O-Voxel asserts robust modeling of open, non-manifold, and fully-enclosed surfaces together with full PBR parameters (roughness, metallic, etc.), but this premise is not secured by targeted validation; if experiments are limited to closed manifold objects or texture-only appearance, the Sparse Compression VAE and downstream flow-matching results cannot be shown to deliver the claimed quality gains without topological collapse or attribute loss.

Authors: O-Voxel is explicitly constructed as a sparse voxel structure that does not presuppose closed or manifold topology, enabling representation of open, non-manifold, and fully-enclosed surfaces while encoding full PBR attributes including roughness and metallic values. Our training uses diverse public 3D asset datasets containing such topological and material variations. We agree that dedicated validation would strengthen the presentation; we will add a targeted subsection with examples, visualizations, and metrics demonstrating topology robustness and PBR attribute preservation in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: new representation and trained models are independent of target claims

full rationale

The paper introduces O-Voxel as a novel sparse voxel structure for encoding geometry and PBR attributes, followed by a Sparse Compression VAE and large-scale flow-matching models trained on public 3D datasets. The central quality claims rest on empirical outputs from these trained components rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. No equations or derivation steps in the provided text reduce the asserted robustness or quality gains to inputs by construction; the approach is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the effectiveness of the newly introduced O-Voxel and VAE without independent evidence or derivations provided in the abstract; training assumptions for flow-matching on compressed latents are taken as given.

axioms (1)

domain assumption Flow-matching models trained on O-Voxel latents will produce high-quality 3D generation at 4B parameter scale
Invoked when stating that large-scale models achieve superior results on public datasets

invented entities (1)

O-Voxel no independent evidence
purpose: Sparse voxel structure encoding geometry and appearance for arbitrary topologies and PBR attributes
Newly proposed representation forming the core of the latent learning approach

pith-pipeline@v0.9.0 · 5752 in / 1315 out tokens · 40371 ms · 2026-05-21T05:15:08.308845+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/DimensionForcing dimension_forced echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters.
Foundation/LedgerForcing conservation_from_balance unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the geometry and material quality of our generated assets far exceed those of existing models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation
cs.GR 2026-05 unverdicted novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset ...
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
cs.CR 2026-05 conditional novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...
Stream3D: Sequential Multi-View 3D Generation via Evidential Memory
cs.CV 2026-05 unverdicted novelty 7.0

Stream3D is a training-free method that maintains temporal consistency in 3D generation from monocular streams by dynamically caching a fixed number of informative historical frames using an evidence score.
The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting
cs.CV 2026-05 conditional novelty 7.0

MixCount provides a scalable synthetic dataset for mixed-object counting that improves state-of-the-art models on real benchmarks, cutting MAE by 20.14% on FSC-147 and 18.3% on PairTally.
Count Anything at Any Granularity
cs.CV 2026-05 unverdicted novelty 7.0

Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
Velocity-Space 3D Asset Editing
cs.GR 2026-05 unverdicted novelty 7.0

VS3D performs local 3D asset editing by injecting reconstruction-anchored source signals, partial-mean guidance, and twin-agreement residuals into the velocity sampler to control edit strength and preserve identity.
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 accept novelty 7.0

3D generation for embodied AI is shifting from visual realism toward interaction readiness, organized into data generation, simulation environments, and sim-to-real bridging roles.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
cs.CV 2026-04 unverdicted novelty 7.0

A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
cs.CV 2026-04 unverdicted novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects
cs.CV 2026-05 unverdicted novelty 6.0

PhysX-Omni unifies simulation-ready 3D asset generation across rigid, deformable, and articulated objects via a new geometry representation, the PhysXVerse dataset, and the PhysX-Bench evaluation suite.
ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation
cs.CV 2026-05 unverdicted novelty 6.0

ROAR-3D adds a token-wise view router and dual-stream attention to pretrained single-view 3D generators so they can use arbitrary unposed images for higher-fidelity output.
Pixal3D: Pixel-Aligned 3D Generation from Images
cs.CV 2026-05 unverdicted novelty 6.0

Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.
Generative 3D Gaussians with Learned Density Control
cs.GR 2026-05 unverdicted novelty 6.0

DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.
DVD: Discrete Voxel Diffusion for 3D Generation and Editing
cs.CV 2026-05 unverdicted novelty 6.0

DVD treats voxel occupancy as a discrete variable in a diffusion framework to generate, assess, and edit sparse 3D voxels without continuous thresholding.
LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows
cs.CV 2026-04 conditional novelty 6.0

LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.
CMAG: Concept-Scaffolded Retrieval for Marketplace Avatar Generation
cs.CV 2026-05 unverdicted novelty 5.0

CMAG combines 3D concept scaffolding, prompt decomposition, taxonomy routing, hybrid retrieval, and agentic VLM verification to assemble topologically consistent avatars from catalog assets given free-form text prompts.
EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers
cs.CV 2026-05 unverdicted novelty 5.0

EVA01 introduces a Mixture-of-Transformers model that natively adds 3D mesh understanding, generation, and multi-turn editing to MLLMs by decoupling understanding and generation experts with shared global self-attention.
Pose Tracking with a Foundation Pose Model and an Ensemble Directional Kalman Filter
cs.LG 2026-05 unverdicted novelty 5.0

EnDKF combines ensemble Kalman filtering with directional statistics and unit quaternions to achieve lower pose tracking error than raw measurements in synthetic constant-velocity tests and FoundationPose-based head tracking.
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation
cs.GR 2026-04 unverdicted novelty 5.0

The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
cs.CV 2026-04 unverdicted novelty 5.0

Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...
Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation
cs.CV 2026-04 unverdicted novelty 5.0

Hitem3D 2.0 combines multi-view image synthesis with native 3D texture projection to improve completeness, cross-view consistency, and geometry alignment over prior methods.
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation
cs.GR 2026-04 unverdicted novelty 4.0

The paper surveys 3D content generation literature using a taxonomy of asset types and production stages to evaluate progress toward engine-ready assets.
Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation
cs.GR 2026-04 unverdicted novelty 4.0

Seed3D 2.0 advances 3D content generation via a coarse-to-fine geometry pipeline, unified PBR material model, and simulation-ready scene tools, reporting 69-89.9% win rates over commercial systems in human studies.
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 unverdicted novelty 3.0

The survey organizes 3D generation for embodied AI into data generators for assets, simulation environments for interaction, and sim-to-real bridges, noting a shift toward interaction readiness and listing bottlenecks...
3D Generation for Embodied AI and Robotic Simulation: A Survey
cs.RO 2026-04 unverdicted novelty 2.0

The paper surveys 3D generation techniques for embodied AI and robotics, categorizing them into data generation, simulation environments, and sim-to-real bridging while identifying bottlenecks in physical validity and...

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 22 Pith papers · 8 internal anchors

[1]

Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffu- sion models.arXiv preprint arXiv:2410.10733, 2024. 2, 5, 1

work page arXiv 2024
[2]

Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis

Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis. InICLR, 2024. 6, 1

work page 2024
[3]

Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders

Rui Chen, Jianfeng Zhang, Yixun Liang, Guan Luo, Weiyu Li, Jiarui Liu, Xiu Li, Xiaoxiao Long, Jiashi Feng, and Ping Tan. Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16251–16261,

work page
[4]

Meshanything: Artist-created mesh generation with autoregressive transformers.arXiv preprint arXiv:2406.10163, 2024

Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Ji- axiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, et al. Meshanything: Artist-created mesh generation with au- toregressive transformers.arXiv preprint arXiv:2406.10163,

work page arXiv
[5]

Ultra3d: Efficient and high- fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025

Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, and Guosheng Lin. Ultra3d: Efficient and high- fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025. 3

work page arXiv 2025
[6]

Neural dual contouring.ACM Transactions on Graphics (TOG), 41(4):1–13, 2022

Zhiqin Chen, Andrea Tagliasacchi, Thomas Funkhouser, and Hao Zhang. Neural dual contouring.ACM Transactions on Graphics (TOG), 41(4):1–13, 2022. 3

work page 2022
[7]

3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion.arXiv preprint arXiv:2409.12957, 2024

Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion.arXiv preprint arXiv:2409.12957, 2024. 2, 3

work page arXiv 2024
[8]

Warpconvnet: High- performance 3d deep learning library.https://github

Chris Choy and NVIDIA Research. Warpconvnet: High- performance 3d deep learning library.https://github. com/NVlabs/warpconvnet, 2025. 4

work page 2025
[9]

Abo: Dataset and benchmarks for real-world 3d object un- derstanding

Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object un- derstanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126– 21136, 2022. 6, 4

work page 2022
[10]

Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018

Blender Online Community.Blender — a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018. 6, 4

work page 2018
[11]

Spconv: Spatially sparse convolu- tion library.https://github.com/traveller59/ spconv, 2022

Spconv Contributors. Spconv: Spatially sparse convolu- tion library.https://github.com/traveller59/ spconv, 2022. 3, 4

work page 2022
[12]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 3

work page 2023
[13]

Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36, 2024

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36, 2024. 3, 6, 4

work page 2024
[14]

Deformed implicit field: Modeling 3d shapes with learned dense correspon- dence

Yu Deng, Jiaolong Yang, and Xin Tong. Deformed implicit field: Modeling 3d shapes with learned dense correspon- dence. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 10286–10296,

work page
[15]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 2

work page 2024
[16]

Introducing gemini 2.5 flash image: Our state-of-the-art image model.https://developers

Alisa Fortin, Guillaume Vernade, Kat Kampf, and Am- maar Reshi. Introducing gemini 2.5 flash image: Our state-of-the-art image model.https://developers. googleblog.com/en/introducing- gemini- 2- 5-flash-image/, 2025. Google Developer Blog. 6

work page 2025
[17]

3d-future: 3d fur- niture shape with texture.International Journal of Computer Vision, pages 1–25, 2021

Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d fur- niture shape with texture.International Journal of Computer Vision, pages 1–25, 2021. 4 9

work page 2021
[18]

Submanifold Sparse Convolutional Networks

Benjamin Graham and Laurens Van der Maaten. Sub- manifold sparse convolutional networks.arXiv preprint arXiv:1706.01307, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

3dgen: Triplane latent diffusion for textured mesh generation

Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Bar- las O˘guz. 3dgen: Triplane latent diffusion for textured mesh generation.arXiv preprint arXiv:2303.05371, 2023. 3

work page arXiv 2023
[20]

Gvgen: Text-to-3d generation with volumetric rep- resentation

Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yang- guang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. Gvgen: Text-to-3d generation with volumetric rep- resentation. InECCV, 2024. 2

work page 2024
[21]

Sparseflex: High-resolution and arbitrary-topology 3d shape modeling.arXiv preprint arXiv:2503.21732, 2025

Xianglong He, Zi-Xin Zou, Chia-Hao Chen, Yuan-Chen Guo, Ding Liang, Chun Yuan, Wanli Ouyang, Yan-Pei Cao, and Yangguang Li. Sparseflex: High-resolution and arbitrary-topology 3d shape modeling.arXiv preprint arXiv:2503.21732, 2025. 2, 3, 5, 6

work page arXiv 2025
[22]

Neural wavelet-domain diffusion for 3d shape generation

Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. Neural wavelet-domain diffusion for 3d shape generation. InSIG- GRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 2

work page 2022
[23]

Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, et al. Hunyuan3d 2.1: From images to high- fidelity 3d assets with production-ready pbr material.arXiv preprint arXiv:2506.15442, 2025. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop- pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs.arXiv preprint arXiv:2107.14795, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021
[25]

Dual contouring of hermite data

Tao Ju, Frank Losasso, Scott Schaefer, and Joe Warren. Dual contouring of hermite data. InProceedings of the 29th an- nual conference on Computer graphics and interactive tech- niques, pages 339–346, 2002. 3, 4

work page 2002
[26]

Shap-E: Generating Conditional 3D Implicit Functions

Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page
[28]

Chang, and Manolis Savva

Mukul Khanna*, Yongsen Mao*, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, and Manolis Savva. Habitat Synthetic Scenes Dataset (HSSD-200): An Analy- sis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation.arXiv preprint, 2023. 6, 4

work page 2023
[29]

Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation

Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. InECCV, 2024. 3

work page 2024
[30]

Gaussiananything: Interactive point cloud latent diffu- sion for 3d generation

Yushi Lan, Shangchen Zhou, Zhaoyang Lyu, Fangzhou Hong, Shuai Yang, Bo Dai, Xingang Pan, and Chen Change Loy. Gaussiananything: Interactive point cloud latent diffu- sion for 3d generation. InICLR, 2025. 3

work page 2025
[31]

2025.doi:10.48550/arXiv.2405.14979

Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner.arXiv preprint arXiv:2405.14979, 2024. 3

work page arXiv 2024
[32]

2025.doi:10.48550/arXiv.2505.07747

Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, et al. Step1x-3d: Towards high-fidelity and con- trollable generation of textured 3d assets.arXiv preprint arXiv:2505.07747, 2025. 3, 7

work page arXiv 2025
[33]

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Sparc: Sparse representation and construc- tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025

Zhihao Li, Yufei Wang, Heliang Zheng, Yihao Luo, and Bihan Wen. Sparc3d: Sparse representation and construc- tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025. 2, 3

work page arXiv 2025
[35]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 6, 3

work page 2023
[36]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023. 3

work page 2023
[37]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,

work page
[38]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Diffusion probabilistic models for 3d point cloud generation

Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2837–2845, 2021. 2

work page 2021
[40]

Lt3sd: Latent trees for 3d scene diffusion

Quan Meng, Lei Li, Matthias Nießner, and Angela Dai. Lt3sd: Latent trees for 3d scene diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 650–660, 2025. 3

work page 2025
[41]

Occupancy networks: Learning 3d reconstruction in function space

Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- bastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019. 2

work page 2019
[42]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

work page 2021
[43]

Diffrf: Rendering-guided 3d radiance field diffusion

Norman M ¨uller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4328–4338, 2023. 2

work page 2023
[44]

Extracting triangular 3d models, materials, and lighting from images

Jacob Munkberg, Jon Hasselgren, Tianchang Shen, Jun Gao, Wenzheng Chen, Alex Evans, Thomas M¨uller, and Sanja Fi- dler. Extracting triangular 3d models, materials, and lighting from images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8280– 8290, 2022. 6 10

work page 2022
[45]

Polygen: An autoregressive generative model of 3d meshes

Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. Polygen: An autoregressive generative model of 3d meshes. InInternational conference on machine learning, pages 7220–7229. PMLR, 2020. 2

work page 2020
[46]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

Au- todecoding latent 3d diffusion models.Advances in Neural Information Processing Systems, 36:67021–67047, 2023

Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc V Gool, and Sergey Tulyakov. Au- todecoding latent 3d diffusion models.Advances in Neural Information Processing Systems, 36:67021–67047, 2023. 3

work page 2023
[48]

Deepsdf: Learning con- tinuous signed distance functions for shape representation

Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning con- tinuous signed distance functions for shape representation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 165–174, 2019. 2

work page 2019
[49]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

work page
[50]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 7, 6

work page 2021
[51]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4209–4219, 2024. 3, 5

work page 2024
[52]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3, 5

work page 2022
[53]

Flexible isosurface extraction for gradient-based mesh optimization.ACM Trans

Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible isosurface extraction for gradient-based mesh optimization.ACM Trans. Graph., 42(4), 2023. 2, 4

work page 2023
[54]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Sketchfab - the best 3d viewer on the web

Sketchfab. Sketchfab - the best 3d viewer on the web. https://sketchfab.com/, 2025. 6, 4

work page 2025
[56]

Using shape to categorize: Low-shot learning with an explicit shape bias

Stefan Stojanov, Anh Thai, and James M Rehg. Using shape to categorize: Low-shot learning with an explicit shape bias. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1798–1808, 2021. 6, 4

work page 2021
[57]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page
[58]

Torchsparse++: Efficient training and inference framework for sparse convolution on gpus

Haotian Tang, Shang Yang, Zhijian Liu, Ke Hong, Zhong- ming Yu, Xiuyu Li, Guohao Dai, Yu Wang, and Song Han. Torchsparse++: Efficient training and inference framework for sparse convolution on gpus. InIEEE/ACM International Symposium on Microarchitecture (MICRO), 2023. 4

work page 2023
[59]

V olumediffu- sion: Flexible text-to-3d generation with efficient volumetric encoder.arXiv preprint arXiv:2312.11459, 2023

Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, and Baining Guo. V olumediffu- sion: Flexible text-to-3d generation with efficient volumetric encoder.arXiv preprint arXiv:2312.11459, 2023. 2

work page arXiv 2023
[60]

Tri- ton: an intermediate language and compiler for tiled neu- ral network computations

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Tri- ton: an intermediate language and compiler for tiled neu- ral network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019. 6, 3

work page 2019
[61]

Lion: Latent point dif- fusion models for 3d shape generation.Advances in Neural Information Processing Systems, 35:10021–10039, 2022

Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, Karsten Kreis, et al. Lion: Latent point dif- fusion models for 3d shape generation.Advances in Neural Information Processing Systems, 35:10021–10039, 2022. 3

work page 2022
[62]

fvdb: A deep- learning framework for sparse, large scale, and high perfor- mance spatial intelligence.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024

Francis Williams, Jiahui Huang, Jonathan Swartz, Gergely Klar, Vijay Thakkar, Matthew Cong, Xuanchi Ren, Ruilong Li, Clement Fuji-Tsang, Sanja Fidler, et al. fvdb: A deep- learning framework for sparse, large scale, and high perfor- mance spatial intelligence.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024. 4

work page 2024
[63]

Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer. arXiv preprint arXiv:2405.14832, 2024. 3

work page arXiv 2024
[64]

Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Xun Cao, Philip Torr, et al. Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025. 2, 3, 6, 7

work page arXiv 2025
[65]

Structured 3d latents for scalable and versatile 3d gen- eration

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21469–21480, 2025. 2, 3, 5, 6, 7, 4

work page 2025
[66]

Octfusion: Octree- based diffusion models for 3d shape generation

Bojun Xiong, Si-Tong Wei, Xin-Yang Zheng, Yan-Pei Cao, Zhouhui Lian, and Peng-Shuai Wang. Octfusion: Octree- based diffusion models for 3d shape generation. InComputer Graphics Forum, page e70198. Wiley Online Library, 2025. 3

work page 2025
[67]

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Jun- nan Li, Roberto Mart´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27091–27101, 2024. 7, 6

work page 2024
[68]

Atlas gaussians diffusion for 3d generation with infinite number of points.arXiv preprint arXiv:2408.13055, 2024

Haitao Yang, Yuan Dong, Hanwen Jiang, Dejia Xu, Georgios Pavlakos, and Qixing Huang. Atlas gaussians diffusion for 3d generation with infinite number of points.arXiv preprint arXiv:2408.13055, 2024. 3

work page arXiv 2024
[69]

Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation

Jiayu Yang, Taizhang Shang, Weixuan Sun, Xibin Song, Ziang Cheng, Senbo Wang, Shenzhou Chen, Weizhe Liu, Hongdong Li, and Pan Ji. Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation. arXiv preprint arXiv:2502.14247, 2025. 3 11

work page arXiv 2025
[70]

Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging.arXiv preprint arXiv:2503.22236, 3:2,

Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xi- aoyang Guo, Jiaqing Zhou, Hao Zhao, and Xiaoguang Han. Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging.arXiv preprint arXiv:2503.22236, 3:2,

work page arXiv
[71]

Texgen: a generative diffusion model for mesh textures

Xin Yu, Ze Yuan, Yuan-Chen Guo, Ying-Tian Liu, Jianhui Liu, Yangguang Li, Yan-Pei Cao, Ding Liang, and Xiaojuan Qi. Texgen: a generative diffusion model for mesh textures. ACM Transactions on Graphics (TOG), 43(6):1–14, 2024. 7

work page 2024
[72]

Mip-splatting: Alias-free 3d gaussian splat- ting

Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 19447–19456,

work page
[73]

Root mean square layer nor- malization.Advances in Neural Information Processing Sys- tems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer nor- malization.Advances in Neural Information Processing Sys- tems, 32, 2019. 2

work page 2019
[74]

3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023

Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023. 3

work page 2023
[75]

11 Frontiers in Probabilistic Inference: Sampling meets Learning (FPI) workshop at ICLR 2025 A DATA We use two datasets in our experiments

Bowen Zhang, Yiji Cheng, Jiaolong Yang, Chunyu Wang, Feng Zhao, Yansong Tang, Dong Chen, and Baining Guo. Gaussiancube: Structuring gaussian splatting using opti- mal transport for 3d generative modeling.arXiv preprint arXiv:2403.19655, 2024. 2

work page arXiv 2024
[76]

Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024. 2, 3

work page 2024
[77]

Texverse: A universe of 3d objects with high-resolution textures.arXiv preprint arXiv:2508.10868, 2025

Yibo Zhang, Li Zhang, Rui Ma, and Nan Cao. Texverse: A universe of 3d objects with high-resolution textures.arXiv preprint arXiv:2508.10868, 2025. 3, 6, 4

work page arXiv 2025
[78]

Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation.Advances in Neural Information Processing Systems, 36, 2024

Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, and Shenghua Gao. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation.Advances in Neural Information Processing Systems, 36, 2024. 3

work page 2024
[79]

Locally attentional sdf diffusion for controllable 3d shape generation.ACM Trans- actions on Graphics (SIGGRAPH), 42(4), 2023

Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, and Heung-Yeung Shum. Locally attentional sdf diffusion for controllable 3d shape generation.ACM Trans- actions on Graphics (SIGGRAPH), 42(4), 2023. 2

work page 2023
[80]

2023.doi: 10.48550/arXiv.2310.06773

Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

work page arXiv

Showing first 80 references.

[1] [1]

Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffu- sion models.arXiv preprint arXiv:2410.10733, 2024. 2, 5, 1

work page arXiv 2024

[2] [2]

Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis

Junsong Chen, YU Jincheng, GE Chongjian, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis. InICLR, 2024. 6, 1

work page 2024

[3] [3]

Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders

Rui Chen, Jianfeng Zhang, Yixun Liang, Guan Luo, Weiyu Li, Jiarui Liu, Xiu Li, Xiaoxiao Long, Jiashi Feng, and Ping Tan. Dora: Sampling and benchmarking for 3d shape varia- tional auto-encoders. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16251–16261,

work page

[4] [4]

Meshanything: Artist-created mesh generation with autoregressive transformers.arXiv preprint arXiv:2406.10163, 2024

Yiwen Chen, Tong He, Di Huang, Weicai Ye, Sijin Chen, Ji- axiang Tang, Xin Chen, Zhongang Cai, Lei Yang, Gang Yu, et al. Meshanything: Artist-created mesh generation with au- toregressive transformers.arXiv preprint arXiv:2406.10163,

work page arXiv

[5] [5]

Ultra3d: Efficient and high- fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025

Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, and Guosheng Lin. Ultra3d: Efficient and high- fidelity 3d generation with part attention.arXiv preprint arXiv:2507.17745, 2025. 3

work page arXiv 2025

[6] [6]

Neural dual contouring.ACM Transactions on Graphics (TOG), 41(4):1–13, 2022

Zhiqin Chen, Andrea Tagliasacchi, Thomas Funkhouser, and Hao Zhang. Neural dual contouring.ACM Transactions on Graphics (TOG), 41(4):1–13, 2022. 3

work page 2022

[7] [7]

3dtopia-xl: Scaling high-quality 3d asset generation via primitive diffusion.arXiv preprint arXiv:2409.12957, 2024

Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, et al. 3dtopia-xl: Scaling high- quality 3d asset generation via primitive diffusion.arXiv preprint arXiv:2409.12957, 2024. 2, 3

work page arXiv 2024

[8] [8]

Warpconvnet: High- performance 3d deep learning library.https://github

Chris Choy and NVIDIA Research. Warpconvnet: High- performance 3d deep learning library.https://github. com/NVlabs/warpconvnet, 2025. 4

work page 2025

[9] [9]

Abo: Dataset and benchmarks for real-world 3d object un- derstanding

Jasmine Collins, Shubham Goel, Kenan Deng, Achlesh- war Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object un- derstanding. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126– 21136, 2022. 6, 4

work page 2022

[10] [10]

Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018

Blender Online Community.Blender — a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, 2018. 6, 4

work page 2018

[11] [11]

Spconv: Spatially sparse convolu- tion library.https://github.com/traveller59/ spconv, 2022

Spconv Contributors. Spconv: Spatially sparse convolu- tion library.https://github.com/traveller59/ spconv, 2022. 3, 4

work page 2022

[12] [12]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023. 3

work page 2023

[13] [13]

Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36, 2024

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects.Advances in Neural Informa- tion Processing Systems, 36, 2024. 3, 6, 4

work page 2024

[14] [14]

Deformed implicit field: Modeling 3d shapes with learned dense correspon- dence

Yu Deng, Jiaolong Yang, and Xin Tong. Deformed implicit field: Modeling 3d shapes with learned dense correspon- dence. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 10286–10296,

work page

[15] [15]

Scaling recti- fied flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InICML, 2024. 2

work page 2024

[16] [16]

Introducing gemini 2.5 flash image: Our state-of-the-art image model.https://developers

Alisa Fortin, Guillaume Vernade, Kat Kampf, and Am- maar Reshi. Introducing gemini 2.5 flash image: Our state-of-the-art image model.https://developers. googleblog.com/en/introducing- gemini- 2- 5-flash-image/, 2025. Google Developer Blog. 6

work page 2025

[17] [17]

3d-future: 3d fur- niture shape with texture.International Journal of Computer Vision, pages 1–25, 2021

Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3d-future: 3d fur- niture shape with texture.International Journal of Computer Vision, pages 1–25, 2021. 4 9

work page 2021

[18] [18]

Submanifold Sparse Convolutional Networks

Benjamin Graham and Laurens Van der Maaten. Sub- manifold sparse convolutional networks.arXiv preprint arXiv:1706.01307, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

3dgen: Triplane latent diffusion for textured mesh generation

Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Bar- las O˘guz. 3dgen: Triplane latent diffusion for textured mesh generation.arXiv preprint arXiv:2303.05371, 2023. 3

work page arXiv 2023

[20] [20]

Gvgen: Text-to-3d generation with volumetric rep- resentation

Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yang- guang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. Gvgen: Text-to-3d generation with volumetric rep- resentation. InECCV, 2024. 2

work page 2024

[21] [21]

Sparseflex: High-resolution and arbitrary-topology 3d shape modeling.arXiv preprint arXiv:2503.21732, 2025

Xianglong He, Zi-Xin Zou, Chia-Hao Chen, Yuan-Chen Guo, Ding Liang, Chun Yuan, Wanli Ouyang, Yan-Pei Cao, and Yangguang Li. Sparseflex: High-resolution and arbitrary-topology 3d shape modeling.arXiv preprint arXiv:2503.21732, 2025. 2, 3, 5, 6

work page arXiv 2025

[22] [22]

Neural wavelet-domain diffusion for 3d shape generation

Ka-Hei Hui, Ruihui Li, Jingyu Hu, and Chi-Wing Fu. Neural wavelet-domain diffusion for 3d shape generation. InSIG- GRAPH Asia 2022 Conference Papers, pages 1–9, 2022. 2

work page 2022

[23] [23]

Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, et al. Hunyuan3d 2.1: From images to high- fidelity 3d assets with production-ready pbr material.arXiv preprint arXiv:2506.15442, 2025. 3, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop- pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs.arXiv preprint arXiv:2107.14795, 2021. 3

work page internal anchor Pith review Pith/arXiv arXiv 2021

[25] [25]

Dual contouring of hermite data

Tao Ju, Frank Losasso, Scott Schaefer, and Joe Warren. Dual contouring of hermite data. InProceedings of the 29th an- nual conference on Computer graphics and interactive tech- niques, pages 339–346, 2002. 3, 4

work page 2002

[26] [26]

Shap-E: Generating Conditional 3D Implicit Functions

Heewoo Jun and Alex Nichol. Shap-e: Generat- ing conditional 3d implicit functions.arXiv preprint arXiv:2305.02463, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page

[28] [28]

Chang, and Manolis Savva

Mukul Khanna*, Yongsen Mao*, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, and Manolis Savva. Habitat Synthetic Scenes Dataset (HSSD-200): An Analy- sis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation.arXiv preprint, 2023. 6, 4

work page 2023

[29] [29]

Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation

Yushi Lan, Fangzhou Hong, Shuai Yang, Shangchen Zhou, Xuyi Meng, Bo Dai, Xingang Pan, and Chen Change Loy. Ln3diff: Scalable latent neural fields diffusion for speedy 3d generation. InECCV, 2024. 3

work page 2024

[30] [30]

Gaussiananything: Interactive point cloud latent diffu- sion for 3d generation

Yushi Lan, Shangchen Zhou, Zhaoyang Lyu, Fangzhou Hong, Shuai Yang, Bo Dai, Xingang Pan, and Chen Change Loy. Gaussiananything: Interactive point cloud latent diffu- sion for 3d generation. InICLR, 2025. 3

work page 2025

[31] [31]

2025.doi:10.48550/arXiv.2405.14979

Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan, and Xiaoxiao Long. Craftsman: High-fidelity mesh generation with 3d native generation and interactive geometry refiner.arXiv preprint arXiv:2405.14979, 2024. 3

work page arXiv 2024

[32] [32]

2025.doi:10.48550/arXiv.2505.07747

Weiyu Li, Xuanyang Zhang, Zheng Sun, Di Qi, Hao Li, Wei Cheng, Weiwei Cai, Shihao Wu, Jiarui Liu, Zihao Wang, et al. Step1x-3d: Towards high-fidelity and con- trollable generation of textured 3d assets.arXiv preprint arXiv:2505.07747, 2025. 3, 7

work page arXiv 2025

[33] [33]

TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models

Yangguang Li, Zi-Xin Zou, Zexiang Liu, Dehu Wang, Yuan Liang, Zhipeng Yu, Xingchao Liu, Yuan-Chen Guo, Ding Liang, Wanli Ouyang, et al. Triposg: High-fidelity 3d shape synthesis using large-scale rectified flow models.arXiv preprint arXiv:2502.06608, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

Sparc: Sparse representation and construc- tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025

Zhihao Li, Yufei Wang, Heliang Zheng, Yihao Luo, and Bihan Wen. Sparc3d: Sparse representation and construc- tion for high-resolution 3d shapes modeling.arXiv preprint arXiv:2505.14521, 2025. 2, 3

work page arXiv 2025

[35] [35]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximil- ian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023. 6, 3

work page 2023

[36] [36]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023. 3

work page 2023

[37] [37]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 11976–11986,

work page

[38] [38]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017

[39] [39]

Diffusion probabilistic models for 3d point cloud generation

Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2837–2845, 2021. 2

work page 2021

[40] [40]

Lt3sd: Latent trees for 3d scene diffusion

Quan Meng, Lei Li, Matthias Nießner, and Angela Dai. Lt3sd: Latent trees for 3d scene diffusion. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 650–660, 2025. 3

work page 2025

[41] [41]

Occupancy networks: Learning 3d reconstruction in function space

Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se- bastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460–4470, 2019. 2

work page 2019

[42] [42]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

work page 2021

[43] [43]

Diffrf: Rendering-guided 3d radiance field diffusion

Norman M ¨uller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4328–4338, 2023. 2

work page 2023

[44] [44]

Extracting triangular 3d models, materials, and lighting from images

Jacob Munkberg, Jon Hasselgren, Tianchang Shen, Jun Gao, Wenzheng Chen, Alex Evans, Thomas M¨uller, and Sanja Fi- dler. Extracting triangular 3d models, materials, and lighting from images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8280– 8290, 2022. 6 10

work page 2022

[45] [45]

Polygen: An autoregressive generative model of 3d meshes

Charlie Nash, Yaroslav Ganin, SM Ali Eslami, and Peter Battaglia. Polygen: An autoregressive generative model of 3d meshes. InInternational conference on machine learning, pages 7220–7229. PMLR, 2020. 2

work page 2020

[46] [46]

Point-E: A System for Generating 3D Point Clouds from Complex Prompts

Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generat- ing 3d point clouds from complex prompts.arXiv preprint arXiv:2212.08751, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[47] [47]

Au- todecoding latent 3d diffusion models.Advances in Neural Information Processing Systems, 36:67021–67047, 2023

Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc V Gool, and Sergey Tulyakov. Au- todecoding latent 3d diffusion models.Advances in Neural Information Processing Systems, 36:67021–67047, 2023. 3

work page 2023

[48] [48]

Deepsdf: Learning con- tinuous signed distance functions for shape representation

Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning con- tinuous signed distance functions for shape representation. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 165–174, 2019. 2

work page 2019

[49] [49]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

work page

[50] [50]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 7, 6

work page 2021

[51] [51]

Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies

Xuanchi Ren, Jiahui Huang, Xiaohui Zeng, Ken Museth, Sanja Fidler, and Francis Williams. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4209–4219, 2024. 3, 5

work page 2024

[52] [52]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 3, 5

work page 2022

[53] [53]

Flexible isosurface extraction for gradient-based mesh optimization.ACM Trans

Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible isosurface extraction for gradient-based mesh optimization.ACM Trans. Graph., 42(4), 2023. 2, 4

work page 2023

[54] [54]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Sketchfab - the best 3d viewer on the web

Sketchfab. Sketchfab - the best 3d viewer on the web. https://sketchfab.com/, 2025. 6, 4

work page 2025

[56] [56]

Using shape to categorize: Low-shot learning with an explicit shape bias

Stefan Stojanov, Anh Thai, and James M Rehg. Using shape to categorize: Low-shot learning with an explicit shape bias. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1798–1808, 2021. 6, 4

work page 2021

[57] [57]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

work page

[58] [58]

Torchsparse++: Efficient training and inference framework for sparse convolution on gpus

Haotian Tang, Shang Yang, Zhijian Liu, Ke Hong, Zhong- ming Yu, Xiuyu Li, Guohao Dai, Yu Wang, and Song Han. Torchsparse++: Efficient training and inference framework for sparse convolution on gpus. InIEEE/ACM International Symposium on Microarchitecture (MICRO), 2023. 4

work page 2023

[59] [59]

V olumediffu- sion: Flexible text-to-3d generation with efficient volumetric encoder.arXiv preprint arXiv:2312.11459, 2023

Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, and Baining Guo. V olumediffu- sion: Flexible text-to-3d generation with efficient volumetric encoder.arXiv preprint arXiv:2312.11459, 2023. 2

work page arXiv 2023

[60] [60]

Tri- ton: an intermediate language and compiler for tiled neu- ral network computations

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Tri- ton: an intermediate language and compiler for tiled neu- ral network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019. 6, 3

work page 2019

[61] [61]

Lion: Latent point dif- fusion models for 3d shape generation.Advances in Neural Information Processing Systems, 35:10021–10039, 2022

Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, Karsten Kreis, et al. Lion: Latent point dif- fusion models for 3d shape generation.Advances in Neural Information Processing Systems, 35:10021–10039, 2022. 3

work page 2022

[62] [62]

fvdb: A deep- learning framework for sparse, large scale, and high perfor- mance spatial intelligence.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024

Francis Williams, Jiahui Huang, Jonathan Swartz, Gergely Klar, Vijay Thakkar, Matthew Cong, Xuanchi Ren, Ruilong Li, Clement Fuji-Tsang, Sanja Fidler, et al. fvdb: A deep- learning framework for sparse, large scale, and high perfor- mance spatial intelligence.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024. 4

work page 2024

[63] [63]

Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Jingxi Xu, Philip Torr, Xun Cao, and Yao Yao. Direct3d: Scalable image-to-3d generation via 3d latent diffusion transformer. arXiv preprint arXiv:2405.14832, 2024. 3

work page arXiv 2024

[64] [64]

Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Xun Cao, Philip Torr, et al. Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412, 2025. 2, 3, 6, 7

work page arXiv 2025

[65] [65]

Structured 3d latents for scalable and versatile 3d gen- eration

Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d gen- eration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21469–21480, 2025. 2, 3, 5, 6, 7, 4

work page 2025

[66] [66]

Octfusion: Octree- based diffusion models for 3d shape generation

Bojun Xiong, Si-Tong Wei, Xin-Yang Zheng, Yan-Pei Cao, Zhouhui Lian, and Peng-Shuai Wang. Octfusion: Octree- based diffusion models for 3d shape generation. InComputer Graphics Forum, page e70198. Wiley Online Library, 2025. 3

work page 2025

[67] [67]

Ulip-2: Towards scalable multimodal pre-training for 3d understanding

Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Jun- nan Li, Roberto Mart´ın-Mart´ın, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, et al. Ulip-2: Towards scalable multimodal pre-training for 3d understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27091–27101, 2024. 7, 6

work page 2024

[68] [68]

Atlas gaussians diffusion for 3d generation with infinite number of points.arXiv preprint arXiv:2408.13055, 2024

Haitao Yang, Yuan Dong, Hanwen Jiang, Dejia Xu, Georgios Pavlakos, and Qixing Huang. Atlas gaussians diffusion for 3d generation with infinite number of points.arXiv preprint arXiv:2408.13055, 2024. 3

work page arXiv 2024

[69] [69]

Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation

Jiayu Yang, Taizhang Shang, Weixuan Sun, Xibin Song, Ziang Cheng, Senbo Wang, Shenzhou Chen, Weizhe Liu, Hongdong Li, and Pan Ji. Pandora3d: A comprehensive framework for high-quality 3d shape and texture generation. arXiv preprint arXiv:2502.14247, 2025. 3 11

work page arXiv 2025

[70] [70]

Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging.arXiv preprint arXiv:2503.22236, 3:2,

Chongjie Ye, Yushuang Wu, Ziteng Lu, Jiahao Chang, Xi- aoyang Guo, Jiaqing Zhou, Hao Zhao, and Xiaoguang Han. Hi3dgen: High-fidelity 3d geometry generation from images via normal bridging.arXiv preprint arXiv:2503.22236, 3:2,

work page arXiv

[71] [71]

Texgen: a generative diffusion model for mesh textures

Xin Yu, Ze Yuan, Yuan-Chen Guo, Ying-Tian Liu, Jianhui Liu, Yangguang Li, Yan-Pei Cao, Ding Liang, and Xiaojuan Qi. Texgen: a generative diffusion model for mesh textures. ACM Transactions on Graphics (TOG), 43(6):1–14, 2024. 7

work page 2024

[72] [72]

Mip-splatting: Alias-free 3d gaussian splat- ting

Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 19447–19456,

work page

[73] [73]

Root mean square layer nor- malization.Advances in Neural Information Processing Sys- tems, 32, 2019

Biao Zhang and Rico Sennrich. Root mean square layer nor- malization.Advances in Neural Information Processing Sys- tems, 32, 2019. 2

work page 2019

[74] [74]

3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023

Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models.ACM Transactions on Graphics (TOG), 42(4):1–16, 2023. 3

work page 2023

[75] [75]

11 Frontiers in Probabilistic Inference: Sampling meets Learning (FPI) workshop at ICLR 2025 A DATA We use two datasets in our experiments

Bowen Zhang, Yiji Cheng, Jiaolong Yang, Chunyu Wang, Feng Zhao, Yansong Tang, Dong Chen, and Baining Guo. Gaussiancube: Structuring gaussian splatting using opti- mal transport for 3d generative modeling.arXiv preprint arXiv:2403.19655, 2024. 2

work page arXiv 2024

[76] [76]

Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024

Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu, and Jingyi Yu. Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets.ACM Transactions on Graphics (TOG), 43(4):1–20, 2024. 2, 3

work page 2024

[77] [77]

Texverse: A universe of 3d objects with high-resolution textures.arXiv preprint arXiv:2508.10868, 2025

Yibo Zhang, Li Zhang, Rui Ma, and Nan Cao. Texverse: A universe of 3d objects with high-resolution textures.arXiv preprint arXiv:2508.10868, 2025. 3, 6, 4

work page arXiv 2025

[78] [78]

Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation.Advances in Neural Information Processing Systems, 36, 2024

Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, Bin Fu, Tao Chen, Gang Yu, and Shenghua Gao. Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation.Advances in Neural Information Processing Systems, 36, 2024. 3

work page 2024

[79] [79]

Locally attentional sdf diffusion for controllable 3d shape generation.ACM Trans- actions on Graphics (SIGGRAPH), 42(4), 2023

Xin-Yang Zheng, Hao Pan, Peng-Shuai Wang, Xin Tong, Yang Liu, and Heung-Yeung Shum. Locally attentional sdf diffusion for controllable 3d shape generation.ACM Trans- actions on Graphics (SIGGRAPH), 42(4), 2023. 2

work page 2023

[80] [80]

2023.doi: 10.48550/arXiv.2310.06773

Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

work page arXiv