hub Canonical reference

Native and Compact Structured Latents for 3D Generation

Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng · 2025 · cs.CV · arXiv 2512.14692

Canonical reference. 80% of citing Pith papers cite this work as background.

28 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 28 citing papers arXiv PDF

abstract

Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 method 2

citation-polarity summary

background 8 use method 2

representative citing papers

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

Feedforward 3D Editing Learns from Semantic-Part Transformation

cs.CV · 2026-05-26 · unverdicted · novelty 7.0 · 2 refs

Pxform provides 100K semantic-part 3D edit pairs; PartFlow uses them to deliver feedforward 3D editing with improved fidelity and preservation over prior methods.

GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.

The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting

cs.CV · 2026-05-18 · conditional · novelty 7.0

MixCount provides a scalable synthetic dataset for mixed-object counting that improves state-of-the-art models on real benchmarks, cutting MAE by 20.14% on FSC-147 and 18.3% on PairTally.

Count Anything at Any Granularity

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.

Velocity-Space 3D Asset Editing

cs.GR · 2026-05-08 · unverdicted · novelty 7.0

VS3D performs local 3D asset editing by injecting reconstruction-anchored source signals, partial-mean guidance, and twin-agreement residuals into the velocity sampler to control edit strength and preserve identity.

Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

Helix4D: Complex 4D Mesh Generation

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

Helix4D generates high-quality dynamic 4D meshes from videos by extending Trellis2 with sliding-window cross-frame attention anchored on the first frame and a repurposed 4D temporal encoding.

PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

PhysX-Omni unifies simulation-ready 3D asset generation across rigid, deformable, and articulated objects via a new geometry representation, the PhysXVerse dataset, and the PhysX-Bench evaluation suite.

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

cs.CV · 2026-05-20 · unverdicted · novelty 6.0 · 2 refs

Stream3D is a training-free method that maintains a fixed-size evidential memory of past frames to convert frozen view-conditioned 3D generators into consistent streaming generators.

ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

ROAR-3D adds a token-wise view router and dual-stream attention to pretrained single-view 3D generators so they can use arbitrary unposed images for higher-fidelity output.

Pixal3D: Pixel-Aligned 3D Generation from Images

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.

Generative 3D Gaussians with Learned Density Control

cs.GR · 2026-05-08 · unverdicted · novelty 6.0

DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.

LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

cs.CV · 2026-04-06 · conditional · novelty 6.0

LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.

VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching

cs.CV · 2026-05-29 · unverdicted · novelty 5.0

VolFill uses a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into latent space and a latent Diffusion Transformer to denoise complete scenes, conditioned on geometry foundation models, outperforming baselines on SCRREAM and NRGB-D datasets.

SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation

cs.CV · 2026-05-28 · unverdicted · novelty 5.0

SuperVoxelGPT creates shape-adaptive, deterministically ordered supervoxel tokens via saliency-guided CVT, cutting sequence length to 12.8% of uniform voxels while claiming SOTA quality and 10x speedup on Trellis-500K.

AssetGen: Deployable 3D Asset Generation at Interactive Speed

cs.GR · 2026-05-22 · unverdicted · novelty 5.0

AssetGen is a system that produces deployable 3D assets including meshes, baked normals, and textures from a single reference image in under 30 seconds via a coarse-to-refine VecSet pipeline and co-designed optimizations.

CMAG: Concept-Scaffolded Retrieval for Marketplace Avatar Generation

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

CMAG combines 3D concept scaffolding, prompt decomposition, taxonomy routing, hybrid retrieval, and agentic VLM verification to assemble topologically consistent avatars from catalog assets given free-form text prompts.

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

cs.CV · 2026-05-16 · unverdicted · novelty 5.0

EVA01 introduces a Mixture-of-Transformers model that natively adds 3D mesh understanding, generation, and multi-turn editing to MLLMs by decoupling understanding and generation experts with shared global self-attention.

Pose Tracking with a Foundation Pose Model and an Ensemble Directional Kalman Filter

cs.LG · 2026-05-04 · unverdicted · novelty 5.0

EnDKF combines ensemble Kalman filtering with directional statistics and unit quaternions to achieve lower pose tracking error than raw measurements in synthetic constant-velocity tests and FoundationPose-based head tracking.

From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation

cs.GR · 2026-04-26 · unverdicted · novelty 5.0 · 2 refs

The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.

Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples sparse-view multiview generation with 3D Gaussian lifting.

citing papers explorer

Showing 24 of 24 citing papers after filters.

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation cs.GR · 2026-05-13 · unverdicted · none · ref 11 · internal anchor
Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.
Feedforward 3D Editing Learns from Semantic-Part Transformation cs.CV · 2026-05-26 · unverdicted · none · ref 7 · 2 links · internal anchor
Pxform provides 100K semantic-part 3D edit pairs; PartFlow uses them to deliver feedforward 3D editing with improved fidelity and preservation over prior methods.
GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction cs.CV · 2026-05-22 · unverdicted · none · ref 25 · internal anchor
GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.
Count Anything at Any Granularity cs.CV · 2026-05-11 · unverdicted · none · ref 74 · internal anchor
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.
Velocity-Space 3D Asset Editing cs.GR · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
VS3D performs local 3D asset editing by injecting reconstruction-anchored source signals, partial-mean guidance, and twin-agreement residuals into the velocity sampler to control edit strength and preserve identity.
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches cs.CV · 2026-04-15 · unverdicted · none · ref 51 · internal anchor
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale cs.CV · 2026-04-13 · unverdicted · none · ref 77 · internal anchor
A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.
Helix4D: Complex 4D Mesh Generation cs.CV · 2026-05-25 · unverdicted · none · ref 30 · internal anchor
Helix4D generates high-quality dynamic 4D meshes from videos by extending Trellis2 with sliding-window cross-frame attention anchored on the first frame and a repurposed 4D temporal encoding.
PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects cs.CV · 2026-05-20 · unverdicted · none · ref 6 · internal anchor
PhysX-Omni unifies simulation-ready 3D asset generation across rigid, deformable, and articulated objects via a new geometry representation, the PhysXVerse dataset, and the PhysX-Bench evaluation suite.
Stream3D: Sequential Multi-View 3D Generation via Evidential Memory cs.CV · 2026-05-20 · unverdicted · none · ref 78 · 2 links · internal anchor
Stream3D is a training-free method that maintains a fixed-size evidential memory of past frames to convert frozen view-conditioned 3D generators into consistent streaming generators.
ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation cs.CV · 2026-05-20 · unverdicted · none · ref 62 · internal anchor
ROAR-3D adds a token-wise view router and dual-stream attention to pretrained single-view 3D generators so they can use arbitrary unposed images for higher-fidelity output.
Pixal3D: Pixel-Aligned 3D Generation from Images cs.CV · 2026-05-11 · unverdicted · none · ref 12 · internal anchor
Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.
Generative 3D Gaussians with Learned Density Control cs.GR · 2026-05-08 · unverdicted · none · ref 62 · internal anchor
DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.
VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching cs.CV · 2026-05-29 · unverdicted · none · ref 84 · internal anchor
VolFill uses a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into latent space and a latent Diffusion Transformer to denoise complete scenes, conditioned on geometry foundation models, outperforming baselines on SCRREAM and NRGB-D datasets.
SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation cs.CV · 2026-05-28 · unverdicted · none · ref 41 · internal anchor
SuperVoxelGPT creates shape-adaptive, deterministically ordered supervoxel tokens via saliency-guided CVT, cutting sequence length to 12.8% of uniform voxels while claiming SOTA quality and 10x speedup on Trellis-500K.
AssetGen: Deployable 3D Asset Generation at Interactive Speed cs.GR · 2026-05-22 · unverdicted · none · ref 23 · internal anchor
AssetGen is a system that produces deployable 3D assets including meshes, baked normals, and textures from a single reference image in under 30 seconds via a coarse-to-refine VecSet pipeline and co-designed optimizations.
CMAG: Concept-Scaffolded Retrieval for Marketplace Avatar Generation cs.CV · 2026-05-18 · unverdicted · none · ref 17 · internal anchor
CMAG combines 3D concept scaffolding, prompt decomposition, taxonomy routing, hybrid retrieval, and agentic VLM verification to assemble topologically consistent avatars from catalog assets given free-form text prompts.
EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers cs.CV · 2026-05-16 · unverdicted · none · ref 64 · internal anchor
EVA01 introduces a Mixture-of-Transformers model that natively adds 3D mesh understanding, generation, and multi-turn editing to MLLMs by decoupling understanding and generation experts with shared global self-attention.
Pose Tracking with a Foundation Pose Model and an Ensemble Directional Kalman Filter cs.LG · 2026-05-04 · unverdicted · none · ref 30 · internal anchor
EnDKF combines ensemble Kalman filtering with directional statistics and unit quaternions to achieve lower pose tracking error than raw measurements in synthetic constant-velocity tests and FoundationPose-based head tracking.
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation cs.GR · 2026-04-26 · unverdicted · none · ref 34 · 2 links · internal anchor
The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation cs.CV · 2026-04-20 · unverdicted · none · ref 38 · internal anchor
Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples sparse-view multiview generation with 3D Gaussian lifting.
Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation cs.CV · 2026-04-10 · unverdicted · none · ref 49 · internal anchor
Hitem3D 2.0 combines multi-view image synthesis with native 3D texture projection to improve completeness, cross-view consistency, and geometry alignment over prior methods.
Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation cs.GR · 2026-04-22 · unverdicted · none · ref 21 · internal anchor
Seed3D 2.0 advances 3D content generation via a coarse-to-fine geometry pipeline, unified PBR material model, and simulation-ready scene tools, reporting 69-89.9% win rates over commercial systems in human studies.
3D Generation for Embodied AI and Robotic Simulation: A Survey cs.RO · 2026-04-29 · unverdicted · none · ref 102 · 3 links · internal anchor
The paper surveys 3D generation techniques for embodied AI and robotics, categorizing them into data generation, simulation environments, and sim-to-real bridging while identifying bottlenecks in physical validity and transfer.

Native and Compact Structured Latents for 3D Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer