hub Canonical reference

Native and Compact Structured Latents for 3D Generation

Jianfeng Xiang, Xiaoxue Chen, Sicheng Xu, Ruicheng Wang, Zelong Lv, Yu Deng · 2025 · cs.CV · arXiv 2512.14692

Canonical reference. 80% of citing Pith papers cite this work as background.

25 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 25 citing papers arXiv PDF

abstract

Recent advancements in 3D generative modeling have significantly improved the generation realism, yet the field is still hampered by existing representations, which struggle to capture assets with complex topologies and detailed appearance. This paper present an approach for learning a structured latent representation from native 3D data to address this challenge. At its core is a new sparse voxel structure called O-Voxel, an omni-voxel representation that encodes both geometry and appearance. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, while capturing comprehensive surface attributes beyond texture color, such as physically-based rendering parameters. Based on O-Voxel, we design a Sparse Compression VAE which provides a high spatial compression rate and a compact latent space. We train large-scale flow-matching models comprising 4B parameters for 3D generation using diverse public 3D asset datasets. Despite their scale, inference remains highly efficient. Meanwhile, the geometry and material quality of our generated assets far exceed those of existing models. We believe our approach offers a significant advancement in 3D generative modeling.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 method 2

citation-polarity summary

background 8 use method 2

representative citing papers

Rigel3D: Rig-aware Latents for Animation-Ready 3D Asset Generation

cs.GR · 2026-05-13 · unverdicted · novelty 8.0

Rigel3D jointly generates rigged 3D meshes with geometry, skeleton topology, joint positions, and skinning weights using coupled surface and skeleton latent representations for image-conditioned animation-ready asset synthesis.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

cs.CR · 2026-05-10 · conditional · novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.

GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.

The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting

cs.CV · 2026-05-18 · conditional · novelty 7.0

MixCount provides a scalable synthetic dataset for mixed-object counting that improves state-of-the-art models on real benchmarks, cutting MAE by 20.14% on FSC-147 and 18.3% on PairTally.

Count Anything at Any Granularity

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for improved accuracy.

Velocity-Space 3D Asset Editing

cs.GR · 2026-05-08 · unverdicted · novelty 7.0

VS3D performs local 3D asset editing by injecting reconstruction-anchored source signals, partial-mean guidance, and twin-agreement residuals into the velocity sampler to control edit strength and preserve identity.

Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in realism and consistency.

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

PhysX-Omni unifies simulation-ready 3D asset generation across rigid, deformable, and articulated objects via a new geometry representation, the PhysXVerse dataset, and the PhysX-Bench evaluation suite.

ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

ROAR-3D adds a token-wise view router and dual-stream attention to pretrained single-view 3D generators so they can use arbitrary unposed images for higher-fidelity output.

Pixal3D: Pixel-Aligned 3D Generation from Images

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.

Generative 3D Gaussians with Learned Density Control

cs.GR · 2026-05-08 · unverdicted · novelty 6.0

DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.

LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

cs.CV · 2026-04-06 · conditional · novelty 6.0

LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.

VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching

cs.CV · 2026-05-29 · unverdicted · novelty 5.0

VolFill uses a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into latent space and a latent Diffusion Transformer to denoise complete scenes, conditioned on geometry foundation models, outperforming baselines on SCRREAM and NRGB-D datasets.

SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation

cs.CV · 2026-05-28 · unverdicted · novelty 5.0

SuperVoxelGPT creates shape-adaptive, deterministically ordered supervoxel tokens via saliency-guided CVT, cutting sequence length to 12.8% of uniform voxels while claiming SOTA quality and 10x speedup on Trellis-500K.

CMAG: Concept-Scaffolded Retrieval for Marketplace Avatar Generation

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

CMAG combines 3D concept scaffolding, prompt decomposition, taxonomy routing, hybrid retrieval, and agentic VLM verification to assemble topologically consistent avatars from catalog assets given free-form text prompts.

EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers

cs.CV · 2026-05-16 · unverdicted · novelty 5.0

EVA01 introduces a Mixture-of-Transformers model that natively adds 3D mesh understanding, generation, and multi-turn editing to MLLMs by decoupling understanding and generation experts with shared global self-attention.

Pose Tracking with a Foundation Pose Model and an Ensemble Directional Kalman Filter

cs.LG · 2026-05-04 · unverdicted · novelty 5.0

EnDKF combines ensemble Kalman filtering with directional statistics and unit quaternions to achieve lower pose tracking error than raw measurements in synthetic constant-velocity tests and FoundationPose-based head tracking.

From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation

cs.GR · 2026-04-26 · unverdicted · novelty 5.0 · 2 refs

The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.

Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples sparse-view multiview generation with 3D Gaussian lifting.

Hitem3D 2.0: Multi-View Guided Native 3D Texture Generation

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

Hitem3D 2.0 combines multi-view image synthesis with native 3D texture projection to improve completeness, cross-view consistency, and geometry alignment over prior methods.

Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation

cs.GR · 2026-04-22 · unverdicted · novelty 4.0

Seed3D 2.0 advances 3D content generation via a coarse-to-fine geometry pipeline, unified PBR material model, and simulation-ready scene tools, reporting 69-89.9% win rates over commercial systems in human studies.

3D Generation for Embodied AI and Robotic Simulation: A Survey

cs.RO · 2026-04-29 · unverdicted · novelty 2.0 · 3 refs

The paper surveys 3D generation techniques for embodied AI and robotics, categorizing them into data generation, simulation environments, and sim-to-real bridging while identifying bottlenecks in physical validity and transfer.

Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

cs.CV · 2026-05-20

citing papers explorer

Showing 3 of 3 citing papers after filters.

On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models cs.CR · 2026-05-10 · conditional · none · ref 47 · internal anchor
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting cs.CV · 2026-05-18 · conditional · none · ref 55 · internal anchor
MixCount provides a scalable synthetic dataset for mixed-object counting that improves state-of-the-art models on real benchmarks, cutting MAE by 20.14% on FSC-147 and 18.3% on PairTally.
LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows cs.CV · 2026-04-06 · conditional · none · ref 66 · internal anchor
LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.

Native and Compact Structured Latents for 3D Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer