Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
hub Canonical reference
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models
Canonical reference. 90% of citing Pith papers cite this work as background.
abstract
Recent advancements in diffusion techniques have propelled image and video generation to unprecedented levels of quality, significantly accelerating the deployment and application of generative AI. However, 3D shape generation technology has so far lagged behind, constrained by limitations in 3D data scale, complexity of 3D data processing, and insufficient exploration of advanced techniques in the 3D domain. Current approaches to 3D shape generation face substantial challenges in terms of output quality, generalization capability, and alignment with input conditions. We present TripoSG, a new streamlined shape diffusion paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images. Specifically, we propose: 1) A large-scale rectified flow transformer for 3D shape generation, achieving state-of-the-art fidelity through training on extensive, high-quality data. 2) A hybrid supervised training strategy combining SDF, normal, and eikonal losses for 3D VAE, achieving high-quality 3D reconstruction performance. 3) A data processing pipeline to generate 2 million high-quality 3D samples, highlighting the crucial rules for data quality and quantity in training 3D generative models. Through comprehensive experiments, we have validated the effectiveness of each component in our new framework. The seamless integration of these parts has enabled TripoSG to achieve state-of-the-art performance in 3D shape generation. The resulting 3D shapes exhibit enhanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input images. Moreover, TripoSG demonstrates improved versatility in generating 3D models from diverse image styles and contents, showcasing strong generalization capabilities. To foster progress and innovation in the field of 3D generation, we will make our model publicly available.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
CelloCut formulates watertight remeshing as binary labeling on a Delaunay tetrahedral partition solved by graph-cut minimization with one-sided constraints to guarantee volumetrically consistent solids.
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
A latent-space transformer framework poses 3D characters without skinning or fixed topologies, outperforming baselines and generalizing zero-shot to quadrupeds.
ROAR-3D adds a token-wise view router and dual-stream attention to pretrained single-view 3D generators so they can use arbitrary unposed images for higher-fidelity output.
A training-free Spatio-Temporal Attention Chain framework accelerates 4D mesh generation 13x, improves quality, scales to 16x longer videos, and supports downstream tracking and camera estimation.
Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.
DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.
PhysForge generates physics-grounded 3D assets via a VLM-planned Hierarchical Physical Blueprint and a KineVoxel Injection diffusion model, backed by the new PhysDB dataset of 150,000 annotated assets.
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth simulation.
MeshReGen introduces a conditioned 3D geometry regenerator with VecSet that learns a regeneration prior via self-supervision and reports state-of-the-art results on controllable generation tasks.
BVE framework enables text-guided 3D editing beyond voxel limits by combining self-constructed data, lightweight semantic injection, and annotation-free masking to preserve local invariance.
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.
SegviGen shows pretrained 3D generative models can be repurposed for part segmentation via voxel colorization, beating prior methods by 40% interactively and 15% on full segmentation using only 0.32% of labeled data.
MV-SAM3D adds multi-view fusion via multi-diffusion with attention-entropy and visibility weighting plus physics-aware optimization to improve fidelity and physical plausibility in layout-aware 3D generation.
Learn2Fold generates physically valid origami folding sequences from text prompts by decoupling LLM-based program proposals from verification in a learned graph-structured world model.
Introduces O-Voxel omni-voxel representation and Sparse Compression VAE for structured native 3D latents, enabling efficient training of large flow-matching models that produce higher-quality geometry and materials than prior methods.
TM outperforms FM for well-separated modes with non-negligible variance by preserving covariance via stochastic latent updates, with the gap closing as variance approaches zero.
ReScene introduces HierView for view prioritization and Relation-Aware Assembly for scene graph fusion, reporting 17% lower Chamfer Distance and 26% lower LPIPS than prior baselines on ScanNet while running faster.
TelePhysics is a training-free pipeline that builds a unified 3D scene model from one photo and then runs decoupled physics simulation to produce controllable, penetration-free multi-object videos.
PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.
citing papers explorer
-
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-positive cost.
-
VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation
VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.
-
CelloCut: Constructive Watertight Remeshing via Tetrahedral Cell Cuts
CelloCut formulates watertight remeshing as binary labeling on a Delaunay tetrahedral partition solved by graph-cut minimization with one-sided constraints to guarantee volumetrically consistent solids.
-
Towards Realistic and Consistent Orbital Video Generation via 3D Foundation Priors
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
-
Make-It-Poseable: Feed-forward Latent Posing Model for 3D Characters
A latent-space transformer framework poses 3D characters without skinning or fixed topologies, outperforming baselines and generalizing zero-shot to quadrupeds.
-
ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation
ROAR-3D adds a token-wise view router and dual-stream attention to pretrained single-view 3D generators so they can use arbitrary unposed images for higher-fidelity output.
-
Fast 4D Mesh Generation by Spatio-Temporal Attention Chains
A training-free Spatio-Temporal Attention Chain framework accelerates 4D mesh generation 13x, improves quality, scales to 16x longer videos, and supports downstream tracking and camera estimation.
-
Pixal3D: Pixel-Aligned 3D Generation from Images
Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.
-
Generative 3D Gaussians with Learned Density Control
DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.
-
PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World
PhysForge generates physics-grounded 3D assets via a VLM-planned Hierarchical Physical Blueprint and a KineVoxel Injection diffusion model, backed by the new PhysDB dataset of 150,000 annotated assets.
-
Velox: Learning Representations of 4D Geometry and Appearance
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth simulation.
-
MeshReGen: A Unified 3D Geometry Regeneration Framework
MeshReGen introduces a conditioned 3D geometry regenerator with VecSet that learns a regeneration prior via self-supervision and reports state-of-the-art results on controllable generation tasks.
-
Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data
BVE framework enables text-guided 3D editing beyond voxel limits by combining self-constructed data, lightweight semantic injection, and annotation-free masking to preserve local invariance.
-
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
-
UniRecGen: Unifying Multi-View 3D Reconstruction and Generation
UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.
-
SegviGen: Repurposing 3D Generative Model for Part Segmentation
SegviGen shows pretrained 3D generative models can be repurposed for part segmentation via voxel colorization, beating prior methods by 40% interactively and 15% on full segmentation using only 0.32% of labeled data.
-
MV-SAM3D: Adaptive Multi-View Fusion for Layout-Aware 3D Generation
MV-SAM3D adds multi-view fusion via multi-diffusion with attention-entropy and visibility weighting plus physics-aware optimization to improve fidelity and physical plausibility in layout-aware 3D generation.
-
Learn2Fold: Structured Origami Generation with World Model Planning
Learn2Fold generates physically valid origami folding sequences from text prompts by decoupling LLM-based program proposals from verification in a learned graph-structured world model.
-
Native and Compact Structured Latents for 3D Generation
Introduces O-Voxel omni-voxel representation and Sparse Compression VAE for structured native 3D latents, enabling efficient training of large flow-matching models that produce higher-quality geometry and materials than prior methods.
-
Demystifying Transition Matching: When and Why It Can Beat Flow Matching
TM outperforms FM for well-separated modes with non-negligible variance by preserving covariance via stochastic latent updates, with the gap closing as variance approaches zero.
-
ReScene: Structured Indoor Scene Reconstruction from Multi-View Captures
ReScene introduces HierView for view prioritization and Relation-Aware Assembly for scene graph fusion, reporting 17% lower Chamfer Distance and 26% lower LPIPS than prior baselines on ScanNet while running faster.
-
TelePhysics: Physics-Grounded Multi-Object Scene Generation from a Single Image with Real-Time Interaction
TelePhysics is a training-free pipeline that builds a unified 3D scene model from one photo and then runs decoupled physics simulation to produce controllable, penetration-free multi-object videos.
-
Pose-Aware Diffusion for 3D Generation
PAD synthesizes 3D geometry in observation space via depth unprojection as anchor to eliminate pose ambiguity in image-to-3D generation.
-
From Visual Synthesis to Interactive Worlds: Toward Production-Ready 3D Asset Generation
The paper surveys 3D asset generation methods and organizes them around the full production pipeline to assess which outputs meet engine-level requirements for interactive applications.
-
DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation
LGAA is a modular adapter framework that lifts multi-view diffusion models to produce 2D Gaussian Splats with PBR channels for high-quality relightable 3D mesh extraction using data-efficient finetuning on 69k instances.
-
AnimateAnyMesh++: A Flexible 4D Foundation Model for High-Fidelity Text-Driven Mesh Animation
AnimateAnyMesh++ animates arbitrary 3D meshes from text using an expanded 300K-identity DyMesh-XL dataset, a power-law topology-aware DyMeshVAE-Flex, and a variable-length rectified-flow generator to produce semantically accurate, temporally coherent animations in seconds.
-
Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation
Seed3D 2.0 advances 3D content generation via a coarse-to-fine geometry pipeline, unified PBR material model, and simulation-ready scene tools, reporting 69-89.9% win rates over commercial systems in human studies.
-
Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material
Hunyuan3D 2.1 is a two-part system with DiT for shape generation and Paint for texture synthesis that produces high-fidelity 3D assets with PBR materials.
- DVD: Discrete Voxel Diffusion for 3D Generation and Editing