Garment Particles is a 5D point cloud representation jointly encoding 2D sewing patterns and 3D geometry, supporting rectified flow generation from high-level inputs and diffusion-based editing of patterns or shapes.
hub Mixed citations
Structured 3D Latents for Scalable and Versatile 3D Generation
Mixed citation behavior. Most common role is background (40%).
abstract
We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.
CAdam reinterprets densification in generative 3DGS as signal verification via gradient-moment interference, quantile context, and SNR gating to achieve large reductions in primitive count with comparable quality.
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
A conditional diffusion model using proprioception and multi-contact touch produces metric-scale, physically consistent 3D object reconstructions under hand occlusion.
SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
MeshTailor is a mesh-native generative model that uses ChainingSeams serialization and a dual-stream transformer with pointer layers to trace coherent seams vertex-by-vertex on 3D surfaces.
ATATA enables fast joint inference of structurally aligned pairs using Rectified Flow models via segment transport, improving state-of-the-art for image and video generation while matching 3D quality at much higher speed.
Affostruction reconstructs full 3D object geometry from partial RGBD views and grounds text-based affordances on both visible and unobserved surfaces, reporting large gains over prior methods.
Voxify3D generates voxel art from 3D meshes via orthographic pixel supervision, patch-based CLIP alignment, and palette-constrained Gumbel-Softmax quantization, achieving 37.12 CLIP-IQA and 77.90% user preference.
SVG360 lifts a single SVG to a view-conditioned representation, uses spatial memory to propagate consistent parts across views, and applies structure-aware vectorization to produce editable multiview SVGs.
GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.
Diffusion-based per-view harmonization for lighting-consistent object transfer between 3DGS scenes, using heterogeneous training data and final 3D consolidation.
PhyGenHOI couples a motion diffusion model for humans with material point method simulation for objects on 3D Gaussians, using attraction loss, contact re-simulation, and masked video-SDS to produce physically consistent dynamic interactions from text.
PhysX-Omni unifies simulation-ready 3D asset generation across rigid, deformable, and articulated objects via a new geometry representation, the PhysXVerse dataset, and the PhysX-Bench evaluation suite.
ROAR-3D adds a token-wise view router and dual-stream attention to pretrained single-view 3D generators so they can use arbitrary unposed images for higher-fidelity output.
PhysForge generates physics-grounded 3D assets via a VLM-planned Hierarchical Physical Blueprint and a KineVoxel Injection diffusion model, backed by the new PhysDB dataset of 150,000 annotated assets.
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth simulation.
MeshReGen introduces a conditioned 3D geometry regenerator with VecSet that learns a regeneration prior via self-supervision and reports state-of-the-art results on controllable generation tasks.
REVIVE 3D generates voluminous 3D assets from flat 2D images via an inflated prior construction followed by latent-space refinement, plus new metrics for volume and flatness validated by user study.
Pair2Scene generates complex 3D scenes beyond training data by training a network on local object-pair placement rules and applying them recursively with collision-aware sampling.
WARPED synthesizes realistic wrist-view observations from monocular egocentric human videos via foundation models, hand-object tracking, retargeting, and Gaussian Splatting to train visuomotor policies that match teleoperation success rates on five tabletop tasks with 5-8x less collection effort.
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
UniRecGen unifies reconstruction and generation via shared canonical space and disentangled cooperative learning to produce complete, consistent 3D models from sparse views.
citing papers explorer
-
Voxify3D: Pixel Art Meets Volumetric Rendering
Voxify3D generates voxel art from 3D meshes via orthographic pixel supervision, patch-based CLIP alignment, and palette-constrained Gumbel-Softmax quantization, achieving 37.12 CLIP-IQA and 77.90% user preference.
-
Depth Anything 3: Recovering the Visual Space from Any Views
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.