Arbor attaches constraint mesh tokens to a frozen text-to-3D denoiser to enable controllable generation obeying hull, avoidance, and touch constraints.
hub Mixed citations
Structured 3D Latents for Scalable and Versatile 3D Generation
Mixed citation behavior. Most common role is background (40%).
abstract
We introduce a novel 3D generation method for versatile and high-quality 3D asset creation. The cornerstone is a unified Structured LATent (SLAT) representation which allows decoding to different output formats, such as Radiance Fields, 3D Gaussians, and meshes. This is achieved by integrating a sparsely-populated 3D grid with dense multiview visual features extracted from a powerful vision foundation model, comprehensively capturing both structural (geometry) and textural (appearance) information while maintaining flexibility during decoding. We employ rectified flow transformers tailored for SLAT as our 3D generation models and train models with up to 2 billion parameters on a large 3D asset dataset of 500K diverse objects. Our model generates high-quality results with text or image conditions, significantly surpassing existing methods, including recent ones at similar scales. We showcase flexible output format selection and local 3D editing capabilities which were not offered by previous models. Code, model, and data will be released.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
AdaVoMP predicts accurate dense spatially-varying Young's modulus, Poisson's ratio and density for 3D objects using an adaptive sparse voxel structure generated by a sparse transformer encoder-decoder at 16^3 higher resolution than prior fixed-voxel methods.
Garment Particles is a 5D point cloud representation jointly encoding 2D sewing patterns and 3D geometry, supporting rectified flow generation from high-level inputs and diffusion-based editing of patterns or shapes.
GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.
CAdam reinterprets densification in generative 3DGS as signal verification via gradient-moment interference, quantile context, and SNR gating to achieve large reductions in primitive count with comparable quality.
A video generation approach conditions a base model with multi-scale 3D latent features and a cross-attention adapter to produce geometrically realistic and consistent orbital videos from one image.
SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.
MeshTailor is a mesh-native generative model that uses ChainingSeams serialization and a dual-stream transformer with pointer layers to trace coherent seams vertex-by-vertex on 3D surfaces.
ATATA enables fast joint inference of structurally aligned pairs using Rectified Flow models via segment transport, improving state-of-the-art for image and video generation while matching 3D quality at much higher speed.
Affostruction reconstructs full 3D object geometry from partial RGBD views and grounds text-based affordances on both visible and unobserved surfaces, reporting large gains over prior methods.
Voxify3D generates voxel art from 3D meshes via orthographic pixel supervision, patch-based CLIP alignment, and palette-constrained Gumbel-Softmax quantization, achieving 37.12 CLIP-IQA and 77.90% user preference.
SVG360 lifts a single SVG to a view-conditioned representation, uses spatial memory to propagate consistent parts across views, and applies structure-aware vectorization to produce editable multiview SVGs.
GenHSI is a training-free three-stage pipeline that turns a scene image, character image, and complex HSI prompt into long videos with plausible chained interactions by generating atomic actions, 3D keyframes via 2D inpainting plus optimization, and then feeding them to pre-trained video diffusion.
A single-stage pixel-space diffusion model for direct 3D Gaussian Splat generation that bypasses latent compression and adds geometric supervisions to outperform prior multi-stage methods.
HiFiVe is a training-free framework using an auto-regressive texture refinement pipeline with depth-based warping, multi-view fusion, and symmetry to enhance both texture and geometry fidelity in vehicle generation from 2D priors.
GRA combines UV-space material optimization and physics rendering with feed-forward texture refinement and a fine-tuned video-to-video diffusion model to achieve controllable, high-detail relighting of full-body avatars.
Diffusion-based per-view harmonization for lighting-consistent object transfer between 3DGS scenes, using heterogeneous training data and final 3D consolidation.
A de-biased VLM judge protocol is applied to adapt TRELLIS for single-image furniture 3D generation but yields no improvement over the strong public base across six methods.
DO AS I DO reconstructs and retargets hand-object interactions from in-the-wild monocular RGB videos to produce dexterous robot manipulation trajectories, outperforming prior methods on ground-truth and online video datasets.
Surflo compresses unposed RGB views into K global latent tokens and uses flow matching with photometric guidance to decode consistent arbitrary-resolution 3D surface points in one forward pass.
MeshFlow uses a contrastive MeshVAE for compact mesh latents and a flow transformer for parallel generation, claiming 18x speedup over autoregressive methods with high accuracy on standard metrics.
PerceptTwin creates interactive simulations from open-vocabulary object maps for verifying and refining LLM robot plans, reporting ~39% higher success rates and up to 18% better human verification.
PhyGenHOI couples a motion diffusion model for humans with material point method simulation for objects on 3D Gaussians, using attraction loss, contact re-simulation, and masked video-SDS to produce physically consistent dynamic interactions from text.
Fishbone introduces a unified rib-spine representation computed via adaptive heat method, iso-contour ribs, and geometry-aware spine that enables real-time parametric deformation, reduced-space simulation, and animation on general meshes.
citing papers explorer
-
ReplicateAnyScene: Zero-Shot Video-to-3D Composition via Textual-Visual-Spatial Alignment
ReplicateAnyScene performs fully automated zero-shot video-to-compositional-3D reconstruction by cascading alignments of generic priors from vision foundation models across textual, visual, and spatial dimensions.
- Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch