SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation

Alexander Schwing; Hsin-Ying Lee; Liangyan Gui; Sergey Tulyakov; Yen-Chi Cheng

arxiv: 2212.04493 · v2 · pith:PJ3MDSJOnew · submitted 2022-12-08 · 💻 cs.CV · cs.LG

SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation

Yen-Chi Cheng , Hsin-Ying Lee , Sergey Tulyakov , Alexander Schwing , Liangyan Gui This is my paper

classification 💻 cs.CV cs.LG

keywords generationshapeinputmodelshapesvarietyapproachcompletion

0 comments

read the original abstract

In this work, we present a novel framework built to simplify 3D asset generation for amateur users. To enable interactive generation, our method supports a variety of input modalities that can be easily provided by a human, including images, text, partially observed shapes and combinations of these, further allowing to adjust the strength of each input. At the core of our approach is an encoder-decoder, compressing 3D shapes into a compact latent representation, upon which a diffusion model is learned. To enable a variety of multi-modal inputs, we employ task-specific encoders with dropout followed by a cross-attention mechanism. Due to its flexibility, our model naturally supports a variety of tasks, outperforming prior works on shape completion, image-based 3D reconstruction, and text-to-3D. Most interestingly, our model can combine all these tasks into one swiss-army-knife tool, enabling the user to perform shape generation using incomplete shapes, images, and textual descriptions at the same time, providing the relative weights for each input and facilitating interactivity. Despite our approach being shape-only, we further show an efficient method to texture the generated shape using large-scale text-to-image models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models
cs.CV 2026-01 unverdicted novelty 5.0

CG-MLLM is a multimodal LLM using a Mixture-of-Transformer architecture with separate TokenAR and BlockAR components integrated with a pre-trained vision-language backbone and 3D VAE to enable 3D captioning and high-f...