IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation

Andrea Vedaldi; Christian Rupprecht; Filippos Kokkinos; Iro Laina; Luke Melas-Kyriazi; Natalia Neverova; Oran Gafni

arxiv: 2402.08682 · v1 · pith:7DSGQMBInew · submitted 2024-02-13 · 💻 cs.CV · cs.AI· cs.LG

IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation

Luke Melas-Kyriazi , Iro Laina , Christian Rupprecht , Natalia Neverova , Andrea Vedaldi , Oran Gafni , Filippos Kokkinos This is my paper

classification 💻 cs.CV cs.AIcs.LG

keywords reconstructioncombineddirectlydistillationgenerationgeneratorgeneratorshigh-quality

0 comments

read the original abstract

Most text-to-3D generators build upon off-the-shelf text-to-image models trained on billions of images. They use variants of Score Distillation Sampling (SDS), which is slow, somewhat unstable, and prone to artifacts. A mitigation is to fine-tune the 2D generator to be multi-view aware, which can help distillation or can be combined with reconstruction networks to output 3D objects directly. In this paper, we further explore the design space of text-to-3D models. We significantly improve multi-view generation by considering video instead of image generators. Combined with a 3D reconstruction algorithm which, by using Gaussian splatting, can optimize a robust image-based loss, we directly produce high-quality 3D outputs from the generated views. Our new method, IM-3D, reduces the number of evaluations of the 2D generator network 10-100x, resulting in a much more efficient pipeline, better quality, fewer geometric inconsistencies, and higher yield of usable 3D assets.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation
cs.CV 2024-06 unverdicted novelty 6.0

CamCo equips image-to-video generators with Plücker-coordinate camera inputs and epipolar attention to improve 3D consistency and camera controllability.