Pixel-Aligned Multi-View Generation with Depth Guided Decoder

Alexander Schwing; Aliaksandr Siarohin; Chaoyang Wang; Hsin-Ying Lee; Peiye Zhuang; Sergey Tulyakov; Yash Kant; Zhenggang Tang

arxiv: 2408.14016 · v1 · pith:6Y5IEVSTnew · submitted 2024-08-26 · 💻 cs.CV · cs.AI

Pixel-Aligned Multi-View Generation with Depth Guided Decoder

Zhenggang Tang , Peiye Zhuang , Chaoyang Wang , Aliaksandr Siarohin , Yash Kant , Alexander Schwing , Sergey Tulyakov , Hsin-Ying Lee This is my paper

classification 💻 cs.CV cs.AI

keywords depthmulti-viewgenerationmodelacrossattentiondepth-truncateddiffusion

0 comments

read the original abstract

The task of image-to-multi-view generation refers to generating novel views of an instance from a single image. Recent methods achieve this by extending text-to-image latent diffusion models to multi-view version, which contains an VAE image encoder and a U-Net diffusion model. Specifically, these generation methods usually fix VAE and finetune the U-Net only. However, the significant downscaling of the latent vectors computed from the input images and independent decoding leads to notable pixel-level misalignment across multiple views. To address this, we propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Specifically, we introduce a depth-truncated epipolar attention, enabling the model to focus on spatially adjacent regions while remaining memory efficient. Applying depth-truncated attn is challenging during inference as the ground-truth depth is usually difficult to obtain and pre-trained depth estimation models is hard to provide accurate depth. Thus, to enhance the generalization to inaccurate depth when ground truth depth is missing, we perturb depth inputs during training. During inference, we employ a rapid multi-view to 3D reconstruction approach, NeuS, to obtain coarse depth for the depth-truncated epipolar attention. Our model enables better pixel alignment across multi-view images. Moreover, we demonstrate the efficacy of our approach in improving downstream multi-view to 3D reconstruction tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
cs.CV 2026-04 unverdicted novelty 7.0

3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.
Lyra 2.0: Explorable Generative 3D Worlds
cs.CV 2026-04 unverdicted novelty 6.0

Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.