ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Discrete Diffusion Models

Hongyuan Zhang; Jiacheng Sun; Ping Luo; Ruishu Zhu; Xuelong Li; Zhihao Huang

arxiv: 2512.14099 · v3 · pith:ARMWIOIFnew · submitted 2025-12-16 · 💻 cs.CV

ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Discrete Diffusion Models

Ruishu Zhu , Zhihao Huang , Jiacheng Sun , Ping Luo , Hongyuan Zhang , Xuelong Li This is my paper

classification 💻 cs.CV

keywords generationdiffusiondiscretemulti-viewmultimodaltokencontinuousd-future

0 comments

read the original abstract

Motivated by discrete diffusion's success in language-vision modeling, we explore its potential for multi-view generation, a task dominated by continuous approaches. We introduce ViewMask-1-to-3, formulating multi-view generation as a discrete sequence modeling problem where each viewpoint is represented as visual tokens from MAGVIT-v2. Through discrete diffusion via masked token prediction, our approach enables progressive multi-view generation via iterative token unmasking, unifying language and vision in a shared token space. Importantly, simple random masking combined with self-attention naturally encourages cross-view consistency without specialized architectures or 3D geometric priors. Our method outperforms the baseline on the GSO and 3D-FUTURE benchmarks, ranking first on average across standard image metrics, and achieving a 10.6% higher IoU than continuous diffusion models on 3D-FUTURE. Furthermore, the proposed framework can be naturally extended to support text-to-image generation and multimodal understanding, highlighting its potential toward a more unified paradigm for multimodal understanding and generation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Safeguarding Text-to-Image Generative Models Against Unauthorized Knowledge Distillation
cs.CR 2026-05 unverdicted novelty 5.0

WaveGuard injects user-controlled, imperceptible frequency-based perturbations into synthetic images to reduce their value as training data for unauthorized knowledge distillation while preserving perceptual quality.