DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Bosong Chai; Cunjian Chen; Jingyu Lin; Juncan Deng; Kejie Huang; Lan Du; Qu Yang; Shicen Tian; Yifei Qian; Yi Huang

arxiv: 2407.15488 · v5 · pith:AJCWOPLWnew · submitted 2024-07-22 · 💻 cs.CV

DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Zeyu Wang , Jingyu Lin , Yifei Qian , Yi Huang , Shicen Tian , Bosong Chai , Juncan Deng , Qu Yang

show 3 more authors

Lan Du Cunjian Chen Kejie Huang

This is my paper

classification 💻 cs.CV

keywords cross-modaldiffxgenerationimagedatasetsdiffusionlayoutmodel

0 comments

read the original abstract

Diffusion models have made significant strides in language-driven and layout-driven image generation. However, most diffusion models are limited to visible RGB image generation. In fact, human perception of the world is enriched by diverse viewpoints, such as chromatic contrast, thermal illumination, and depth information. In this paper, we introduce a novel diffusion model for general layout-guided cross-modal generation, called DiffX. Notably, our DiffX presents a compact and effective cross-modal generative modeling pipeline, which conducts diffusion and denoising processes in the modality-shared latent space. Moreover, we introduce the Joint-Modality Embedder (JME) to enhance the interaction between layout and text conditions by incorporating a gated attention mechanism. To facilitate the user-instructed training, we construct the cross-modal image datasets with detailed text captions by the Large-Multimodal Model (LMM) and our human-in-the-loop refinement. Through extensive experiments, our DiffX demonstrates robustness in cross-modal ''RGB+X'' image generation on FLIR, MFNet, and COME15K datasets, guided by various layout conditions. Meanwhile, it shows the strong potential for the adaptive generation of ``RGB+X+Y(+Z)'' images or more diverse modalities on FLIR, MFNet, COME15K, and MCXFace datasets. To our knowledge, DiffX is the first model for layout-guided cross-modal image generation. Our code and constructed cross-modal image datasets are available at https://github.com/zeyuwang-zju/DiffX.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images
cs.CV 2026-03 unverdicted novelty 6.0

HiFi-Inpaint delivers state-of-the-art detail-preserving human-product images by adding Shared Enhancement Attention and Detail-Aware Loss to reference-based inpainting on a new 40K dataset.
Composing People Together: Iterative Pose-Image Generation for Multi-Person Interaction Scenes
cs.CV 2026-05 unverdicted novelty 5.0

Introduces dual pose-image representation, cross-modal alignment, and iterative construction to improve prompt alignment and diversity in multi-person text-to-image generation.