Multimodal ELBO with Diffusion Decoders

Daniel Wesego; Pedram Rooshenas

arxiv: 2408.16883 · v2 · pith:7C3AG4V7new · submitted 2024-08-29 · 💻 cs.LG · cs.CV

Multimodal ELBO with Diffusion Decoders

Daniel Wesego , Pedram Rooshenas This is my paper

classification 💻 cs.LG cs.CV

keywords multimodalmodalitiesmodeldifferentdecoderdiffusiongenerationvaes

0 comments

read the original abstract

Multimodal variational autoencoders have demonstrated their ability to learn the relationships between different modalities by mapping them into a latent representation. Their design and capacity to perform any-to-any conditional and unconditional generation make them appealing. However, different variants of multimodal VAEs often suffer from generating low-quality output, particularly when complex modalities such as images are involved. In addition to that, they frequently exhibit low coherence among the generated modalities when sampling from the joint distribution. To address these limitations, we propose a new variant of the multimodal VAE ELBO that incorporates a better decoder using a diffusion generative model. The diffusion decoder enables the model to learn complex modalities and generate high-quality outputs. The multimodal model can also seamlessly integrate with a standard feed-forward decoder for different types of modality, facilitating end-to-end training and inference. Furthermore, we introduce an auxiliary score-based model to enhance the unconditional generation capabilities of our proposed approach. This approach addresses the limitations imposed by conventional multimodal VAEs and opens up new possibilities to improve multimodal generation tasks. Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

H\"older++: Improving the Quality-Coherence Trade-off in Multimodal VAEs
cs.LG 2026-06 unverdicted novelty 6.0

Hölder++ improves the quality-coherence trade-off in multimodal VAEs via exact Hölder pooling, shared-private latent modeling, and hierarchical inference.