A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
[2021]: LV AE= L1 + LLPIPS + 0.5LGAN + 0.2LID + 0.000001LKL where L1 is L1 loss in pixel space, LLPIPS is perceptual loss based on LPIPS similarity Zhang et al
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2024 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.