{"paper":{"title":"Emu: Generative Pretraining in Multimodality","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A single Transformer model generates images and text by autoregressively predicting the next token or visual embedding from interleaved inputs.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Fan Zhang, Hongcheng Gao, Jingjing Liu, Qiying Yu, Quan Sun, Tiejun Huang, Xiaosong Zhang, Xinlong Wang, Yueze Wang, Yufeng Cui","submitted_at":"2023-07-11T12:45:39Z","abstract_excerpt":"We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal se"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That encoding visual signals into embeddings and training with a unified next-token or next-embedding objective will produce coherent multimodal generation without modality-specific losses or architectures.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Emu is a multimodal foundation model that unifies image and text generation via autoregressive pretraining on interleaved multimodal data, showing strong zero-shot performance on captioning, VQA, and text-to-image tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A single Transformer model generates images and text by autoregressively predicting the next token or visual embedding from interleaved inputs.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5072c2a8cf7419115bf0b0457f33763a63dff986b79e3cb73c852a5e32e703b3"},"source":{"id":"2307.05222","kind":"arxiv","version":2},"verdict":{"id":"eb77e0f9-4408-4b15-900f-eb5ebba5a48b","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T20:18:23.073660Z","strongest_claim":"Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models.","one_line_summary":"Emu is a multimodal foundation model that unifies image and text generation via autoregressive pretraining on interleaved multimodal data, showing strong zero-shot performance on captioning, VQA, and text-to-image tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That encoding visual signals into embeddings and training with a unified next-token or next-embedding objective will produce coherent multimodal generation without modality-specific losses or architectures.","pith_extraction_headline":"A single Transformer model generates images and text by autoregressively predicting the next token or visual embedding from interleaved inputs."},"references":{"count":22,"sample":[{"doi":"","year":2022,"title":"and contains large-scale image-text pairs data. LAION-COCO (lai, b) is captioned 600M images from LAION-2B with an ensemble of BLIP (Li et al., 2022) and CLIP (Radford et al., 2021) models. Whereas th","work_id":"5714861e-873c-4f7f-8805-0a782e92c494","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Make sure to check the weather forecast before your visit and pack appropriate clothing and gear","work_id":"41abc708-2bee-41a2-87bb-3f33e004ff03","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Make sure to stay on designated trails and keep your distance from any wildlife you encounter","work_id":"2c556f4a-378a-4f7c-9de4-e1b4e017403e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Make sure to check with local authorities before swimming or boating in the lake to ensure it is safe to do so","work_id":"22cae4bf-9c5d-4c68-b95d-0795eb7d7dae","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Make sure to familiarize yourself with the lake's layout and any potential hazards before venturing out on the water","work_id":"eeed9ff6-fa73-4fcf-9e71-e074f699acd3","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":22,"snapshot_sha256":"ffebb0f1446b0e62db4715f8ca1ae85467d04d8d19f93362818d69444e5803d5","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"455f3d72a31ca83ed0371ac902779939b0be8d90400205b164a5df3a1801ccea"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}