{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:ID7PWW53VUCANRI32L4VIQM256","short_pith_number":"pith:ID7PWW53","schema_version":"1.0","canonical_sha256":"40fefb5bbbad0406c51bd2f954419aefbb8537aa8f92f967a5d8353a016028c6","source":{"kind":"arxiv","id":"2307.05222","version":2},"attestation_state":"computed","paper":{"title":"Emu: Generative Pretraining in Multimodality","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A single Transformer model generates images and text by autoregressively predicting the next token or visual embedding from interleaved inputs.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Fan Zhang, Hongcheng Gao, Jingjing Liu, Qiying Yu, Quan Sun, Tiejun Huang, Xiaosong Zhang, Xinlong Wang, Yueze Wang, Yufeng Cui","submitted_at":"2023-07-11T12:45:39Z","abstract_excerpt":"We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal se"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2307.05222","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2023-07-11T12:45:39Z","cross_cats_sorted":[],"title_canon_sha256":"0bc9d5de0efbc6d14eda4e338377e6441a0a00c832fa1e567e5af7edca242882","abstract_canon_sha256":"46600ae81117b6ccc061a48ecab7a9f32c222fa5a904f09d6e372d0982f5c569"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.701161Z","signature_b64":"vIeUDVmv2CUaRTXZbmXI07Y9a3Q5tJvEi8+NCQVVWx3PHznBslRdGFpgXXE6thD1/hqzgAG2w6stsJwtu34RBQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"40fefb5bbbad0406c51bd2f954419aefbb8537aa8f92f967a5d8353a016028c6","last_reissued_at":"2026-05-17T23:38:46.700545Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.700545Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Emu: Generative Pretraining in Multimodality","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A single Transformer model generates images and text by autoregressively predicting the next token or visual embedding from interleaved inputs.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Fan Zhang, Hongcheng Gao, Jingjing Liu, Qiying Yu, Quan Sun, Tiejun Huang, Xiaosong Zhang, Xinlong Wang, Yueze Wang, Yufeng Cui","submitted_at":"2023-07-11T12:45:39Z","abstract_excerpt":"We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal se"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That encoding visual signals into embeddings and training with a unified next-token or next-embedding objective will produce coherent multimodal generation without modality-specific losses or architectures.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Emu is a multimodal foundation model that unifies image and text generation via autoregressive pretraining on interleaved multimodal data, showing strong zero-shot performance on captioning, VQA, and text-to-image tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A single Transformer model generates images and text by autoregressively predicting the next token or visual embedding from interleaved inputs.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5072c2a8cf7419115bf0b0457f33763a63dff986b79e3cb73c852a5e32e703b3"},"source":{"id":"2307.05222","kind":"arxiv","version":2},"verdict":{"id":"eb77e0f9-4408-4b15-900f-eb5ebba5a48b","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T20:18:23.073660Z","strongest_claim":"Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models.","one_line_summary":"Emu is a multimodal foundation model that unifies image and text generation via autoregressive pretraining on interleaved multimodal data, showing strong zero-shot performance on captioning, VQA, and text-to-image tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That encoding visual signals into embeddings and training with a unified next-token or next-embedding objective will produce coherent multimodal generation without modality-specific losses or architectures.","pith_extraction_headline":"A single Transformer model generates images and text by autoregressively predicting the next token or visual embedding from interleaved inputs."},"references":{"count":22,"sample":[{"doi":"","year":2022,"title":"and contains large-scale image-text pairs data. LAION-COCO (lai, b) is captioned 600M images from LAION-2B with an ensemble of BLIP (Li et al., 2022) and CLIP (Radford et al., 2021) models. Whereas th","work_id":"5714861e-873c-4f7f-8805-0a782e92c494","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Make sure to check the weather forecast before your visit and pack appropriate clothing and gear","work_id":"41abc708-2bee-41a2-87bb-3f33e004ff03","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Make sure to stay on designated trails and keep your distance from any wildlife you encounter","work_id":"2c556f4a-378a-4f7c-9de4-e1b4e017403e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Make sure to check with local authorities before swimming or boating in the lake to ensure it is safe to do so","work_id":"22cae4bf-9c5d-4c68-b95d-0795eb7d7dae","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Make sure to familiarize yourself with the lake's layout and any potential hazards before venturing out on the water","work_id":"eeed9ff6-fa73-4fcf-9e71-e074f699acd3","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":22,"snapshot_sha256":"ffebb0f1446b0e62db4715f8ca1ae85467d04d8d19f93362818d69444e5803d5","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"455f3d72a31ca83ed0371ac902779939b0be8d90400205b164a5df3a1801ccea"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2307.05222","created_at":"2026-05-17T23:38:46.700652+00:00"},{"alias_kind":"arxiv_version","alias_value":"2307.05222v2","created_at":"2026-05-17T23:38:46.700652+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2307.05222","created_at":"2026-05-17T23:38:46.700652+00:00"},{"alias_kind":"pith_short_12","alias_value":"ID7PWW53VUCA","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"ID7PWW53VUCANRI3","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"ID7PWW53","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":23,"internal_anchor_count":23,"sample":[{"citing_arxiv_id":"2601.01593","citing_title":"Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23606","citing_title":"Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2506.04565","citing_title":"From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems","ref_index":165,"is_internal_anchor":true},{"citing_arxiv_id":"2402.15852","citing_title":"NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation","ref_index":93,"is_internal_anchor":true},{"citing_arxiv_id":"2403.18814","citing_title":"Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2505.05472","citing_title":"Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2602.12286","citing_title":"Mind the Gap No More: Achieving Zero-Gap Multimodal Integration via One Tokenizer","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2306.13549","citing_title":"A Survey on Multimodal Large Language Models","ref_index":175,"is_internal_anchor":true},{"citing_arxiv_id":"2404.14396","citing_title":"SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2410.13848","citing_title":"Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation","ref_index":75,"is_internal_anchor":true},{"citing_arxiv_id":"2503.12605","citing_title":"Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey","ref_index":203,"is_internal_anchor":true},{"citing_arxiv_id":"2505.15809","citing_title":"MMaDA: Multimodal Large Diffusion Language Models","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2312.14238","citing_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","ref_index":132,"is_internal_anchor":true},{"citing_arxiv_id":"2311.07919","citing_title":"Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2307.16125","citing_title":"SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2406.06525","citing_title":"Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22989","citing_title":"CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2403.05525","citing_title":"DeepSeek-VL: Towards Real-World Vision-Language Understanding","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09088","citing_title":"Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation","ref_index":83,"is_internal_anchor":true},{"citing_arxiv_id":"2501.17811","citing_title":"Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04058","citing_title":"MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07753","citing_title":"Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2505.14683","citing_title":"Emerging Properties in Unified Multimodal Pretraining","ref_index":68,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/ID7PWW53VUCANRI32L4VIQM256","json":"https://pith.science/pith/ID7PWW53VUCANRI32L4VIQM256.json","graph_json":"https://pith.science/api/pith-number/ID7PWW53VUCANRI32L4VIQM256/graph.json","events_json":"https://pith.science/api/pith-number/ID7PWW53VUCANRI32L4VIQM256/events.json","paper":"https://pith.science/paper/ID7PWW53"},"agent_actions":{"view_html":"https://pith.science/pith/ID7PWW53VUCANRI32L4VIQM256","download_json":"https://pith.science/pith/ID7PWW53VUCANRI32L4VIQM256.json","view_paper":"https://pith.science/paper/ID7PWW53","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2307.05222&json=true","fetch_graph":"https://pith.science/api/pith-number/ID7PWW53VUCANRI32L4VIQM256/graph.json","fetch_events":"https://pith.science/api/pith-number/ID7PWW53VUCANRI32L4VIQM256/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/ID7PWW53VUCANRI32L4VIQM256/action/timestamp_anchor","attest_storage":"https://pith.science/pith/ID7PWW53VUCANRI32L4VIQM256/action/storage_attestation","attest_author":"https://pith.science/pith/ID7PWW53VUCANRI32L4VIQM256/action/author_attestation","sign_citation":"https://pith.science/pith/ID7PWW53VUCANRI32L4VIQM256/action/citation_signature","submit_replication":"https://pith.science/pith/ID7PWW53VUCANRI32L4VIQM256/action/replication_record"}},"created_at":"2026-05-17T23:38:46.700652+00:00","updated_at":"2026-05-17T23:38:46.700652+00:00"}