{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:4OXKXQ4PBCCPHOCVZK2PKPZTA2","short_pith_number":"pith:4OXKXQ4P","schema_version":"1.0","canonical_sha256":"e3aeabc38f0884f3b855cab4f53f330696ad3beecbe15677a1497d065c6c83d6","source":{"kind":"arxiv","id":"2412.14164","version":1},"attestation_state":"computed","paper":{"title":"MetaMorph: Multimodal Understanding and Generation via Instruction Tuning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Visual generation ability emerges as a natural byproduct of improved visual understanding in instruction-tuned LLMs.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"David Fan, Jiachen Zhu, Koustuv Sinha, Michael Rabbat, Saining Xie, Shengbang Tong, Xinlei Chen, Yann LeCun, Yunyang Xiong, Zhuang Liu","submitted_at":"2024-12-18T18:58:50Z","abstract_excerpt":"In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual under"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2412.14164","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2024-12-18T18:58:50Z","cross_cats_sorted":[],"title_canon_sha256":"20c025291f77e7cde39c6acc6ed7ebea9dd3abbcaa26f3b99b95e26f8161a30c","abstract_canon_sha256":"76dfbffcf154571b952855eb877a4fa7e54ec04dfe3914dcbd04f4a201adbe57"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.674979Z","signature_b64":"L95J5m+1uDsuxFWQxqYMncs8QOqZFMcD0B0gT0zFw2gyY4jaQBew2aFkx+kbPi+B9HS2SvIl5eWexzi8tGEfBA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"e3aeabc38f0884f3b855cab4f53f330696ad3beecbe15677a1497d065c6c83d6","last_reissued_at":"2026-05-17T23:38:14.674381Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.674381Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"MetaMorph: Multimodal Understanding and Generation via Instruction Tuning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Visual generation ability emerges as a natural byproduct of improved visual understanding in instruction-tuned LLMs.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"David Fan, Jiachen Zhu, Koustuv Sinha, Michael Rabbat, Saining Xie, Shengbang Tong, Xinlei Chen, Yann LeCun, Yunyang Xiong, Zhuang Liu","submitted_at":"2024-12-18T18:58:50Z","abstract_excerpt":"In this work, we propose Visual-Predictive Instruction Tuning (VPiT) - a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual under"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the curated instruction-following multimodal datasets are sufficient to reveal general emergence of generation from understanding and that results will transfer beyond the specific models and data mixtures tested.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Visual generation ability emerges as a natural byproduct of improved visual understanding in instruction-tuned LLMs.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"05e3e8b72041543185682f573e78aa9d0dadf9912ae8421d658cbfec816c7586"},"source":{"id":"2412.14164","kind":"arxiv","version":1},"verdict":{"id":"63a59b41-11c4-4055-bd70-76d4fd244fa8","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T07:46:11.714221Z","strongest_claim":"visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data","one_line_summary":"VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the curated instruction-following multimodal datasets are sufficient to reveal general emergence of generation from understanding and that results will transfer beyond the specific models and data mixtures tested.","pith_extraction_headline":"Visual generation ability emerges as a natural byproduct of improved visual understanding in instruction-tuned LLMs."},"references":{"count":282,"sample":[{"doi":"","year":2024,"title":"Llama 3 model card","work_id":"ac426759-fc95-4576-90e9-cad354462c5a","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Flamingo: a visual language model for few-shot learning","work_id":"90ed68c9-b335-4721-acb0-6953c1542432","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"ICML 2024 Tutorial: Physics of Language Models , 2024","work_id":"0cc1e582-245f-4283-bff2-fda235cc7dda","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Anthropic. Claude, 2024","work_id":"99efe9ce-d918-4dfa-8967-d5987496475d","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2016,"title":"Jimmy Lei Ba, Jamie Kiros, and Geoffrey E. Hinton. Layer normalization. In NeurIPS, 2016","work_id":"47de663f-c464-4369-8f56-87c0dc0e9e56","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":282,"snapshot_sha256":"3e2432c0bd6d74a623fc5e755abdc806532744a2b623f563cb874029a29bc76c","internal_anchors":32},"formal_canon":{"evidence_count":2,"snapshot_sha256":"0132b43be650f9339225ad3466de67150b3c9e69708ec744be1ef730980b8430"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2412.14164","created_at":"2026-05-17T23:38:14.674474+00:00"},{"alias_kind":"arxiv_version","alias_value":"2412.14164v1","created_at":"2026-05-17T23:38:14.674474+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2412.14164","created_at":"2026-05-17T23:38:14.674474+00:00"},{"alias_kind":"pith_short_12","alias_value":"4OXKXQ4PBCCP","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"4OXKXQ4PBCCPHOCV","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"4OXKXQ4P","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":22,"internal_anchor_count":22,"sample":[{"citing_arxiv_id":"2605.21642","citing_title":"Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18160","citing_title":"Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18714","citing_title":"Semantic Generative Tuning for Unified Multimodal Models","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23606","citing_title":"Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2509.21912","citing_title":"Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching","ref_index":75,"is_internal_anchor":true},{"citing_arxiv_id":"2505.05472","citing_title":"Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2505.16933","citing_title":"LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2512.00993","citing_title":"PhotoFramer: Multi-modal Image Composition Instruction","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2503.07265","citing_title":"WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2506.15564","citing_title":"Show-o2: Improved Native Unified Multimodal Models","ref_index":106,"is_internal_anchor":true},{"citing_arxiv_id":"2506.03147","citing_title":"UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10564","citing_title":"DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving","ref_index":118,"is_internal_anchor":true},{"citing_arxiv_id":"2505.09568","citing_title":"BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2506.21539","citing_title":"WorldVLA: Towards Autoregressive Action World Model","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24625","citing_title":"Meta-CoT: Enhancing Granularity and Generalization in Image Editing","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05781","citing_title":"Steering Visual Generation in Unified Multimodal Models with Understanding Supervision","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2501.17811","citing_title":"Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08029","citing_title":"STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06339","citing_title":"Evolution of Video Generative Foundations","ref_index":146,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04746","citing_title":"Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2505.14683","citing_title":"Emerging Properties in Unified Multimodal Pretraining","ref_index":74,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17375","citing_title":"When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models","ref_index":58,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/4OXKXQ4PBCCPHOCVZK2PKPZTA2","json":"https://pith.science/pith/4OXKXQ4PBCCPHOCVZK2PKPZTA2.json","graph_json":"https://pith.science/api/pith-number/4OXKXQ4PBCCPHOCVZK2PKPZTA2/graph.json","events_json":"https://pith.science/api/pith-number/4OXKXQ4PBCCPHOCVZK2PKPZTA2/events.json","paper":"https://pith.science/paper/4OXKXQ4P"},"agent_actions":{"view_html":"https://pith.science/pith/4OXKXQ4PBCCPHOCVZK2PKPZTA2","download_json":"https://pith.science/pith/4OXKXQ4PBCCPHOCVZK2PKPZTA2.json","view_paper":"https://pith.science/paper/4OXKXQ4P","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2412.14164&json=true","fetch_graph":"https://pith.science/api/pith-number/4OXKXQ4PBCCPHOCVZK2PKPZTA2/graph.json","fetch_events":"https://pith.science/api/pith-number/4OXKXQ4PBCCPHOCVZK2PKPZTA2/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/4OXKXQ4PBCCPHOCVZK2PKPZTA2/action/timestamp_anchor","attest_storage":"https://pith.science/pith/4OXKXQ4PBCCPHOCVZK2PKPZTA2/action/storage_attestation","attest_author":"https://pith.science/pith/4OXKXQ4PBCCPHOCVZK2PKPZTA2/action/author_attestation","sign_citation":"https://pith.science/pith/4OXKXQ4PBCCPHOCVZK2PKPZTA2/action/citation_signature","submit_replication":"https://pith.science/pith/4OXKXQ4PBCCPHOCVZK2PKPZTA2/action/replication_record"}},"created_at":"2026-05-17T23:38:14.674474+00:00","updated_at":"2026-05-17T23:38:14.674474+00:00"}