{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:2KOQ3SMADRV4XAYY3DLSLG76FH","short_pith_number":"pith:2KOQ3SMA","schema_version":"1.0","canonical_sha256":"d29d0dc9801c6bcb8318d8d7259bfe29e407418722613cd139c0b9faa3e3b0fc","source":{"kind":"arxiv","id":"2403.09611","version":4},"attestation_state":"computed","paper":{"title":"MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A careful mix of image-caption, interleaved image-text, and text-only data during pre-training is crucial for state-of-the-art few-shot results in multimodal large language models.","cross_cats":["cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"Alexander Toshev, Ankur Jain, Anton Belyi, Aonan Zhang, Bowen Zhang, Brandon McKinzie, Chong Wang, Dhruti Shah, Doug Kang, Floris Weers, Futang Peng, Guoli Yin, Haotian Zhang, Hongyu H\\`e, Jean-Philippe Fauconnier, Jianyu Wang, Karanjeet Singh, Mark Lee, Max Schwarzer, Nan Du, Peter Grasch, Philipp Dufter, Ruoming Pang, Sam Dodge, Sam Wiseman, Tao Lei, Tom Gunter, Xiang Kong, Xianzhi Du, Yinfei Yang, Zhe Gan, Zirui Wang","submitted_at":"2024-03-14T17:51:32Z","abstract_excerpt":"In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple ben"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2403.09611","kind":"arxiv","version":4},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2024-03-14T17:51:32Z","cross_cats_sorted":["cs.CL","cs.LG"],"title_canon_sha256":"98612f0506b0805073aeaaeaf93f8af49f3f2ccba777087e6dd48a1edd8d0f0a","abstract_canon_sha256":"923a9976303f3c648273dba6d0d92803fad89135dc3e2e95942bff3913bb9ceb"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:49.148244Z","signature_b64":"L+c/1XOwsz1+wEq1yU5lcDksSh9P63OzsYGSDTUmiqWanjwyTdzMXeINbqp/ye3XR/jEy/nzb71Um2EHYgP8AA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"d29d0dc9801c6bcb8318d8d7259bfe29e407418722613cd139c0b9faa3e3b0fc","last_reissued_at":"2026-05-17T23:38:49.147551Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:49.147551Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A careful mix of image-caption, interleaved image-text, and text-only data during pre-training is crucial for state-of-the-art few-shot results in multimodal large language models.","cross_cats":["cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"Alexander Toshev, Ankur Jain, Anton Belyi, Aonan Zhang, Bowen Zhang, Brandon McKinzie, Chong Wang, Dhruti Shah, Doug Kang, Floris Weers, Futang Peng, Guoli Yin, Haotian Zhang, Hongyu H\\`e, Jean-Philippe Fauconnier, Jianyu Wang, Karanjeet Singh, Mark Lee, Max Schwarzer, Nan Du, Peter Grasch, Philipp Dufter, Ruoming Pang, Sam Dodge, Sam Wiseman, Tao Lei, Tom Gunter, Xiang Kong, Xianzhi Du, Yinfei Yang, Zhe Gan, Zirui Wang","submitted_at":"2024-03-14T17:51:32Z","abstract_excerpt":"In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple ben"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"For large-scale multimodal pre-training, a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art few-shot results across multiple benchmarks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the ablations performed are comprehensive enough to isolate the true importance of data composition and image encoder choices without confounding effects from untested interactions or hyperparameter choices.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A careful mix of image-caption, interleaved image-text, and text-only data during pre-training is crucial for state-of-the-art few-shot results in multimodal large language models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"bcd878f5465a8e225a182212643fabbd3819a335f32611ac0995bb0b3e6616af"},"source":{"id":"2403.09611","kind":"arxiv","version":4},"verdict":{"id":"e8a6560f-96b3-48fc-89ba-e052b67b750a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T04:01:36.939836Z","strongest_claim":"For large-scale multimodal pre-training, a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art few-shot results across multiple benchmarks.","one_line_summary":"MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the ablations performed are comprehensive enough to isolate the true importance of data composition and image encoder choices without confounding effects from untested interactions or hyperparameter choices.","pith_extraction_headline":"A careful mix of image-caption, interleaved image-text, and text-only data during pre-training is crucial for state-of-the-art few-shot results in multimodal large language models."},"references":{"count":137,"sample":[{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":1,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2019,"title":"In: ICCV (2019)","work_id":"7c1b8382-b9e1-44a0-a966-63773acaec5c","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro,","work_id":"bce27169-4fab-4916-967e-1f87eeac9fdb","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models","work_id":"87bfa84a-e663-4165-806f-93ef439d88d0","ref_index":4,"cited_arxiv_id":"2308.01390","is_internal_anchor":true},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":5,"cited_arxiv_id":"2308.12966","is_internal_anchor":true}],"resolved_work":137,"snapshot_sha256":"7ef4e9ade704b04fc25f5181b2f44cfb9f6bf1df6c26572b44364f03eed55d5a","internal_anchors":47},"formal_canon":{"evidence_count":1,"snapshot_sha256":"4bbefda8724716d1fdebfb7f51abf7fef21ba16a801f3cb605ec55ff1bf66c1a"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2403.09611","created_at":"2026-05-17T23:38:49.147663+00:00"},{"alias_kind":"arxiv_version","alias_value":"2403.09611v4","created_at":"2026-05-17T23:38:49.147663+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2403.09611","created_at":"2026-05-17T23:38:49.147663+00:00"},{"alias_kind":"pith_short_12","alias_value":"2KOQ3SMADRV4","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"2KOQ3SMADRV4XAYY","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"2KOQ3SMA","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2412.18158","citing_title":"Semantics Disentanglement and Composition for Universal Image Coding with Efficiently LLM Reasoning and Generative Diffusion","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2504.09925","citing_title":"FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2408.04840","citing_title":"mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models","ref_index":178,"is_internal_anchor":true},{"citing_arxiv_id":"2412.14164","citing_title":"MetaMorph: Multimodal Understanding and Generation via Instruction Tuning","ref_index":172,"is_internal_anchor":true},{"citing_arxiv_id":"2406.16860","citing_title":"Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs","ref_index":100,"is_internal_anchor":true},{"citing_arxiv_id":"2602.00104","citing_title":"R3G: A Reasoning--Retrieval--Reranking Framework for Vision-Centric Answer Generation","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2408.13257","citing_title":"MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2306.13549","citing_title":"A Survey on Multimodal Large Language Models","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2505.15809","citing_title":"MMaDA: Multimodal Large Diffusion Language Models","ref_index":69,"is_internal_anchor":true},{"citing_arxiv_id":"2412.03555","citing_title":"PaliGemma 2: A Family of Versatile VLMs for Transfer","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2409.17146","citing_title":"Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models","ref_index":85,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11405","citing_title":"20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14198","citing_title":"MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11405","citing_title":"20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2404.16821","citing_title":"How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites","ref_index":84,"is_internal_anchor":true},{"citing_arxiv_id":"2408.12528","citing_title":"Show-o: One Single Transformer to Unify Multimodal Understanding and Generation","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2407.07726","citing_title":"PaliGemma: A versatile 3B VLM for transfer","ref_index":96,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10985","citing_title":"Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2407.07895","citing_title":"LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2408.01800","citing_title":"MiniCPM-V: A GPT-4V Level MLLM on Your Phone","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2404.14219","citing_title":"Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2502.14786","citing_title":"SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2406.09246","citing_title":"OpenVLA: An Open-Source Vision-Language-Action Model","ref_index":87,"is_internal_anchor":true},{"citing_arxiv_id":"2408.03326","citing_title":"LLaVA-OneVision: Easy Visual Task Transfer","ref_index":104,"is_internal_anchor":true},{"citing_arxiv_id":"2412.05271","citing_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","ref_index":185,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/2KOQ3SMADRV4XAYY3DLSLG76FH","json":"https://pith.science/pith/2KOQ3SMADRV4XAYY3DLSLG76FH.json","graph_json":"https://pith.science/api/pith-number/2KOQ3SMADRV4XAYY3DLSLG76FH/graph.json","events_json":"https://pith.science/api/pith-number/2KOQ3SMADRV4XAYY3DLSLG76FH/events.json","paper":"https://pith.science/paper/2KOQ3SMA"},"agent_actions":{"view_html":"https://pith.science/pith/2KOQ3SMADRV4XAYY3DLSLG76FH","download_json":"https://pith.science/pith/2KOQ3SMADRV4XAYY3DLSLG76FH.json","view_paper":"https://pith.science/paper/2KOQ3SMA","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2403.09611&json=true","fetch_graph":"https://pith.science/api/pith-number/2KOQ3SMADRV4XAYY3DLSLG76FH/graph.json","fetch_events":"https://pith.science/api/pith-number/2KOQ3SMADRV4XAYY3DLSLG76FH/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/2KOQ3SMADRV4XAYY3DLSLG76FH/action/timestamp_anchor","attest_storage":"https://pith.science/pith/2KOQ3SMADRV4XAYY3DLSLG76FH/action/storage_attestation","attest_author":"https://pith.science/pith/2KOQ3SMADRV4XAYY3DLSLG76FH/action/author_attestation","sign_citation":"https://pith.science/pith/2KOQ3SMADRV4XAYY3DLSLG76FH/action/citation_signature","submit_replication":"https://pith.science/pith/2KOQ3SMADRV4XAYY3DLSLG76FH/action/replication_record"}},"created_at":"2026-05-17T23:38:49.147663+00:00","updated_at":"2026-05-17T23:38:49.147663+00:00"}