{"paper":{"title":"The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"GPT-4V processes arbitrarily interleaved multimodal inputs to function as a multimodal generalist system","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Chung-Ching Lin, Jianfeng Wang, Kevin Lin, Lijuan Wang, Linjie Li, Zhengyuan Yang, Zicheng Liu","submitted_at":"2023-09-29T17:34:51Z","abstract_excerpt":"Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V(ision), to deepen the understanding of LMMs. The analysis focuses on the intriguing tasks that GPT-4V can perform, containing test samples to probe the quality and genericity of GPT-4V's capabilities, its supported inputs and working modes, and the effective ways to prompt the model. In our approach to exploring GPT-4V, we curate and organize a collection of carefully designed qua"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Observations from these samples demonstrate that GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the authors' hand-curated qualitative samples are representative enough to establish genericity and quality without quantitative benchmarks or controlled comparisons.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"GPT-4V processes arbitrarily interleaved multimodal inputs to function as a multimodal generalist system","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"2edf31b0bc266fe59db181c18996c7d14a2d29c1d0e200067180a07d22e88480"},"source":{"id":"2309.17421","kind":"arxiv","version":2},"verdict":{"id":"f2725e2c-7377-4b1e-b5b2-bc56b54ce40e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T23:22:38.619369Z","strongest_claim":"Observations from these samples demonstrate that GPT-4V's unprecedented ability in processing arbitrarily interleaved multimodal inputs and the genericity of its capabilities together make GPT-4V a powerful multimodal generalist system.","one_line_summary":"GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the authors' hand-curated qualitative samples are representative enough to establish genericity and quality without quantitative benchmarks or controlled comparisons.","pith_extraction_headline":"GPT-4V processes arbitrarily interleaved multimodal inputs to function as a multimodal generalist system"},"references":{"count":160,"sample":[{"doi":"","year":2023,"title":"https://openai.com/blog/ chatgpt-can-now-see-hear-and-speak , 2023","work_id":"35ab23ac-eb48-4ba1-aed5-48df69df9ff0","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Deepfloyd if. https://github.com/deep-floyd/IF, 2023","work_id":"2d3a7b46-a054-458f-8990-ac38d4c99efd","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Guidance. https://github.com/microsoft/guidance/, 2023","work_id":"98a2d882-61bf-4897-88b5-c13ba133dedc","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Midjourney. https://www.midjourney.com/, 2023","work_id":"862286be-a5ac-41c3-9be2-5bcc00f44797","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2011,"title":"Building rome in a day","work_id":"e957ed10-08b0-4b1d-a9fd-2d84342d4e2c","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":160,"snapshot_sha256":"eecd12b7c90b7da3920833d6dc1b5ea6731b7ec3b61fc942968ade52ba2bdbb9","internal_anchors":35},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}