{"paper":{"title":"MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A textual prompt design lets ChatGPT collaborate with vision experts to handle advanced multimodal reasoning and action in zero-shot settings.","cross_cats":["cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"Ce Liu, Ehsan Azarnasab, Faisal Ahmed, Jianfeng Wang, Kevin Lin, Lijuan Wang, Linjie Li, Michael Zeng, Zhengyuan Yang, Zicheng Liu","submitted_at":"2023-03-20T18:31:47Z","abstract_excerpt":"We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT's prompt design allows lan"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The textual prompt design can faithfully represent and allow language models to process dense visual signals such as images and videos without loss of critical information.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A textual prompt design lets ChatGPT collaborate with vision experts to handle advanced multimodal reasoning and action in zero-shot settings.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"9d7f11d0f7de4153634e5044445836d36d2ec36322404c5ff59d92a988d63b19"},"source":{"id":"2303.11381","kind":"arxiv","version":1},"verdict":{"id":"b74c6a6f-c277-4064-92f1-c74bac347ebb","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T01:12:43.672888Z","strongest_claim":"Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding.","one_line_summary":"MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The textual prompt design can faithfully represent and allow language models to process dense visual signals such as images and videos without loss of critical information.","pith_extraction_headline":"A textual prompt design lets ChatGPT collaborate with vision experts to handle advanced multimodal reasoning and action in zero-shot settings."},"references":{"count":60,"sample":[{"doi":"","year":2022,"title":"Do As I Can, Not As I Say: Grounding Language in Robotic Affordances","work_id":"037320f1-b0a9-4cbe-a639-bfb25409ce71","ref_index":1,"cited_arxiv_id":"2204.01691","is_internal_anchor":true},{"doi":"","year":2022,"title":"Flamingo: a Visual Language Model for Few-Shot Learning","work_id":"a110f764-38dc-41b2-a802-53744ecea1fc","ref_index":2,"cited_arxiv_id":"2204.14198","is_internal_anchor":true},{"doi":"","year":2020,"title":"Lan- guage models are few-shot learners","work_id":"9d80ea01-55d6-44ef-a69e-0ca87cabb940","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"End-to- end object detection with transformers","work_id":"8376a687-7c64-4343-88ee-9ab3141e93a7","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Harrison Chase. Langchain. https://langchain. readthedocs.io/, 2023. 4 6 Figure 4. Case studies of MM-R EACT’s capabilities and application scenarios: visual math and text reasoning. 7","work_id":"28eb406b-9a34-43fa-94df-ca72fd959644","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":60,"snapshot_sha256":"0c88837ac336b5a608d05b739f13655b39808a38ed12b8adc144911be5f17fbb","internal_anchors":18},"formal_canon":{"evidence_count":2,"snapshot_sha256":"eaed9143d2045250ee98f9ffdd88129e7a4f5d96c1f76f99d18c8a3e7d5d81dd"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}