{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:RFRSJ5YLLMIJLAMZFAPJE43BNM","short_pith_number":"pith:RFRSJ5YL","schema_version":"1.0","canonical_sha256":"896324f70b5b10958199281e9273616b3b1c9cba067746a5a493e96d395ec151","source":{"kind":"arxiv","id":"2305.16355","version":1},"attestation_state":"computed","paper":{"title":"PandaGPT: One Model To Instruction-Follow Them All","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A single model trained only on image-text pairs can follow instructions on video, audio, depth, and thermal inputs by composing their meanings in a shared embedding space.","cross_cats":["cs.CV"],"primary_cat":"cs.CL","authors_text":"Deng Cai, Huayang Li, Jialu Xu, Tian Lan, Yan Wang, Yixuan Su","submitted_at":"2023-05-25T04:16:07Z","abstract_excerpt":"We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can perform complex tasks such as detailed image description generation, writing stories inspired by videos, and answering questions about audios. More interestingly, PandaGPT can take multimodal inputs simultaneously and compose their semantics naturally. For example, PandaGPT can connect how objects look in an image/video and how they sound in an audio. To do so, PandaGPT combines the multimodal encoders from ImageBind and the"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2305.16355","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2023-05-25T04:16:07Z","cross_cats_sorted":["cs.CV"],"title_canon_sha256":"bbbc8f4530482ee4a7ed90c8764467b8790d3a9d3a102881526d4cdb5d5655bd","abstract_canon_sha256":"a33eb1f75754eee664c50c5a05cfef2ea5b7c32e181ab55bcafee2f43fdb58d5"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:48.430747Z","signature_b64":"B/QM9uBoWogcXjnkmqhnXlQoQ3sxDA9hnLvh1Ust1Pyu1jRxeScz2UZLFfxOAU+CMA8FEljZD6rUdRtT+IO2DA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"896324f70b5b10958199281e9273616b3b1c9cba067746a5a493e96d395ec151","last_reissued_at":"2026-05-17T23:38:48.430132Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:48.430132Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"PandaGPT: One Model To Instruction-Follow Them All","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A single model trained only on image-text pairs can follow instructions on video, audio, depth, and thermal inputs by composing their meanings in a shared embedding space.","cross_cats":["cs.CV"],"primary_cat":"cs.CL","authors_text":"Deng Cai, Huayang Li, Jialu Xu, Tian Lan, Yan Wang, Yixuan Su","submitted_at":"2023-05-25T04:16:07Z","abstract_excerpt":"We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can perform complex tasks such as detailed image description generation, writing stories inspired by videos, and answering questions about audios. More interestingly, PandaGPT can take multimodal inputs simultaneously and compose their semantics naturally. For example, PandaGPT can connect how objects look in an image/video and how they sound in an audio. To do so, PandaGPT combines the multimodal encoders from ImageBind and the"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors for data other than image and text (e.g., video, audio, depth, thermal, and IMU) and can take multimodal inputs simultaneously and compose their semantics naturally.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That ImageBind's embedding space is already semantically rich enough for the language model to compose meanings across modalities without any further alignment training on those modalities.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A single model trained only on image-text pairs gains instruction-following ability across images, video, and audio by routing all modalities through ImageBind's shared embedding space into Vicuna.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A single model trained only on image-text pairs can follow instructions on video, audio, depth, and thermal inputs by composing their meanings in a shared embedding space.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"925da561fe9ffe37f1187f7250a02b718a7486712ccb9d1f1189f9c333c3bc4e"},"source":{"id":"2305.16355","kind":"arxiv","version":1},"verdict":{"id":"5e0eb80b-0917-4302-b4b4-90bbb09d44c1","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T08:58:06.902090Z","strongest_claim":"PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors for data other than image and text (e.g., video, audio, depth, thermal, and IMU) and can take multimodal inputs simultaneously and compose their semantics naturally.","one_line_summary":"A single model trained only on image-text pairs gains instruction-following ability across images, video, and audio by routing all modalities through ImageBind's shared embedding space into Vicuna.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That ImageBind's embedding space is already semantically rich enough for the language model to compose meanings across modalities without any further alignment training on those modalities.","pith_extraction_headline":"A single model trained only on image-text pairs can follow instructions on video, audio, depth, and thermal inputs by composing their meanings in a shared embedding space."},"references":{"count":32,"sample":[{"doi":"","year":2022,"title":"Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model","work_id":"99c9825d-d8aa-4d56-9301-b5cac88e2bb4","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelovi´c, Jason Rama- puram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal","work_id":"f80752c3-199b-4209-a52f-91a90fb91770","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot lear","work_id":"0ee6a9fc-348e-411b-945a-c5820c50b0b1","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Gonzalez, Ion Stoica, and Eric P","work_id":"eb89ea20-08e6-4a9e-9f97-6330cab3e994","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the N","work_id":"62467c4c-275a-48fb-a48c-2d95503573e0","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":32,"snapshot_sha256":"9348f680c6e4af2af79bde822d07cb39360040d5299e0413d722c64c593c6c5f","internal_anchors":8},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b65acc071b63869295fd48ec3472d576acaea9f112fd4298df924112f4d67867"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2305.16355","created_at":"2026-05-17T23:38:48.430224+00:00"},{"alias_kind":"arxiv_version","alias_value":"2305.16355v1","created_at":"2026-05-17T23:38:48.430224+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2305.16355","created_at":"2026-05-17T23:38:48.430224+00:00"},{"alias_kind":"pith_short_12","alias_value":"RFRSJ5YLLMIJ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"RFRSJ5YLLMIJLAMZ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"RFRSJ5YL","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":29,"internal_anchor_count":29,"sample":[{"citing_arxiv_id":"2605.17360","citing_title":"Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19950","citing_title":"AffectVerse: Emotional World Models for Multimodal Affective Computing","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2506.18962","citing_title":"UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2510.15148","citing_title":"XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2310.13289","citing_title":"SALMONN: Towards Generic Hearing Abilities for Large Language Models","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2311.07575","citing_title":"SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2403.00476","citing_title":"TempCompass: Do Video LLMs Really Understand Videos?","ref_index":119,"is_internal_anchor":true},{"citing_arxiv_id":"2512.02231","citing_title":"See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2403.14624","citing_title":"MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2407.01284","citing_title":"We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2602.12286","citing_title":"Mind the Gap No More: Achieving Zero-Gap Multimodal Integration via One Tokenizer","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23747","citing_title":"Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2306.13549","citing_title":"A Survey on Multimodal Large Language Models","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2604.00013","citing_title":"C2F-Thinker: Coarse-to-Fine Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2603.17980","citing_title":"Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02605","citing_title":"Do Audio-Visual Large Language Models Really See and Hear?","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03995","citing_title":"A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2307.06281","citing_title":"MMBench: Is Your Multi-modal Model an All-around Player?","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2307.16125","citing_title":"SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27968","citing_title":"ClimateVID -- Social Media Videos Analysis and Challenges Involved","ref_index":68,"is_internal_anchor":true},{"citing_arxiv_id":"2404.18930","citing_title":"Hallucination of Multimodal Large Language Models: A Survey","ref_index":151,"is_internal_anchor":true},{"citing_arxiv_id":"2309.07864","citing_title":"The Rise and Potential of Large Language Model Based Agents: A Survey","ref_index":292,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12735","citing_title":"AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11283","citing_title":"Empowering Video Translation using Multimodal Large Language Models","ref_index":178,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07490","citing_title":"Cross-Modal Backdoors in Multimodal Large Language Models","ref_index":5,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/RFRSJ5YLLMIJLAMZFAPJE43BNM","json":"https://pith.science/pith/RFRSJ5YLLMIJLAMZFAPJE43BNM.json","graph_json":"https://pith.science/api/pith-number/RFRSJ5YLLMIJLAMZFAPJE43BNM/graph.json","events_json":"https://pith.science/api/pith-number/RFRSJ5YLLMIJLAMZFAPJE43BNM/events.json","paper":"https://pith.science/paper/RFRSJ5YL"},"agent_actions":{"view_html":"https://pith.science/pith/RFRSJ5YLLMIJLAMZFAPJE43BNM","download_json":"https://pith.science/pith/RFRSJ5YLLMIJLAMZFAPJE43BNM.json","view_paper":"https://pith.science/paper/RFRSJ5YL","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2305.16355&json=true","fetch_graph":"https://pith.science/api/pith-number/RFRSJ5YLLMIJLAMZFAPJE43BNM/graph.json","fetch_events":"https://pith.science/api/pith-number/RFRSJ5YLLMIJLAMZFAPJE43BNM/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/RFRSJ5YLLMIJLAMZFAPJE43BNM/action/timestamp_anchor","attest_storage":"https://pith.science/pith/RFRSJ5YLLMIJLAMZFAPJE43BNM/action/storage_attestation","attest_author":"https://pith.science/pith/RFRSJ5YLLMIJLAMZFAPJE43BNM/action/author_attestation","sign_citation":"https://pith.science/pith/RFRSJ5YLLMIJLAMZFAPJE43BNM/action/citation_signature","submit_replication":"https://pith.science/pith/RFRSJ5YLLMIJLAMZFAPJE43BNM/action/replication_record"}},"created_at":"2026-05-17T23:38:48.430224+00:00","updated_at":"2026-05-17T23:38:48.430224+00:00"}