pith:RFRSJ5YL
PandaGPT: One Model To Instruction-Follow Them All
A single model trained only on image-text pairs can follow instructions on video, audio, depth, and thermal inputs by composing their meanings in a shared embedding space.
arxiv:2305.16355 v1 · 2023-05-25 · cs.CL · cs.CV
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{RFRSJ5YLLMIJLAMZFAPJE43BNM}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more
Record completeness
Claims
PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors for data other than image and text (e.g., video, audio, depth, thermal, and IMU) and can take multimodal inputs simultaneously and compose their semantics naturally.
That ImageBind's embedding space is already semantically rich enough for the language model to compose meanings across modalities without any further alignment training on those modalities.
A single model trained only on image-text pairs gains instruction-following ability across images, video, and audio by routing all modalities through ImageBind's shared embedding space into Vicuna.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:48.430132Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
896324f70b5b10958199281e9273616b3b1c9cba067746a5a493e96d395ec151
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/RFRSJ5YLLMIJLAMZFAPJE43BNM \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 896324f70b5b10958199281e9273616b3b1c9cba067746a5a493e96d395ec151
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "a33eb1f75754eee664c50c5a05cfef2ea5b7c32e181ab55bcafee2f43fdb58d5",
"cross_cats_sorted": [
"cs.CV"
],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.CL",
"submitted_at": "2023-05-25T04:16:07Z",
"title_canon_sha256": "bbbc8f4530482ee4a7ed90c8764467b8790d3a9d3a102881526d4cdb5d5655bd"
},
"schema_version": "1.0",
"source": {
"id": "2305.16355",
"kind": "arxiv",
"version": 1
}
}