{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:SZRBHSLRNVZDJUEZMF5UBKC4W5","short_pith_number":"pith:SZRBHSLR","schema_version":"1.0","canonical_sha256":"966213c9716d7234d099617b40a85cb77984fc3acbebb3591451c6e67aa9b5b8","source":{"kind":"arxiv","id":"2305.03726","version":2},"attestation_state":"computed","paper":{"title":"Otter: A Multi-Modal Model with In-Context Instruction Tuning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Otter improves multi-modal instruction following by training on in-context examples from both text and images or videos.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Bo Li, Fanyi Pu, Jinghao Wang, Jingkang Yang, Joshua Adrian Cahyono, Liangyu Chen, Yuanhan Zhang, Ziwei Liu","submitted_at":"2023-05-05T17:59:46Z","abstract_excerpt":"Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability.\n  To bridge this gap, we introduce the \\textbf{Otter} model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruct"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2305.03726","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2023-05-05T17:59:46Z","cross_cats_sorted":["cs.CL"],"title_canon_sha256":"7f1ada7a3f996e919f83d304f27b98700115314307b314a7a04bb90566b62030","abstract_canon_sha256":"f66dfa86f6dda71fb473e6f79af6934e8e44dc9ecb4cee8807ff533506874aec"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:53.781728Z","signature_b64":"eMjFiTgs12kZGeXvSJlbD6Nv1vtjYyOdEbx8Xrkch9QgR0CF/0qY+qRkgA4yZAlJ8o+zjEwdV2B0Y+w7gzxmDQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"966213c9716d7234d099617b40a85cb77984fc3acbebb3591451c6e67aa9b5b8","last_reissued_at":"2026-05-17T23:38:53.781237Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:53.781237Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Otter: A Multi-Modal Model with In-Context Instruction Tuning","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Otter improves multi-modal instruction following by training on in-context examples from both text and images or videos.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Bo Li, Fanyi Pu, Jinghao Wang, Jingkang Yang, Joshua Adrian Cahyono, Liangyu Chen, Yuanhan Zhang, Ziwei Liu","submitted_at":"2023-05-05T17:59:46Z","abstract_excerpt":"Recent advances in Large Multimodal Models (LMMs) have unveiled great potential as visual assistants. However, most existing works focus on responding to individual instructions or using previous dialogues for contextual understanding. There is little discussion on employing both images and text as in-context examples to enhance the instruction following capability.\n  To bridge this gap, we introduce the \\textbf{Otter} model to leverage both textual and visual in-context examples for instruction tuning. Specifically, Otter builds upon Flamingo with Perceiver architecture, and has been instruct"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities. Notably, the extensive scenario coverage provided by the MIMIC-IT dataset empowers the Otter model to excel in tasks involving complex video and multi-image understanding.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the MIMIC-IT dataset's curation of diverse in-context examples across images and videos produces genuine generalization gains rather than dataset-specific improvements, and that the base Flamingo Perceiver architecture seamlessly supports the added multi-modal in-context inputs without hidden limitations.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Otter improves multi-modal instruction following by training on in-context examples from both text and images or videos.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b33bb1d522d9f2c58fe10ea2b497e82c8fde62136b7e893a1742fa13ee0fa660"},"source":{"id":"2305.03726","kind":"arxiv","version":2},"verdict":{"id":"7d51ca55-25cd-4603-b554-e9732f3645ac","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T02:40:09.150748Z","strongest_claim":"instruction tuning with these in-context examples substantially enhances model convergence and generalization capabilities. Notably, the extensive scenario coverage provided by the MIMIC-IT dataset empowers the Otter model to excel in tasks involving complex video and multi-image understanding.","one_line_summary":"Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the MIMIC-IT dataset's curation of diverse in-context examples across images and videos produces genuine generalization gains rather than dataset-specific improvements, and that the base Flamingo Perceiver architecture seamlessly supports the added multi-modal in-context inputs without hidden limitations.","pith_extraction_headline":"Otter improves multi-modal instruction following by training on in-context examples from both text and images or videos."},"references":{"count":104,"sample":[{"doi":"","year":2023,"title":"https://commoncrawl.org/","work_id":"eec7545e-c896-4ba3-8e13-6b19deb355f9","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"What learning algorithm is in-context learning? Investigations with linear models","work_id":"c7ff11dd-6785-4052-a878-ceb418d6f000","ref_index":2,"cited_arxiv_id":"2211.15661","is_internal_anchor":true},{"doi":"","year":2022,"title":"Flamingo: a visual language model for few-shot learning","work_id":"80bfdb3e-04fe-4388-9591-7b8e6f9665a0","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Flamingo: a visual language model for few-shot learning","work_id":"bc11415c-9fcc-43cb-862c-c2b57acb82e5","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2015,"title":"Vqa: Visual question answering","work_id":"752d0e17-6dc9-4e26-8e28-8f32abff46ed","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":104,"snapshot_sha256":"bb3474d6af19c9157873aba7f10d5f51bceaea826f6f12a04c018a6b61ff6a58","internal_anchors":30},"formal_canon":{"evidence_count":2,"snapshot_sha256":"3cc461b4eaae413be7fbf6729bfeace250673f643c90a82d275852facc34917b"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2305.03726","created_at":"2026-05-17T23:38:53.781328+00:00"},{"alias_kind":"arxiv_version","alias_value":"2305.03726v2","created_at":"2026-05-17T23:38:53.781328+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2305.03726","created_at":"2026-05-17T23:38:53.781328+00:00"},{"alias_kind":"pith_short_12","alias_value":"SZRBHSLRNVZD","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"SZRBHSLRNVZDJUEZ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"SZRBHSLR","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":35,"internal_anchor_count":35,"sample":[{"citing_arxiv_id":"2308.12067","citing_title":"MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2504.07148","citing_title":"Q-Agent: Quality-Driven Chain-of-Thought Image Restoration Agent through Robust Multimodal Large Language Model","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15711","citing_title":"EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19950","citing_title":"AffectVerse: Emotional World Models for Multimodal Affective Computing","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15300","citing_title":"Deep Pre-Alignment for VLMs","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2311.04257","citing_title":"mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2310.00754","citing_title":"Analyzing and Mitigating Object Hallucination in Large Vision-Language Models","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2311.17005","citing_title":"MVBench: A Comprehensive Multi-modal Video Understanding Benchmark","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2401.10935","citing_title":"SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents","ref_index":82,"is_internal_anchor":true},{"citing_arxiv_id":"2408.13257","citing_title":"MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2403.09611","citing_title":"MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2306.13549","citing_title":"A Survey on Multimodal Large Language Models","ref_index":182,"is_internal_anchor":true},{"citing_arxiv_id":"2309.14525","citing_title":"Aligning Large Multimodal Models with Factually Augmented RLHF","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2312.17090","citing_title":"Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels","ref_index":224,"is_internal_anchor":true},{"citing_arxiv_id":"2311.16502","citing_title":"MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2409.17146","citing_title":"Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2303.16199","citing_title":"LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2603.27507","citing_title":"Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2406.04264","citing_title":"MLVU: Benchmarking Multi-task Long Video Understanding","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2311.10122","citing_title":"Video-LLaVA: Learning United Visual Representation by Alignment Before Projection","ref_index":106,"is_internal_anchor":true},{"citing_arxiv_id":"2306.14565","citing_title":"Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2308.01390","citing_title":"OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2409.02813","citing_title":"MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2312.14238","citing_title":"InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks","ref_index":80,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03231","citing_title":"CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning","ref_index":33,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/SZRBHSLRNVZDJUEZMF5UBKC4W5","json":"https://pith.science/pith/SZRBHSLRNVZDJUEZMF5UBKC4W5.json","graph_json":"https://pith.science/api/pith-number/SZRBHSLRNVZDJUEZMF5UBKC4W5/graph.json","events_json":"https://pith.science/api/pith-number/SZRBHSLRNVZDJUEZMF5UBKC4W5/events.json","paper":"https://pith.science/paper/SZRBHSLR"},"agent_actions":{"view_html":"https://pith.science/pith/SZRBHSLRNVZDJUEZMF5UBKC4W5","download_json":"https://pith.science/pith/SZRBHSLRNVZDJUEZMF5UBKC4W5.json","view_paper":"https://pith.science/paper/SZRBHSLR","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2305.03726&json=true","fetch_graph":"https://pith.science/api/pith-number/SZRBHSLRNVZDJUEZMF5UBKC4W5/graph.json","fetch_events":"https://pith.science/api/pith-number/SZRBHSLRNVZDJUEZMF5UBKC4W5/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/SZRBHSLRNVZDJUEZMF5UBKC4W5/action/timestamp_anchor","attest_storage":"https://pith.science/pith/SZRBHSLRNVZDJUEZMF5UBKC4W5/action/storage_attestation","attest_author":"https://pith.science/pith/SZRBHSLRNVZDJUEZMF5UBKC4W5/action/author_attestation","sign_citation":"https://pith.science/pith/SZRBHSLRNVZDJUEZMF5UBKC4W5/action/citation_signature","submit_replication":"https://pith.science/pith/SZRBHSLRNVZDJUEZMF5UBKC4W5/action/replication_record"}},"created_at":"2026-05-17T23:38:53.781328+00:00","updated_at":"2026-05-17T23:38:53.781328+00:00"}