{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:UMTFWY7FF7LPB4AG4LC3PPN4NN","short_pith_number":"pith:UMTFWY7F","schema_version":"1.0","canonical_sha256":"a3265b63e52fd6f0f006e2c5b7bdbc6b59ff9c3187e6dd8a99d4b5d12ca0d596","source":{"kind":"arxiv","id":"2311.10122","version":3},"attestation_state":"computed","paper":{"title":"Video-LLaVA: Learning United Visual Representation by Alignment Before Projection","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"By aligning images and videos into the language feature space before projection, a single LLM processes both modalities and lets them improve each other.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Bin Lin, Bin Zhu, Jiaxi Cui, Li Yuan, Munan Ning, Peng Jin, Yang Ye","submitted_at":"2023-11-16T10:59:44Z","abstract_excerpt":"The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the found"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2311.10122","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2023-11-16T10:59:44Z","cross_cats_sorted":[],"title_canon_sha256":"169b24f471d2208db1ce36173b5691902e0fd44518285d76760c7236864b0685","abstract_canon_sha256":"c80295a5762ecdeb6e65c5f49691842dea7b0fe27da82d7476d300d10c866324"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:22.232498Z","signature_b64":"Mv5IzXuboGLvQpANpwzEHBK9a0MKyollJFy2nrJXcRvXtzJS2UTaxRQIE3PObq5hhWM/S8YcNpdVIX3txITnDQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"a3265b63e52fd6f0f006e2c5b7bdbc6b59ff9c3187e6dd8a99d4b5d12ca0d596","last_reissued_at":"2026-05-17T23:39:22.231807Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:22.231807Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Video-LLaVA: Learning United Visual Representation by Alignment Before Projection","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"By aligning images and videos into the language feature space before projection, a single LLM processes both modalities and lets them improve each other.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Bin Lin, Bin Zhu, Jiaxi Cui, Li Yuan, Munan Ning, Peng Jin, Yang Ye","submitted_at":"2023-11-16T10:59:44Z","abstract_excerpt":"The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the found"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"By aligning images and videos into the language feature space before projection, a single LLM processes both modalities and lets them improve each other.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f2b3bac15ee6a45f3d1362e8edf3ba5003a44ec70f4824434b8a35147eb74b6c"},"source":{"id":"2311.10122","kind":"arxiv","version":3},"verdict":{"id":"5a5bd3af-90b1-4325-aaba-e920b4087d41","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T18:00:44.719539Z","strongest_claim":"we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other.","one_line_summary":"Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers.","pith_extraction_headline":"By aligning images and videos into the language feature space before projection, a single LLM processes both modalities and lets them improve each other."},"references":{"count":87,"sample":[{"doi":"","year":2022,"title":"Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model","work_id":"af714d03-fb34-46cc-9760-9ea257b01f78","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Max Bain, Arsha Nagrani, G \\\"u l Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on","work_id":"22f85f0f-8960-47a7-905b-bcc8a2bd58d4","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot lear","work_id":"50684699-ce18-4086-8bac-7cecd178fad0","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2011,"title":"David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human langu","work_id":"a7625cc6-f851-46ec-8b30-c2ba62f4d93a","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 wi","work_id":"32e12fcf-bb8e-4e6a-a249-13cd0e7d6e3f","ref_index":8,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":87,"snapshot_sha256":"dbcc5bd6d4f0f7258d0fe8ec6e440f6997594d8719a2176a8137e66cc8a4a412","internal_anchors":28},"formal_canon":{"evidence_count":1,"snapshot_sha256":"3c56d0259743582676dd476e1e36786cab235bdfc345a80f6bb28cfddd36bd3d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2311.10122","created_at":"2026-05-17T23:39:22.231917+00:00"},{"alias_kind":"arxiv_version","alias_value":"2311.10122v3","created_at":"2026-05-17T23:39:22.231917+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2311.10122","created_at":"2026-05-17T23:39:22.231917+00:00"},{"alias_kind":"pith_short_12","alias_value":"UMTFWY7FF7LP","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"UMTFWY7FF7LPB4AG","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"UMTFWY7F","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":69,"internal_anchor_count":69,"sample":[{"citing_arxiv_id":"2504.09583","citing_title":"AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2312.11805","citing_title":"Gemini: A Family of Highly Capable Multimodal Models","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2407.08101","citing_title":"What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2412.00131","citing_title":"Open-Sora Plan: Open-Source Large Video Generation Model","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2501.05067","citing_title":"LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2503.09158","citing_title":"FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2503.14075","citing_title":"Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2505.15269","citing_title":"LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21625","citing_title":"Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22269","citing_title":"MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23747","citing_title":"Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2509.03526","citing_title":"Enhancing Speech Large Language Models through Reinforced Behavior Alignment","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2601.01593","citing_title":"Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20950","citing_title":"Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18904","citing_title":"Dynamic Model Merging Made Slim","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17283","citing_title":"OProver: A Unified Framework for Agentic Formal Theorem Proving","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19950","citing_title":"AffectVerse: Emotional World Models for Multimodal Affective Computing","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2408.04840","citing_title":"mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models","ref_index":226,"is_internal_anchor":true},{"citing_arxiv_id":"2509.15602","citing_title":"TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2501.01957","citing_title":"VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2407.03320","citing_title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","ref_index":78,"is_internal_anchor":true},{"citing_arxiv_id":"2502.04326","citing_title":"WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2511.21998","citing_title":"Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2511.22125","citing_title":"GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2511.19972","citing_title":"Boosting Reasoning in Large Multimodal Models via Activation Replay","ref_index":21,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/UMTFWY7FF7LPB4AG4LC3PPN4NN","json":"https://pith.science/pith/UMTFWY7FF7LPB4AG4LC3PPN4NN.json","graph_json":"https://pith.science/api/pith-number/UMTFWY7FF7LPB4AG4LC3PPN4NN/graph.json","events_json":"https://pith.science/api/pith-number/UMTFWY7FF7LPB4AG4LC3PPN4NN/events.json","paper":"https://pith.science/paper/UMTFWY7F"},"agent_actions":{"view_html":"https://pith.science/pith/UMTFWY7FF7LPB4AG4LC3PPN4NN","download_json":"https://pith.science/pith/UMTFWY7FF7LPB4AG4LC3PPN4NN.json","view_paper":"https://pith.science/paper/UMTFWY7F","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2311.10122&json=true","fetch_graph":"https://pith.science/api/pith-number/UMTFWY7FF7LPB4AG4LC3PPN4NN/graph.json","fetch_events":"https://pith.science/api/pith-number/UMTFWY7FF7LPB4AG4LC3PPN4NN/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/UMTFWY7FF7LPB4AG4LC3PPN4NN/action/timestamp_anchor","attest_storage":"https://pith.science/pith/UMTFWY7FF7LPB4AG4LC3PPN4NN/action/storage_attestation","attest_author":"https://pith.science/pith/UMTFWY7FF7LPB4AG4LC3PPN4NN/action/author_attestation","sign_citation":"https://pith.science/pith/UMTFWY7FF7LPB4AG4LC3PPN4NN/action/citation_signature","submit_replication":"https://pith.science/pith/UMTFWY7FF7LPB4AG4LC3PPN4NN/action/replication_record"}},"created_at":"2026-05-17T23:39:22.231917+00:00","updated_at":"2026-05-17T23:39:22.231917+00:00"}