{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:GPX7CXXA6TXOGXF3CZZDBISFOU","short_pith_number":"pith:GPX7CXXA","schema_version":"1.0","canonical_sha256":"33eff15ee0f4eee35cbb167230a245751ccd247696963e60d90fa53cff881e9d","source":{"kind":"arxiv","id":"2408.10188","version":6},"attestation_state":"computed","paper":{"title":"LongVILA: Scaling Long-Context Visual Language Models for Long Videos","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"LongVILA scales visual-language models from 8 to 2048 video frames while reaching 99.8 percent accuracy on million-token needle-in-a-haystack retrieval.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Dacheng Li, Ethan He, Fuzhao Xue, Haotian Tang, Hongxu Yin, Jan Kautz, Ligeng Zhu, Linxi Fan, Pavlo Molchanov, Qinghao Hu, Shang Yang, Song Han, Xiuyu Li, Yao Lu, Yukang Chen, Yuke Zhu, Yunhao Fang, Zhijian Liu","submitted_at":"2024-08-19T17:48:08Z","abstract_excerpt":"Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently para"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2408.10188","kind":"arxiv","version":6},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2024-08-19T17:48:08Z","cross_cats_sorted":["cs.CL"],"title_canon_sha256":"6d52f40ddef7d6f16dcc45e3011f7dd835c2e5bebf78daa88a6e7266486b643d","abstract_canon_sha256":"cb75cb75920d79568c4cce055daca563fbae53b13cea3879ed77e41e712417df"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:15.213291Z","signature_b64":"Y2KHMrLtSz4geF9kNWMIgRhpEoLItyVgUnEHqNKLmEvbHZLNDlTwY0ZMnUeNQBnzEulLUlpY9ZjkclrtI719Cw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"33eff15ee0f4eee35cbb167230a245751ccd247696963e60d90fa53cff881e9d","last_reissued_at":"2026-05-17T23:38:15.212631Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:15.212631Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"LongVILA: Scaling Long-Context Visual Language Models for Long Videos","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"LongVILA scales visual-language models from 8 to 2048 video frames while reaching 99.8 percent accuracy on million-token needle-in-a-haystack retrieval.","cross_cats":["cs.CL"],"primary_cat":"cs.CV","authors_text":"Dacheng Li, Ethan He, Fuzhao Xue, Haotian Tang, Hongxu Yin, Jan Kautz, Ligeng Zhu, Linxi Fan, Pavlo Molchanov, Qinghao Hu, Shang Yang, Song Han, Xiuyu Li, Yao Lu, Yukang Chen, Yuke Zhu, Yunhao Fang, Zhijian Liu","submitted_at":"2024-08-19T17:48:08Z","abstract_excerpt":"Long-context capability is critical for multi-modal foundation models, especially for long video understanding. We introduce LongVILA, a full-stack solution for long-context visual-language models by co-designing the algorithm and system. For model training, we upgrade existing VLMs to support long video understanding by incorporating two additional stages, i.e., long context extension and long video supervised fine-tuning. However, training on long video is computationally and memory intensive. We introduce the long-context Multi-Modal Sequence Parallelism (MM-SP) system that efficiently para"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the two-stage training process (long context extension followed by long video supervised fine-tuning) combined with MM-SP will scale to long videos while preserving accuracy and efficiency without hidden performance regressions or unstated data selection effects.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"LongVILA scales visual-language models from 8 to 2048 video frames while reaching 99.8 percent accuracy on million-token needle-in-a-haystack retrieval.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"67b1878b85f3b1f0ba695de0098c1a1014469d1851d1b8a5cb92b5f5118dd038"},"source":{"id":"2408.10188","kind":"arxiv","version":6},"verdict":{"id":"3f48e89f-bade-42fe-a94e-6f11795971f8","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T03:47:16.925127Z","strongest_claim":"LongVILA efficiently extends the number of video frames of VILA from 8 to 2048, achieving 99.8% accuracy in 6,000-frame (more than 1 million tokens) video needle-in-a-haystack.","one_line_summary":"LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the two-stage training process (long context extension followed by long video supervised fine-tuning) combined with MM-SP will scale to long videos while preserving accuracy and efficiency without hidden performance regressions or unstated data selection effects.","pith_extraction_headline":"LongVILA scales visual-language models from 8 to 2048 video frames while reaching 99.8 percent accuracy on million-token needle-in-a-haystack retrieval."},"references":{"count":32,"sample":[{"doi":"","year":null,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":1,"cited_arxiv_id":"2308.12966","is_internal_anchor":true},{"doi":"","year":null,"title":"RT-1: Robotics Transformer for Real-World Control at Scale","work_id":"e11bda85-8531-46bc-a07f-d0ade3643ab1","ref_index":2,"cited_arxiv_id":"2212.06817","is_internal_anchor":true},{"doi":"","year":null,"title":"RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control","work_id":"ff438a8a-8003-4fae-9131-acd418b3597b","ref_index":3,"cited_arxiv_id":"2307.15818","is_internal_anchor":true},{"doi":"","year":1901,"title":"Language models are few-shot learners","work_id":"f93ff324-f230-4a46-97b9-6b103c35585d","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Sharegpt4video: Improving video understanding and generation with better captions","work_id":"22138421-9fc7-4d3e-8a5b-90fd9a0a6c97","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":32,"snapshot_sha256":"6e341a9383a2d1f5a6f14726919cb95a735bbd18228c5ba8bdfe90f04957e5c1","internal_anchors":17},"formal_canon":{"evidence_count":2,"snapshot_sha256":"ad10d04b43aa7112a1f191c98eba5ea4cb5cdcb28dfc11c3efeee13a5087e9c8"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2408.10188","created_at":"2026-05-17T23:38:15.212725+00:00"},{"alias_kind":"arxiv_version","alias_value":"2408.10188v6","created_at":"2026-05-17T23:38:15.212725+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2408.10188","created_at":"2026-05-17T23:38:15.212725+00:00"},{"alias_kind":"pith_short_12","alias_value":"GPX7CXXA6TXO","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"GPX7CXXA6TXOGXF3","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"GPX7CXXA","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":24,"internal_anchor_count":24,"sample":[{"citing_arxiv_id":"2412.04468","citing_title":"NVILA: Efficient Frontier Visual Language Models","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2412.14171","citing_title":"Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces","ref_index":91,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22678","citing_title":"Swift Sampling: Selecting Temporal Surprises via Taylor Series","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23747","citing_title":"Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2501.00574","citing_title":"VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2501.12386","citing_title":"InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2601.15724","citing_title":"VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23747","citing_title":"Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2507.08128","citing_title":"Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2603.27437","citing_title":"SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13831","citing_title":"Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02371","citing_title":"Internalized Reasoning for Long-Context Visual Document Understanding","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2505.13211","citing_title":"MAGI-1: Autoregressive Video Generation at Scale","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09223","citing_title":"CATS: Curvature Aware Temporal Selection for efficient long video understanding","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03276","citing_title":"VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03276","citing_title":"VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05848","citing_title":"VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19564","citing_title":"EgoSelf: From Memory to Personalized Egocentric Assistant","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05848","citing_title":"VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08120","citing_title":"Small Vision-Language Models are Smart Compressors for Long Video Understanding","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2501.03575","citing_title":"Cosmos World Foundation Model Platform for Physical AI","ref_index":227,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04372","citing_title":"Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14149","citing_title":"One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17087","citing_title":"EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling","ref_index":6,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/GPX7CXXA6TXOGXF3CZZDBISFOU","json":"https://pith.science/pith/GPX7CXXA6TXOGXF3CZZDBISFOU.json","graph_json":"https://pith.science/api/pith-number/GPX7CXXA6TXOGXF3CZZDBISFOU/graph.json","events_json":"https://pith.science/api/pith-number/GPX7CXXA6TXOGXF3CZZDBISFOU/events.json","paper":"https://pith.science/paper/GPX7CXXA"},"agent_actions":{"view_html":"https://pith.science/pith/GPX7CXXA6TXOGXF3CZZDBISFOU","download_json":"https://pith.science/pith/GPX7CXXA6TXOGXF3CZZDBISFOU.json","view_paper":"https://pith.science/paper/GPX7CXXA","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2408.10188&json=true","fetch_graph":"https://pith.science/api/pith-number/GPX7CXXA6TXOGXF3CZZDBISFOU/graph.json","fetch_events":"https://pith.science/api/pith-number/GPX7CXXA6TXOGXF3CZZDBISFOU/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/GPX7CXXA6TXOGXF3CZZDBISFOU/action/timestamp_anchor","attest_storage":"https://pith.science/pith/GPX7CXXA6TXOGXF3CZZDBISFOU/action/storage_attestation","attest_author":"https://pith.science/pith/GPX7CXXA6TXOGXF3CZZDBISFOU/action/author_attestation","sign_citation":"https://pith.science/pith/GPX7CXXA6TXOGXF3CZZDBISFOU/action/citation_signature","submit_replication":"https://pith.science/pith/GPX7CXXA6TXOGXF3CZZDBISFOU/action/replication_record"}},"created_at":"2026-05-17T23:38:15.212725+00:00","updated_at":"2026-05-17T23:38:15.212725+00:00"}