{"paper":{"title":"LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding","license":"http://creativecommons.org/licenses/by/4.0/","headline":"LongVU adaptively compresses long videos by removing redundant frames and tokens to fit hour-long clips into limited LLM context.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Balakrishnan Varadarajan, Bilge Soran, Changsheng Zhao, Chenchen Zhu, Fanyi Xiao, Florian Bordes, Hu Xu, Hyunwoo J. Kim, Jun Chen, Lemeng Wu, Mohamed Elhoseiny, Raghuraman Krishnamoorthi, Vikas Chandra, Xiaoqian Shen, Yunyang Xiong, Zechun Liu, Zhuang Liu","submitted_at":"2024-10-22T21:21:37Z","abstract_excerpt":"Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redun"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that DINOv2 similarity reliably identifies redundant frames without discarding task-relevant visual information and that text-guided cross-modal queries plus temporal dependency reduction preserve all necessary details for downstream understanding.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"LongVU adaptively compresses long videos by removing redundant frames and tokens to fit hour-long clips into limited LLM context.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c51cbd96859f5bbe81add0bfe675c69ffcfdc4971d503858cb493ede14a7dc69"},"source":{"id":"2410.17434","kind":"arxiv","version":1},"verdict":{"id":"60cef8cc-c0ca-4d12-bce1-aae87594e1e2","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T13:49:27.580289Z","strongest_claim":"Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.","one_line_summary":"LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that DINOv2 similarity reliably identifies redundant frames without discarding task-relevant visual information and that text-guided cross-modal queries plus temporal dependency reduction preserve all necessary details for downstream understanding.","pith_extraction_headline":"LongVU adaptively compresses long videos by removing redundant frames and tokens to fit hour-long clips into limited LLM context."},"references":{"count":35,"sample":[{"doi":"","year":null,"title":"Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone","work_id":"feef9556-a016-493c-abd2-0c97a23a7ebf","ref_index":1,"cited_arxiv_id":"2404.14219","is_internal_anchor":true},{"doi":"","year":null,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":2,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":null,"title":"Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens","work_id":"cc937528-86d1-430f-bb5d-4980dbaadd72","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Token Merging: Your ViT But Faster","work_id":"528509bc-2611-4e7f-a772-ea14d25b6dae","ref_index":4,"cited_arxiv_id":"2210.09461","is_internal_anchor":true},{"doi":"","year":2005,"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","ref_index":5,"cited_arxiv_id":"2005.14165","is_internal_anchor":true}],"resolved_work":35,"snapshot_sha256":"894c6c4e8b922ab6362c19ac20437904ea9d062a18a2368a8d607314411962f1","internal_anchors":25},"formal_canon":{"evidence_count":2,"snapshot_sha256":"c9d7cc26ff704e4bbf66c68b949bf9454bb467a5b19c2f37b2fc6203f2d0419a"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}