{"work":{"id":"82f3cbed-22bb-41b5-b9c9-67304154cd52","openalex_id":null,"doi":null,"arxiv_id":"2410.17434","raw_key":null,"title":"LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding","authors":null,"authors_text":"Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu","year":2024,"venue":"cs.CV","abstract":"Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.","external_url":"https://arxiv.org/abs/2410.17434","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-23T07:42:43.107074+00:00","pith_arxiv_id":"2410.17434","created_at":"2026-05-10T00:49:48.945282+00:00","updated_at":"2026-05-23T07:42:43.107074+00:00","title_quality_ok":true,"display_title":"LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding","render_title":"LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding"},"hub":{"state":{"work_id":"82f3cbed-22bb-41b5-b9c9-67304154cd52","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":33,"external_cited_by_count":null,"distinct_field_count":5,"first_pith_cited_at":"2024-12-05T18:59:55+00:00","last_pith_cited_at":"2026-05-21T16:20:31+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-29T20:00:18.830063+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":9},{"context_role":"baseline","n":1}],"polarity_counts":[{"context_polarity":"background","n":9},{"context_polarity":"baseline","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}