hub Canonical reference

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu · 2024 · cs.CV · arXiv 2410.17434

Canonical reference. 90% of citing Pith papers cite this work as background.

33 Pith papers citing it

Background 90% of classified citations

open full Pith review browse 33 citing papers arXiv PDF

abstract

Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 baseline 1

citation-polarity summary

background 9 baseline 1

representative citing papers

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

RotateK uses online PCA-based rotation to align token-dependent key channel importance into a shared subspace, enabling accurate head-wise structured pruning and faster decoding in VLMs compared to prior token or channel methods.

LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

cs.CV · 2026-05-07 · conditional · novelty 7.0

LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on multiple datasets.

Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning, and MCQA benchmarks.

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.

Mosaic: Cross-Modal Clustering for Efficient Video Understanding

cs.PF · 2026-04-11 · unverdicted · novelty 7.0

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

cs.CV · 2026-02-24 · unverdicted · novelty 7.0

LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.

Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

cs.CV · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.

VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding

cs.CV · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.

Small Vision-Language Models are Smart Compressors for Long Video Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

cs.CV · 2026-01-21 · unverdicted · novelty 6.0

HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.

Streaming Video Instruction Tuning

cs.CV · 2025-12-24 · unverdicted · novelty 6.0

Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.

Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval

cs.CV · 2025-12-09 · unverdicted · novelty 6.0

OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

cs.CV · 2025-11-17 · unverdicted · novelty 6.0

REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory

cs.CV · 2025-05-29 · unverdicted · novelty 6.0

TrajViT tokenizes videos via panoptic sub-object trajectories, achieving 10x token reduction and outperforming ViT3D by 6% on retrieval and 5.2% on VideoQA tasks with faster training and inference.

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

cs.CV · 2024-12-31 · unverdicted · novelty 6.0

VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.

Swift Sampling: Selecting Temporal Surprises via Taylor Series

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.

citing papers explorer

Showing 33 of 33 citing papers.

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding cs.CV · 2026-01-15 · unverdicted · none · ref 129 · internal anchor
Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.
EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning cs.CV · 2026-05-19 · unverdicted · none · ref 31 · internal anchor
EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.
Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference cs.CV · 2026-05-19 · unverdicted · none · ref 11 · internal anchor
RotateK uses online PCA-based rotation to align token-dependent key channel importance into a shared subspace, enabling accurate head-wise structured pruning and faster decoding in VLMs compared to prior token or channel methods.
LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute cs.CV · 2026-05-07 · conditional · none · ref 51 · internal anchor
LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on multiple datasets.
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs cs.LG · 2026-04-22 · unverdicted · none · ref 36 · internal anchor
Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning, and MCQA benchmarks.
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video cs.CV · 2026-04-13 · unverdicted · none · ref 26 · internal anchor
OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.
Mosaic: Cross-Modal Clustering for Efficient Video Understanding cs.PF · 2026-04-11 · unverdicted · none · ref 31 · internal anchor
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CV · 2026-04-06 · unverdicted · none · ref 32 · internal anchor
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding cs.CV · 2026-02-24 · unverdicted · none · ref 39 · internal anchor
LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning cs.CV · 2026-05-21 · unverdicted · none · ref 33 · internal anchor
CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding cs.CV · 2026-05-18 · unverdicted · none · ref 57 · internal anchor
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
OProver: A Unified Framework for Agentic Formal Theorem Proving cs.CL · 2026-05-17 · unverdicted · none · ref 29 · internal anchor
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context cs.CV · 2026-05-13 · unverdicted · none · ref 46 · internal anchor
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding cs.CV · 2026-05-08 · unverdicted · none · ref 58 · 2 links · internal anchor
Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.
VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding cs.CV · 2026-05-07 · unverdicted · none · ref 48 · 2 links · internal anchor
VideoRouter uses dual semantic and image routers for query-adaptive token compression in long-video models, delivering up to 67.9% reduction while outperforming the InternVL baseline on VideoMME, MLVU, and LongVideoBench.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding cs.CV · 2026-04-15 · unverdicted · none · ref 53 · internal anchor
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.
Small Vision-Language Models are Smart Compressors for Long Video Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 10 · internal anchor
Tempo uses a 6B SVLM as a local temporal compressor with training-free adaptive token allocation to achieve SOTA long-video understanding at 0.5-16 tokens per frame, scoring 52.3 on 4101s LVBench under 8K budget.
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding cs.CV · 2026-01-21 · unverdicted · none · ref 39 · internal anchor
HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.
Streaming Video Instruction Tuning cs.CV · 2025-12-24 · unverdicted · none · ref 43 · internal anchor
Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval cs.CV · 2025-12-09 · unverdicted · none · ref 40 · internal anchor
OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding cs.CV · 2025-11-17 · unverdicted · none · ref 37 · internal anchor
REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.
One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory cs.CV · 2025-05-29 · unverdicted · none · ref 51 · internal anchor
TrajViT tokenizes videos via panoptic sub-object trajectories, achieving 10x token reduction and outperforming ViT3D by 6% on retrieval and 5.2% on VideoQA tasks with faster training and inference.
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling cs.CV · 2024-12-31 · unverdicted · none · ref 47 · internal anchor
VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.
Swift Sampling: Selecting Temporal Surprises via Taylor Series cs.CV · 2026-05-21 · unverdicted · none · ref 28 · internal anchor
Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering cs.CV · 2026-05-21 · conditional · none · ref 33 · internal anchor
MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 40 · internal anchor
OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.
EgoSelf: From Memory to Personalized Egocentric Assistant cs.CV · 2026-04-21 · unverdicted · none · ref 45 · internal anchor
EgoSelf uses graph-based memory of user interactions to derive personalized profiles and predict future behaviors for egocentric assistants.
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models cs.CV · 2026-03-02 · unverdicted · none · ref 48 · internal anchor
AOT reduces visual tokens in VLLMs via intra-frame and inter-frame anchors with local-global optimal transport, delivering competitive benchmark performance and efficiency gains in a training-free way.
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling cs.CV · 2025-01-21 · unverdicted · none · ref 23 · internal anchor
InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding better benchmark performance, 6x longer video memory, and new capabilities likeobject
NVILA: Efficient Frontier Visual Language Models cs.CV · 2024-12-05 · unverdicted · none · ref 62 · internal anchor
NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding cs.CV · 2025-01-22 · unverdicted · none · ref 144 · internal anchor
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs cs.CV · 2026-05-17 · unreviewed · ref 9 · internal anchor
CREST: Curvature-Regulated Event-Centric Sampling for Efficient Long-Video Understanding cs.CV · 2026-05-09 · unreviewed · ref 8 · internal anchor

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer