hub Canonical reference

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu · 2024 · cs.CV · arXiv 2410.17434

Canonical reference. 75% of citing Pith papers cite this work as background.

50 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 50 citing papers arXiv PDF

abstract

Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 baseline 2

citation-polarity summary

background 9 baseline 2 unclear 1

representative citing papers

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

cs.CV · 2026-01-15 · unverdicted · novelty 8.0

Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

ReQuest: Rethinking-based Question-Aware Frame Selection for Long-Form Video QA

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

ReQuest introduces an uncertainty-driven question-adaptive keyframe selector with rethinking routing and adaptive NMS that boosts long-form video QA accuracy on Video-MME, MLVU, and LongVideoBench without fine-tuning the base MLLM.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

EventPrune prunes 80% of visual tokens in Video-LLMs using event camera motion cues, yielding 1.89x speedup, 52% fewer GFLOPs, and slightly higher accuracy than full-token baselines on first-person dynamic spatial reasoning.

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

RotateK uses online PCA-based rotation to align token-dependent key channel importance into a shared subspace, enabling accurate head-wise structured pruning and faster decoding in VLMs compared to prior token or channel methods.

LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute

cs.CV · 2026-05-07 · conditional · novelty 7.0

LookWhen factorizes video recognition into learning when, where, and what to compute via uniqueness-based token selection and dual-teacher distillation, achieving better accuracy-FLOPs trade-offs than baselines on multiple datasets.

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.

Mosaic: Cross-Modal Clustering for Efficient Video Understanding

cs.PF · 2026-04-11 · unverdicted · novelty 7.0

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding

cs.CV · 2026-02-24 · unverdicted · novelty 7.0

LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.

InduceKV: Fixed-Footprint Continual Adaptation of Multimodal LLMs via Inducing KV Memories

cs.AI · 2026-07-02 · unverdicted · novelty 6.0

InduceKV is a retrieval-based continual adaptation method that uses bilevel selection to build a compact set of inducing KV memories for fixed-footprint updates to multimodal LLMs.

VisReflect: Latent Visual Reflection for Fine-Grained Perception in Long Visual Context

cs.CV · 2026-06-29 · unverdicted · novelty 6.0

VisReflect generates continuous latent visual reflections to emphasize relevant visual features and guide attention in LVLMs, yielding 4.1% gains on image benchmarks and 1.8% on video benchmarks with 44% less inference time than zooming methods.

ViTexQA: A Multi-Frame Temporal Perception Dataset for Video Text Question Answering

cs.CV · 2026-06-23 · unverdicted · novelty 6.0

ViTexQA is a dataset forcing multi-frame text fusion for all questions, with FrameThinker achieving 6.3% ROUGE-L gain over baselines via CoT SFT and temporally-grounded RL.

HPP: Hierarchical Programmatic Probing for Long Video Understanding by Decoupling Perception and Reasoning

cs.CV · 2026-06-19 · unverdicted · novelty 6.0

HPP decouples perception from reasoning in long-video VLMs by having an LLM run iterative programmatic probes on hierarchically segmented video, reporting gains on LongVideoBench, EgoSchema, VideoMME, and MLVU.

CoCoSI: Collaborative Cognitive Map Construction for Spatial Intelligence

cs.CV · 2026-06-09 · unverdicted · novelty 6.0

CoCoSI is a training-free multi-agent system for collaborative cognitive map construction that improves spatial understanding in arbitrary pretrained MLLMs.

StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset

cs.CV · 2026-06-04 · unverdicted · novelty 6.0

StoryVideoQA provides the largest auto-generated deep video understanding dataset to date with 363K QAs across TV and movies, paired with the PlotTree agent for hierarchical plot-based reasoning that existing VideoQA models struggle to match.

GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning

cs.CV · 2026-06-03 · unverdicted · novelty 6.0

GOPAgen proposes integrating video codec GOPs with a motion agent, GOP tree reasoning, structural memory, and motion vector database to improve efficiency and motion detail in agentic long-video VQA, reporting gains on MotionBench and EgoSchema.

Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

KeyVT improves zero-shot 3D question answering by hierarchically selecting semantically and geometrically relevant views and using optimal transport to extract representative tokens from them.

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs

cs.CV · 2026-05-17 · unverdicted · novelty 6.0 · 2 refs

LiteFrame is an efficient vision encoder backbone trained with Compressed Token Distillation and Language Model Adaptation to scale frame count in Video LLMs while cutting latency and raising accuracy.

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.

Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding

cs.CV · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

Response-G1 uses query-guided scene graphs, memory retrieval, and augmented prompting to improve when Video-LLMs decide to respond during streaming videos.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs cs.LG · 2026-04-22 · unreviewed · ref 36 · internal anchor

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer