EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
hub Canonical reference
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
Canonical reference. 71% of citing Pith papers cite this work as background.
abstract
This paper aims to improve the performance of video multimodal large language models (MLLM) via long and rich context (LRC) modeling. As a result, we develop a new version of InternVideo2.5 with a focus on enhancing the original MLLMs' ability to perceive fine-grained details and capture long-form temporal structure in videos. Specifically, our approach incorporates dense vision task annotations into MLLMs using direct preference optimization and develops compact spatiotemporal representations through adaptive hierarchical token compression. Experimental results demonstrate this unique design of LRC greatly improves the results of video MLLM in mainstream video understanding benchmarks (short & long), enabling the MLLM to memorize significantly longer video inputs (at least 6x longer than the original), and master specialized vision capabilities like object tracking and segmentation. Our work highlights the importance of multimodal context richness (length and fineness) in empowering MLLM's innate abilites (focus and memory), providing new insights for future research on video MLLM. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robustness and weak spatial performance.
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.
A unified synthetic data generation pipeline produces unlimited annotated multimodal video data across multiple tasks, enabling models trained mostly on synthetic data to generalize effectively to real-world video understanding benchmarks.
DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.
VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.
Higher temporal resolution in video significantly improves zero-shot semantic understanding of high-speed human actions like kendo.
A controlled study on compact video LLMs finds that continuous temporal decoding delivers the strongest accuracy-efficiency trade-off for video temporal grounding across three benchmarks.
OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.
citing papers explorer
-
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
-
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
-
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
-
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
-
Grounding Video Reasoning in Physical Signals
A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robustness and weak spatial performance.
-
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
-
Adapting MLLMs for Nuanced Video Retrieval
Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
-
TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?
Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.
-
OProver: A Unified Framework for Agentic Formal Theorem Proving
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
-
From Priors to Perception: Grounding Video-LLMs in Physical Reality
Video-LLMs fail physical reasoning due to semantic prior dominance rather than perception deficits; a new programmatic adversarial curriculum and visual-anchored reasoning chain enable substantial gains via standard LoRA fine-tuning.
-
All in One: A Unified Synthetic Data Pipeline for Multimodal Video Understanding
A unified synthetic data generation pipeline produces unlimited annotated multimodal video data across multiple tasks, enabling models trained mostly on synthetic data to generalize effectively to real-world video understanding benchmarks.
-
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
-
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
-
Streaming Video Instruction Tuning
Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
-
Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents
CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
-
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
-
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.
-
VISD: Enhancing Video Reasoning via Structured Self-Distillation
VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.
-
High-Speed Vision Improves Zero-Shot Semantic Understanding of Human Actions
Higher temporal resolution in video significantly improves zero-shot semantic understanding of high-speed human actions like kendo.
-
How Should Video LLMs Output Time? An Analysis of Efficient Temporal Grounding Paradigms
A controlled study on compact video LLMs finds that continuous temporal decoding delivers the strongest accuracy-efficiency trade-off for video temporal grounding across three benchmarks.
-
OneThinker: All-in-one Reasoning Model for Image and Video
OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
-
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.
- LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs