Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Caiming Xiong; Honglu Zhou; Juan Carlos Niebles; Junnan Li; Michael S. Ryoo; Mohit Bansal; Shijie Wang; Silvio Savarese; Ziyang Wang

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 2512.05774 v2 pith:67CMYWP3 submitted 2025-12-05 cs.CV cs.AIcs.CL

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Ziyang Wang , Honglu Zhou , Shijie Wang , Junnan Li , Caiming Xiong , Silvio Savarese , Mohit Bansal , Michael S. Ryoo

show 1 more author

Juan Carlos Niebles

This is my paper

classification cs.CV cs.AIcs.CL

keywords videoevidenceactiveagenticperceptionaccuracyagentsanswer

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest overall accuracy with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average overall accuracy while only requires 18.4% inference time and 12.4% input tokens.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
cs.CV 2026-05 unverdicted novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
Benchmarking Visual State Tracking in Multimodal Video Understanding
cs.CV 2026-06 unverdicted novelty 7.0

VSTAT benchmark shows state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines on visual state tracking, failing at visual perception despite correct textual reasoning.
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
cs.CV 2026-05 unverdicted novelty 7.0

ToolMerge decomposes queries into LLM-planned tool calls merged by boolean operators for long-video keyframe retrieval and introduces the M2M benchmark, showing competitive results with 5% gains on caption retrieval.
Rethinking RAG in Long Videos: What to Retrieve and How to Use It?
cs.AI 2026-06 unverdicted novelty 6.0

Introduces V-RAGBench benchmark and CARVE method that selects per-chunk retrieval configurations via parallel retrievers and adaptive reranking, outperforming eight VideoRAG baselines.
Personal Visual Context Learning in Large Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
cs.CV 2026-05 conditional novelty 5.0

An LLM planner decomposes long-video queries into tool calls and boolean merge rules, yielding competitive keyframe retrieval and a 5% gain on caption retrieval on the new M2M benchmark.
Watch, Remember, Reason: Human-View Video Understanding with MLLMs
cs.CV 2026-06 unverdicted novelty 4.0

This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.