VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

Caifeng Shan; Chaoyou Fu; Chu Wu; Ran He; Ruoliu Yang

arxiv: 2603.22285 · v2 · submitted 2026-03-23 · 💻 cs.CV

VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

Ruoliu Yang , Chu Wu , Caifeng Shan , Ran He , Chaoyou Fu This is my paper

Pith reviewed 2026-05-15 00:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords long video understandingvideo question answeringmultimodal large language modelsclue localizationaffinity graphrelevance propagationsparse observationVideoMME

0 comments

The pith

VideoDetective finds relevant segments in long videos by combining query matching with the video's own segment-to-segment affinities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VideoDetective to help multimodal large language models answer questions about long videos by locating only the most relevant segments instead of processing the entire sequence. It divides the video into segments and connects them in a graph using visual similarity and time proximity. A Hypothesis-Verification-Refinement loop then estimates how each observed segment relates to the query and spreads those scores to unobserved segments. The resulting global relevance map directs the model to sample just the critical parts for the final answer. This approach yields consistent accuracy gains across mainstream MLLMs on long-video benchmarks.

Core claim

VideoDetective represents video segments as a visual-temporal affinity graph built from visual similarity and temporal proximity, then applies a Hypothesis-Verification-Refinement loop to estimate relevance scores for observed segments, propagate them to unseen segments, and produce a global relevance distribution that localizes the most critical segments for sparse-observation answering.

What carries the argument

Visual-temporal affinity graph plus Hypothesis-Verification-Refinement loop that estimates and propagates query relevance across segments.

Load-bearing premise

A graph built only from visual similarity and temporal proximity between segments will correctly spread query relevance to the rest of the video without creating false connections or overlooking important content shifts.

What would settle it

Run the method on a long video containing visually similar but semantically unrelated segments; if accuracy falls below a query-only baseline, the propagation mechanism is failing.

Figures

Figures reproduced from arXiv: 2603.22285 by Caifeng Shan, Chaoyou Fu, Chu Wu, Ran He, Ruoliu Yang.

**Figure 1.** Figure 1: Overview of VideoDetective. Given a query, we (1) divide the video into segments and construct a spatio-temporal affinity graph from visual similarity and temporal proximity; (2) iteratively observe video segments and propagate the relevance scores over the graph to update a global belief field, guiding next observation via a hypothesis–verification–refinement loop to recover missing clues; and (3) aggrega… view at source ↗

**Figure 2.** Figure 2: A qualitative example of VideoDetective. It illustrates how VideoDetective processes the input video and query to derive the correct answer. Original video: https://www.youtube.com/watch?v=B6tQyCH5hQM Algorithm 1. Overall pipeline of VideoDetective Require: Video V , Question q, Iteration steps budget B Ensure: Answer a 1: Preprocessing: 2: Chunk V into K segments {ci} K i=1 with features {hi} 3: Generate … view at source ↗

**Figure 3.** Figure 3: Performance improvements across different backbones on VideoMME-long w/o subtitle. VideoDetective consistently enhances various MLLM across different architectures and parameter scales, demonstrating its plug-and-play capability. For this backbone comparison, VideoXL2 and Oryx-1.5 use 16 sampled frames, InternVL-2.5 uses 8 sampled frames, and all other models use 32 sampled frames; GLM, SeedVL, and Qwen3-V… view at source ↗

**Figure 5.** Figure 5: evaluates the complete time, which includes all pipeline stages: offline preprocessing (sampling, graph construction, OCR/ASR), the iterative VLM observation loop, and final answer generation. While straightforward singlestep or pure retrieval methods (e.g., LVNet, DVD) exhibit lower latency (< 30s), their accuracy is limited (< 43%) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 4.** Figure 4: Token Efficiency. Comparison of accuracy versus average token consumption(whole pipeline). VideoDetective achieves the optimal position on the Pareto frontier. A comprehensive evaluation of long-video understanding methods requires considering both the computational budget (token consumption) and the actual end-to-end processing time. As illustrated in Figures 4 and 5, we report the average token consumpti… view at source ↗

read the original abstract

Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VideoDetective adds a relevance propagation loop over a visual-temporal graph to localize clues in long videos, delivering small but consistent gains on existing MLLMs.

read the letter

The main contribution is a concrete procedure that scores some segments directly against the query, then diffuses those scores across a graph whose edges come from visual similarity and temporal adjacency. The Hypothesis-Verification-Refinement loop is the part that is new relative to plain query-only localization. They report up to 7.5% accuracy lift on VideoMME-long when plugged into several mainstream MLLMs, and the code is released, which is useful for anyone who wants to try it on their own long-video pipeline.

Referee Report

1 major / 1 minor

Summary. The paper presents VideoDetective, a framework for long-video question answering with MLLMs. Videos are segmented and modeled as a visual-temporal affinity graph constructed from visual similarity and temporal proximity. A Hypothesis-Verification-Refinement loop estimates query relevance on observed segments and propagates scores across the graph to produce a global relevance distribution, which is then used to localize critical segments for sparse observation and final answering. Experiments report consistent accuracy gains across mainstream MLLMs, reaching up to 7.5% on VideoMME-long.

Significance. If the gains are reproducible and the propagation mechanism is shown to be reliable, the work could improve long-video understanding by combining extrinsic query signals with intrinsic video structure, mitigating context-window limits without dense sampling. The public code release is a clear strength for reproducibility.

major comments (1)

[Method (Hypothesis-Verification-Refinement loop)] The relevance-propagation step in the Hypothesis-Verification-Refinement loop (described in the abstract and method) is load-bearing for both the 'global relevance distribution' claim and the reported 7.5% gains. The graph is built solely from visual similarity and temporal proximity with no mention of query-conditioned edge weighting or explicit handling of semantic shifts; in long videos containing repeated scenes or cutaways this can assign spurious scores to unseen segments. Targeted ablations or failure-case analysis on this assumption are required.

minor comments (1)

[Abstract] The abstract states accuracy improvements but supplies no information on the number of MLLMs evaluated, the baselines compared, or any statistical testing; a one-sentence summary of these elements would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work's potential and for highlighting the importance of reproducibility through code release. We provide a point-by-point response to the major comment below.

read point-by-point responses

Referee: The relevance-propagation step in the Hypothesis-Verification-Refinement loop (described in the abstract and method) is load-bearing for both the 'global relevance distribution' claim and the reported 7.5% gains. The graph is built solely from visual similarity and temporal proximity with no mention of query-conditioned edge weighting or explicit handling of semantic shifts; in long videos containing repeated scenes or cutaways this can assign spurious scores to unseen segments. Targeted ablations or failure-case analysis on this assumption are required.

Authors: We agree that the relevance propagation is crucial to achieving the global relevance distribution and the observed performance gains. The affinity graph is indeed constructed using only visual similarity and temporal proximity, without query-conditioned edge weights. Nevertheless, query conditioning is incorporated through the Hypothesis step, which assigns initial relevance scores to observed segments based on their alignment with the query. These scores are then propagated to unseen segments via the graph structure. The Verification and Refinement stages further incorporate query feedback by selecting additional segments for observation and updating the distribution accordingly. This iterative process helps reduce the impact of potential spurious connections arising from repeated scenes. Although the original manuscript does not include dedicated ablations for semantic shift scenarios, the method's effectiveness is supported by results on challenging long-video benchmarks. To directly address this concern, we will include targeted ablations and failure-case analyses on videos with repeated scenes and cutaways in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic framework with explicit propagation steps

full rationale

The paper describes VideoDetective as a procedural pipeline: segment the video, construct an affinity graph from visual similarity and temporal proximity, then run a Hypothesis-Verification-Refinement loop to estimate observed relevance and propagate to unseen segments. No equations, fitted parameters, or self-citations are shown that would reduce the final relevance distribution to the inputs by construction. The propagation is an explicit algorithmic step rather than a tautological redefinition or renamed fit. This is self-contained against external benchmarks and matches the expected non-finding for a method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on standard video-processing assumptions rather than new postulates; no free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption A video can be partitioned into segments whose pairwise visual similarity and temporal proximity form a useful affinity graph for relevance propagation.
Explicitly stated as the basis for the graph construction step.
domain assumption Relevance scores estimated on observed segments can be reliably propagated to unobserved segments through the affinity graph.
Core premise of the Hypothesis-Verification-Refinement loop.

pith-pipeline@v0.9.0 · 5500 in / 1382 out tokens · 42124 ms · 2026-05-15T00:19:42.961715+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

Read the question and all options

work page
[2]

For each option, check frames and attached evidence lines

work page
[3]

Prefer explicit evidence over vague impressions

work page
[4]

For order/time questions, compare early vs late frames; for text, use OCR evidence

work page
[5]

NO EVIDENCE

If evidence is weak, choose most plausible option and state low confidence. Rules:MUST output an option LETTER (A/B/C/D). DO NOT output “NO EVIDENCE”. Response Format: Analysis:<your reasoning> Final Answer:<ONE LETTER> Reason:<one short sentence> User Prompt Based on these video frames, answer the following ques- tion: Frame Information:{frame info str} ...

work page
[6]

Official sampling rates (frames per video)

work page
[7]

Per-frame token counts specified in official API documentation

work page
[8]

semantic drift

Standard video resolution settings Important Notes: • These estimates includeonly image tokensand exclude text prompts, system instructions, and other textual overhead. • This makes them conservative baselines—the actual token consumption of these models would be higher in practice. • All measurements are averaged across all videos in the VideoMME-long be...

work page 2009
[9]

Text preprocessing: Lowercase conversion, stopword removal, and lemmatization

work page
[10]

IDF computation: Pre-computed on a large corpus, with out-of-vocabulary words assigned a default IDF value

work page
[11]

Score computation: For evidence text e and keyword setK r: slex(e, fr) = min 1.0, P t∈e∩Kr IDF(t) Zlex whereZ lex = 3.0is a normalization constant

work page
[12]

Normalization: We clip scores to [0,1] via the min(·) term above. D.4.3. EMBEDDING-BASEDSEMANTICSIMILARITY For semantic matching, we use SigLIP text encoder with cosine similarity:

work page
[13]

Text encoding: ψ(e) =SigLIP-Text(e)∈R d, ∥ψ(e)∥2 = 1

work page
[14]

Score computation: For evidence text e and semantic query setP r, we compute: ssem(e, fr) = max p∈Pr ⟨ψ(e), ψ(p)⟩ where p represents semantic queries (event descrip- tions) that capture the contextual meaning of each facet

work page
[15]

Batch encoding: All semantic queries are pre-encoded for efficiency. D.5. Source-aware Fusion Different evidence sources have different signal-to-noise characteristics: • OCR text: High precision, low recall. Weight: λocr = 0.7(trust lexical more). socr(e, fr) = 0.7·s lex(e, fr) + 0.3·s sem(e, fr) • ASR text: Balanced. Weight: λasr = 0.5 (equal trust). sa...

work page
[16]

We reuse the same frame sampling number F as the final answer generation

Uniform sampling: Extract F frames uniformly dis- tributed across theentire video(not per-node). We reuse the same frame sampling number F as the final answer generation

work page
[17]

VLM generation (time-stamped event timeline): Use the VLM to generate a coarse event timeline based on these F frames, capturing the overall narrative and key events. Concretely, the VLM outputs a list of event items, each with an approximate temporal span (e.g., start/end timestamps or the corresponding frame indices among the F sampled frames) plus a sh...

work page
[18]

A person explains X before demonstrating Y

Deterministic node-level assignment: Each node cor- responds to a video chunk with a temporal interval [si, ei]. We assign to node i all event items whose tem- poral spans overlap with [si, ei] (or whose associated sampled-frame indices fall within the node’s interval), and concatenate their descriptions to form ei. If no event item overlaps, we assign th...

work page 2048

[1] [1]

Read the question and all options

work page

[2] [2]

For each option, check frames and attached evidence lines

work page

[3] [3]

Prefer explicit evidence over vague impressions

work page

[4] [4]

For order/time questions, compare early vs late frames; for text, use OCR evidence

work page

[5] [5]

NO EVIDENCE

If evidence is weak, choose most plausible option and state low confidence. Rules:MUST output an option LETTER (A/B/C/D). DO NOT output “NO EVIDENCE”. Response Format: Analysis:<your reasoning> Final Answer:<ONE LETTER> Reason:<one short sentence> User Prompt Based on these video frames, answer the following ques- tion: Frame Information:{frame info str} ...

work page

[6] [6]

Official sampling rates (frames per video)

work page

[7] [7]

Per-frame token counts specified in official API documentation

work page

[8] [8]

semantic drift

Standard video resolution settings Important Notes: • These estimates includeonly image tokensand exclude text prompts, system instructions, and other textual overhead. • This makes them conservative baselines—the actual token consumption of these models would be higher in practice. • All measurements are averaged across all videos in the VideoMME-long be...

work page 2009

[9] [9]

Text preprocessing: Lowercase conversion, stopword removal, and lemmatization

work page

[10] [10]

IDF computation: Pre-computed on a large corpus, with out-of-vocabulary words assigned a default IDF value

work page

[11] [11]

Score computation: For evidence text e and keyword setK r: slex(e, fr) = min 1.0, P t∈e∩Kr IDF(t) Zlex whereZ lex = 3.0is a normalization constant

work page

[12] [12]

Normalization: We clip scores to [0,1] via the min(·) term above. D.4.3. EMBEDDING-BASEDSEMANTICSIMILARITY For semantic matching, we use SigLIP text encoder with cosine similarity:

work page

[13] [13]

Text encoding: ψ(e) =SigLIP-Text(e)∈R d, ∥ψ(e)∥2 = 1

work page

[14] [14]

Score computation: For evidence text e and semantic query setP r, we compute: ssem(e, fr) = max p∈Pr ⟨ψ(e), ψ(p)⟩ where p represents semantic queries (event descrip- tions) that capture the contextual meaning of each facet

work page

[15] [15]

Batch encoding: All semantic queries are pre-encoded for efficiency. D.5. Source-aware Fusion Different evidence sources have different signal-to-noise characteristics: • OCR text: High precision, low recall. Weight: λocr = 0.7(trust lexical more). socr(e, fr) = 0.7·s lex(e, fr) + 0.3·s sem(e, fr) • ASR text: Balanced. Weight: λasr = 0.5 (equal trust). sa...

work page

[16] [16]

We reuse the same frame sampling number F as the final answer generation

Uniform sampling: Extract F frames uniformly dis- tributed across theentire video(not per-node). We reuse the same frame sampling number F as the final answer generation

work page

[17] [17]

VLM generation (time-stamped event timeline): Use the VLM to generate a coarse event timeline based on these F frames, capturing the overall narrative and key events. Concretely, the VLM outputs a list of event items, each with an approximate temporal span (e.g., start/end timestamps or the corresponding frame indices among the F sampled frames) plus a sh...

work page

[18] [18]

A person explains X before demonstrating Y

Deterministic node-level assignment: Each node cor- responds to a video chunk with a temporal interval [si, ei]. We assign to node i all event items whose tem- poral spans overlap with [si, ei] (or whose associated sampled-frame indices fall within the node’s interval), and concatenate their descriptions to form ei. If no event item overlaps, we assign th...

work page 2048