VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding
Pith reviewed 2026-05-15 00:19 UTC · model grok-4.3
The pith
VideoDetective finds relevant segments in long videos by combining query matching with the video's own segment-to-segment affinities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VideoDetective represents video segments as a visual-temporal affinity graph built from visual similarity and temporal proximity, then applies a Hypothesis-Verification-Refinement loop to estimate relevance scores for observed segments, propagate them to unseen segments, and produce a global relevance distribution that localizes the most critical segments for sparse-observation answering.
What carries the argument
Visual-temporal affinity graph plus Hypothesis-Verification-Refinement loop that estimates and propagates query relevance across segments.
Load-bearing premise
A graph built only from visual similarity and temporal proximity between segments will correctly spread query relevance to the rest of the video without creating false connections or overlooking important content shifts.
What would settle it
Run the method on a long video containing visually similar but semantically unrelated segments; if accuracy falls below a query-only baseline, the propagation mechanism is failing.
Figures
read the original abstract
Long video understanding remains challenging for multimodal large language models (MLLMs) due to limited context windows, which necessitate identifying sparse query-relevant video segments. However, existing methods predominantly localize clues based solely on the query, overlooking the video's intrinsic structure and varying relevance across segments. To address this, we propose VideoDetective, a framework that integrates query-to-segment relevance and inter-segment affinity for effective clue hunting in long-video question answering. Specifically, we divide a video into various segments and represent them as a visual-temporal affinity graph built from visual similarity and temporal proximity. We then perform a Hypothesis-Verification-Refinement loop to estimate relevance scores of observed segments to the query and propagate them to unseen segments, yielding a global relevance distribution that guides the localization of the most critical segments for final answering with sparse observation. Experiments show our method consistently achieves substantial gains across a wide range of mainstream MLLMs on representative benchmarks, with accuracy improvements of up to 7.5% on VideoMME-long. Our code is available at https://videodetective.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents VideoDetective, a framework for long-video question answering with MLLMs. Videos are segmented and modeled as a visual-temporal affinity graph constructed from visual similarity and temporal proximity. A Hypothesis-Verification-Refinement loop estimates query relevance on observed segments and propagates scores across the graph to produce a global relevance distribution, which is then used to localize critical segments for sparse observation and final answering. Experiments report consistent accuracy gains across mainstream MLLMs, reaching up to 7.5% on VideoMME-long.
Significance. If the gains are reproducible and the propagation mechanism is shown to be reliable, the work could improve long-video understanding by combining extrinsic query signals with intrinsic video structure, mitigating context-window limits without dense sampling. The public code release is a clear strength for reproducibility.
major comments (1)
- [Method (Hypothesis-Verification-Refinement loop)] The relevance-propagation step in the Hypothesis-Verification-Refinement loop (described in the abstract and method) is load-bearing for both the 'global relevance distribution' claim and the reported 7.5% gains. The graph is built solely from visual similarity and temporal proximity with no mention of query-conditioned edge weighting or explicit handling of semantic shifts; in long videos containing repeated scenes or cutaways this can assign spurious scores to unseen segments. Targeted ablations or failure-case analysis on this assumption are required.
minor comments (1)
- [Abstract] The abstract states accuracy improvements but supplies no information on the number of MLLMs evaluated, the baselines compared, or any statistical testing; a one-sentence summary of these elements would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work's potential and for highlighting the importance of reproducibility through code release. We provide a point-by-point response to the major comment below.
read point-by-point responses
-
Referee: The relevance-propagation step in the Hypothesis-Verification-Refinement loop (described in the abstract and method) is load-bearing for both the 'global relevance distribution' claim and the reported 7.5% gains. The graph is built solely from visual similarity and temporal proximity with no mention of query-conditioned edge weighting or explicit handling of semantic shifts; in long videos containing repeated scenes or cutaways this can assign spurious scores to unseen segments. Targeted ablations or failure-case analysis on this assumption are required.
Authors: We agree that the relevance propagation is crucial to achieving the global relevance distribution and the observed performance gains. The affinity graph is indeed constructed using only visual similarity and temporal proximity, without query-conditioned edge weights. Nevertheless, query conditioning is incorporated through the Hypothesis step, which assigns initial relevance scores to observed segments based on their alignment with the query. These scores are then propagated to unseen segments via the graph structure. The Verification and Refinement stages further incorporate query feedback by selecting additional segments for observation and updating the distribution accordingly. This iterative process helps reduce the impact of potential spurious connections arising from repeated scenes. Although the original manuscript does not include dedicated ablations for semantic shift scenarios, the method's effectiveness is supported by results on challenging long-video benchmarks. To directly address this concern, we will include targeted ablations and failure-case analyses on videos with repeated scenes and cutaways in the revised version. revision: yes
Circularity Check
No circularity: algorithmic framework with explicit propagation steps
full rationale
The paper describes VideoDetective as a procedural pipeline: segment the video, construct an affinity graph from visual similarity and temporal proximity, then run a Hypothesis-Verification-Refinement loop to estimate observed relevance and propagate to unseen segments. No equations, fitted parameters, or self-citations are shown that would reduce the final relevance distribution to the inputs by construction. The propagation is an explicit algorithmic step rather than a tautological redefinition or renamed fit. This is self-contained against external benchmarks and matches the expected non-finding for a method paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A video can be partitioned into segments whose pairwise visual similarity and temporal proximity form a useful affinity graph for relevance propagation.
- domain assumption Relevance scores estimated on observed segments can be reliably propagated to unobserved segments through the affinity graph.
Reference graph
Works this paper leans on
-
[1]
Read the question and all options
-
[2]
For each option, check frames and attached evidence lines
-
[3]
Prefer explicit evidence over vague impressions
-
[4]
For order/time questions, compare early vs late frames; for text, use OCR evidence
-
[5]
If evidence is weak, choose most plausible option and state low confidence. Rules:MUST output an option LETTER (A/B/C/D). DO NOT output “NO EVIDENCE”. Response Format: Analysis:<your reasoning> Final Answer:<ONE LETTER> Reason:<one short sentence> User Prompt Based on these video frames, answer the following ques- tion: Frame Information:{frame info str} ...
-
[6]
Official sampling rates (frames per video)
-
[7]
Per-frame token counts specified in official API documentation
-
[8]
Standard video resolution settings Important Notes: • These estimates includeonly image tokensand exclude text prompts, system instructions, and other textual overhead. • This makes them conservative baselines—the actual token consumption of these models would be higher in practice. • All measurements are averaged across all videos in the VideoMME-long be...
work page 2009
-
[9]
Text preprocessing: Lowercase conversion, stopword removal, and lemmatization
-
[10]
IDF computation: Pre-computed on a large corpus, with out-of-vocabulary words assigned a default IDF value
-
[11]
Score computation: For evidence text e and keyword setK r: slex(e, fr) = min 1.0, P t∈e∩Kr IDF(t) Zlex whereZ lex = 3.0is a normalization constant
-
[12]
Normalization: We clip scores to [0,1] via the min(·) term above. D.4.3. EMBEDDING-BASEDSEMANTICSIMILARITY For semantic matching, we use SigLIP text encoder with cosine similarity:
-
[13]
Text encoding: ψ(e) =SigLIP-Text(e)∈R d, ∥ψ(e)∥2 = 1
-
[14]
Score computation: For evidence text e and semantic query setP r, we compute: ssem(e, fr) = max p∈Pr ⟨ψ(e), ψ(p)⟩ where p represents semantic queries (event descrip- tions) that capture the contextual meaning of each facet
-
[15]
Batch encoding: All semantic queries are pre-encoded for efficiency. D.5. Source-aware Fusion Different evidence sources have different signal-to-noise characteristics: • OCR text: High precision, low recall. Weight: λocr = 0.7(trust lexical more). socr(e, fr) = 0.7·s lex(e, fr) + 0.3·s sem(e, fr) • ASR text: Balanced. Weight: λasr = 0.5 (equal trust). sa...
-
[16]
We reuse the same frame sampling number F as the final answer generation
Uniform sampling: Extract F frames uniformly dis- tributed across theentire video(not per-node). We reuse the same frame sampling number F as the final answer generation
-
[17]
VLM generation (time-stamped event timeline): Use the VLM to generate a coarse event timeline based on these F frames, capturing the overall narrative and key events. Concretely, the VLM outputs a list of event items, each with an approximate temporal span (e.g., start/end timestamps or the corresponding frame indices among the F sampled frames) plus a sh...
-
[18]
A person explains X before demonstrating Y
Deterministic node-level assignment: Each node cor- responds to a video chunk with a temporal interval [si, ei]. We assign to node i all event items whose tem- poral spans overlap with [si, ei] (or whose associated sampled-frame indices fall within the node’s interval), and concatenate their descriptions to form ei. If no event item overlaps, we assign th...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.