pith. sign in

arxiv: 2601.23224 · v2 · pith:RXKT4C55new · submitted 2026-01-30 · 💻 cs.CV

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

Pith reviewed 2026-05-22 11:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords long video understandingmulti-hop reasoningtool invocationmultimodal large language modelsattention maskingtrajectory guided rewardevidence seekingvideo question answering
0
0 comments X

The pith

Video-o3 enables iterative visual clue seeking in long videos using native tool invocation for multi-hop reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Video-o3 to overcome the limits of uniform sampling and single-turn inference in multimodal models processing long videos. It shows that models can actively discover salient visual clues, inspect key segments in detail, and stop once they have enough evidence for complex questions. This matters because extensive redundancy in long videos makes sparse critical information hard to locate without repeated, focused looks. The work introduces Task-Decoupled Attention Masking to keep attention focused during mixed reasoning and tool steps while holding onto global context, plus a Verifiable Trajectory-Guided Reward to manage growing context length during multi-turn use. Training relies on a synthesized set of 173K trajectories, and the resulting model records 72.1 percent accuracy on MLVU and 46.5 percent on Video-Holmes.

Core claim

Video-o3 supports iterative discovery of salient visual clues, fine-grained inspection of key segments, and adaptive termination once sufficient evidence is acquired. Task-Decoupled Attention Masking isolates per-step concentration while preserving shared global context to address attention dispersion from heterogeneous reasoning and tool-calling. The Verifiable Trajectory-Guided Reward balances exploration coverage with reasoning efficiency to limit context growth in multi-turn interactions. A data synthesis pipeline builds the Seeker-173K dataset of 173K high-quality tool-interaction trajectories for supervised and reinforcement learning, yielding 72.1 percent accuracy on MLVU and 46.5 on

What carries the argument

Native interleaved tool invocation supported by Task-Decoupled Attention Masking, which separates per-step attention flows while retaining global context, and Verifiable Trajectory-Guided Reward, which steers efficient multi-turn trajectories.

If this is right

  • Long-video tasks improve when models can target and inspect only the segments that matter instead of sampling frames uniformly.
  • Multi-turn reasoning stays tractable because the reward system limits unnecessary context expansion.
  • Training scales through large numbers of automatically generated tool-use trajectories rather than hand-labeled data.
  • Adaptive stopping reduces wasted computation once the model judges that enough evidence has been gathered.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same interleaved seeking pattern might help models handle other long-form inputs where key details are scattered, such as extended audio or document collections.
  • Practical systems for video search or monitoring could adopt active inspection to lower the cost of reviewing hours of footage.
  • Further tests on unscripted real-world videos would show whether the synthesized trajectories transfer to noisy, non-benchmark settings.

Load-bearing premise

Task-Decoupled Attention Masking isolates per-step focus without discarding essential shared context, and the Verifiable Trajectory-Guided Reward generates trajectories that generalize past the synthesized Seeker-173K dataset.

What would settle it

An ablation that removes Task-Decoupled Attention Masking and measures the resulting accuracy drop on MLVU or Video-Holmes against uniform-sampling baselines would test whether the masking step is required for the reported gains.

read the original abstract

Existing multimodal large language models for long-video understanding predominantly rely on uniform sampling and single-turn inference, limiting their ability to identify sparse yet critical evidence amid extensive redundancy. We introduce Video-o3, a novel framework that supports iterative discovery of salient visual clues, fine-grained inspection of key segments, and adaptive termination once sufficient evidence is acquired. Technically, we address two core challenges in interleaved tool invocation. First, to mitigate attention dispersion induced by the heterogeneity of reasoning and tool-calling, we propose Task-Decoupled Attention Masking, which isolates per-step concentration while preserving shared global context. Second, to control context length growth in multi-turn interactions, we introduce a Verifiable Trajectory-Guided Reward that balances exploration coverage with reasoning efficiency. To support training at scale, we further develop a data synthesis pipeline and construct Seeker-173K, comprising 173K high-quality tool-interaction trajectories for effective supervised and reinforcement learning. Extensive experiments show that Video-o3 substantially outperforms state-of-the-art methods, achieving 72.1% accuracy on MLVU and 46.5% on Video-Holmes. These results demonstrate Video-o3's strong multi-hop evidence-seeking and reasoning capabilities, and validate the effectiveness of native tool invocation in long-video scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Video-o3, a framework for long-video multi-hop reasoning that replaces uniform sampling with iterative native tool invocation for discovering and inspecting sparse visual clues. It proposes Task-Decoupled Attention Masking to prevent attention dispersion between reasoning and tool-calling steps while preserving global context, and a Verifiable Trajectory-Guided Reward to manage context growth by balancing exploration coverage against reasoning efficiency. A data-synthesis pipeline produces the Seeker-173K dataset of tool-interaction trajectories for supervised and reinforcement learning. Experiments report 72.1% accuracy on MLVU and 46.5% on Video-Holmes, claimed to demonstrate the effectiveness of native interleaved tool use.

Significance. If the performance gains and causal attribution to the two proposed mechanisms hold after proper validation, the work would be significant for the field of long-video understanding. It directly targets the redundancy and evidence-sparsity problems that limit current MLLMs, offering a concrete path toward adaptive, multi-turn evidence-seeking agents rather than single-pass inference.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): The headline results (72.1% MLVU, 46.5% Video-Holmes) are presented without any baseline tables, ablation studies, error bars, or statistical tests. Because these numbers are the sole empirical support for the central claim that native tool invocation yields substantial gains, the absence of comparative evidence makes the contribution impossible to evaluate at present.
  2. [§3.2] §3.2 (Verifiable Trajectory-Guided Reward): The reward is described as balancing exploration coverage with reasoning efficiency, yet no equations, reward formulation, or fitting procedure are supplied. It is therefore impossible to determine whether the reward function contains free parameters fitted on Seeker-173K trajectories or whether it truly generalizes to real long-video distributions whose clue statistics were never seen during synthesis.
  3. [§3.1] §3.1 (Task-Decoupled Attention Masking): The claim that the masking isolates per-step concentration without loss of shared global context is load-bearing for the attention-dispersion argument, but no attention-map visualizations, ablation on context preservation, or quantitative measure of information loss is provided to support it.
minor comments (2)
  1. [Abstract] The abstract and introduction repeatedly use the phrase 'native interleaved clue seeking' without a concise definition or contrast to prior tool-use paradigms; a short clarifying sentence would improve readability.
  2. [§3.3] Dataset construction details for Seeker-173K (video sources, clue annotation protocol, trajectory filtering criteria) are referenced but not quantified; adding a table of dataset statistics would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on our manuscript. We address each of the major comments in detail below. We are committed to improving the clarity and completeness of the paper based on these suggestions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline results (72.1% MLVU, 46.5% Video-Holmes) are presented without any baseline tables, ablation studies, error bars, or statistical tests. Because these numbers are the sole empirical support for the central claim that native tool invocation yields substantial gains, the absence of comparative evidence makes the contribution impossible to evaluate at present.

    Authors: We appreciate this observation. The experiments section does present comparisons to existing methods, but we acknowledge that the presentation could be strengthened with more structured tables and additional analyses. In the revised manuscript, we will include comprehensive baseline comparison tables, detailed ablation studies on the proposed components, error bars from multiple experimental runs, and appropriate statistical significance tests to better support the reported performance gains. revision: yes

  2. Referee: [§3.2] §3.2 (Verifiable Trajectory-Guided Reward): The reward is described as balancing exploration coverage with reasoning efficiency, yet no equations, reward formulation, or fitting procedure are supplied. It is therefore impossible to determine whether the reward function contains free parameters fitted on Seeker-173K trajectories or whether it truly generalizes to real long-video distributions whose clue statistics were never seen during synthesis.

    Authors: We agree that formalizing the reward is essential for reproducibility. We will add the mathematical formulation of the Verifiable Trajectory-Guided Reward, including the specific equations that balance the exploration coverage term and the reasoning efficiency penalty. Additionally, we will detail the procedure used to fit any parameters on the Seeker-173K dataset and include a discussion on its potential generalization to unseen long-video distributions, supported by qualitative analysis of clue statistics. revision: yes

  3. Referee: [§3.1] §3.1 (Task-Decoupled Attention Masking): The claim that the masking isolates per-step concentration without loss of shared global context is load-bearing for the attention-dispersion argument, but no attention-map visualizations, ablation on context preservation, or quantitative measure of information loss is provided to support it.

    Authors: Thank you for highlighting this. To substantiate the effectiveness of Task-Decoupled Attention Masking, we will incorporate attention map visualizations comparing the masked and unmasked attention patterns. We will also add ablation experiments measuring context preservation and quantitative metrics for information loss, such as attention entropy or token retention rates, to empirically validate that per-step concentration is achieved without compromising the global context. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces Task-Decoupled Attention Masking and Verifiable Trajectory-Guided Reward as novel mechanisms, supported by a data synthesis pipeline yielding Seeker-173K trajectories for supervised and reinforcement learning. Performance is demonstrated empirically on external benchmarks (MLVU at 72.1%, Video-Holmes at 46.5%) rather than derived by construction from the synthesis procedure. No equations or claims reduce a central result to fitted inputs or self-citations; the framework remains self-contained against independent test distributions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard multimodal LLM assumptions about attention behavior and introduces new mechanisms without postulating new physical entities or unverified constants.

axioms (2)
  • domain assumption Heterogeneity of reasoning and tool-calling induces attention dispersion in existing MLLMs for long videos.
    Identified as the first core challenge addressed by Task-Decoupled Attention Masking.
  • domain assumption Multi-turn interactions require explicit control of context length growth to maintain efficiency.
    Identified as the second core challenge addressed by the Verifiable Trajectory-Guided Reward.

pith-pipeline@v0.9.0 · 5803 in / 1332 out tokens · 37913 ms · 2026-05-22T11:24:19.897652+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.

  2. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...

  3. ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six b...

  4. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.

  5. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...