VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation

Boqiang Duan; Boyuan Tong; Jiedong Zhuang; Jingdong Wang; Ming Dai; Sen Yang; Wankou Yang

arxiv: 2606.06819 · v1 · pith:3LCP4KS3new · submitted 2026-06-05 · 💻 cs.CV

VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation

Ming Dai , Sen Yang , Boqiang Duan , Boyuan Tong , Jiedong Zhuang , Wankou Yang , Jingdong Wang This is my paper

Pith reviewed 2026-06-27 22:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords reasoning video object segmentationmulti-turn reinforcement learningtemporal-spatial chain-of-thoughtvideo segmentationreinforcement learning for visioncoarse-to-fine reasoningkeyframe selection

0 comments

The pith

VideoSEG-O3 introduces the first multi-turn reinforcement learning framework for reasoning video object segmentation that iteratively refines outputs by selecting critical time intervals and keyframes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new approach to reasoning video object segmentation by framing it as a multi-turn reinforcement learning task that mimics human coarse-to-fine visual search. Instead of processing a fixed initial input once, the system generates a temporal-spatial chain-of-thought that repeatedly identifies promising video segments and keyframes to gather additional evidence. A SEG-aware logit calibration step feeds pixel-level segmentation quality back into the model's token decisions during training, and a decoupled thinking trace organizes reasoning along separate temporal, spatial, and linguistic axes. A new cold-start dataset called VTS-CoT supplies the detailed trajectories needed to bootstrap this process.

Core claim

VideoSEG-O3 is the first multi-turn reinforcement learning framework for RVOS that emulates the human coarse-to-fine cognitive process through a multi-turn temporal-spatial chain-of-thought, SEG-aware logit calibration that integrates pixel-wise segmentation feedback into token-level logits, a decoupled thinking trace that decomposes reasoning into temporal, spatial, and linguistic dimensions, and the VTS-CoT cold-start dataset of comprehensive reasoning trajectories.

What carries the argument

multi-turn temporal-spatial chain-of-thought combined with SEG-aware logit calibration, which lets the policy adjust token probabilities using direct pixel-wise segmentation feedback rather than text-only signals.

If this is right

The policy can now perceive and act on segmentation quality beyond the text probability of the [SEG] token during reinforcement learning.
Reasoning trajectories decompose hierarchically into temporal, spatial, and linguistic components via the decoupled thinking trace.
Training can begin from a specialized cold-start dataset VTS-CoT that supplies complete multi-turn reasoning paths.
The framework supports active acquisition of visual evidence in long or intricate videos rather than relying solely on fixed initial inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same iterative selection of intervals and keyframes could extend to other video tasks that require progressive refinement of spatial or temporal focus.
If the logit calibration generalizes, similar feedback loops might improve reinforcement learning in domains where output quality can be measured at a finer granularity than the action tokens themselves.
The decoupled trace structure suggests a template for making chain-of-thought reasoning more interpretable by separating concerns that are currently entangled in single-pass models.

Load-bearing premise

That feeding pixel-wise segmentation quality back into token-level logits through SEG-aware logit calibration will let the reinforcement learning policy improve its decisions on the basis of actual segmentation accuracy.

What would settle it

A controlled comparison on long videos with ambiguous references showing that a single-turn baseline achieves equal or higher segmentation accuracy without the iterative interval and keyframe selection steps.

Figures

Figures reproduced from arXiv: 2606.06819 by Boqiang Duan, Boyuan Tong, Jiedong Zhuang, Jingdong Wang, Ming Dai, Sen Yang, Wankou Yang.

**Figure 1.** Figure 1: Motivation of VideoSEG-O3. (a) Explicit Text: Coordinate-based RL methods (e.g., box, point) allow rewards to directly optimize the policy. (b) Implicit Embedding: Latent mask representations (e.g., [SEG]) suffer from implicit disconnection between token probability and mask quality. (c) Limited Visual Cues: Fixed sampling strategies fail to actively explore key temporal segments required for precise loca… view at source ↗

**Figure 2.** Figure 2: Overview and statistical analysis of VideoSEG-O3. Top: The MLLM architecture employs a Decoupled Thinking Trace to analyze motion context, spatial details, and expressions. A specialized <select> token is utilized to orchestrate iterative temporal-spatial exploration. Bottom: Performance gains (∆J &F) and turn distributions across four representative benchmarks. Results highlight the performance evolution … view at source ↗

**Figure 3.** Figure 3: Overall pipeline of VideoSEG-O3. (1) Initial Observation: The model ingests global temporal frames and uniformly sampled spatial frames as visual inputs. (2) Iterative Exploration: The model iteratively refines its reasoning by taking the temporal interval and keyframe index selected via <select> in the previous turn as input. (3) Termination: Upon generating the <answer> token, the process terminates, and… view at source ↗

**Figure 4.** Figure 4: The overall framework of VideoSEG-O3. (a) Multi-turn temporal-spatial exploration (Sec. 3.1). (b) A composite reward design guiding reinforcement learning via format (Rf ), temporal (Rt), segmentation (Rm), and progressive (Rp) rewards (Sec. 3.3). (c) A calibration strategy that aligns the latent [SEG] token representation with pixel-level mask confidence (Sec. 3.2). into textual probabilities. Specificall… view at source ↗

**Figure 5.** Figure 5: Quantitative comparison of RL on keyframe selection. We compare keyframe mIoU (mask quality for initialization) and keyframe empty ratio (frequency of selecting frames without the target) across four representative benchmarks [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Reinforcement learning training curves. (a) The smoothed reward curve demonstrates consistent convergence and performance improvement. (b) The average turns indicate the model’s transition from fixed reasoning patterns to adaptive multi-turn exploration. (c) The completion length reflects the refinement and stabilization of the generated reasoning trajectories. Calibration successfully reverses this trend … view at source ↗

**Figure 7.** Figure 7: Pipeline of our VTS-CoT: Data selection → Temporal labeling → Candidate generation → Chain-of-Thought construction. B.2. Prompt Designs We provide the prompt templates used for the distinct stages of VTS-CoT construction below. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study on auxiliary segmentation loss (λseg) [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: Visualization of a two-round interactive reasoning and segmentation process. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of a three-round iterative refinement process for complex target localization (Sample 1). 27 [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of a three-round iterative refinement process for complex target localization (Sample 2). 28 [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

read the original abstract

Reasoning Video Object Segmentation (RVOS) demands a sophisticated integration of temporal dynamics, spatial details, and linguistic reasoning to achieve precise pixel-level localization. Existing methods are limited to reasoning over fixed initial inputs and lack the capacity to actively acquire further visual evidence, which is often essential for resolving complex references in long or intricate videos. To address this, we propose \textbf{VideoSEG-O3}, the first multi-turn reinforcement learning framework for RVOS that emulates the human \textit{``coarse-to-fine''} cognitive process. It employs a \textit{multi-turn temporal-spatial chain-of-thought} to capture fine-grained details by iteratively pinpointing critical intervals and keyframes. Additionally, to enable the policy to perceive segmentation quality beyond mere text probability of \texttt{[SEG]} during the RL stage, we introduce \textit{SEG-aware logit calibration}, which integrates pixel-wise segmentation feedback directly into the token-level logits. Furthermore, we design a \textit{decoupled thinking trace} to hierarchically decompose the reasoning process into temporal, spatial, and linguistic dimensions, and construct \textbf{VTS-CoT}, a specialized cold-start dataset featuring comprehensive reasoning trajectories. The code and models will be released at https://github.com/Dmmm1997/VideoSEG-O3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a multi-turn RL framework for reasoning video object segmentation with several named components, but the abstract supplies no results or ablations to assess whether they work.

read the letter

The main thing to know is that this paper describes VideoSEG-O3 as the first multi-turn reinforcement learning approach for reasoning video object segmentation. It adds a temporal-spatial chain-of-thought that runs iteratively over turns to locate key intervals and frames, a SEG-aware logit calibration meant to inject pixel-level segmentation feedback into token logits, a decoupled thinking trace that splits reasoning into temporal, spatial, and linguistic parts, and a new VTS-CoT cold-start dataset.

These pieces target a genuine limit in prior RVOS work: models that cannot actively gather extra visual evidence when the initial input is ambiguous or the video is long. The design tries to emulate a coarse-to-fine human process, and the calibration step is a direct attempt to let the policy see segmentation quality rather than just the probability of a [SEG] token.

The description of how the components fit together is clear enough on its own terms. The decoupled trace and the calibration mechanism are concrete enough that a reader could imagine implementing them.

The clear weakness is the total lack of numbers. The abstract gives no benchmark scores, no ablations on the new pieces, and no comparison to single-turn baselines. Without that evidence it is impossible to tell whether the calibration actually improves the policy or whether the multi-turn setup reduces error on hard cases. The assumption that pixel feedback can be usefully folded into token logits therefore remains untested in the provided text.

This is for people already working on RL or chain-of-thought methods inside video segmentation. A reader who wants to see a new framework sketched out could get ideas from it, but anyone needing proven gains would have to wait for the experiments. It is worth sending to referees so the authors can supply the missing results and the community can judge the calibration and multi-turn claims on data.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes VideoSEG-O3 as the first multi-turn reinforcement learning framework for Reasoning Video Object Segmentation (RVOS). It introduces a multi-turn temporal-spatial chain-of-thought process to emulate human coarse-to-fine cognition by iteratively identifying critical intervals and keyframes in videos. Additional contributions include SEG-aware logit calibration to incorporate pixel-wise segmentation feedback into token-level logits during RL training, a decoupled thinking trace that decomposes reasoning hierarchically across temporal, spatial, and linguistic dimensions, and the VTS-CoT cold-start dataset containing comprehensive reasoning trajectories. The work targets limitations in prior RVOS methods that rely on fixed initial inputs without active visual evidence acquisition.

Significance. If validated, the framework could meaningfully advance RVOS research by enabling iterative, evidence-seeking reasoning in complex or long videos, moving beyond single-pass approaches. The SEG-aware logit calibration and decoupled thinking trace offer potentially reusable ideas for integrating dense visual signals into RL policies for vision-language tasks, and the VTS-CoT dataset could serve as a useful resource for training multi-turn models. Code and model release supports reproducibility.

major comments (3)

[Abstract] Abstract/Contributions paragraph: the central claim that SEG-aware logit calibration 'integrates pixel-wise segmentation feedback directly into the token-level logits' to allow perception of segmentation quality beyond [SEG] text probability is load-bearing for the RL stage, yet the abstract (and any corresponding method description) provides no formulation, pseudocode, or derivation showing how the calibration is computed or why it avoids reducing to a fitted parameter from the same data.
[Method] Method section on multi-turn temporal-spatial chain-of-thought: the description of iteratively pinpointing intervals and keyframes is presented without an explicit algorithm, state transition, or reward formulation; this makes it impossible to verify whether the multi-turn process is internally consistent or reduces to standard single-turn RL with added turns.
[Experiments] Experiments (or lack thereof): the abstract and provided text contain no quantitative results, ablation studies, or baseline comparisons; without these, the effectiveness claims for the new components cannot be assessed and the soundness of the overall contribution remains unverified.

minor comments (2)

[Abstract] Abstract: the phrase 'the first multi-turn reinforcement learning framework' should be supported by an explicit literature comparison in §2 rather than asserted.
[Abstract] Notation: 'VTS-CoT' and 'decoupled thinking trace' are introduced without immediate expansion or reference to their definitions in later sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additions.

read point-by-point responses

Referee: [Abstract] Abstract/Contributions paragraph: the central claim that SEG-aware logit calibration 'integrates pixel-wise segmentation feedback directly into the token-level logits' to allow perception of segmentation quality beyond [SEG] text probability is load-bearing for the RL stage, yet the abstract (and any corresponding method description) provides no formulation, pseudocode, or derivation showing how the calibration is computed or why it avoids reducing to a fitted parameter from the same data.

Authors: We agree that the abstract and method description as presented lack the explicit formulation. In the revised manuscript we will add the full mathematical derivation of SEG-aware logit calibration, including the precise update rule that combines pixel-wise segmentation metrics with token logits, along with pseudocode and an explanation of why the mechanism is not reducible to a data-fitted scalar. revision: yes
Referee: [Method] Method section on multi-turn temporal-spatial chain-of-thought: the description of iteratively pinpointing intervals and keyframes is presented without an explicit algorithm, state transition, or reward formulation; this makes it impossible to verify whether the multi-turn process is internally consistent or reduces to standard single-turn RL with added turns.

Authors: We acknowledge the need for a formal specification. The revised manuscript will include an explicit algorithm box, state-transition equations, and the per-turn reward formulation that distinguishes the multi-turn evidence-acquisition loop from single-turn RL. revision: yes
Referee: [Experiments] Experiments (or lack thereof): the abstract and provided text contain no quantitative results, ablation studies, or baseline comparisons; without these, the effectiveness claims for the new components cannot be assessed and the soundness of the overall contribution remains unverified.

Authors: This observation is correct for the text provided to the referee. The revised version will contain a full experimental section with quantitative results on standard RVOS benchmarks, ablations isolating each proposed component (multi-turn CoT, SEG-aware calibration, decoupled trace), and comparisons against relevant baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript presents VideoSEG-O3 as a newly constructed multi-turn RL framework for RVOS, together with auxiliary components (multi-turn temporal-spatial CoT, SEG-aware logit calibration, decoupled thinking trace, VTS-CoT dataset). No equations, fitted parameters, predictions, or uniqueness theorems appear in the supplied text. All load-bearing elements are introduced as design choices rather than derived from quantities defined inside the same paper or from self-citations that reduce the central claim to an input. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review yields no explicit free parameters, standard axioms, or invented physical entities; the new methodological components are treated as introduced techniques rather than postulated entities with independent evidence.

invented entities (2)

SEG-aware logit calibration no independent evidence
purpose: Integrate pixel-wise segmentation feedback into token-level logits during RL
New technique described in abstract to address limitation of text-only probability
decoupled thinking trace no independent evidence
purpose: Hierarchically decompose reasoning into temporal, spatial, and linguistic dimensions
New design element introduced to support the multi-turn process

pith-pipeline@v0.9.1-grok · 5777 in / 1175 out tokens · 24142 ms · 2026-06-27T22:55:26.355684+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references

[1]

•Stage II: Chain-of-Thought Cold-Start

(0.6K), Ref-SA V (Yuan et al., 2025a) (37K), and Long-RVOS (Liang et al., 2025a) (2.2K). •Stage II: Chain-of-Thought Cold-Start. – Objective:To elicit multi-step reasoning, enable tool usage, and standardize the format for multi-turn interactions. – Datasets (VTS-CoT):We constructed a proprietary dataset (VTS-CoT, 6K samples) via GPT-assisted labeling. So...

2024
[2]

(In-Domain), alongside zero-shot benchmarks onReasonVOS(Bai et al., 2024) andGroundMoRe(Deng et al.,

2024
[3]

Performance is measured using standard metrics: J (average Intersection over Union, IoU), F (boundary F-measure), andJ&F(the average ofJandF)

(Out-Domain). Performance is measured using standard metrics: J (average Intersection over Union, IoU), F (boundary F-measure), andJ&F(the average ofJandF). Table 6.The three-stage training pipeline of VideoSEG-O3, detailing the specific capabilities developed and the datasets utilized at each stage. Stage Capability Training Datasets Stage I: SFTVideo QA...

2024
[4]

(1.7K) A.2. Training Details The training of VideoSEG-O3 is hierarchically structured into three stages: supervised fine-tuning (SFT), Chain-of-Thought (CoT) cold-start, and Reinforcement Learning (RL). This section provides the exhaustive hyperparameter configurations for each phase. A.2.1. STAGEI (SFT)ANDSTAGEII (COT COLD-START) The transition from Stag...

2025
[5]

Each dataset introduces unique challenges to ensure the complexity and generalization capability of VTS-CoT

Diverse Data Curation.We aggregate a diverse collection of video clips from three representative benchmarks: ReVOS, Long-RVOS, and MeViS. Each dataset introduces unique challenges to ensure the complexity and generalization capability of VTS-CoT. Specifically, ReVOS focuses on reasoning-intensive scenarios, MeViS emphasizes complex motion and expression u...
[6]

Following the initial annotation, we implement a rigorous filtering protocol to retain high-quality samples

Temporal Labeling and Mask Quality Assessment.We design a specialized prompt (see Table 10) to guide the Qwen3VL-235B-A30B-Instruct (Bai et al., 2025) model in performing precise temporal labeling and evaluating mask fidelity. Following the initial annotation, we implement a rigorous filtering protocol to retain high-quality samples. To ensure annotation ...

2025
[7]

This process is executed by the ERNIE-45-VL-424B-A47B model using the prompt detailed in Table 11

Temporal Candidation.Leveraging the refined temporal annotations from the previous stage, we generate three top-ranked candidate temporal intervals. This process is executed by the ERNIE-45-VL-424B-A47B model using the prompt detailed in Table 11. To encourage the generation of diverse yet plausible temporal proposals, we deliberately employ a low-resolut...
[8]

Decoupled CoT Construction.With the high-quality temporal annotations and candidate intervals established, we utilize the prompt presented in Table 12 to instruct the Qwen3VL-235B-A30B-Thinking (Bai et al., 2025) model in synthesizing the final reasoning chains. To maximize the diversity and coverage of the generated Chain-of-Thought (CoT) data, we apply ...

2025
[9]

Temporal Localization:Identify the most semantic-relevant time interval [tstart, tend] and a representative keyframe tkey that best aligns with theTarget Query
[10]

start frame

Mask Fidelity Assessment:Evaluate theCandidate Masks. Determine if the green contours accurately and consistently capture the target object described in the query. Constraints & Rules: • Temporal Duration:The selected window must capture the core event without being excessive. The duration ∆t=t end −t start must satisfy: 5<∆t <0.5×T total •Index Validity:...
[11]

Primary Selection (S 1):Identify the single most relevant segment in the entire video
[12]

Find the best matching segment in the remaining video parts (V\S 1)

Secondary Selection (S2):Mask the time range of S1. Find the best matching segment in the remaining video parts (V\S 1)
[13]

start frame

Tertiary Selection (S 3):Mask the time ranges of bothS 1 andS 2. Find the best match in (V\(S 1 ∪S 2)). Constraints & Rules: •Relevance Ranking:Output must be sorted by semantic relevance:Rel(S 1)> Rel(S 2)> Rel(S 3). •Non-Overlapping:Segments must be mutually exclusive: Si ∩S j =∅,∀i̸=j •Duration Constraints:For any segmentS i with duration∆t i: 0.1×T to...
[14]

Phase 1 (Initialization):Analyze global context and text to form a hypothesis.Constraint:Y ou cannot access specific segment details yet.Action:Request Segment 0
[15]

Verify if the target matches

Phase 2 (Verification Loop):Analyze specific frames requested in the previous step. Verify if the target matches. Action:Request next segment OR Confirm final target. Input Data: •Global:Referring Expression, Total Video Length, Sampled Frame Indices. •Input Segments:Sequence of temporal segments (start/end/keyframe). •Visual Cues:Red Numbers (Frame Indic...
[16]

-Step 2 (Local Spatial):Inspect Sampled Frames

First Item (Global Analysis): -Step 1 (Global Temporal):Analyze low-res global context for candidates. -Step 2 (Local Spatial):Inspect Sampled Frames. -Step 3 (Alignment):Correlate visual cues with text constraints. -Action:RequestInput Segment [0]
[17]

The car at the very front at the beginning

Subsequent Items (Verification Loop for Segmentk): -Step 1 (Segment Temporal):Analyze motion in the requested interval. -Step 2 (Segment Spatial):Verify visual attributes in the requestedkey frame. -Step 3 (Refinement):Compare specific evidence against text. -Action:Justify checkingSegment [k+1]OR ConfirmFinal Segment. Constraints & Rules: •Information Is...

2025

[1] [1]

•Stage II: Chain-of-Thought Cold-Start

(0.6K), Ref-SA V (Yuan et al., 2025a) (37K), and Long-RVOS (Liang et al., 2025a) (2.2K). •Stage II: Chain-of-Thought Cold-Start. – Objective:To elicit multi-step reasoning, enable tool usage, and standardize the format for multi-turn interactions. – Datasets (VTS-CoT):We constructed a proprietary dataset (VTS-CoT, 6K samples) via GPT-assisted labeling. So...

2024

[2] [2]

(In-Domain), alongside zero-shot benchmarks onReasonVOS(Bai et al., 2024) andGroundMoRe(Deng et al.,

2024

[3] [3]

Performance is measured using standard metrics: J (average Intersection over Union, IoU), F (boundary F-measure), andJ&F(the average ofJandF)

(Out-Domain). Performance is measured using standard metrics: J (average Intersection over Union, IoU), F (boundary F-measure), andJ&F(the average ofJandF). Table 6.The three-stage training pipeline of VideoSEG-O3, detailing the specific capabilities developed and the datasets utilized at each stage. Stage Capability Training Datasets Stage I: SFTVideo QA...

2024

[4] [4]

(1.7K) A.2. Training Details The training of VideoSEG-O3 is hierarchically structured into three stages: supervised fine-tuning (SFT), Chain-of-Thought (CoT) cold-start, and Reinforcement Learning (RL). This section provides the exhaustive hyperparameter configurations for each phase. A.2.1. STAGEI (SFT)ANDSTAGEII (COT COLD-START) The transition from Stag...

2025

[5] [5]

Each dataset introduces unique challenges to ensure the complexity and generalization capability of VTS-CoT

Diverse Data Curation.We aggregate a diverse collection of video clips from three representative benchmarks: ReVOS, Long-RVOS, and MeViS. Each dataset introduces unique challenges to ensure the complexity and generalization capability of VTS-CoT. Specifically, ReVOS focuses on reasoning-intensive scenarios, MeViS emphasizes complex motion and expression u...

[6] [6]

Following the initial annotation, we implement a rigorous filtering protocol to retain high-quality samples

Temporal Labeling and Mask Quality Assessment.We design a specialized prompt (see Table 10) to guide the Qwen3VL-235B-A30B-Instruct (Bai et al., 2025) model in performing precise temporal labeling and evaluating mask fidelity. Following the initial annotation, we implement a rigorous filtering protocol to retain high-quality samples. To ensure annotation ...

2025

[7] [7]

This process is executed by the ERNIE-45-VL-424B-A47B model using the prompt detailed in Table 11

Temporal Candidation.Leveraging the refined temporal annotations from the previous stage, we generate three top-ranked candidate temporal intervals. This process is executed by the ERNIE-45-VL-424B-A47B model using the prompt detailed in Table 11. To encourage the generation of diverse yet plausible temporal proposals, we deliberately employ a low-resolut...

[8] [8]

Decoupled CoT Construction.With the high-quality temporal annotations and candidate intervals established, we utilize the prompt presented in Table 12 to instruct the Qwen3VL-235B-A30B-Thinking (Bai et al., 2025) model in synthesizing the final reasoning chains. To maximize the diversity and coverage of the generated Chain-of-Thought (CoT) data, we apply ...

2025

[9] [9]

Temporal Localization:Identify the most semantic-relevant time interval [tstart, tend] and a representative keyframe tkey that best aligns with theTarget Query

[10] [10]

start frame

Mask Fidelity Assessment:Evaluate theCandidate Masks. Determine if the green contours accurately and consistently capture the target object described in the query. Constraints & Rules: • Temporal Duration:The selected window must capture the core event without being excessive. The duration ∆t=t end −t start must satisfy: 5<∆t <0.5×T total •Index Validity:...

[11] [11]

Primary Selection (S 1):Identify the single most relevant segment in the entire video

[12] [12]

Find the best matching segment in the remaining video parts (V\S 1)

Secondary Selection (S2):Mask the time range of S1. Find the best matching segment in the remaining video parts (V\S 1)

[13] [13]

start frame

Tertiary Selection (S 3):Mask the time ranges of bothS 1 andS 2. Find the best match in (V\(S 1 ∪S 2)). Constraints & Rules: •Relevance Ranking:Output must be sorted by semantic relevance:Rel(S 1)> Rel(S 2)> Rel(S 3). •Non-Overlapping:Segments must be mutually exclusive: Si ∩S j =∅,∀i̸=j •Duration Constraints:For any segmentS i with duration∆t i: 0.1×T to...

[14] [14]

Phase 1 (Initialization):Analyze global context and text to form a hypothesis.Constraint:Y ou cannot access specific segment details yet.Action:Request Segment 0

[15] [15]

Verify if the target matches

Phase 2 (Verification Loop):Analyze specific frames requested in the previous step. Verify if the target matches. Action:Request next segment OR Confirm final target. Input Data: •Global:Referring Expression, Total Video Length, Sampled Frame Indices. •Input Segments:Sequence of temporal segments (start/end/keyframe). •Visual Cues:Red Numbers (Frame Indic...

[16] [16]

-Step 2 (Local Spatial):Inspect Sampled Frames

First Item (Global Analysis): -Step 1 (Global Temporal):Analyze low-res global context for candidates. -Step 2 (Local Spatial):Inspect Sampled Frames. -Step 3 (Alignment):Correlate visual cues with text constraints. -Action:RequestInput Segment [0]

[17] [17]

The car at the very front at the beginning

Subsequent Items (Verification Loop for Segmentk): -Step 1 (Segment Temporal):Analyze motion in the requested interval. -Step 2 (Segment Spatial):Verify visual attributes in the requestedkey frame. -Step 3 (Refinement):Compare specific evidence against text. -Action:Justify checkingSegment [k+1]OR ConfirmFinal Segment. Constraints & Rules: •Information Is...

2025