Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

Changlin Li; Haihong Hao; Junhan Zhao; Kecheng Zhang; Mingfei Han; Xiaojun Chang; Yunzhi Zhuge; Zhihui Li; Zongxin Yang

arxiv: 2604.18459 · v1 · submitted 2026-04-20 · 💻 cs.CV · cs.AI

Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

Kecheng Zhang , Zongxin Yang , Mingfei Han , Haihong Hao , Yunzhi Zhuge , Changlin Li , Junhan Zhao , Zhihui Li

show 1 more author

Xiaojun Chang

This is my paper

Pith reviewed 2026-05-10 04:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords online video understandingstreaming videoevidence-aligned timingtransparent decision makingvisual agentsprogress metricshierarchical semantic integrationmemory integration

0 comments

The pith

A framework decouples reasoning control from memory integration so video AI can time responses exactly to the first sufficient evidence while streaming transparent decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that visual agents processing streaming video can respond to queries at the precise moment enough evidence appears, rather than after the full clip or with delayed offline processing. It separates the control of when and how to reason from the ongoing integration of visual memory, using observable metrics to decide timing and special tokens to keep a global state efficient. A reader would care because real applications like live surveillance or interactive analysis require both accurate timing and explainable outputs, which current video models trained on complete videos cannot provide. The approach externalizes progress and confidence scores so the system can align its answer moment with the evidence arrival and stream its thinking to users. This yields higher accuracy on benchmarks built specifically for online, streaming evaluation.

Core claim

We propose a novel framework that decouples reasoning control from memory integration. We introduce Thinking-QwenVL with ATDM that externalizes decision process using observable progress (ρ) and confidence (c) metrics to precisely time response tr to match first-sufficient-evidence timestamp t⋆, and HPSI module with learnable multi-level aggregation tokens for global cognitive state, achieving 71.6% on StreamingBench and 46.9% on OVOBench.

What carries the argument

Active Thinking Decision Maker (ATDM) that uses progress (ρ) and confidence (c) metrics for evidence-aligned response timing, paired with Hierarchical Progressive Semantic Integration (HPSI) that propagates learnable multi-level aggregation tokens across clips.

Load-bearing premise

Progress and confidence metrics can reliably identify the first-sufficient-evidence timestamp using only information available up to the current point in the stream, without future frames.

What would settle it

Human annotators mark the earliest timestamp at which sufficient evidence appears for each query in a held-out streaming video set; if the model's chosen response times deviate systematically from those marks, the timing alignment claim is false.

Figures

Figures reproduced from arXiv: 2604.18459 by Changlin Li, Haihong Hao, Junhan Zhao, Kecheng Zhang, Mingfei Han, Xiaojun Chang, Yunzhi Zhuge, Zhihui Li, Zongxin Yang.

**Figure 1.** Figure 1: Comparing paradigms vs. Ours. Given a query Q, offline VLLMs answer only after the full video is available (t = T), while streaming models answer at the query moment (t = tq); neither ensures evidence-aligned timing with the earliest evidence time t ⋆ . Our method decomposes Q into sub-goals and maintains a progress estimate ρ, emitting real-time, stage-wise feedback at every step and selecting a response … view at source ↗

**Figure 2.** Figure 2: (a) Visual information aggregation flow diagram. (b) The dynamic integration operation in LLM with a single clip as an example. The aggregation tokens are initialized in layer 1, layer 1L/3 and layer 2L/3 according to the aggregation tokens of the previous level that can support dynamic resolution style, and these tokens are passed forward layer by layer within the LLM to aggregate the visual information… view at source ↗

**Figure 3.** Figure 3: Pipeline of Thinking-QwenVL. Given streamed clips and a query Q, ATDM generates question-guided caption instructions, decomposes Q into sub-questions, and iteratively extracts evidence from each clip (with progressive visual integration using HPSI), updating sub-answers with progress ρ ∈ [0, 1] and confidence c ∈ [0, 1]. This process runs in parallel across clips and permits to trigger active reflection a… view at source ↗

**Figure 4.** Figure 4: Visualization of qualitative example showcasing how our ATDM framework achieves [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy improvements over our baseline model on sub-tasks of the RTVBench under the same experimental conditions as the RTVBench paper. The overall accuracy of our model increased from 32.75% to 35.87%. 4.2 MAIN RESULTS StreamingBench. In [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Impact of ATDM components. All represents the complete model performance when use ATDM. Each column beyond this represents the ablation of the corresponding part of ATDM. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: A real example of the attention mask in our final [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: An example illustrating the outputs of each ATDM component in Thinking-QwenVL; in [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: An example illustrating the outputs of each ATDM component in Flash-VStream. The [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of model-generated captions for the same clip. Our caption explicitly en [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

read the original abstract

Visual agents operating in the wild must respond to queries precisely when sufficient evidence first appears in a video stream, a critical capability that is overlooked by conventional video LLMs evaluated in offline settings. The shift to an online, streaming paradigm introduces significant challenges: a lack of decision transparency, the difficulty of aligning response timing with visual evidence, and the need to maintain a global, causally consistent understanding under tight computational budgets. To address these issues, we propose a novel framework that decouples reasoning control from memory integration. We introduce \textbf{\model{}}, an instantiation of this framework with two core components. First, the \emph{Active Thinking Decision Maker (ATDM)} is a transparent reasoning controller that externalizes its decision process using observable progress ($\boldsymbol{\rho}$) and confidence ($\boldsymbol{c}$) metrics. This allows it to precisely time its response $t_r$ to match the first-sufficient-evidence timestamp $t^\star$ while streaming its reasoning to the user. Second, the \emph{Hierarchical Progressive Semantic Integration (HPSI)} module acts as an efficient memory system. It employs a set of learnable, multi-level aggregation tokens that are propagated across clips to build a rich, global cognitive state without exceeding token budgets. %Our approach sets a new standard on key online video understanding benchmarks, achieving strong performance of \textbf{71.6\%} on StreamingBench and \textbf{46.9\%} on OVOBench, demonstrating a robust solution for evidence-aligned and transparent online video analysis. Extensive experiments demonstrate the effectiveness of ATDM and HPSI, e.g., Thinking-QwenVL improves the accuracy of the previous state-of-the-art from 67.63\% to 71.60\% on the StreamingBench benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds ATDM for timing responses via progress and confidence metrics plus HPSI memory tokens to online video LLMs, but lacks direct tests confirming the timing works causally from past frames only.

read the letter

The main things to know are that this work targets online streaming video understanding by decoupling reasoning control from memory, using an ATDM controller that outputs observable ρ progress and c confidence scores to set response time tr at the first-sufficient-evidence point t*, and an HPSI module that propagates learnable multi-level aggregation tokens across clips to keep a global state under token limits. They report lifting StreamingBench accuracy from 67.63% to 71.6% and 46.9% on OVOBench with Thinking-QwenVL. That combination of explicit timing metrics and hierarchical memory tokens is not in the prior offline video LLM work referenced in the abstract, so the specific modules count as new. The framing of the problem—transparent decisions and evidence-aligned timing in a true stream—is useful for anyone building visual agents that must act without future frames. The benchmark numbers suggest the additions move the needle on those datasets. The soft spot is the missing validation for the timing claim. The stress-test point holds: the abstract and reported gains do not include an ablation that freezes the model, removes any possible future-clip leakage, and measures alignment error against human-annotated t* collected under strict online rules. Without that, it is hard to know whether ρ and c actually detect the first-sufficient point causally or rely on training signals that would not be available at inference. HPSI looks like a reasonable efficiency device, but the same causal-consistency check would strengthen it. This is for people working on streaming video agents or real-time visual reasoning rather than general video understanding. A reader focused on practical deployment in robotics or surveillance would get concrete module ideas and benchmark numbers to build on. The paper deserves peer review because the problem is real, the modules are specific, and the reported gains are measurable even if the central timing evidence needs tightening.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes a framework for online streaming video understanding that decouples reasoning control from memory integration. It introduces Thinking-QwenVL with two components: the Active Thinking Decision Maker (ATDM), which externalizes decisions via observable progress (ρ) and confidence (c) metrics to set response time tr equal to the first-sufficient-evidence timestamp t⋆ while streaming reasoning, and the Hierarchical Progressive Semantic Integration (HPSI) module, which uses learnable multi-level aggregation tokens propagated across clips to maintain a global causal cognitive state within token budgets. The work reports benchmark results of 71.6% on StreamingBench (improving prior SOTA from 67.63%) and 46.9% on OVOBench, with claims that extensive experiments validate the effectiveness of ATDM and HPSI for evidence-aligned, transparent online video analysis.

Significance. If ATDM's ρ and c metrics can causally identify t⋆ from past frames only and HPSI maintains consistent global understanding without lookahead or leakage, the framework would advance streaming video LLMs by addressing decision transparency and precise timing—capabilities overlooked in offline evaluations. The reported gains on dedicated online benchmarks (StreamingBench, OVOBench) suggest practical value for real-world agents operating under computational constraints, provided the timing alignment is empirically isolated from non-causal signals.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: The central claim that ATDM times tr to match t⋆ using only observable ρ and c presupposes causal detection of the first-sufficient-evidence point from past frames alone. The reported accuracy lift (67.63% to 71.60% on StreamingBench) provides no ablation, temporal alignment error metric, or validation against human-annotated t⋆ collected under strict online constraints with no future-frame access or post-hoc labels; this is load-bearing for the evidence-aligned timing contribution.
[HPSI module] HPSI module description: The learnable multi-level aggregation tokens are presented as building a rich global cognitive state while respecting token budgets and causal consistency across clips. No equation, propagation rule, or analysis demonstrates how information flow remains strictly causal (no implicit future-clip leakage) in a true streaming regime, which underpins the claim of maintaining consistent understanding under tight budgets.

minor comments (3)

[Abstract] Abstract: Performance is reported as both 71.6% and 71.60% on StreamingBench; standardize decimal precision for consistency.
[Abstract] Abstract: The model is named Thinking-QwenVL but the title and framework description use 'Progressive Online Video Understanding'; clarify the exact relationship between the instantiated model and the general framework.
[ATDM description] Notation: ρ and c are typeset in bold math mode (suggesting vectors), but their scalar or per-token usage in timing decisions is not explicitly defined; add a brief clarification.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and detailed comments, which highlight important aspects of causal timing and memory consistency in our online video understanding framework. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: The central claim that ATDM times tr to match t⋆ using only observable ρ and c presupposes causal detection of the first-sufficient-evidence point from past frames alone. The reported accuracy lift (67.63% to 71.60% on StreamingBench) provides no ablation, temporal alignment error metric, or validation against human-annotated t⋆ collected under strict online constraints with no future-frame access or post-hoc labels; this is load-bearing for the evidence-aligned timing contribution.

Authors: We agree that isolating the causal contribution of ATDM's timing mechanism is essential. Both ρ and c are computed exclusively from past and current frames in a streaming fashion, with tr determined as the first timestamp where the joint threshold on these metrics is met (see Section 3.2). To address the concern, we will add an ablation in the revised Experiments section that disables the progressive timing logic (replacing it with a fixed-delay baseline) and reports the resulting drop in StreamingBench accuracy, thereby quantifying the 3.97% gain attributable to evidence-aligned decisions. We will also introduce a temporal alignment error metric that measures the average deviation between tr and the model's internal evidence-sufficiency point. However, human-annotated t⋆ labels collected under strict online constraints (no future frames or post-hoc adjustment) are not present in StreamingBench or OVOBench; creating such a dataset would require a separate, large-scale annotation effort outside the scope of this work. We will add an explicit limitations paragraph discussing this gap and outlining it as future work. revision: partial
Referee: [HPSI module] HPSI module description: The learnable multi-level aggregation tokens are presented as building a rich global cognitive state while respecting token budgets and causal consistency across clips. No equation, propagation rule, or analysis demonstrates how information flow remains strictly causal (no implicit future-clip leakage) in a true streaming regime, which underpins the claim of maintaining consistent understanding under tight budgets.

Authors: We thank the referee for emphasizing the need for explicit causal guarantees. In the revised manuscript we will insert the formal propagation equations for the multi-level aggregation tokens: at each clip t, the level-l token A_l^t is computed via a unidirectional cross-attention update A_l^t = CrossAttn(A_l^{t-1}, ClipTokens^t; causal_mask), where the mask prevents any access to future clips. We will add a new subsection with a flow diagram and a short proof that the recurrence is strictly causal (information from clip t+k never influences state at t). This analysis will also quantify the token-budget savings while preserving global consistency, directly supporting the claims under streaming constraints. revision: yes

standing simulated objections not resolved

We cannot supply human-annotated first-sufficient-evidence timestamps (t⋆) collected under strict online constraints, because no such labels exist in the current benchmarks and generating them would constitute an independent data-collection project beyond the scope of a manuscript revision.

Circularity Check

0 steps flagged

No significant circularity; framework modules presented as independent additions

full rationale

The provided abstract and text describe a framework proposal introducing ATDM (with observable ρ and c metrics) and HPSI (with learnable aggregation tokens) as novel components to address streaming video challenges. No equations, derivation chains, or self-referential reductions appear in the text. Performance numbers (71.6% on StreamingBench) are reported as empirical outcomes rather than predictions forced by construction from fitted parameters. The central claims rest on the described modules' design and benchmark results without load-bearing self-citations or definitional equivalences that collapse the argument to its inputs. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

Based on abstract only: introduces learnable aggregation tokens and decision metrics without specifying their training or validation; assumes decoupling of control and memory solves streaming issues.

free parameters (1)

multi-level aggregation tokens
Learnable tokens propagated across clips to build global state, number and levels not specified.

axioms (1)

domain assumption Decoupling reasoning control from memory integration addresses online video challenges effectively
Stated as the core of the novel framework to handle timing, transparency, and budgets.

invented entities (2)

Active Thinking Decision Maker (ATDM) no independent evidence
purpose: Transparent reasoning controller using progress and confidence metrics
New module to externalize decisions and align timing with evidence.
Hierarchical Progressive Semantic Integration (HPSI) no independent evidence
purpose: Efficient memory system with multi-level aggregation tokens
New module for global cognitive state under token limits.

pith-pipeline@v0.9.0 · 5649 in / 1452 out tokens · 72829 ms · 2026-05-10T04:20:04.813270+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video
cs.CV 2026-05 unverdicted novelty 7.0

StreamPro introduces a benchmark and training method using CB-Stream Loss and GRPO to enable proactive decision-making in streaming videos, achieving 41.5 on StreamPro-Bench compared to 10.4 previously.

Reference graph

Works this paper leans on

16 extracted references · cited by 1 Pith paper

[1]

P1: Remove P1 (caption instructionsCI q); demand P2 give captions directly
[2]

P2: Disable P 2 Question decomposition; retain a single queryQand require P 4 to answerQat each step while still emitting per-step confidencecand progressρ
[3]

15 Table 7: Comparison with current online Video understanding LMMs onOVOBench

P3: Remove P3Streaming captioningto test the value of the textual intermediary; P 4 is switched from text-only consumption tomultimodalextraction—directly retrieving evidence from the current visual stream to fill sub-answers. 15 Table 7: Comparison with current online Video understanding LMMs onOVOBench. The subtasks are: i)Real-Time Visual Perception(OC...
[4]

P 4: Replace the graded(ρ, c)update in P 4 (Progressive tracking sub-questions status) with a single binary answerable flag (0/1), eliminating accumulated progress and confidence smoothing
[5]

P5: Remove P5 (self-triggered reflection) to assess the benefit of cross-clip causal revision under low confidence or major semantic shifts. A.3 SUMMARY OF HYPERPARAMETER SETTINGS The training process of our Thinking-QwenVL is structured into three distinct phases.1) In- tegration Pre-training.We pretrain the model on LLA V A-Video-178k (Li et al., 2024a)...

2025
[6]

type": "action

Interaction-Focused QA Fine-Tuning.We further fine-tune the model using general QA-style dialog data to enhance its interaction ability and improve alignment with user queries in a stream- ing setting. Throughout all stages, only the intermediateMergelayers and theLLMbackbone are fine-tuned, while thevisual encoderremains frozen. All experiments are run o...

2024
[7]

Limit to 500 words max

Base your caption only on what is clearly visible. Limit to 500 words max
[8]

Be specific and concrete: describe actions, hand use, counts, object states, etc
[9]

No speculation, no vague summaries

Use short, factual sentences. No speculation, no vague summaries
[10]

three apples

Precision first if required:"three apples"NOT"some fruits","thrusting downward at 45° angle"NOT"attacking","2.3m left of tree"NOT "near tree", etc. <CONSTRAINTS> Return the caption in valid JSON format: { "clip_timestamp": "{timestamp}", "caption": "detailed caption that fulfills the requirements" } <CONSTRAINTS> ▶Output: { "clip_timestamp": "0:07:46-0:08...
[11]

Read the main user question and the list of requiredsubquestions(from Part-1)
[12]

Read the caption of the current video clip
[13]

value": “?

Foreach subquestion, determine whether the caption provides enough information to answer it: - If yes: provide an appropriate answer (‘value’) and a confidence score between 0 and 1. - If no or uncertain: set “value": “?” and “confidence": 0.0. <INPUT> Main Question: <|Question|> Required Subquestions(from Part-2 or latest output from Part-4): <|Required ...
[14]

Clip X→[supports/contradicts/provides evidence for] [attribute] because [exact caption text]

Cross-clip causal reasoning - Analyze each new clip caption fordirectevidence related to each attribute. - Build an explicit, ordered chainonlyfor attributes with relevant evidence. Use arrow notation: “Clip X→[supports/contradicts/provides evidence for] [attribute] because [exact caption text]”. If a clip provides no relevant evidence for any attribute, ...
[15]

relevant evidence found

Evidence relevance check - For each attribute, explicitly check whether the captions contain relevant information. Mark attributes as “relevant evidence found” or “no relevant evidence”
[16]

causal_chain

Update the attribute list -Preserveoriginal values and confidences for attributes without relevant evidence. Modify attributesonlywhere direct, explicit evidence is found; quote the exact caption text that supports the change. <INPUT> Question: <|Question|> Latest reasoning state(attribute list + confidences): <|Past CoT State|> Past clip captions when co...

[1] [1]

P1: Remove P1 (caption instructionsCI q); demand P2 give captions directly

[2] [2]

P2: Disable P 2 Question decomposition; retain a single queryQand require P 4 to answerQat each step while still emitting per-step confidencecand progressρ

[3] [3]

15 Table 7: Comparison with current online Video understanding LMMs onOVOBench

P3: Remove P3Streaming captioningto test the value of the textual intermediary; P 4 is switched from text-only consumption tomultimodalextraction—directly retrieving evidence from the current visual stream to fill sub-answers. 15 Table 7: Comparison with current online Video understanding LMMs onOVOBench. The subtasks are: i)Real-Time Visual Perception(OC...

[4] [4]

P 4: Replace the graded(ρ, c)update in P 4 (Progressive tracking sub-questions status) with a single binary answerable flag (0/1), eliminating accumulated progress and confidence smoothing

[5] [5]

P5: Remove P5 (self-triggered reflection) to assess the benefit of cross-clip causal revision under low confidence or major semantic shifts. A.3 SUMMARY OF HYPERPARAMETER SETTINGS The training process of our Thinking-QwenVL is structured into three distinct phases.1) In- tegration Pre-training.We pretrain the model on LLA V A-Video-178k (Li et al., 2024a)...

2025

[6] [6]

type": "action

Interaction-Focused QA Fine-Tuning.We further fine-tune the model using general QA-style dialog data to enhance its interaction ability and improve alignment with user queries in a stream- ing setting. Throughout all stages, only the intermediateMergelayers and theLLMbackbone are fine-tuned, while thevisual encoderremains frozen. All experiments are run o...

2024

[7] [7]

Limit to 500 words max

Base your caption only on what is clearly visible. Limit to 500 words max

[8] [8]

Be specific and concrete: describe actions, hand use, counts, object states, etc

[9] [9]

No speculation, no vague summaries

Use short, factual sentences. No speculation, no vague summaries

[10] [10]

three apples

Precision first if required:"three apples"NOT"some fruits","thrusting downward at 45° angle"NOT"attacking","2.3m left of tree"NOT "near tree", etc. <CONSTRAINTS> Return the caption in valid JSON format: { "clip_timestamp": "{timestamp}", "caption": "detailed caption that fulfills the requirements" } <CONSTRAINTS> ▶Output: { "clip_timestamp": "0:07:46-0:08...

[11] [11]

Read the main user question and the list of requiredsubquestions(from Part-1)

[12] [12]

Read the caption of the current video clip

[13] [13]

value": “?

Foreach subquestion, determine whether the caption provides enough information to answer it: - If yes: provide an appropriate answer (‘value’) and a confidence score between 0 and 1. - If no or uncertain: set “value": “?” and “confidence": 0.0. <INPUT> Main Question: <|Question|> Required Subquestions(from Part-2 or latest output from Part-4): <|Required ...

[14] [14]

Clip X→[supports/contradicts/provides evidence for] [attribute] because [exact caption text]

Cross-clip causal reasoning - Analyze each new clip caption fordirectevidence related to each attribute. - Build an explicit, ordered chainonlyfor attributes with relevant evidence. Use arrow notation: “Clip X→[supports/contradicts/provides evidence for] [attribute] because [exact caption text]”. If a clip provides no relevant evidence for any attribute, ...

[15] [15]

relevant evidence found

Evidence relevance check - For each attribute, explicitly check whether the captions contain relevant information. Mark attributes as “relevant evidence found” or “no relevant evidence”

[16] [16]

causal_chain

Update the attribute list -Preserveoriginal values and confidences for attributes without relevant evidence. Modify attributesonlywhere direct, explicit evidence is found; quote the exact caption text that supports the change. <INPUT> Question: <|Question|> Latest reasoning state(attribute list + confidences): <|Past CoT State|> Past clip captions when co...