StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video

Ao Li; Boshen Xu; Jian Luan; Jianzhong Ju; Jiaze Li; Linli Yao; Pei Fu; Qin Jin; Zihan Xiao; Zihao Yue

arxiv: 2605.16381 · v1 · pith:HK7KQUJ2new · submitted 2026-05-11 · 💻 cs.CV · cs.AI

StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video

Ao Li , Zihan Xiao , Zihao Yue , Boshen Xu , Linli Yao , Jiaze Li , Pei Fu , Jianzhong Ju

show 2 more authors

Jian Luan Qin Jin

This is my paper

Pith reviewed 2026-05-20 23:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords streaming video understandingproactive decision-makingvideo benchmarktemporal reasoningreinforcement learningsupervised fine-tuning

0 comments

The pith

A new benchmark and two-stage training let video models decide when to respond under partial observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current streaming video benchmarks reduce the task to delayed perception by requiring models to wait for explicit evidence before answering. It shows why this misses real needs for early yet reliable decisions in ongoing video streams, such as live monitoring or commentary. To fix it, the authors create StreamPro-Bench that adds a proactive agency dimension alongside perception and temporal reasoning. They then introduce a training method that first corrects severe imbalance between silence and response signals during fine-tuning and next uses reinforcement learning with rewards at both individual turns and full trajectories. If the approach works, models can act on incomplete information without losing accuracy on standard streaming tests.

Core claim

StreamPro-Bench tests three perspectives including Proactive Agency to measure early reliable decisions under partial observations, and the StreamPro framework uses CB-Stream Loss in supervised fine-tuning followed by Group Relative Policy Optimization with multi-grained rewards to jointly optimize correctness and timing, reaching 41.5 on the new benchmark versus a prior best of 10.4 while scoring 78.9 on StreamingBench-RTVU.

What carries the argument

The two-stage training framework that applies CB-Stream Loss to address supervision imbalance in supervised fine-tuning then Group Relative Policy Optimization with turn-level and trajectory-level rewards.

Load-bearing premise

The new benchmark accurately captures a model's ability to make early yet reliable decisions under partial observations, and the training framework effectively balances response correctness and timing without introducing biases from the imbalance in silence and response signals.

What would settle it

A direct test in which models must output responses at fixed early fractions of video clips containing only partial evidence, checking whether accuracy remains high on StreamPro-Bench but drops sharply on real-time streaming tasks.

Figures

Figures reproduced from arXiv: 2605.16381 by Ao Li, Boshen Xu, Jian Luan, Jianzhong Ju, Jiaze Li, Linli Yao, Pei Fu, Qin Jin, Zihan Xiao, Zihao Yue.

**Figure 2.** Figure 2: Task case illustration of StreamPro-Bench. It contains 7 tasks categorized into three major [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: StreamPro-Bench Statistics: the number of tasks and the average video length. To generate high-quality data across the 7 tasks, we design a pipeline based on a two-agent verification loop, followed by thorough human refinement for all samples. Given the inherent complexity of Risk Forecasting task, we rely entirely on human annotation and verification to guarantee data quality. Further details are provid… view at source ↗

**Figure 4.** Figure 4: Overview of the StreamPro training framework. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Bootstrap rank stability heatmap. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Estimated Bradley-Terry scores and 95% confidence intervals via bootstrap resampling. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Dataset Statistics. Left: StreamPro-SFT-63K; Right: StreamPro-RL-3K. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Benchmark samples of Temporal Reasoning tasks [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Benchmark samples of Perceptual Understanding tasks [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Benchmark samples of Proactive Agency tasks. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of different models on object understanding tasks. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison of different models on temporal grounding tasks. [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison of different models on risk forecasting tasks. [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Comparison of different models on risk forecasting tasks. [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

read the original abstract

Proactive streaming video understanding requires models to continuously process video streams and decide when to respond, rather than merely what to respond. This naturally introduces a decision-making problem under partial observations, where models must balance early prediction against sufficient evidence. However, existing benchmarks largely follow a "see-then-answer" paradigm, where responses are triggered only after explicit evidence appears, effectively reducing proactive reasoning to delayed perception. As a result, they fail to evaluate a model's ability to make timely and reliable decisions under incomplete observations. Moreover, training proactive models is inherently challenging due to the extreme imbalance between silence and response signals in streaming trajectories, as well as the need to jointly optimize response correctness and timing. To address these challenges, we introduce StreamPro-Bench, a new benchmark that evaluates streaming models from three complementary perspectives: Perception Understanding, Temporal Reasoning, and Proactive Agency, where the last measures a model's ability to make early yet reliable decisions under partial observations. We further propose StreamPro, a two-stage training framework for proactive learning. First, we introduce CB-Stream Loss to mitigate the severe supervision imbalance during supervised fine-tuning (SFT). Then, we apply Group Relative Policy Optimization (GRPO) with a multi-grained reward design that involves both turn-level and trajectory-level rewards. Experiments show that StreamPro significantly improves proactive performance. On StreamPro-Bench, it achieves 41.5, substantially outperforming the previous best (10.4), while also maintaining strong performance on real-time streaming benchmarks, achieving 78.9 on StreamingBench-RTVU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StreamPro adds a benchmark for when to answer in video streams plus a training fix for silence imbalance, but the timing ground truth needs explicit independent definition before the 41.5 score can be read as clear progress.

read the letter

The paper's core move is to treat streaming video as a decision problem under partial observations rather than a delayed perception task. It introduces StreamPro-Bench with three axes—Perception Understanding, Temporal Reasoning, and Proactive Agency—and pairs it with a two-stage method: CB-Stream Loss during SFT to counter the extreme silence-response imbalance, followed by GRPO using both turn-level and trajectory-level rewards. That combination is the actual new piece; prior work stayed in the see-then-answer setup, so this directly targets the timing-correctness trade-off that existing benchmarks ignore. The reported 78.9 on StreamingBench-RTVU shows the method does not simply sacrifice accuracy on established real-time tasks, which is a useful check. The 41.5 versus 10.4 gap on the new benchmark is the headline number, but it rests on how the reference timestamps for correct early responses are produced. The stress-test concern is worth taking seriously: if those timestamps are derived from the same trajectories or encode dataset regularities that the reward design exploits, the margin could shrink under a different labeling protocol. The full paper needs to show the annotation rules or detection procedure in enough detail that someone else could reproduce the benchmark without guessing. Minor implementation details around the multi-grained rewards could also use more space, but they are not load-bearing. This work is aimed at researchers building real-time video agents that must act before all evidence arrives. It is worth a serious referee to verify the benchmark construction and to test whether the gains hold under alternative timing definitions.

Referee Report

1 major / 2 minor

Summary. The paper claims that existing streaming video benchmarks follow a reactive 'see-then-answer' paradigm and fail to evaluate proactive decision-making under partial observations. To address this, it introduces StreamPro-Bench, which evaluates models on Perception Understanding, Temporal Reasoning, and Proactive Agency (the latter measuring early yet reliable decisions). It further proposes StreamPro, a two-stage framework: CB-Stream Loss during SFT to handle severe silence/response imbalance, followed by GRPO with multi-grained (turn-level and trajectory-level) rewards to jointly optimize correctness and timing. Experiments report StreamPro scoring 41.5 on StreamPro-Bench (vs. prior best of 10.4) while achieving 78.9 on StreamingBench-RTVU.

Significance. If the benchmark's Proactive Agency metric is shown to be independently and reproducibly defined, the work would meaningfully advance proactive streaming understanding by providing both a new evaluation axis and a training recipe that balances earliness against reliability. The empirical margin on the new benchmark and retention of performance on existing real-time benchmarks would constitute a concrete step beyond reactive perception models.

major comments (1)

[StreamPro-Bench description] StreamPro-Bench section (description of Proactive Agency metric): The paper does not specify how reference response timestamps are generated (human annotation protocol, automated event detection, or otherwise). Without an explicit, reproducible definition of ground-truth timing under partial observations, the 41.5 vs. 10.4 gap cannot be unambiguously interpreted as evidence that the CB-Stream Loss + multi-grained GRPO framework 'effectively balances response correctness and timing.' This is load-bearing because the central contribution is the joint introduction of benchmark and method.

minor comments (2)

[Experiments] The abstract and experimental section would benefit from a brief statement of the number of videos, average stream length, and annotation protocol used to construct StreamPro-Bench so readers can assess its scale and independence from the training trajectories.
[Training framework] Notation for the multi-grained reward components (turn-level vs. trajectory-level) should be introduced with explicit equations or pseudocode to clarify how they are combined during GRPO.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The point regarding the reproducibility of the Proactive Agency metric in StreamPro-Bench is well-taken and directly impacts the interpretability of our central results. We address it in detail below and will incorporate the requested clarifications in the revised version.

read point-by-point responses

Referee: StreamPro-Bench section (description of Proactive Agency metric): The paper does not specify how reference response timestamps are generated (human annotation protocol, automated event detection, or otherwise). Without an explicit, reproducible definition of ground-truth timing under partial observations, the 41.5 vs. 10.4 gap cannot be unambiguously interpreted as evidence that the CB-Stream Loss + multi-grained GRPO framework 'effectively balances response correctness and timing.' This is load-bearing because the central contribution is the joint introduction of benchmark and method.

Authors: We agree that an explicit, reproducible protocol for generating reference response timestamps is necessary to support claims about the balance between earliness and reliability. The current manuscript provides only a high-level description of the Proactive Agency metric. In the revision we will add a dedicated subsection (likely 3.2.3 or equivalent) that details the human annotation protocol: (1) the exact guidelines provided to annotators for identifying the earliest timestamp at which a reliable proactive decision can be made under partial observations, (2) the number of annotators per video and the aggregation method (e.g., majority vote or consensus), (3) inter-annotator agreement statistics (Cohen’s kappa or equivalent), and (4) quality-control procedures such as pilot studies and adjudication of disagreements. These additions will make the ground-truth timing definition fully reproducible and allow readers to assess whether the reported 41.5 score genuinely reflects improved timing-aware decision making rather than benchmark artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on independent benchmarks

full rationale

The paper introduces StreamPro-Bench and the StreamPro training framework (CB-Stream Loss + GRPO with multi-grained rewards) as separate contributions, then reports empirical scores (41.5 on StreamPro-Bench, 78.9 on StreamingBench-RTVU) against prior baselines. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described derivation. The benchmark and method are presented as addressing distinct challenges (partial-observation timing and supervision imbalance), with performance claims resting on external evaluation rather than reducing to inputs by construction. This is the common honest case of a self-contained empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard ML assumptions about data imbalance in streaming tasks and introduces new loss and reward designs without new physical entities or unproven axioms beyond domain assumptions.

axioms (1)

domain assumption The extreme imbalance between silence and response signals in streaming trajectories requires special handling in training.
Invoked in the description of training challenges for proactive models.

pith-pipeline@v0.9.0 · 5837 in / 1247 out tokens · 44738 ms · 2026-05-20T23:29:15.741933+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 18 internal anchors

[1]

Videollm- online: Online video large language model for streaming video,

J. Chen, Z. Lv, S. Wu, K. Q. Lin, C. Song, D. Gao, J.-W. Liu, Z. Gao, D. Mao, and M. Z. Shou, “Videollm- online: Online video large language model for streaming video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 407–18 418

work page 2024
[2]

Videollm knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format,

Y . Wang, X. Meng, Y . Wang, J. Liang, J. Wei, H. Zhang, and D. Zhao, “Videollm knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format,”arXiv preprint arXiv:2411.17991, vol. 1, no. 3, p. 5, 2024

work page arXiv 2024
[3]

Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning,

Y . Wang, S. Liu, D. Wang, N. Xu, G. Wan, H. Zhang, and D. Zhao, “Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning,”arXiv preprint arXiv:2512.06810, 2025

work page arXiv 2025
[4]

Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction,

R. Qian, S. Ding, X. Dong, P. Zhang, Y . Zang, Y . Cao, D. Lin, and J. Wang, “Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24 045–24 055

work page 2025
[5]

StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

H. Yang, F. Tang, L. Zhao, X. An, M. Hu, H. Li, X. Zhuang, Y . Lu, X. Zhang, A. Swikiret al., “Streamagent: Towards anticipatory agents for streaming video understanding,”arXiv preprint arXiv:2508.01875, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Streamforest: Efficient online video understanding with persistent event memory,

X. Zeng, K. Qiu, Q. Zhang, X. Li, J. Wang, J. Li, Z. Yan, K. Tian, M. Tian, X. Zhaoet al., “Streamforest: Efficient online video understanding with persistent event memory,”arXiv preprint arXiv:2509.24871, 2025

work page arXiv 2025
[7]

Streambridge: Turning your offline video large language model into a proactive streaming assistant,

H. Wang, B. Feng, Z. Lai, M. Xu, S. Li, W. Ge, A. Dehghan, M. Cao, and P. Huang, “Streambridge: Turning your offline video large language model into a proactive streaming assistant,”arXiv preprint arXiv:2505.05467, 2025

work page arXiv 2025
[8]

Flash-vstream: Efficient real-time understanding for long video streams,

H. Zhang, Y . Wang, Y . Tang, Y . Liu, J. Feng, and X. Jin, “Flash-vstream: Efficient real-time understanding for long video streams,” inProceedings of the IEEE/CVF international conference on computer vision, 2025, pp. 21 059–21 069

work page 2025
[9]

StreamingVLM: Real-Time Understanding for Infinite Video Streams

R. Xu, G. Xiao, Y . Chen, L. He, K. Peng, Y . Lu, and S. Han, “Streamingvlm: Real-time understanding for infinite video streams,”arXiv preprint arXiv:2510.09608, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Vispeak: Visual instruction feedback in streaming videos,

S. Fu, Q. Yang, Y .-M. Li, Y .-X. Peng, K.-Y . Lin, X. Wei, J.-F. Hu, X. Xie, and W.-S. Zheng, “Vispeak: Visual instruction feedback in streaming videos,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 21 778–21 788

work page 2025
[11]

Streaming Video Instruction Tuning

J. Xia, P. Chen, M. Zhang, X. Sun, and K. Zhou, “Streaming video instruction tuning,”arXiv preprint arXiv:2512.21334, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Querystream: Advancing streaming video understanding with query-aware pruning and proactive response,

K. Zhang, Z. Yang, B. Wang, S. Qian, and C. Xu, “Querystream: Advancing streaming video understanding with query-aware pruning and proactive response,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[13]

Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

K. Zhang, Z. Yang, M. Han, H. Hao, Y . Zhuge, C. Li, J. Zhao, Z. Li, and X. Chang, “Progressive online video understanding with evidence-aligned timing and transparent decisions,”arXiv preprint arXiv:2604.18459, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Timechat-online: 80% visual tokens are naturally redundant in streaming videos,

L. Yao, Y . Li, Y . Wei, L. Li, S. Ren, Y . Liu, K. Ouyang, L. Wang, S. Li, S. Liet al., “Timechat-online: 80% visual tokens are naturally redundant in streaming videos,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 10 807–10 816. 10

work page 2025
[15]

Streamingbench: Assessing the gap for mllms to achieve streaming video understanding,

J. Lin, Z. Fang, C. Chen, H. Cheng, Z. Wan, F. Luo, Z. Wang, P. Li, Y . Liu, and M. Sun, “Streamingbench: Assessing the gap for mllms to achieve streaming video understanding,” inICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2026, pp. 12 147–12 151

work page 2026
[16]

Ovo-bench: How far is your video-llms from real-world online video understanding?

J. Niu, Y . Li, Z. Miao, C. Ge, Y . Zhou, Q. He, X. Dong, H. Duan, S. Ding, R. Qianet al., “Ovo-bench: How far is your video-llms from real-world online video understanding?” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 18 902–18 913

work page 2025
[17]

Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models,

Y . Wang, X. Meng, Y . Wang, H. Zhang, and D. Zhao, “Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models,”arXiv preprint arXiv:2507.09313, 2025

work page arXiv 2025
[18]

Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts,

Y . Wang, Y . Wang, B. Chen, T. Wu, D. Zhao, and Z. Zheng, “Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 18 925–18 935

work page 2025
[19]

Streammind: Unlocking full frame rate streaming video dialogue through event-gated cognition,

X. Ding, H. Wu, Y . Yang, S. Jiang, Q. Zhang, D. Bai, Z. Chen, and T. Cao, “Streammind: Unlocking full frame rate streaming video dialogue through event-gated cognition,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 13 448–13 459

work page 2025
[20]

Eyes wide open: Ego proactive video-llm for streaming video,

Y . Zhang, C. Shi, Y . Wang, and S. Yang, “Eyes wide open: Ego proactive video-llm for streaming video,” arXiv preprint arXiv:2510.14560, 2025

work page arXiv 2025
[21]

Thinking in streaming video,

Z. Liu, L. Guo, H. Li, R. Zhen, X. He, R. Ji, X. Ren, Y . Zhang, H. Lu, and J. Liu, “Thinking in streaming video,”arXiv preprint arXiv:2603.12938, 2026

work page arXiv 2026
[22]

Learning to respond: A large-scale benchmark and progressive learning framework for trigger-centric online video understanding,

J. Qian, H. Du, G. Nan, S. Huang, J. Yu, H. Wang, J. Chen, M. Cai, M. Yang, J. Li, Z. Li, H. Wang, J. Liu, X. Jiang, and S. Leng, “Learning to respond: A large-scale benchmark and progressive learning framework for trigger-centric online video understanding,” https://openreview.net/pdf?id=gmpnSSiJt7, 2025

work page 2025
[23]

Minicpm-o 4.5 technical report,

OpenBMB, “Minicpm-o 4.5 technical report,” https://github.com/OpenBMB/MiniCPM-o/blob/main/docs/ MiniCPM_o_45_technical_report.pdf, 2026, gitHub technical report

work page 2026
[24]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Class-balanced loss based on effective number of samples,

Y . Cui, M. Jia, T.-Y . Lin, Y . Song, and S. Belongie, “Class-balanced loss based on effective number of samples,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9268–9277

work page 2019
[27]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

X. Li, Y . Wang, J. Yu, X. Zeng, Y . Zhu, H. Huang, J. Gao, K. Li, Y . He, C. Wanget al., “Videochat-flash: Hierarchical compression for long-context video modeling,”arXiv preprint arXiv:2501.00574, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,

C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhanget al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025, pp. 24 108–24 118

work page 2025
[30]

Longvideobench: A benchmark for long-context interleaved video- language understanding,

H. Wu, D. Li, B. Chen, and J. Li, “Longvideobench: A benchmark for long-context interleaved video- language understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 28 828–28 857, 2024

work page 2024
[31]

HybridFlow: A Flexible and Efficient RLHF Framework

G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu, “Hybridflow: A flexible and efficient rlhf framework,”arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[33]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

X. Lu, Y . Bo, J. Chen, S. Li, X. Guo, H. Guan, F. Liu, D. Xu, P. Sun, H. Sunet al., “Aura: Always-on understanding and real-time assistance via video streams,”arXiv preprint arXiv:2604.04184, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Et bench: Towards open-ended event-level video- language understanding,

Y . Liu, Z. Ma, Z. Qi, Y . Wu, Y . Shan, and C. W. Chen, “Et bench: Towards open-ended event-level video- language understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 32 076–32 110, 2024

work page 2024
[36]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Llava-video: Video instruction tuning with synthetic data,”arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Advancing video anomaly detection: A concise review and a new dataset,

L. Zhu, L. Wang, A. Raj, T. Gedeon, and C. Chen, “Advancing video anomaly detection: A concise review and a new dataset,” inThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024
[38]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalezet al., “Chatbot arena: An open platform for evaluating llms by human preference,”arXiv preprint arXiv:2403.04132, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Rank analysis of incomplete block designs: I. the method of paired comparisons,

R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,”Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952

work page 1952
[40]

Efron and R

B. Efron and R. J. Tibshirani,An introduction to the bootstrap. Chapman and Hall/CRC, 1994

work page 1994
[41]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wanget al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava- onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhanc- ing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025. 12 Appendices A Details of StreamPro-Bench 14 A.1 Data Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

• If YES→proceed to Step 2

Is the visual information observed so far sufficient to answer the Query without guessing? • If NO→replyWait. • If YES→proceed to Step 2

work page
[47]

the model should do X well

Does the inferred answer differ from the last answer you provided? • If NO→replyWait. • If YES (First Trigger)→reply with the actual answer. • If YES (Answer Update due to new evidence)→reply with the updated answer. Output Constraint: • Do not output anything other thanWaitor the actual content of the answer. 29 Rubric Generation Prompt You are designing...

work page

[1] [1]

Videollm- online: Online video large language model for streaming video,

J. Chen, Z. Lv, S. Wu, K. Q. Lin, C. Song, D. Gao, J.-W. Liu, Z. Gao, D. Mao, and M. Z. Shou, “Videollm- online: Online video large language model for streaming video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 407–18 418

work page 2024

[2] [2]

Videollm knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format,

Y . Wang, X. Meng, Y . Wang, J. Liang, J. Wei, H. Zhang, and D. Zhao, “Videollm knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format,”arXiv preprint arXiv:2411.17991, vol. 1, no. 3, p. 5, 2024

work page arXiv 2024

[3] [3]

Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning,

Y . Wang, S. Liu, D. Wang, N. Xu, G. Wan, H. Zhang, and D. Zhao, “Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning,”arXiv preprint arXiv:2512.06810, 2025

work page arXiv 2025

[4] [4]

Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction,

R. Qian, S. Ding, X. Dong, P. Zhang, Y . Zang, Y . Cao, D. Lin, and J. Wang, “Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24 045–24 055

work page 2025

[5] [5]

StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

H. Yang, F. Tang, L. Zhao, X. An, M. Hu, H. Li, X. Zhuang, Y . Lu, X. Zhang, A. Swikiret al., “Streamagent: Towards anticipatory agents for streaming video understanding,”arXiv preprint arXiv:2508.01875, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Streamforest: Efficient online video understanding with persistent event memory,

X. Zeng, K. Qiu, Q. Zhang, X. Li, J. Wang, J. Li, Z. Yan, K. Tian, M. Tian, X. Zhaoet al., “Streamforest: Efficient online video understanding with persistent event memory,”arXiv preprint arXiv:2509.24871, 2025

work page arXiv 2025

[7] [7]

Streambridge: Turning your offline video large language model into a proactive streaming assistant,

H. Wang, B. Feng, Z. Lai, M. Xu, S. Li, W. Ge, A. Dehghan, M. Cao, and P. Huang, “Streambridge: Turning your offline video large language model into a proactive streaming assistant,”arXiv preprint arXiv:2505.05467, 2025

work page arXiv 2025

[8] [8]

Flash-vstream: Efficient real-time understanding for long video streams,

H. Zhang, Y . Wang, Y . Tang, Y . Liu, J. Feng, and X. Jin, “Flash-vstream: Efficient real-time understanding for long video streams,” inProceedings of the IEEE/CVF international conference on computer vision, 2025, pp. 21 059–21 069

work page 2025

[9] [9]

StreamingVLM: Real-Time Understanding for Infinite Video Streams

R. Xu, G. Xiao, Y . Chen, L. He, K. Peng, Y . Lu, and S. Han, “Streamingvlm: Real-time understanding for infinite video streams,”arXiv preprint arXiv:2510.09608, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Vispeak: Visual instruction feedback in streaming videos,

S. Fu, Q. Yang, Y .-M. Li, Y .-X. Peng, K.-Y . Lin, X. Wei, J.-F. Hu, X. Xie, and W.-S. Zheng, “Vispeak: Visual instruction feedback in streaming videos,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 21 778–21 788

work page 2025

[11] [11]

Streaming Video Instruction Tuning

J. Xia, P. Chen, M. Zhang, X. Sun, and K. Zhou, “Streaming video instruction tuning,”arXiv preprint arXiv:2512.21334, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Querystream: Advancing streaming video understanding with query-aware pruning and proactive response,

K. Zhang, Z. Yang, B. Wang, S. Qian, and C. Xu, “Querystream: Advancing streaming video understanding with query-aware pruning and proactive response,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[13] [13]

Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

K. Zhang, Z. Yang, M. Han, H. Hao, Y . Zhuge, C. Li, J. Zhao, Z. Li, and X. Chang, “Progressive online video understanding with evidence-aligned timing and transparent decisions,”arXiv preprint arXiv:2604.18459, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

Timechat-online: 80% visual tokens are naturally redundant in streaming videos,

L. Yao, Y . Li, Y . Wei, L. Li, S. Ren, Y . Liu, K. Ouyang, L. Wang, S. Li, S. Liet al., “Timechat-online: 80% visual tokens are naturally redundant in streaming videos,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 10 807–10 816. 10

work page 2025

[15] [15]

Streamingbench: Assessing the gap for mllms to achieve streaming video understanding,

J. Lin, Z. Fang, C. Chen, H. Cheng, Z. Wan, F. Luo, Z. Wang, P. Li, Y . Liu, and M. Sun, “Streamingbench: Assessing the gap for mllms to achieve streaming video understanding,” inICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2026, pp. 12 147–12 151

work page 2026

[16] [16]

Ovo-bench: How far is your video-llms from real-world online video understanding?

J. Niu, Y . Li, Z. Miao, C. Ge, Y . Zhou, Q. He, X. Dong, H. Duan, S. Ding, R. Qianet al., “Ovo-bench: How far is your video-llms from real-world online video understanding?” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 18 902–18 913

work page 2025

[17] [17]

Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models,

Y . Wang, X. Meng, Y . Wang, H. Zhang, and D. Zhao, “Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models,”arXiv preprint arXiv:2507.09313, 2025

work page arXiv 2025

[18] [18]

Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts,

Y . Wang, Y . Wang, B. Chen, T. Wu, D. Zhao, and Z. Zheng, “Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 18 925–18 935

work page 2025

[19] [19]

Streammind: Unlocking full frame rate streaming video dialogue through event-gated cognition,

X. Ding, H. Wu, Y . Yang, S. Jiang, Q. Zhang, D. Bai, Z. Chen, and T. Cao, “Streammind: Unlocking full frame rate streaming video dialogue through event-gated cognition,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 13 448–13 459

work page 2025

[20] [20]

Eyes wide open: Ego proactive video-llm for streaming video,

Y . Zhang, C. Shi, Y . Wang, and S. Yang, “Eyes wide open: Ego proactive video-llm for streaming video,” arXiv preprint arXiv:2510.14560, 2025

work page arXiv 2025

[21] [21]

Thinking in streaming video,

Z. Liu, L. Guo, H. Li, R. Zhen, X. He, R. Ji, X. Ren, Y . Zhang, H. Lu, and J. Liu, “Thinking in streaming video,”arXiv preprint arXiv:2603.12938, 2026

work page arXiv 2026

[22] [22]

Learning to respond: A large-scale benchmark and progressive learning framework for trigger-centric online video understanding,

J. Qian, H. Du, G. Nan, S. Huang, J. Yu, H. Wang, J. Chen, M. Cai, M. Yang, J. Li, Z. Li, H. Wang, J. Liu, X. Jiang, and S. Leng, “Learning to respond: A large-scale benchmark and progressive learning framework for trigger-centric online video understanding,” https://openreview.net/pdf?id=gmpnSSiJt7, 2025

work page 2025

[23] [23]

Minicpm-o 4.5 technical report,

OpenBMB, “Minicpm-o 4.5 technical report,” https://github.com/OpenBMB/MiniCPM-o/blob/main/docs/ MiniCPM_o_45_technical_report.pdf, 2026, gitHub technical report

work page 2026

[24] [24]

Qwen2.5-VL Technical Report

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Class-balanced loss based on effective number of samples,

Y . Cui, M. Jia, T.-Y . Lin, Y . Song, and S. Belongie, “Class-balanced loss based on effective number of samples,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9268–9277

work page 2019

[27] [27]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

X. Li, Y . Wang, J. Yu, X. Zeng, Y . Zhu, H. Huang, J. Gao, K. Li, Y . He, C. Wanget al., “Videochat-flash: Hierarchical compression for long-context video modeling,”arXiv preprint arXiv:2501.00574, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,

C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhanget al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025, pp. 24 108–24 118

work page 2025

[30] [30]

Longvideobench: A benchmark for long-context interleaved video- language understanding,

H. Wu, D. Li, B. Chen, and J. Li, “Longvideobench: A benchmark for long-context interleaved video- language understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 28 828–28 857, 2024

work page 2024

[31] [31]

HybridFlow: A Flexible and Efficient RLHF Framework

G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu, “Hybridflow: A flexible and efficient rlhf framework,”arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023

[33] [33]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

X. Lu, Y . Bo, J. Chen, S. Li, X. Guo, H. Guan, F. Liu, D. Xu, P. Sun, H. Sunet al., “Aura: Always-on understanding and real-time assistance via video streams,”arXiv preprint arXiv:2604.04184, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Et bench: Towards open-ended event-level video- language understanding,

Y . Liu, Z. Ma, Z. Qi, Y . Wu, Y . Shan, and C. W. Chen, “Et bench: Towards open-ended event-level video- language understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 32 076–32 110, 2024

work page 2024

[36] [36]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Llava-video: Video instruction tuning with synthetic data,”arXiv preprint arXiv:2410.02713, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Advancing video anomaly detection: A concise review and a new dataset,

L. Zhu, L. Wang, A. Raj, T. Gedeon, and C. Chen, “Advancing video anomaly detection: A concise review and a new dataset,” inThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

work page 2024

[38] [38]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalezet al., “Chatbot arena: An open platform for evaluating llms by human preference,”arXiv preprint arXiv:2403.04132, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Rank analysis of incomplete block designs: I. the method of paired comparisons,

R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,”Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952

work page 1952

[40] [40]

Efron and R

B. Efron and R. J. Tibshirani,An introduction to the bootstrap. Chapman and Hall/CRC, 1994

work page 1994

[41] [41]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wanget al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava- onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhanc- ing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025. 12 Appendices A Details of StreamPro-Bench 14 A.1 Data Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

• If YES→proceed to Step 2

Is the visual information observed so far sufficient to answer the Query without guessing? • If NO→replyWait. • If YES→proceed to Step 2

work page

[47] [47]

the model should do X well

Does the inferred answer differ from the last answer you provided? • If NO→replyWait. • If YES (First Trigger)→reply with the actual answer. • If YES (Answer Update due to new evidence)→reply with the updated answer. Output Constraint: • Do not output anything other thanWaitor the actual content of the answer. 29 Rubric Generation Prompt You are designing...

work page