pith. sign in

arxiv: 2605.16381 · v1 · pith:HK7KQUJ2new · submitted 2026-05-11 · 💻 cs.CV · cs.AI

StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video

Pith reviewed 2026-05-20 23:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords streaming video understandingproactive decision-makingvideo benchmarktemporal reasoningreinforcement learningsupervised fine-tuning
0
0 comments X

The pith

A new benchmark and two-stage training let video models decide when to respond under partial observations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current streaming video benchmarks reduce the task to delayed perception by requiring models to wait for explicit evidence before answering. It shows why this misses real needs for early yet reliable decisions in ongoing video streams, such as live monitoring or commentary. To fix it, the authors create StreamPro-Bench that adds a proactive agency dimension alongside perception and temporal reasoning. They then introduce a training method that first corrects severe imbalance between silence and response signals during fine-tuning and next uses reinforcement learning with rewards at both individual turns and full trajectories. If the approach works, models can act on incomplete information without losing accuracy on standard streaming tests.

Core claim

StreamPro-Bench tests three perspectives including Proactive Agency to measure early reliable decisions under partial observations, and the StreamPro framework uses CB-Stream Loss in supervised fine-tuning followed by Group Relative Policy Optimization with multi-grained rewards to jointly optimize correctness and timing, reaching 41.5 on the new benchmark versus a prior best of 10.4 while scoring 78.9 on StreamingBench-RTVU.

What carries the argument

The two-stage training framework that applies CB-Stream Loss to address supervision imbalance in supervised fine-tuning then Group Relative Policy Optimization with turn-level and trajectory-level rewards.

Load-bearing premise

The new benchmark accurately captures a model's ability to make early yet reliable decisions under partial observations, and the training framework effectively balances response correctness and timing without introducing biases from the imbalance in silence and response signals.

What would settle it

A direct test in which models must output responses at fixed early fractions of video clips containing only partial evidence, checking whether accuracy remains high on StreamPro-Bench but drops sharply on real-time streaming tasks.

Figures

Figures reproduced from arXiv: 2605.16381 by Ao Li, Boshen Xu, Jian Luan, Jianzhong Ju, Jiaze Li, Linli Yao, Pei Fu, Qin Jin, Zihan Xiao, Zihao Yue.

Figure 1
Figure 1. Figure 1: Overview of streaming video paradigms and our contributions. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Task case illustration of StreamPro-Bench. It contains 7 tasks categorized into three major [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: StreamPro-Bench Statistics: the number of tasks and the average video length. To generate high-quality data across the 7 tasks, we design a pipeline based on a two-agent ver￾ification loop, followed by thorough human re￾finement for all samples. Given the inherent complexity of Risk Forecasting task, we rely entirely on human annotation and verification to guarantee data quality. Further details are provid… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the StreamPro training framework. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Bootstrap rank stability heatmap. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Estimated Bradley-Terry scores and 95% confidence intervals via bootstrap resampling. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Dataset Statistics. Left: StreamPro-SFT-63K; Right: StreamPro-RL-3K. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Benchmark samples of Temporal Reasoning tasks [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Benchmark samples of Perceptual Understanding tasks [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Benchmark samples of Proactive Agency tasks. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of different models on object understanding tasks. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of different models on temporal grounding tasks. [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of different models on risk forecasting tasks. [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of different models on risk forecasting tasks. [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
read the original abstract

Proactive streaming video understanding requires models to continuously process video streams and decide when to respond, rather than merely what to respond. This naturally introduces a decision-making problem under partial observations, where models must balance early prediction against sufficient evidence. However, existing benchmarks largely follow a "see-then-answer" paradigm, where responses are triggered only after explicit evidence appears, effectively reducing proactive reasoning to delayed perception. As a result, they fail to evaluate a model's ability to make timely and reliable decisions under incomplete observations. Moreover, training proactive models is inherently challenging due to the extreme imbalance between silence and response signals in streaming trajectories, as well as the need to jointly optimize response correctness and timing. To address these challenges, we introduce StreamPro-Bench, a new benchmark that evaluates streaming models from three complementary perspectives: Perception Understanding, Temporal Reasoning, and Proactive Agency, where the last measures a model's ability to make early yet reliable decisions under partial observations. We further propose StreamPro, a two-stage training framework for proactive learning. First, we introduce CB-Stream Loss to mitigate the severe supervision imbalance during supervised fine-tuning (SFT). Then, we apply Group Relative Policy Optimization (GRPO) with a multi-grained reward design that involves both turn-level and trajectory-level rewards. Experiments show that StreamPro significantly improves proactive performance. On StreamPro-Bench, it achieves 41.5, substantially outperforming the previous best (10.4), while also maintaining strong performance on real-time streaming benchmarks, achieving 78.9 on StreamingBench-RTVU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that existing streaming video benchmarks follow a reactive 'see-then-answer' paradigm and fail to evaluate proactive decision-making under partial observations. To address this, it introduces StreamPro-Bench, which evaluates models on Perception Understanding, Temporal Reasoning, and Proactive Agency (the latter measuring early yet reliable decisions). It further proposes StreamPro, a two-stage framework: CB-Stream Loss during SFT to handle severe silence/response imbalance, followed by GRPO with multi-grained (turn-level and trajectory-level) rewards to jointly optimize correctness and timing. Experiments report StreamPro scoring 41.5 on StreamPro-Bench (vs. prior best of 10.4) while achieving 78.9 on StreamingBench-RTVU.

Significance. If the benchmark's Proactive Agency metric is shown to be independently and reproducibly defined, the work would meaningfully advance proactive streaming understanding by providing both a new evaluation axis and a training recipe that balances earliness against reliability. The empirical margin on the new benchmark and retention of performance on existing real-time benchmarks would constitute a concrete step beyond reactive perception models.

major comments (1)
  1. [StreamPro-Bench description] StreamPro-Bench section (description of Proactive Agency metric): The paper does not specify how reference response timestamps are generated (human annotation protocol, automated event detection, or otherwise). Without an explicit, reproducible definition of ground-truth timing under partial observations, the 41.5 vs. 10.4 gap cannot be unambiguously interpreted as evidence that the CB-Stream Loss + multi-grained GRPO framework 'effectively balances response correctness and timing.' This is load-bearing because the central contribution is the joint introduction of benchmark and method.
minor comments (2)
  1. [Experiments] The abstract and experimental section would benefit from a brief statement of the number of videos, average stream length, and annotation protocol used to construct StreamPro-Bench so readers can assess its scale and independence from the training trajectories.
  2. [Training framework] Notation for the multi-grained reward components (turn-level vs. trajectory-level) should be introduced with explicit equations or pseudocode to clarify how they are combined during GRPO.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The point regarding the reproducibility of the Proactive Agency metric in StreamPro-Bench is well-taken and directly impacts the interpretability of our central results. We address it in detail below and will incorporate the requested clarifications in the revised version.

read point-by-point responses
  1. Referee: StreamPro-Bench section (description of Proactive Agency metric): The paper does not specify how reference response timestamps are generated (human annotation protocol, automated event detection, or otherwise). Without an explicit, reproducible definition of ground-truth timing under partial observations, the 41.5 vs. 10.4 gap cannot be unambiguously interpreted as evidence that the CB-Stream Loss + multi-grained GRPO framework 'effectively balances response correctness and timing.' This is load-bearing because the central contribution is the joint introduction of benchmark and method.

    Authors: We agree that an explicit, reproducible protocol for generating reference response timestamps is necessary to support claims about the balance between earliness and reliability. The current manuscript provides only a high-level description of the Proactive Agency metric. In the revision we will add a dedicated subsection (likely 3.2.3 or equivalent) that details the human annotation protocol: (1) the exact guidelines provided to annotators for identifying the earliest timestamp at which a reliable proactive decision can be made under partial observations, (2) the number of annotators per video and the aggregation method (e.g., majority vote or consensus), (3) inter-annotator agreement statistics (Cohen’s kappa or equivalent), and (4) quality-control procedures such as pilot studies and adjudication of disagreements. These additions will make the ground-truth timing definition fully reproducible and allow readers to assess whether the reported 41.5 score genuinely reflects improved timing-aware decision making rather than benchmark artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on independent benchmarks

full rationale

The paper introduces StreamPro-Bench and the StreamPro training framework (CB-Stream Loss + GRPO with multi-grained rewards) as separate contributions, then reports empirical scores (41.5 on StreamPro-Bench, 78.9 on StreamingBench-RTVU) against prior baselines. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described derivation. The benchmark and method are presented as addressing distinct challenges (partial-observation timing and supervision imbalance), with performance claims resting on external evaluation rather than reducing to inputs by construction. This is the common honest case of a self-contained empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard ML assumptions about data imbalance in streaming tasks and introduces new loss and reward designs without new physical entities or unproven axioms beyond domain assumptions.

axioms (1)
  • domain assumption The extreme imbalance between silence and response signals in streaming trajectories requires special handling in training.
    Invoked in the description of training challenges for proactive models.

pith-pipeline@v0.9.0 · 5837 in / 1247 out tokens · 44738 ms · 2026-05-20T23:29:15.741933+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 18 internal anchors

  1. [1]

    Videollm- online: Online video large language model for streaming video,

    J. Chen, Z. Lv, S. Wu, K. Q. Lin, C. Song, D. Gao, J.-W. Liu, Z. Gao, D. Mao, and M. Z. Shou, “Videollm- online: Online video large language model for streaming video,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 18 407–18 418

  2. [2]

    Videollm knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format,

    Y . Wang, X. Meng, Y . Wang, J. Liang, J. Wei, H. Zhang, and D. Zhao, “Videollm knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format,”arXiv preprint arXiv:2411.17991, vol. 1, no. 3, p. 5, 2024

  3. [3]

    Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning,

    Y . Wang, S. Liu, D. Wang, N. Xu, G. Wan, H. Zhang, and D. Zhao, “Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning,”arXiv preprint arXiv:2512.06810, 2025

  4. [4]

    Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction,

    R. Qian, S. Ding, X. Dong, P. Zhang, Y . Zang, Y . Cao, D. Lin, and J. Wang, “Dispider: Enabling video llms with active real-time interaction via disentangled perception, decision, and reaction,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 24 045–24 055

  5. [5]

    StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

    H. Yang, F. Tang, L. Zhao, X. An, M. Hu, H. Li, X. Zhuang, Y . Lu, X. Zhang, A. Swikiret al., “Streamagent: Towards anticipatory agents for streaming video understanding,”arXiv preprint arXiv:2508.01875, 2025

  6. [6]

    Streamforest: Efficient online video understanding with persistent event memory,

    X. Zeng, K. Qiu, Q. Zhang, X. Li, J. Wang, J. Li, Z. Yan, K. Tian, M. Tian, X. Zhaoet al., “Streamforest: Efficient online video understanding with persistent event memory,”arXiv preprint arXiv:2509.24871, 2025

  7. [7]

    Streambridge: Turning your offline video large language model into a proactive streaming assistant,

    H. Wang, B. Feng, Z. Lai, M. Xu, S. Li, W. Ge, A. Dehghan, M. Cao, and P. Huang, “Streambridge: Turning your offline video large language model into a proactive streaming assistant,”arXiv preprint arXiv:2505.05467, 2025

  8. [8]

    Flash-vstream: Efficient real-time understanding for long video streams,

    H. Zhang, Y . Wang, Y . Tang, Y . Liu, J. Feng, and X. Jin, “Flash-vstream: Efficient real-time understanding for long video streams,” inProceedings of the IEEE/CVF international conference on computer vision, 2025, pp. 21 059–21 069

  9. [9]

    StreamingVLM: Real-Time Understanding for Infinite Video Streams

    R. Xu, G. Xiao, Y . Chen, L. He, K. Peng, Y . Lu, and S. Han, “Streamingvlm: Real-time understanding for infinite video streams,”arXiv preprint arXiv:2510.09608, 2025

  10. [10]

    Vispeak: Visual instruction feedback in streaming videos,

    S. Fu, Q. Yang, Y .-M. Li, Y .-X. Peng, K.-Y . Lin, X. Wei, J.-F. Hu, X. Xie, and W.-S. Zheng, “Vispeak: Visual instruction feedback in streaming videos,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 21 778–21 788

  11. [11]

    Streaming Video Instruction Tuning

    J. Xia, P. Chen, M. Zhang, X. Sun, and K. Zhou, “Streaming video instruction tuning,”arXiv preprint arXiv:2512.21334, 2025

  12. [12]

    Querystream: Advancing streaming video understanding with query-aware pruning and proactive response,

    K. Zhang, Z. Yang, B. Wang, S. Qian, and C. Xu, “Querystream: Advancing streaming video understanding with query-aware pruning and proactive response,” inThe Fourteenth International Conference on Learning Representations, 2026

  13. [13]

    Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions

    K. Zhang, Z. Yang, M. Han, H. Hao, Y . Zhuge, C. Li, J. Zhao, Z. Li, and X. Chang, “Progressive online video understanding with evidence-aligned timing and transparent decisions,”arXiv preprint arXiv:2604.18459, 2026

  14. [14]

    Timechat-online: 80% visual tokens are naturally redundant in streaming videos,

    L. Yao, Y . Li, Y . Wei, L. Li, S. Ren, Y . Liu, K. Ouyang, L. Wang, S. Li, S. Liet al., “Timechat-online: 80% visual tokens are naturally redundant in streaming videos,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 10 807–10 816. 10

  15. [15]

    Streamingbench: Assessing the gap for mllms to achieve streaming video understanding,

    J. Lin, Z. Fang, C. Chen, H. Cheng, Z. Wan, F. Luo, Z. Wang, P. Li, Y . Liu, and M. Sun, “Streamingbench: Assessing the gap for mllms to achieve streaming video understanding,” inICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2026, pp. 12 147–12 151

  16. [16]

    Ovo-bench: How far is your video-llms from real-world online video understanding?

    J. Niu, Y . Li, Z. Miao, C. Ge, Y . Zhou, Q. He, X. Dong, H. Duan, S. Ding, R. Qianet al., “Ovo-bench: How far is your video-llms from real-world online video understanding?” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 18 902–18 913

  17. [17]

    Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models,

    Y . Wang, X. Meng, Y . Wang, H. Zhang, and D. Zhao, “Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models,”arXiv preprint arXiv:2507.09313, 2025

  18. [18]

    Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts,

    Y . Wang, Y . Wang, B. Chen, T. Wu, D. Zhao, and Z. Zheng, “Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 18 925–18 935

  19. [19]

    Streammind: Unlocking full frame rate streaming video dialogue through event-gated cognition,

    X. Ding, H. Wu, Y . Yang, S. Jiang, Q. Zhang, D. Bai, Z. Chen, and T. Cao, “Streammind: Unlocking full frame rate streaming video dialogue through event-gated cognition,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 13 448–13 459

  20. [20]

    Eyes wide open: Ego proactive video-llm for streaming video,

    Y . Zhang, C. Shi, Y . Wang, and S. Yang, “Eyes wide open: Ego proactive video-llm for streaming video,” arXiv preprint arXiv:2510.14560, 2025

  21. [21]

    Thinking in streaming video,

    Z. Liu, L. Guo, H. Li, R. Zhen, X. He, R. Ji, X. Ren, Y . Zhang, H. Lu, and J. Liu, “Thinking in streaming video,”arXiv preprint arXiv:2603.12938, 2026

  22. [22]

    Learning to respond: A large-scale benchmark and progressive learning framework for trigger-centric online video understanding,

    J. Qian, H. Du, G. Nan, S. Huang, J. Yu, H. Wang, J. Chen, M. Cai, M. Yang, J. Li, Z. Li, H. Wang, J. Liu, X. Jiang, and S. Leng, “Learning to respond: A large-scale benchmark and progressive learning framework for trigger-centric online video understanding,” https://openreview.net/pdf?id=gmpnSSiJt7, 2025

  23. [23]

    Minicpm-o 4.5 technical report,

    OpenBMB, “Minicpm-o 4.5 technical report,” https://github.com/OpenBMB/MiniCPM-o/blob/main/docs/ MiniCPM_o_45_technical_report.pdf, 2026, gitHub technical report

  24. [24]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13923

  25. [25]

    Qwen3-VL Technical Report

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

  26. [26]

    Class-balanced loss based on effective number of samples,

    Y . Cui, M. Jia, T.-Y . Lin, Y . Song, and S. Belongie, “Class-balanced loss based on effective number of samples,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9268–9277

  27. [27]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  28. [28]

    VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling

    X. Li, Y . Wang, J. Yu, X. Zeng, Y . Zhu, H. Huang, J. Gao, K. Li, Y . He, C. Wanget al., “Videochat-flash: Hierarchical compression for long-context video modeling,”arXiv preprint arXiv:2501.00574, 2024

  29. [29]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,

    C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhanget al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2025, pp. 24 108–24 118

  30. [30]

    Longvideobench: A benchmark for long-context interleaved video- language understanding,

    H. Wu, D. Li, B. Chen, and J. Li, “Longvideobench: A benchmark for long-context interleaved video- language understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 28 828–28 857, 2024

  31. [31]

    HybridFlow: A Flexible and Efficient RLHF Framework

    G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu, “Hybridflow: A flexible and efficient rlhf framework,”arXiv preprint arXiv: 2409.19256, 2024

  32. [32]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  33. [33]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosenet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025. 11

  34. [34]

    AURA: Always-On Understanding and Real-Time Assistance via Video Streams

    X. Lu, Y . Bo, J. Chen, S. Li, X. Guo, H. Guan, F. Liu, D. Xu, P. Sun, H. Sunet al., “Aura: Always-on understanding and real-time assistance via video streams,”arXiv preprint arXiv:2604.04184, 2026

  35. [35]

    Et bench: Towards open-ended event-level video- language understanding,

    Y . Liu, Z. Ma, Z. Qi, Y . Wu, Y . Shan, and C. W. Chen, “Et bench: Towards open-ended event-level video- language understanding,”Advances in Neural Information Processing Systems, vol. 37, pp. 32 076–32 110, 2024

  36. [36]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Llava-video: Video instruction tuning with synthetic data,”arXiv preprint arXiv:2410.02713, 2024

  37. [37]

    Advancing video anomaly detection: A concise review and a new dataset,

    L. Zhu, L. Wang, A. Raj, T. Gedeon, and C. Chen, “Advancing video anomaly detection: A concise review and a new dataset,” inThe Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  38. [38]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    W.-L. Chiang, L. Zheng, Y . Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalezet al., “Chatbot arena: An open platform for evaluating llms by human preference,”arXiv preprint arXiv:2403.04132, 2024

  39. [39]

    Rank analysis of incomplete block designs: I. the method of paired comparisons,

    R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,”Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952

  40. [40]

    Efron and R

    B. Efron and R. J. Tibshirani,An introduction to the bootstrap. Chapman and Hall/CRC, 1994

  41. [41]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    G. Team, P. Georgiev, V . I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wanget al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

  42. [42]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  43. [43]

    LLaVA-OneVision: Easy Visual Task Transfer

    B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liuet al., “Llava- onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

  44. [44]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Geet al., “Qwen2-vl: Enhanc- ing vision-language model’s perception of the world at any resolution,”arXiv preprint arXiv:2409.12191, 2024

  45. [45]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025. 12 Appendices A Details of StreamPro-Bench 14 A.1 Data Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

  46. [46]

    • If YES→proceed to Step 2

    Is the visual information observed so far sufficient to answer the Query without guessing? • If NO→replyWait. • If YES→proceed to Step 2

  47. [47]

    the model should do X well

    Does the inferred answer differ from the last answer you provided? • If NO→replyWait. • If YES (First Trigger)→reply with the actual answer. • If YES (Answer Update due to new evidence)→reply with the updated answer. Output Constraint: • Do not output anything other thanWaitor the actual content of the answer. 29 Rubric Generation Prompt You are designing...