OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Fengyun Rao; Jie Yang; Jing Lyu; Ruixiang Zhao; Tianyi Wang; Xirong Li; Zijie Xin

arxiv: 2605.18577 · v1 · pith:6JM7MGYCnew · submitted 2026-05-18 · 💻 cs.CV

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Ruixiang Zhao , Jie Yang , Zijie Xin , Tianyi Wang , Fengyun Rao , Jing LYU , Xirong Li This is my paper

Pith reviewed 2026-05-20 11:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords OmniProbenchmarkomni-proactivestreaming videomultimodal perceptionproactive respondingvideo understandingdual-mode evaluation

0 comments

The pith

OmniPro is the first benchmark to jointly test omni-modal perception, proactive response timing, and diverse video understanding in streaming audio-visual inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces OmniPro as a benchmark specifically built to measure how well omni-modal models can decide both when to speak and what to say while processing continuous video streams. Existing tests fall short because they focus mostly on visuals, use fixed or polling-based queries instead of true autonomous timing, and cover too narrow a set of tasks. The new benchmark supplies 2,700 human-checked samples across nine sub-tasks at three cognitive levels and six core video capabilities, with modality-isolation labels and heavy use of audio in 84 percent of cases. A dual-mode protocol lets evaluators separate content understanding from full proactive behavior: Probe mode queries models before and after known trigger points, while Online mode requires models to initiate responses on their own in a streaming setting. Tests on eleven models show audio helps but is used unevenly, performance falls sharply over longer sequences, and non-speech audio remains the hardest element to handle.

Core claim

OmniPro is the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals, each sample carries modality-isolation labels, and a dual-mode protocol separates Probe-mode content checks from Online-mode autonomous timing in streaming input.

What carries the argument

The dual-mode evaluation protocol consisting of Probe mode (queries before and after ground-truth triggers) and Online mode (autonomous decision of response timing in continuous streaming input).

If this is right

Audio signals produce consistent performance gains across tasks but models vary widely in how effectively they exploit them.
Model accuracy drops substantially as input length grows, revealing limited robustness over extended time horizons.
Non-speech audio remains the weakest perceptual dimension for current models.
Modality-isolation labels enable fine-grained diagnosis of which input channels drive success or failure on each sample.
The nine sub-tasks and three cognitive levels together allow differentiation of models across basic perception through higher-level reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed long-horizon degradation points to a need for training regimes that explicitly reward sustained coherence across minutes of streaming input.
The benchmark's emphasis on autonomous timing could encourage architectures that maintain an internal state for deciding response initiation rather than relying on external triggers.
Poor non-speech audio results suggest that pairing general multimodal models with dedicated sound-event detectors might close the largest remaining gap.
The 2,700-sample scale and human verification make the benchmark suitable for tracking progress as new omni-modal models are released.

Load-bearing premise

The dual-mode evaluation protocol accurately measures true proactive ability in streaming input without introducing biases from the specific querying or annotation process.

What would settle it

A set of models that achieve high scores in Online mode yet still require external prompts or fixed timestamps to match human-verified response times would show the protocol does not isolate genuine proactive behavior.

Figures

Figures reproduced from arXiv: 2605.18577 by Fengyun Rao, Jie Yang, Jing Lyu, Ruixiang Zhao, Tianyi Wang, Xirong Li, Zijie Xin.

**Figure 1.** Figure 1: Overview of OMNIPRO. The benchmark comprises 9 sub-tasks organized into three cognitive levels, collectively covering 6 basic video understanding capabilities. Each panel shows a representative sample with its video frames, time-aligned triggers (marked by red triangles), user instruction (Q), and expected proactive responses (A). Audio-dependent triggers are prevalent across tasks, requiring models to per… view at source ↗

**Figure 2.** Figure 2: Dataset statistics of OMNIPRO. or discarded those of unacceptable quality. In the second round, annotators swapped sub-tasks for cross-validation, ensuring consistent standards across tasks. After both rounds, approximately 30% of samples were retained, yielding 2,700 samples across 1,262 videos. 3.1.5 Dataset Statistics [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Performance grouped by where the GT trigger is located along the video timeline. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Performance breakdown by the modality signals required to perceive the trigger event [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Tolerance window ablation (Online mode). Performance of online-mode models under varying temporal matching tolerances [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OmniPro gives the field a first dedicated benchmark for omni-proactive streaming video understanding with useful task coverage and modality labels, but the dual-mode protocol's dependence on ground-truth triggers needs tighter validation to support the proactive claims.

read the letter

The main point is that this paper puts forward the first benchmark aimed at omni-modal models that must both understand streaming video and decide on their own when to respond. It includes 2700 human-verified samples across nine sub-tasks and three cognitive levels, with 84 percent of them requiring audio and each carrying modality-isolation labels. The dual-mode setup separates content checks in Probe mode from autonomous timing in Online mode, and the runs on eleven models surface consistent patterns: audio helps but unevenly, performance falls off over longer streams, and non-speech audio stays the weakest area. Those are concrete observations that prior visual-only or fixed-protocol benchmarks did not organize this way.

Referee Report

3 major / 2 minor

Summary. The paper introduces OmniPro, the first benchmark for omni-proactive streaming video understanding in omni-modal LLMs. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals with modality-isolation labels. A dual-mode evaluation protocol is introduced: Probe mode queries models before/after ground-truth triggers to assess content understanding, while Online mode requires autonomous timing decisions in streaming input. Evaluation of 11 models yields three findings: audio provides gains with variable utilization, performance degrades significantly over time, and non-speech audio perception is weakest.

Significance. If the trigger annotations and protocol prove robust, this benchmark would meaningfully advance evaluation of emerging omni-modal proactive capabilities by addressing gaps in visual-only, polling-based prior work. Strengths include the scale of human-verified samples, explicit modality-isolation labels for fine-grained analysis, and empirical identification of model limitations such as long-horizon robustness. These elements could help standardize assessment and guide development in streaming video understanding.

major comments (3)

[Dual-mode Evaluation Protocol] Dual-mode Evaluation Protocol section: The central claim that Online mode measures true proactive ability rests on the assumption that ground-truth trigger annotations are objective and reproducible markers. The manuscript provides no inter-annotator agreement metrics, details on whether triggers were annotated from full videos or streaming cues only, or validation against streaming simulation artifacts, which directly risks the protocol measuring annotation biases instead of model proactivity.
[Results and Analysis] Results and Analysis section: The key finding that performance degrades significantly over time lacks any statistical significance tests, p-values, confidence intervals, or effect-size reporting on the degradation trends across the 11 models, weakening the load-bearing claim of limited long-horizon robustness.
[Dataset Construction] Dataset Construction section: Sample selection criteria for the 2,700 samples are not described, and no quantitative validation (e.g., agreement scores or error analysis) is given for the modality-isolation labels or the 84% audio-requirement statistic, limiting interpretability of the fine-grained multimodal findings.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief definition or example of the three cognitive levels to help readers quickly grasp the benchmark's scope.
[Evaluation Protocol] Figure captions or the evaluation protocol description could more explicitly note how streaming input is simulated in Online mode to preempt questions about implementation biases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that will improve the transparency and rigor of the manuscript.

read point-by-point responses

Referee: Dual-mode Evaluation Protocol section: The central claim that Online mode measures true proactive ability rests on the assumption that ground-truth trigger annotations are objective and reproducible markers. The manuscript provides no inter-annotator agreement metrics, details on whether triggers were annotated from full videos or streaming cues only, or validation against streaming simulation artifacts, which directly risks the protocol measuring annotation biases instead of model proactivity.

Authors: We appreciate this critical observation on the foundation of our dual-mode protocol. We will revise the manuscript to include inter-annotator agreement metrics for the trigger annotations, provide explicit details on the annotation process (including whether full videos or streaming cues were used), and add any available validation against streaming simulation artifacts. These additions will strengthen the claim that Online mode evaluates genuine proactivity. revision: yes
Referee: Results and Analysis section: The key finding that performance degrades significantly over time lacks any statistical significance tests, p-values, confidence intervals, or effect-size reporting on the degradation trends across the 11 models, weakening the load-bearing claim of limited long-horizon robustness.

Authors: We agree that the degradation finding requires stronger statistical support. In the revised manuscript, we will add appropriate statistical significance tests, p-values, confidence intervals, and effect sizes for the performance trends over time across the evaluated models. revision: yes
Referee: Dataset Construction section: Sample selection criteria for the 2,700 samples are not described, and no quantitative validation (e.g., agreement scores or error analysis) is given for the modality-isolation labels or the 84% audio-requirement statistic, limiting interpretability of the fine-grained multimodal findings.

Authors: We thank the referee for pointing out this lack of detail. We will expand the Dataset Construction section to describe the sample selection criteria and include quantitative validation such as agreement scores and error analysis for the modality-isolation labels as well as the 84% audio-requirement statistic. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark definition and empirical evaluation are self-contained

full rationale

The paper presents a new benchmark (OmniPro) with 2700 samples, 9 sub-tasks, dual-mode evaluation protocol (Probe and Online), and reports results on 11 models. No mathematical derivations, equations, parameter fitting, or predictions that reduce to inputs by construction appear in the provided text or abstract. The dual-mode protocol is explicitly defined as a novel construction for assessing proactive ability rather than derived from prior fitted quantities or self-citations. Central claims rest on human-verified annotations and empirical observations (e.g., audio gains, performance degradation), which are externally falsifiable via the benchmark itself and do not rely on load-bearing self-citations or ansatzes smuggled from prior author work. This is the expected outcome for a benchmark paper with no derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the new definitions of proactive evaluation and human-verified sample curation rather than prior fitted parameters or invented physical entities.

axioms (1)

domain assumption Human verification of samples ensures quality and reliability for model differentiation
The benchmark construction explicitly relies on human-verified samples as the foundation for trustworthy evaluation.

pith-pipeline@v0.9.0 · 5796 in / 1255 out tokens · 41099 ms · 2026-05-20T11:03:32.604261+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · 8 internal anchors

[1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-Mini technical report: Compact yet powerful multimodal language models via mixture-of-LoRAs. arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

StreamReady: Learning what to answer and when in long streaming videos

Shehreen Azad, Vibhav Vineet, and Yogesh Singh Rawat. StreamReady: Learning what to answer and when in long streaming videos. InCVPR, 2026

work page 2026
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

VideoLLM-online: Online video large language model for streaming video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. VideoLLM-online: Online video large language model for streaming video. InCVPR, 2024

work page 2024
[5]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in Video-LLMs.arXiv preprint arXiv:2406.07476, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, et al. MiniCPM-o 4.5: Towards real-time full-duplex omni-modal interaction.arXiv preprint arXiv:2604.27393, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

StreamMind: Unlocking full frame rate streaming video dialogue through event-gated cognition

Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Qianxi Zhang, Donglin Bai, Zhibo Chen, and Ting Cao. StreamMind: Unlocking full frame rate streaming video dialogue through event-gated cognition. InICCV, 2025

work page 2025
[8]

Long- V ALE: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos

Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, and Feng Zheng. Long- V ALE: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos. InCVPR, 2025

work page 2025
[9]

Open-ended hierarchical streaming video understanding with vision language models

Hyolim Kang, Yunsu Park, Youngbeom Yoo, Yeeun Choi, and Seon Joo Kim. Open-ended hierarchical streaming video understanding with vision language models. InICCV, 2025

work page 2025
[10]

Rehg, Minsu Kim, and Yong Man Ro

Junho Kim, Hosu Lee, James M. Rehg, Minsu Kim, and Yong Man Ro. STRIDE: When to speak meets sequence denoising for streaming video understanding.arXiv preprint arXiv:2603.27593, 2026

work page arXiv 2026
[11]

LION-FS: Fast & slow video- language thinker as online video assistant

Wei Li, Bing Hu, Rui Shao, Leyang Shen, and Liqiang Nie. LION-FS: Fast & slow video- language thinker as online video assistant. InCVPR, 2025

work page 2025
[12]

OVO-Bench: How far is your Video-LLMs from real-world online video understanding? InCVPR, 2025

Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. OVO-Bench: How far is your Video-LLMs from real-world online video understanding? InCVPR, 2025

work page 2025
[13]

StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding

Junming Lin, Zheng Fang, Chi Chen, Haoxuan Cheng, Zihao Wan, Fuwen Luo, Ziyue Wang, Peng Li, Yang Liu, and Maosong Sun. StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding. InICASSP, 2026

work page 2026
[14]

Thinking in streaming video,

Zikang Liu, Longteng Guo, Handong Li, Ru Zhen, Xingjian He, Ruyi Ji, Xiaoming Ren, Yanhao Zhang, Haonan Lu, and Jing Liu. Thinking in streaming video.arXiv preprint arXiv:2603.12938, 2026. 10

work page arXiv 2026
[15]

Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction. InCVPR, 2025

work page 2025
[16]

video-SALMONN 2: Caption-enhanced audio-visual large language models

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video-SALMONN 2: Caption-enhanced audio-visual large language models. arXiv preprint arXiv:2506.15220, 2025

work page arXiv 2025
[17]

COIN: A large-scale dataset for comprehensive instructional video analysis

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. COIN: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019

work page 2019
[18]

StreamBridge: Turning your offline video large language model into a proactive streaming assistant

Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. StreamBridge: Turning your offline video large language model into a proactive streaming assistant. InNeurIPS, 2025

work page 2025
[19]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

MMDuet2: Enhancing proactive interaction of video MLLMs with multi-turn reinforcement learning

Yueqian Wang, Songxiang Liu, Disong Wang, Nuo Xu, Guanglu Wan, Huishuai Zhang, and Dongyan Zhao. MMDuet2: Enhancing proactive interaction of video MLLMs with multi-turn reinforcement learning. InICLR, 2026

work page 2026
[21]

Omn- iMMI: A comprehensive multi-modal interaction benchmark in streaming video contexts

Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omn- iMMI: A comprehensive multi-modal interaction benchmark in streaming video contexts. In CVPR, 2025

work page 2025
[22]

VideoLLM-MoD: Efficient video-language streaming with mixture-of-depths vision computation

Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, and Mike Zheng Shou. VideoLLM-MoD: Efficient video-language streaming with mixture-of-depths vision computation. InNeurIPS, 2024

work page 2024
[23]

Streaming video instruction tuning

Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, and Kaiyang Zhou. Streaming video instruction tuning. InCVPR, 2026

work page 2026
[24]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-Omni technical report.arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-Omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

Haolin Yang, Feilong Tang, Lingxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, et al. StreamAgent: Towards anticipatory agents for streaming video understanding.arXiv preprint arXiv:2508.01875, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

LiveStar: Live streaming assistant for real-world online video understanding

Zhenyu Yang, Kairui Zhang, Yuhang Hu, Bing Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Weiming Dong, and Changsheng Xu. LiveStar: Live streaming assistant for real-world online video understanding. InNeurIPS, 2025

work page 2025
[28]

TimeChat-Online: 80% visual tokens are naturally redundant in streaming videos

Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, and Xu Sun. TimeChat-Online: 80% visual tokens are naturally redundant in streaming videos. InMM, 2025

work page 2025
[29]

Proactive assistant dialogue generation from streaming egocentric videos

Yichi Zhang, Xin Luna Dong, Zhaojiang Lin, Andrea Madotto, Anuj Kumar, Babak Damavandi, Joyce Chai, and Seungwhan Moon. Proactive assistant dialogue generation from streaming egocentric videos. InEMNLP, 2025

work page 2025
[30]

Eyes Wide Open: Ego proactive Video-LLM for streaming video

Yulin Zhang, Cheng Shi, Yang Wang, and Sibei Yang. Eyes Wide Open: Ego proactive Video-LLM for streaming video. InNeurIPS, 2025

work page 2025
[31]

Em-Garde: A propose-match framework for proactive streaming video understanding.arXiv preprint arXiv:2603.19054, 2026

Yikai Zheng, Xin Ding, Yifan Yang, Shiqi Jiang, Hao Wu, Qianxi Zhang, Weijun Wang, Ting Cao, and Yunxin Liu. Em-Garde: A propose-match framework for proactive streaming video understanding.arXiv preprint arXiv:2603.19054, 2026. 11 A More Experimental Results A.1 Tolerance Window Ablation ±1 ±2 ±3 ±5 ±10 T olerance (±s) 0 5 10 15 20 25 30Score (%) 15.4 18....

work page arXiv 2026
[32]

Include: Who (appearance, actions), What (objects, text), Action (specific verbs, direction), Change (differences from previous segment), Audio-visual correlation

caption: A detailed, information-dense paragraph integrating visual, audio, and speech into one coherent description. Include: Who (appearance, actions), What (objects, text), Action (specific verbs, direction), Change (differences from previous segment), Audio-visual correlation

work page
[33]

visual: Exhaustive visual details — scene, lighting, colors, objects, people, camera work, on-screen text verbatim

work page
[34]

Note onset and cessation of sounds

audio: Precise sound description — music (genre, tempo, instruments), sound effects, ambient sounds, voice quality. Note onset and cessation of sounds

work page
[35]

None". Return a JSON array: [ {

speech: Detailed summary of what is said — key claims, names, numbers, facts. If no speech, write "None". Return a JSON array: [ { "start": "MM:SS", "end": "MM:SS", "caption": "...", "visual": "...", "audio": "...", "speech": "..." } ] Rules: - Segments must cover the entire video from 00:00 to {duration_mmss} with no gaps or overlaps. - Timestamps in MM:...

work page
[37]

Duration: {duration_mmss} ({duration_sec:.0f}s)

Timestamped dense caption (supplementary reference). Duration: {duration_mmss} ({duration_sec:.0f}s). ## Steps

work page
[38]

Choose an event (AUDIO-FIRST priority): - Audio-required (best): ONLY detectable by listening 13 (doorbell, whistle, spoken phrase, alarm, glass break) - Audio-helpful: visible but audio confirms - Visual-only (last resort): no meaningful audio

work page
[39]

Must sound like a real person talking to a smart assistant

Write the question: One natural standing instruction at 00:00. Must sound like a real person talking to a smart assistant. No spoilers, no timestamps, everyday language

work page
[40]

- Include accurate trigger_time (MM:SS)

Write response(s) — one per event occurrence: - State what happened, briefly and naturally. - Include accurate trigger_time (MM:SS). - Conversational tone, not robotic

work page
[41]

visual"|

Classify each response: - trigger_type: "visual"|"sound"|"speech"|combined (e.g., "visual+sound", "visual+speech") - audio_dependency: "required"|"helpful"|"none" - trigger_type_reason: brief explanation ## Output (single JSON object, no markdown) Fields: status, question, question_time ("00:00"), audio_dependency, responses[] with: trigger_time, response...

work page
[44]

finishes/ends/completes

Find a trigger-target pair (AUDIO-FIRST): TRIGGER: instantaneous, real-time confirmable (NO "finishes/ends/completes"), unambiguous (maps to precise frame), naturally paired with target. OK: whistle->ball, "Maria" called->Maria TARGET: fits in ONE grid cell (highest priority). Never: full person, large vehicle, close-up face. Preferred: small held objects...

work page
[45]

in the frame

Write the question: One natural instruction at 00:00. Specify BOTH trigger 14 and target. Ask for position "in the frame"/"on screen"

work page
[46]

top-left

Write response(s) — default exactly ONE (max 4): - Describe trigger event. State target location. - position: one of 9 grid cells ("top-left"| "top-center"|...|"bottom-right") - trigger_time (MM:SS). Under 20 words

work page
[47]

status":

Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, position, trigger_type, trigger_type_reason, event_description. If no pair: {"status":"skip","reason":"..."} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss}]. - Pos...

work page
[50]

who is speaking

Choose a SPECIFIC PHYSICAL DIMENSION: Audio scan (MANDATORY) first: - Speakers taking turns? -> "who is speaking" - Music starts/stops? -> "whether music is playing" - Alternating sound sources? If any audio dimension works, use it. Visual scan (only if no audio): State must be: specific (ONE property), discrete, about main subject, changes 2+ times, obje...

work page
[51]

UNAMBIGUOUS

Write the question: One natural monitoring instruction at 00:00. UNAMBIGUOUS. No spoilers. No state value lists

work page
[52]

- trigger_time (MM:SS), after 00:00

Write responses — ONLY at transitions (2-5): - Name previous AND new state (from X -> to Y). - trigger_time (MM:SS), after 00:00. Under 15 words. 15 - Do NOT report initial state. Chronological

work page
[53]

status":

Classify: trigger_type, audio_dependency. Include audio_scan field. ## Output (single JSON object, no markdown) Fields: status, audio_scan, question, question_time, audio_dependency, responses[] with: trigger_time, response, trigger_type, trigger_type_reason, event_description. If no suitable state: {"status":"skip","reason":"..."} ## Rules - Timestamps M...

work page
[56]

everyone ready?

Find trigger + counting target (LISTEN FIRST): Good audio trigger->target pairs (naturally connected): - Whistle blows -> count players on field - Applause starts -> count performers on stage - "everyone ready?" -> count people in room - Timer buzzes -> count dishes on counter Bad (artificially forced): - "Hey" -> count people on sofa (no connection) Visu...

work page
[57]

Specifies BOTH trigger and counting target

Write the question: One natural counting instruction at 00:00. Specifies BOTH trigger and counting target. Natural language. No expected count revealed

work page
[58]

State exact count

Write ONE response at trigger moment: - Note trigger occurred. State exact count. - count field with integer (for evaluation). - trigger_time (MM:SS), after 00:00. Under 15 words

work page
[59]

status":

Classify: trigger_type, audio_dependency. Include audio_scan field. ## Output (single JSON object, no markdown) 16 Fields: status, audio_scan, question, question_time, audio_dependency, responses[] (exactly one) with: trigger_time, response, count, trigger_type, trigger_type_reason, event_description. If no pair: {"status":"skip","reason":"..."} ## Rules ...

work page
[61]

Duration: {duration_mmss} ({duration_sec:.0f}s)

Timestamped dense caption (reference). Duration: {duration_mmss} ({duration_sec:.0f}s). ## Steps

work page
[62]

Could a detector+classifier handle this?

Find a natural condition (AUDIO-FIRST): Must satisfy ALL: A. Realistic — a real person would want this alert. B. Requires semantic understanding (NOT perception): Test: "Could a detector+classifier handle this?" BAD: "when audience cheers" (sound classification) GOOD: "when speaker provides a statistic as evidence" C. Unambiguous (9/10 people flag same mo...

work page
[63]

Describes condition clearly

Write the question: One natural monitoring instruction at 00:00. Describes condition clearly. No spoilers

work page
[64]

- Under 25 words

Write responses — one per occurrence: - State what happened AND why it satisfies condition. - Under 25 words. trigger_time (MM:SS), after 00:00. - Speech: timestamp = when sentence ends

work page
[65]

status":

Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, trigger_type, trigger_type_reason, event_description. If no condition: {"status":"skip","reason":"..."} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss}]. 17 - At le...

work page
[67]

Duration: {duration_mmss} ({duration_sec:.0f}s)

Timestamped dense caption (reference). Duration: {duration_mmss} ({duration_sec:.0f}s). Preferred Category: {preferred_category} ## Event Categories A — Discrete non-speech sounds: impact, signals, instrument hits, animal/body sounds. B — Speech acts: questions, instructions, jokes, laughter bursts. Each = one complete act. C — Word/phrase repetitions: me...

work page
[68]

Find event (try {preferred_category} first): Discrete/separable (10 people agree), repeats 3+ times (aim 3-8), non-overlapping, unambiguous

work page
[69]

Specifies event clearly

Write the question: One natural counting instruction at 00:00. Specifies event clearly. No count revealed

work page
[70]

Count: X

Write responses — one per occurrence: - Natural notification (NOT "Count: X"). - count: cumulative integer (1, 2, 3, ...). - trigger_time (MM:SS), after 00:00. Under 20 words. - Chronological, incrementing by exactly 1

work page
[71]

status":

Classify: trigger_type, audio_dependency. Include chosen_category field. ## Output (single JSON object, no markdown) Fields: status, chosen_category, question, question_time, audio_dependency, responses[] with: trigger_time, response, count, trigger_type, trigger_type_reason, event_description. If <3 occurrences: {"status":"skip","reason":"..."} ## Rules ...

work page
[73]

concludes/climax/final/wraps up/ends with

Timestamped dense caption (reference). Duration: {duration_mmss} ({duration_sec:.0f}s). ## Streaming Constraint Real-time — NO future knowledge. Each update describes only what happened UP TO that point. NEVER use "concludes/climax/final/wraps up/ends with." ## Steps

work page
[74]

describe everything

Find a natural narration focus (satisfy ALL): A. Specific and constrained (NOT "describe everything") B. Multiple natural breakpoints (3+ stages). C. Grounded in observable, verifiable facts. D. Realistic. E. Integrates visual AND audio

work page
[75]

Specifies focus, implies ongoing updates

Write the question: One natural instruction at 00:00. Specifies focus, implies ongoing updates. No spoilers

work page
[76]

- Specific verifiable details (names, quantities)

Write responses — one per breakpoint (aim 3-6): - Factual summary since last update, within focus. - Specific verifiable details (names, quantities). - trigger_time (MM:SS). Under 40 words. - Chronological. Each adds NEW information. - Distributed across video (max 2 in first quarter)

work page
[77]

status":

Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, trigger_type, trigger_type_reason, event_description. If no focus: {"status":"skip","reason":"..."} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss}]. - 3-6 response...

work page
[79]

Duration: {duration_mmss} ({duration_sec:.0f}s)

Timestamped dense caption (rough reference). Duration: {duration_mmss} ({duration_sec:.0f}s). ## Steps

work page
[80]

Check required appear/disappear/reappear pattern: - Targets appear at spread-out times? - At least one disappears and reappears later? - At least 3 unique targets? If no reappear pattern, return skip

work page
[81]

people interviewed on camera

Find target category (must satisfy ALL): - Distinct identities. Appear-disappear-reappear. - 3+ targets, min 15s span. Unambiguous (9/10 agree). - Precisely scoped with qualifier when noisy: GOOD: "people interviewed on camera", "products picked up and demonstrated" BAD: "different scenes" (vague)

work page
[82]

Emphasizes unique/ different

Write the question: One natural instruction at 00:00. Emphasizes unique/ different. No expected count revealed

work page
[83]

- count: cumulative unique count (1, 2, 3,...)

Write responses — one per NEW unique target: - Describe what distinguishes from prior targets. - count: cumulative unique count (1, 2, 3,...). - trigger_time = first appearance (MM:SS). - Under 20 words. NEVER count re-appearances

work page
[84]

status":

Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, count, trigger_type, trigger_type_reason, event_description. If no dedup pattern or <3 targets: {"status":"skip",...} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss...

work page
[85]

Original video (ground truth)

work page
[86]

Duration: {duration_mmss} ({duration_sec:.0f}s)

Timestamped dense caption (reference). Duration: {duration_mmss} ({duration_sec:.0f}s). 20 ## Constraints - Real-time — no future knowledge. Instructions based on observations + domain knowledge. - ONLY tutorials: cooking, DIY, repair, beauty, exercise. NOT: interviews, vlogs, news, reviews, sports. If not a replicable process, return skip. ## Steps

work page
[87]

Determine suitability: Clear goal? Sequential steps? Observable? If any = NO, return skip

work page
[88]

User wants to follow along

Write the question: One natural instruction at 00:00. User wants to follow along. States learning goal. No spoilers

work page
[89]

Now add

Write responses — one per step transition: Timing: previous step completed, next not started. - Actionable instruction (WHAT + HOW). - Key parameters (quantities, temps, times). - Instructional language ("Now add...", "Next...") NOT descriptive ("He is adding..."). - Verified by video. trigger_time (MM:SS). - Under 40 words. Chronological

work page
[90]

status":

Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, trigger_type, trigger_type_reason, event_description. If not tutorial: {"status":"skip","reason":"..."} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss}]. - At least...

work page

[1] [1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-Mini technical report: Compact yet powerful multimodal language models via mixture-of-LoRAs. arXiv preprint arXiv:2503.01743, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

StreamReady: Learning what to answer and when in long streaming videos

Shehreen Azad, Vibhav Vineet, and Yogesh Singh Rawat. StreamReady: Learning what to answer and when in long streaming videos. InCVPR, 2026

work page 2026

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

VideoLLM-online: Online video large language model for streaming video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. VideoLLM-online: Online video large language model for streaming video. InCVPR, 2024

work page 2024

[5] [5]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in Video-LLMs.arXiv preprint arXiv:2406.07476, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, et al. MiniCPM-o 4.5: Towards real-time full-duplex omni-modal interaction.arXiv preprint arXiv:2604.27393, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

StreamMind: Unlocking full frame rate streaming video dialogue through event-gated cognition

Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Qianxi Zhang, Donglin Bai, Zhibo Chen, and Ting Cao. StreamMind: Unlocking full frame rate streaming video dialogue through event-gated cognition. InICCV, 2025

work page 2025

[8] [8]

Long- V ALE: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos

Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, and Feng Zheng. Long- V ALE: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos. InCVPR, 2025

work page 2025

[9] [9]

Open-ended hierarchical streaming video understanding with vision language models

Hyolim Kang, Yunsu Park, Youngbeom Yoo, Yeeun Choi, and Seon Joo Kim. Open-ended hierarchical streaming video understanding with vision language models. InICCV, 2025

work page 2025

[10] [10]

Rehg, Minsu Kim, and Yong Man Ro

Junho Kim, Hosu Lee, James M. Rehg, Minsu Kim, and Yong Man Ro. STRIDE: When to speak meets sequence denoising for streaming video understanding.arXiv preprint arXiv:2603.27593, 2026

work page arXiv 2026

[11] [11]

LION-FS: Fast & slow video- language thinker as online video assistant

Wei Li, Bing Hu, Rui Shao, Leyang Shen, and Liqiang Nie. LION-FS: Fast & slow video- language thinker as online video assistant. InCVPR, 2025

work page 2025

[12] [12]

OVO-Bench: How far is your Video-LLMs from real-world online video understanding? InCVPR, 2025

Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. OVO-Bench: How far is your Video-LLMs from real-world online video understanding? InCVPR, 2025

work page 2025

[13] [13]

StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding

Junming Lin, Zheng Fang, Chi Chen, Haoxuan Cheng, Zihao Wan, Fuwen Luo, Ziyue Wang, Peng Li, Yang Liu, and Maosong Sun. StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding. InICASSP, 2026

work page 2026

[14] [14]

Thinking in streaming video,

Zikang Liu, Longteng Guo, Handong Li, Ru Zhen, Xingjian He, Ruyi Ji, Xiaoming Ren, Yanhao Zhang, Haonan Lu, and Jing Liu. Thinking in streaming video.arXiv preprint arXiv:2603.12938, 2026. 10

work page arXiv 2026

[15] [15]

Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction. InCVPR, 2025

work page 2025

[16] [16]

video-SALMONN 2: Caption-enhanced audio-visual large language models

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video-SALMONN 2: Caption-enhanced audio-visual large language models. arXiv preprint arXiv:2506.15220, 2025

work page arXiv 2025

[17] [17]

COIN: A large-scale dataset for comprehensive instructional video analysis

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. COIN: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019

work page 2019

[18] [18]

StreamBridge: Turning your offline video large language model into a proactive streaming assistant

Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. StreamBridge: Turning your offline video large language model into a proactive streaming assistant. InNeurIPS, 2025

work page 2025

[19] [19]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

MMDuet2: Enhancing proactive interaction of video MLLMs with multi-turn reinforcement learning

Yueqian Wang, Songxiang Liu, Disong Wang, Nuo Xu, Guanglu Wan, Huishuai Zhang, and Dongyan Zhao. MMDuet2: Enhancing proactive interaction of video MLLMs with multi-turn reinforcement learning. InICLR, 2026

work page 2026

[21] [21]

Omn- iMMI: A comprehensive multi-modal interaction benchmark in streaming video contexts

Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omn- iMMI: A comprehensive multi-modal interaction benchmark in streaming video contexts. In CVPR, 2025

work page 2025

[22] [22]

VideoLLM-MoD: Efficient video-language streaming with mixture-of-depths vision computation

Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, and Mike Zheng Shou. VideoLLM-MoD: Efficient video-language streaming with mixture-of-depths vision computation. InNeurIPS, 2024

work page 2024

[23] [23]

Streaming video instruction tuning

Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, and Kaiyang Zhou. Streaming video instruction tuning. InCVPR, 2026

work page 2026

[24] [24]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-Omni technical report.arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-Omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding

Haolin Yang, Feilong Tang, Lingxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, et al. StreamAgent: Towards anticipatory agents for streaming video understanding.arXiv preprint arXiv:2508.01875, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

LiveStar: Live streaming assistant for real-world online video understanding

Zhenyu Yang, Kairui Zhang, Yuhang Hu, Bing Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Weiming Dong, and Changsheng Xu. LiveStar: Live streaming assistant for real-world online video understanding. InNeurIPS, 2025

work page 2025

[28] [28]

TimeChat-Online: 80% visual tokens are naturally redundant in streaming videos

Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, and Xu Sun. TimeChat-Online: 80% visual tokens are naturally redundant in streaming videos. InMM, 2025

work page 2025

[29] [29]

Proactive assistant dialogue generation from streaming egocentric videos

Yichi Zhang, Xin Luna Dong, Zhaojiang Lin, Andrea Madotto, Anuj Kumar, Babak Damavandi, Joyce Chai, and Seungwhan Moon. Proactive assistant dialogue generation from streaming egocentric videos. InEMNLP, 2025

work page 2025

[30] [30]

Eyes Wide Open: Ego proactive Video-LLM for streaming video

Yulin Zhang, Cheng Shi, Yang Wang, and Sibei Yang. Eyes Wide Open: Ego proactive Video-LLM for streaming video. InNeurIPS, 2025

work page 2025

[31] [31]

Em-Garde: A propose-match framework for proactive streaming video understanding.arXiv preprint arXiv:2603.19054, 2026

Yikai Zheng, Xin Ding, Yifan Yang, Shiqi Jiang, Hao Wu, Qianxi Zhang, Weijun Wang, Ting Cao, and Yunxin Liu. Em-Garde: A propose-match framework for proactive streaming video understanding.arXiv preprint arXiv:2603.19054, 2026. 11 A More Experimental Results A.1 Tolerance Window Ablation ±1 ±2 ±3 ±5 ±10 T olerance (±s) 0 5 10 15 20 25 30Score (%) 15.4 18....

work page arXiv 2026

[32] [32]

Include: Who (appearance, actions), What (objects, text), Action (specific verbs, direction), Change (differences from previous segment), Audio-visual correlation

caption: A detailed, information-dense paragraph integrating visual, audio, and speech into one coherent description. Include: Who (appearance, actions), What (objects, text), Action (specific verbs, direction), Change (differences from previous segment), Audio-visual correlation

work page

[33] [33]

visual: Exhaustive visual details — scene, lighting, colors, objects, people, camera work, on-screen text verbatim

work page

[34] [34]

Note onset and cessation of sounds

audio: Precise sound description — music (genre, tempo, instruments), sound effects, ambient sounds, voice quality. Note onset and cessation of sounds

work page

[35] [35]

None". Return a JSON array: [ {

speech: Detailed summary of what is said — key claims, names, numbers, facts. If no speech, write "None". Return a JSON array: [ { "start": "MM:SS", "end": "MM:SS", "caption": "...", "visual": "...", "audio": "...", "speech": "..." } ] Rules: - Segments must cover the entire video from 00:00 to {duration_mmss} with no gaps or overlaps. - Timestamps in MM:...

work page

[36] [37]

Duration: {duration_mmss} ({duration_sec:.0f}s)

Timestamped dense caption (supplementary reference). Duration: {duration_mmss} ({duration_sec:.0f}s). ## Steps

work page

[37] [38]

Choose an event (AUDIO-FIRST priority): - Audio-required (best): ONLY detectable by listening 13 (doorbell, whistle, spoken phrase, alarm, glass break) - Audio-helpful: visible but audio confirms - Visual-only (last resort): no meaningful audio

work page

[38] [39]

Must sound like a real person talking to a smart assistant

Write the question: One natural standing instruction at 00:00. Must sound like a real person talking to a smart assistant. No spoilers, no timestamps, everyday language

work page

[39] [40]

- Include accurate trigger_time (MM:SS)

Write response(s) — one per event occurrence: - State what happened, briefly and naturally. - Include accurate trigger_time (MM:SS). - Conversational tone, not robotic

work page

[40] [41]

visual"|

Classify each response: - trigger_type: "visual"|"sound"|"speech"|combined (e.g., "visual+sound", "visual+speech") - audio_dependency: "required"|"helpful"|"none" - trigger_type_reason: brief explanation ## Output (single JSON object, no markdown) Fields: status, question, question_time ("00:00"), audio_dependency, responses[] with: trigger_time, response...

work page

[41] [44]

finishes/ends/completes

Find a trigger-target pair (AUDIO-FIRST): TRIGGER: instantaneous, real-time confirmable (NO "finishes/ends/completes"), unambiguous (maps to precise frame), naturally paired with target. OK: whistle->ball, "Maria" called->Maria TARGET: fits in ONE grid cell (highest priority). Never: full person, large vehicle, close-up face. Preferred: small held objects...

work page

[42] [45]

in the frame

Write the question: One natural instruction at 00:00. Specify BOTH trigger 14 and target. Ask for position "in the frame"/"on screen"

work page

[43] [46]

top-left

Write response(s) — default exactly ONE (max 4): - Describe trigger event. State target location. - position: one of 9 grid cells ("top-left"| "top-center"|...|"bottom-right") - trigger_time (MM:SS). Under 20 words

work page

[44] [47]

status":

Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, position, trigger_type, trigger_type_reason, event_description. If no pair: {"status":"skip","reason":"..."} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss}]. - Pos...

work page

[45] [50]

who is speaking

Choose a SPECIFIC PHYSICAL DIMENSION: Audio scan (MANDATORY) first: - Speakers taking turns? -> "who is speaking" - Music starts/stops? -> "whether music is playing" - Alternating sound sources? If any audio dimension works, use it. Visual scan (only if no audio): State must be: specific (ONE property), discrete, about main subject, changes 2+ times, obje...

work page

[46] [51]

UNAMBIGUOUS

Write the question: One natural monitoring instruction at 00:00. UNAMBIGUOUS. No spoilers. No state value lists

work page

[47] [52]

- trigger_time (MM:SS), after 00:00

Write responses — ONLY at transitions (2-5): - Name previous AND new state (from X -> to Y). - trigger_time (MM:SS), after 00:00. Under 15 words. 15 - Do NOT report initial state. Chronological

work page

[48] [53]

status":

Classify: trigger_type, audio_dependency. Include audio_scan field. ## Output (single JSON object, no markdown) Fields: status, audio_scan, question, question_time, audio_dependency, responses[] with: trigger_time, response, trigger_type, trigger_type_reason, event_description. If no suitable state: {"status":"skip","reason":"..."} ## Rules - Timestamps M...

work page

[49] [56]

everyone ready?

Find trigger + counting target (LISTEN FIRST): Good audio trigger->target pairs (naturally connected): - Whistle blows -> count players on field - Applause starts -> count performers on stage - "everyone ready?" -> count people in room - Timer buzzes -> count dishes on counter Bad (artificially forced): - "Hey" -> count people on sofa (no connection) Visu...

work page

[50] [57]

Specifies BOTH trigger and counting target

Write the question: One natural counting instruction at 00:00. Specifies BOTH trigger and counting target. Natural language. No expected count revealed

work page

[51] [58]

State exact count

Write ONE response at trigger moment: - Note trigger occurred. State exact count. - count field with integer (for evaluation). - trigger_time (MM:SS), after 00:00. Under 15 words

work page

[52] [59]

status":

Classify: trigger_type, audio_dependency. Include audio_scan field. ## Output (single JSON object, no markdown) 16 Fields: status, audio_scan, question, question_time, audio_dependency, responses[] (exactly one) with: trigger_time, response, count, trigger_type, trigger_type_reason, event_description. If no pair: {"status":"skip","reason":"..."} ## Rules ...

work page

[53] [61]

Duration: {duration_mmss} ({duration_sec:.0f}s)

Timestamped dense caption (reference). Duration: {duration_mmss} ({duration_sec:.0f}s). ## Steps

work page

[54] [62]

Could a detector+classifier handle this?

Find a natural condition (AUDIO-FIRST): Must satisfy ALL: A. Realistic — a real person would want this alert. B. Requires semantic understanding (NOT perception): Test: "Could a detector+classifier handle this?" BAD: "when audience cheers" (sound classification) GOOD: "when speaker provides a statistic as evidence" C. Unambiguous (9/10 people flag same mo...

work page

[55] [63]

Describes condition clearly

Write the question: One natural monitoring instruction at 00:00. Describes condition clearly. No spoilers

work page

[56] [64]

- Under 25 words

Write responses — one per occurrence: - State what happened AND why it satisfies condition. - Under 25 words. trigger_time (MM:SS), after 00:00. - Speech: timestamp = when sentence ends

work page

[57] [65]

status":

Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, trigger_type, trigger_type_reason, event_description. If no condition: {"status":"skip","reason":"..."} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss}]. 17 - At le...

work page

[58] [67]

Duration: {duration_mmss} ({duration_sec:.0f}s)

Timestamped dense caption (reference). Duration: {duration_mmss} ({duration_sec:.0f}s). Preferred Category: {preferred_category} ## Event Categories A — Discrete non-speech sounds: impact, signals, instrument hits, animal/body sounds. B — Speech acts: questions, instructions, jokes, laughter bursts. Each = one complete act. C — Word/phrase repetitions: me...

work page

[59] [68]

Find event (try {preferred_category} first): Discrete/separable (10 people agree), repeats 3+ times (aim 3-8), non-overlapping, unambiguous

work page

[60] [69]

Specifies event clearly

Write the question: One natural counting instruction at 00:00. Specifies event clearly. No count revealed

work page

[61] [70]

Count: X

Write responses — one per occurrence: - Natural notification (NOT "Count: X"). - count: cumulative integer (1, 2, 3, ...). - trigger_time (MM:SS), after 00:00. Under 20 words. - Chronological, incrementing by exactly 1

work page

[62] [71]

status":

Classify: trigger_type, audio_dependency. Include chosen_category field. ## Output (single JSON object, no markdown) Fields: status, chosen_category, question, question_time, audio_dependency, responses[] with: trigger_time, response, count, trigger_type, trigger_type_reason, event_description. If <3 occurrences: {"status":"skip","reason":"..."} ## Rules ...

work page

[63] [73]

concludes/climax/final/wraps up/ends with

Timestamped dense caption (reference). Duration: {duration_mmss} ({duration_sec:.0f}s). ## Streaming Constraint Real-time — NO future knowledge. Each update describes only what happened UP TO that point. NEVER use "concludes/climax/final/wraps up/ends with." ## Steps

work page

[64] [74]

describe everything

Find a natural narration focus (satisfy ALL): A. Specific and constrained (NOT "describe everything") B. Multiple natural breakpoints (3+ stages). C. Grounded in observable, verifiable facts. D. Realistic. E. Integrates visual AND audio

work page

[65] [75]

Specifies focus, implies ongoing updates

Write the question: One natural instruction at 00:00. Specifies focus, implies ongoing updates. No spoilers

work page

[66] [76]

- Specific verifiable details (names, quantities)

Write responses — one per breakpoint (aim 3-6): - Factual summary since last update, within focus. - Specific verifiable details (names, quantities). - trigger_time (MM:SS). Under 40 words. - Chronological. Each adds NEW information. - Distributed across video (max 2 in first quarter)

work page

[67] [77]

status":

Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, trigger_type, trigger_type_reason, event_description. If no focus: {"status":"skip","reason":"..."} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss}]. - 3-6 response...

work page

[68] [79]

Duration: {duration_mmss} ({duration_sec:.0f}s)

Timestamped dense caption (rough reference). Duration: {duration_mmss} ({duration_sec:.0f}s). ## Steps

work page

[69] [80]

Check required appear/disappear/reappear pattern: - Targets appear at spread-out times? - At least one disappears and reappears later? - At least 3 unique targets? If no reappear pattern, return skip

work page

[70] [81]

people interviewed on camera

Find target category (must satisfy ALL): - Distinct identities. Appear-disappear-reappear. - 3+ targets, min 15s span. Unambiguous (9/10 agree). - Precisely scoped with qualifier when noisy: GOOD: "people interviewed on camera", "products picked up and demonstrated" BAD: "different scenes" (vague)

work page

[71] [82]

Emphasizes unique/ different

Write the question: One natural instruction at 00:00. Emphasizes unique/ different. No expected count revealed

work page

[72] [83]

- count: cumulative unique count (1, 2, 3,...)

Write responses — one per NEW unique target: - Describe what distinguishes from prior targets. - count: cumulative unique count (1, 2, 3,...). - trigger_time = first appearance (MM:SS). - Under 20 words. NEVER count re-appearances

work page

[73] [84]

status":

Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, count, trigger_type, trigger_type_reason, event_description. If no dedup pattern or <3 targets: {"status":"skip",...} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss...

work page

[74] [85]

Original video (ground truth)

work page

[75] [86]

Duration: {duration_mmss} ({duration_sec:.0f}s)

Timestamped dense caption (reference). Duration: {duration_mmss} ({duration_sec:.0f}s). 20 ## Constraints - Real-time — no future knowledge. Instructions based on observations + domain knowledge. - ONLY tutorials: cooking, DIY, repair, beauty, exercise. NOT: interviews, vlogs, news, reviews, sports. If not a replicable process, return skip. ## Steps

work page

[76] [87]

Determine suitability: Clear goal? Sequential steps? Observable? If any = NO, return skip

work page

[77] [88]

User wants to follow along

Write the question: One natural instruction at 00:00. User wants to follow along. States learning goal. No spoilers

work page

[78] [89]

Now add

Write responses — one per step transition: Timing: previous step completed, next not started. - Actionable instruction (WHAT + HOW). - Key parameters (quantities, temps, times). - Instructional language ("Now add...", "Next...") NOT descriptive ("He is adding..."). - Verified by video. trigger_time (MM:SS). - Under 40 words. Chronological

work page

[79] [90]

status":

Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, trigger_type, trigger_type_reason, event_description. If not tutorial: {"status":"skip","reason":"..."} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss}]. - At least...

work page