OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
Pith reviewed 2026-05-20 11:03 UTC · model grok-4.3
The pith
OmniPro is the first benchmark to jointly test omni-modal perception, proactive response timing, and diverse video understanding in streaming audio-visual inputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OmniPro is the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals, each sample carries modality-isolation labels, and a dual-mode protocol separates Probe-mode content checks from Online-mode autonomous timing in streaming input.
What carries the argument
The dual-mode evaluation protocol consisting of Probe mode (queries before and after ground-truth triggers) and Online mode (autonomous decision of response timing in continuous streaming input).
If this is right
- Audio signals produce consistent performance gains across tasks but models vary widely in how effectively they exploit them.
- Model accuracy drops substantially as input length grows, revealing limited robustness over extended time horizons.
- Non-speech audio remains the weakest perceptual dimension for current models.
- Modality-isolation labels enable fine-grained diagnosis of which input channels drive success or failure on each sample.
- The nine sub-tasks and three cognitive levels together allow differentiation of models across basic perception through higher-level reasoning.
Where Pith is reading between the lines
- The observed long-horizon degradation points to a need for training regimes that explicitly reward sustained coherence across minutes of streaming input.
- The benchmark's emphasis on autonomous timing could encourage architectures that maintain an internal state for deciding response initiation rather than relying on external triggers.
- Poor non-speech audio results suggest that pairing general multimodal models with dedicated sound-event detectors might close the largest remaining gap.
- The 2,700-sample scale and human verification make the benchmark suitable for tracking progress as new omni-modal models are released.
Load-bearing premise
The dual-mode evaluation protocol accurately measures true proactive ability in streaming input without introducing biases from the specific querying or annotation process.
What would settle it
A set of models that achieve high scores in Online mode yet still require external prompts or fixed timestamps to match human-verified response times would show the protocol does not isolate genuine proactive behavior.
Figures
read the original abstract
Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OmniPro, the first benchmark for omni-proactive streaming video understanding in omni-modal LLMs. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals with modality-isolation labels. A dual-mode evaluation protocol is introduced: Probe mode queries models before/after ground-truth triggers to assess content understanding, while Online mode requires autonomous timing decisions in streaming input. Evaluation of 11 models yields three findings: audio provides gains with variable utilization, performance degrades significantly over time, and non-speech audio perception is weakest.
Significance. If the trigger annotations and protocol prove robust, this benchmark would meaningfully advance evaluation of emerging omni-modal proactive capabilities by addressing gaps in visual-only, polling-based prior work. Strengths include the scale of human-verified samples, explicit modality-isolation labels for fine-grained analysis, and empirical identification of model limitations such as long-horizon robustness. These elements could help standardize assessment and guide development in streaming video understanding.
major comments (3)
- [Dual-mode Evaluation Protocol] Dual-mode Evaluation Protocol section: The central claim that Online mode measures true proactive ability rests on the assumption that ground-truth trigger annotations are objective and reproducible markers. The manuscript provides no inter-annotator agreement metrics, details on whether triggers were annotated from full videos or streaming cues only, or validation against streaming simulation artifacts, which directly risks the protocol measuring annotation biases instead of model proactivity.
- [Results and Analysis] Results and Analysis section: The key finding that performance degrades significantly over time lacks any statistical significance tests, p-values, confidence intervals, or effect-size reporting on the degradation trends across the 11 models, weakening the load-bearing claim of limited long-horizon robustness.
- [Dataset Construction] Dataset Construction section: Sample selection criteria for the 2,700 samples are not described, and no quantitative validation (e.g., agreement scores or error analysis) is given for the modality-isolation labels or the 84% audio-requirement statistic, limiting interpretability of the fine-grained multimodal findings.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from a brief definition or example of the three cognitive levels to help readers quickly grasp the benchmark's scope.
- [Evaluation Protocol] Figure captions or the evaluation protocol description could more explicitly note how streaming input is simulated in Online mode to preempt questions about implementation biases.
Simulated Author's Rebuttal
We sincerely thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that will improve the transparency and rigor of the manuscript.
read point-by-point responses
-
Referee: Dual-mode Evaluation Protocol section: The central claim that Online mode measures true proactive ability rests on the assumption that ground-truth trigger annotations are objective and reproducible markers. The manuscript provides no inter-annotator agreement metrics, details on whether triggers were annotated from full videos or streaming cues only, or validation against streaming simulation artifacts, which directly risks the protocol measuring annotation biases instead of model proactivity.
Authors: We appreciate this critical observation on the foundation of our dual-mode protocol. We will revise the manuscript to include inter-annotator agreement metrics for the trigger annotations, provide explicit details on the annotation process (including whether full videos or streaming cues were used), and add any available validation against streaming simulation artifacts. These additions will strengthen the claim that Online mode evaluates genuine proactivity. revision: yes
-
Referee: Results and Analysis section: The key finding that performance degrades significantly over time lacks any statistical significance tests, p-values, confidence intervals, or effect-size reporting on the degradation trends across the 11 models, weakening the load-bearing claim of limited long-horizon robustness.
Authors: We agree that the degradation finding requires stronger statistical support. In the revised manuscript, we will add appropriate statistical significance tests, p-values, confidence intervals, and effect sizes for the performance trends over time across the evaluated models. revision: yes
-
Referee: Dataset Construction section: Sample selection criteria for the 2,700 samples are not described, and no quantitative validation (e.g., agreement scores or error analysis) is given for the modality-isolation labels or the 84% audio-requirement statistic, limiting interpretability of the fine-grained multimodal findings.
Authors: We thank the referee for pointing out this lack of detail. We will expand the Dataset Construction section to describe the sample selection criteria and include quantitative validation such as agreement scores and error analysis for the modality-isolation labels as well as the 84% audio-requirement statistic. revision: yes
Circularity Check
No circularity: benchmark definition and empirical evaluation are self-contained
full rationale
The paper presents a new benchmark (OmniPro) with 2700 samples, 9 sub-tasks, dual-mode evaluation protocol (Probe and Online), and reports results on 11 models. No mathematical derivations, equations, parameter fitting, or predictions that reduce to inputs by construction appear in the provided text or abstract. The dual-mode protocol is explicitly defined as a novel construction for assessing proactive ability rather than derived from prior fitted quantities or self-citations. Central claims rest on human-verified annotations and empirical observations (e.g., audio gains, performance degradation), which are externally falsifiable via the benchmark itself and do not rely on load-bearing self-citations or ansatzes smuggled from prior author work. This is the expected outcome for a benchmark paper with no derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human verification of samples ensures quality and reliability for model differentiation
Reference graph
Works this paper leans on
-
[1]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. Phi-4-Mini technical report: Compact yet powerful multimodal language models via mixture-of-LoRAs. arXiv preprint arXiv:2503.01743, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
StreamReady: Learning what to answer and when in long streaming videos
Shehreen Azad, Vibhav Vineet, and Yogesh Singh Rawat. StreamReady: Learning what to answer and when in long streaming videos. InCVPR, 2026
work page 2026
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
VideoLLM-online: Online video large language model for streaming video
Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. VideoLLM-online: Online video large language model for streaming video. InCVPR, 2024
work page 2024
-
[5]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in Video-LLMs.arXiv preprint arXiv:2406.07476, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, et al. MiniCPM-o 4.5: Towards real-time full-duplex omni-modal interaction.arXiv preprint arXiv:2604.27393, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
StreamMind: Unlocking full frame rate streaming video dialogue through event-gated cognition
Xin Ding, Hao Wu, Yifan Yang, Shiqi Jiang, Qianxi Zhang, Donglin Bai, Zhibo Chen, and Ting Cao. StreamMind: Unlocking full frame rate streaming video dialogue through event-gated cognition. InICCV, 2025
work page 2025
-
[8]
Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, and Feng Zheng. Long- V ALE: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos. InCVPR, 2025
work page 2025
-
[9]
Open-ended hierarchical streaming video understanding with vision language models
Hyolim Kang, Yunsu Park, Youngbeom Yoo, Yeeun Choi, and Seon Joo Kim. Open-ended hierarchical streaming video understanding with vision language models. InICCV, 2025
work page 2025
-
[10]
Rehg, Minsu Kim, and Yong Man Ro
Junho Kim, Hosu Lee, James M. Rehg, Minsu Kim, and Yong Man Ro. STRIDE: When to speak meets sequence denoising for streaming video understanding.arXiv preprint arXiv:2603.27593, 2026
-
[11]
LION-FS: Fast & slow video- language thinker as online video assistant
Wei Li, Bing Hu, Rui Shao, Leyang Shen, and Liqiang Nie. LION-FS: Fast & slow video- language thinker as online video assistant. InCVPR, 2025
work page 2025
-
[12]
OVO-Bench: How far is your Video-LLMs from real-world online video understanding? InCVPR, 2025
Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, and Jiaqi Wang. OVO-Bench: How far is your Video-LLMs from real-world online video understanding? InCVPR, 2025
work page 2025
-
[13]
StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding
Junming Lin, Zheng Fang, Chi Chen, Haoxuan Cheng, Zihao Wan, Fuwen Luo, Ziyue Wang, Peng Li, Yang Liu, and Maosong Sun. StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding. InICASSP, 2026
work page 2026
-
[14]
Zikang Liu, Longteng Guo, Handong Li, Ru Zhen, Xingjian He, Ruyi Ji, Xiaoming Ren, Yanhao Zhang, Haonan Lu, and Jing Liu. Thinking in streaming video.arXiv preprint arXiv:2603.12938, 2026. 10
-
[15]
Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction. InCVPR, 2025
work page 2025
-
[16]
video-SALMONN 2: Caption-enhanced audio-visual large language models
Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video-SALMONN 2: Caption-enhanced audio-visual large language models. arXiv preprint arXiv:2506.15220, 2025
-
[17]
COIN: A large-scale dataset for comprehensive instructional video analysis
Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. COIN: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019
work page 2019
-
[18]
StreamBridge: Turning your offline video large language model into a proactive streaming assistant
Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, and Ping Huang. StreamBridge: Turning your offline video large language model into a proactive streaming assistant. InNeurIPS, 2025
work page 2025
-
[19]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. InternVL3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
MMDuet2: Enhancing proactive interaction of video MLLMs with multi-turn reinforcement learning
Yueqian Wang, Songxiang Liu, Disong Wang, Nuo Xu, Guanglu Wan, Huishuai Zhang, and Dongyan Zhao. MMDuet2: Enhancing proactive interaction of video MLLMs with multi-turn reinforcement learning. InICLR, 2026
work page 2026
-
[21]
Omn- iMMI: A comprehensive multi-modal interaction benchmark in streaming video contexts
Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omn- iMMI: A comprehensive multi-modal interaction benchmark in streaming video contexts. In CVPR, 2025
work page 2025
-
[22]
VideoLLM-MoD: Efficient video-language streaming with mixture-of-depths vision computation
Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, and Mike Zheng Shou. VideoLLM-MoD: Efficient video-language streaming with mixture-of-depths vision computation. InNeurIPS, 2024
work page 2024
-
[23]
Streaming video instruction tuning
Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, and Kaiyang Zhou. Streaming video instruction tuning. InCVPR, 2026
work page 2026
-
[24]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-Omni technical report.arXiv preprint arXiv:2503.20215, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-Omni technical report.arXiv preprint arXiv:2509.17765, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding
Haolin Yang, Feilong Tang, Lingxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, et al. StreamAgent: Towards anticipatory agents for streaming video understanding.arXiv preprint arXiv:2508.01875, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
LiveStar: Live streaming assistant for real-world online video understanding
Zhenyu Yang, Kairui Zhang, Yuhang Hu, Bing Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Weiming Dong, and Changsheng Xu. LiveStar: Live streaming assistant for real-world online video understanding. InNeurIPS, 2025
work page 2025
-
[28]
TimeChat-Online: 80% visual tokens are naturally redundant in streaming videos
Linli Yao, Yicheng Li, Yuancheng Wei, Lei Li, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, and Xu Sun. TimeChat-Online: 80% visual tokens are naturally redundant in streaming videos. InMM, 2025
work page 2025
-
[29]
Proactive assistant dialogue generation from streaming egocentric videos
Yichi Zhang, Xin Luna Dong, Zhaojiang Lin, Andrea Madotto, Anuj Kumar, Babak Damavandi, Joyce Chai, and Seungwhan Moon. Proactive assistant dialogue generation from streaming egocentric videos. InEMNLP, 2025
work page 2025
-
[30]
Eyes Wide Open: Ego proactive Video-LLM for streaming video
Yulin Zhang, Cheng Shi, Yang Wang, and Sibei Yang. Eyes Wide Open: Ego proactive Video-LLM for streaming video. InNeurIPS, 2025
work page 2025
-
[31]
Yikai Zheng, Xin Ding, Yifan Yang, Shiqi Jiang, Hao Wu, Qianxi Zhang, Weijun Wang, Ting Cao, and Yunxin Liu. Em-Garde: A propose-match framework for proactive streaming video understanding.arXiv preprint arXiv:2603.19054, 2026. 11 A More Experimental Results A.1 Tolerance Window Ablation ±1 ±2 ±3 ±5 ±10 T olerance (±s) 0 5 10 15 20 25 30Score (%) 15.4 18....
-
[32]
caption: A detailed, information-dense paragraph integrating visual, audio, and speech into one coherent description. Include: Who (appearance, actions), What (objects, text), Action (specific verbs, direction), Change (differences from previous segment), Audio-visual correlation
-
[33]
visual: Exhaustive visual details — scene, lighting, colors, objects, people, camera work, on-screen text verbatim
-
[34]
Note onset and cessation of sounds
audio: Precise sound description — music (genre, tempo, instruments), sound effects, ambient sounds, voice quality. Note onset and cessation of sounds
-
[35]
None". Return a JSON array: [ {
speech: Detailed summary of what is said — key claims, names, numbers, facts. If no speech, write "None". Return a JSON array: [ { "start": "MM:SS", "end": "MM:SS", "caption": "...", "visual": "...", "audio": "...", "speech": "..." } ] Rules: - Segments must cover the entire video from 00:00 to {duration_mmss} with no gaps or overlaps. - Timestamps in MM:...
-
[37]
Duration: {duration_mmss} ({duration_sec:.0f}s)
Timestamped dense caption (supplementary reference). Duration: {duration_mmss} ({duration_sec:.0f}s). ## Steps
-
[38]
Choose an event (AUDIO-FIRST priority): - Audio-required (best): ONLY detectable by listening 13 (doorbell, whistle, spoken phrase, alarm, glass break) - Audio-helpful: visible but audio confirms - Visual-only (last resort): no meaningful audio
-
[39]
Must sound like a real person talking to a smart assistant
Write the question: One natural standing instruction at 00:00. Must sound like a real person talking to a smart assistant. No spoilers, no timestamps, everyday language
-
[40]
- Include accurate trigger_time (MM:SS)
Write response(s) — one per event occurrence: - State what happened, briefly and naturally. - Include accurate trigger_time (MM:SS). - Conversational tone, not robotic
-
[41]
Classify each response: - trigger_type: "visual"|"sound"|"speech"|combined (e.g., "visual+sound", "visual+speech") - audio_dependency: "required"|"helpful"|"none" - trigger_type_reason: brief explanation ## Output (single JSON object, no markdown) Fields: status, question, question_time ("00:00"), audio_dependency, responses[] with: trigger_time, response...
-
[44]
Find a trigger-target pair (AUDIO-FIRST): TRIGGER: instantaneous, real-time confirmable (NO "finishes/ends/completes"), unambiguous (maps to precise frame), naturally paired with target. OK: whistle->ball, "Maria" called->Maria TARGET: fits in ONE grid cell (highest priority). Never: full person, large vehicle, close-up face. Preferred: small held objects...
-
[45]
Write the question: One natural instruction at 00:00. Specify BOTH trigger 14 and target. Ask for position "in the frame"/"on screen"
- [46]
-
[47]
Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, position, trigger_type, trigger_type_reason, event_description. If no pair: {"status":"skip","reason":"..."} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss}]. - Pos...
-
[50]
Choose a SPECIFIC PHYSICAL DIMENSION: Audio scan (MANDATORY) first: - Speakers taking turns? -> "who is speaking" - Music starts/stops? -> "whether music is playing" - Alternating sound sources? If any audio dimension works, use it. Visual scan (only if no audio): State must be: specific (ONE property), discrete, about main subject, changes 2+ times, obje...
-
[51]
Write the question: One natural monitoring instruction at 00:00. UNAMBIGUOUS. No spoilers. No state value lists
-
[52]
- trigger_time (MM:SS), after 00:00
Write responses — ONLY at transitions (2-5): - Name previous AND new state (from X -> to Y). - trigger_time (MM:SS), after 00:00. Under 15 words. 15 - Do NOT report initial state. Chronological
-
[53]
Classify: trigger_type, audio_dependency. Include audio_scan field. ## Output (single JSON object, no markdown) Fields: status, audio_scan, question, question_time, audio_dependency, responses[] with: trigger_time, response, trigger_type, trigger_type_reason, event_description. If no suitable state: {"status":"skip","reason":"..."} ## Rules - Timestamps M...
-
[56]
Find trigger + counting target (LISTEN FIRST): Good audio trigger->target pairs (naturally connected): - Whistle blows -> count players on field - Applause starts -> count performers on stage - "everyone ready?" -> count people in room - Timer buzzes -> count dishes on counter Bad (artificially forced): - "Hey" -> count people on sofa (no connection) Visu...
-
[57]
Specifies BOTH trigger and counting target
Write the question: One natural counting instruction at 00:00. Specifies BOTH trigger and counting target. Natural language. No expected count revealed
-
[58]
Write ONE response at trigger moment: - Note trigger occurred. State exact count. - count field with integer (for evaluation). - trigger_time (MM:SS), after 00:00. Under 15 words
-
[59]
Classify: trigger_type, audio_dependency. Include audio_scan field. ## Output (single JSON object, no markdown) 16 Fields: status, audio_scan, question, question_time, audio_dependency, responses[] (exactly one) with: trigger_time, response, count, trigger_type, trigger_type_reason, event_description. If no pair: {"status":"skip","reason":"..."} ## Rules ...
-
[61]
Duration: {duration_mmss} ({duration_sec:.0f}s)
Timestamped dense caption (reference). Duration: {duration_mmss} ({duration_sec:.0f}s). ## Steps
-
[62]
Could a detector+classifier handle this?
Find a natural condition (AUDIO-FIRST): Must satisfy ALL: A. Realistic — a real person would want this alert. B. Requires semantic understanding (NOT perception): Test: "Could a detector+classifier handle this?" BAD: "when audience cheers" (sound classification) GOOD: "when speaker provides a statistic as evidence" C. Unambiguous (9/10 people flag same mo...
-
[63]
Write the question: One natural monitoring instruction at 00:00. Describes condition clearly. No spoilers
-
[64]
Write responses — one per occurrence: - State what happened AND why it satisfies condition. - Under 25 words. trigger_time (MM:SS), after 00:00. - Speech: timestamp = when sentence ends
-
[65]
Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, trigger_type, trigger_type_reason, event_description. If no condition: {"status":"skip","reason":"..."} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss}]. 17 - At le...
-
[67]
Duration: {duration_mmss} ({duration_sec:.0f}s)
Timestamped dense caption (reference). Duration: {duration_mmss} ({duration_sec:.0f}s). Preferred Category: {preferred_category} ## Event Categories A — Discrete non-speech sounds: impact, signals, instrument hits, animal/body sounds. B — Speech acts: questions, instructions, jokes, laughter bursts. Each = one complete act. C — Word/phrase repetitions: me...
-
[68]
Find event (try {preferred_category} first): Discrete/separable (10 people agree), repeats 3+ times (aim 3-8), non-overlapping, unambiguous
-
[69]
Write the question: One natural counting instruction at 00:00. Specifies event clearly. No count revealed
- [70]
-
[71]
Classify: trigger_type, audio_dependency. Include chosen_category field. ## Output (single JSON object, no markdown) Fields: status, chosen_category, question, question_time, audio_dependency, responses[] with: trigger_time, response, count, trigger_type, trigger_type_reason, event_description. If <3 occurrences: {"status":"skip","reason":"..."} ## Rules ...
-
[73]
concludes/climax/final/wraps up/ends with
Timestamped dense caption (reference). Duration: {duration_mmss} ({duration_sec:.0f}s). ## Streaming Constraint Real-time — NO future knowledge. Each update describes only what happened UP TO that point. NEVER use "concludes/climax/final/wraps up/ends with." ## Steps
-
[74]
Find a natural narration focus (satisfy ALL): A. Specific and constrained (NOT "describe everything") B. Multiple natural breakpoints (3+ stages). C. Grounded in observable, verifiable facts. D. Realistic. E. Integrates visual AND audio
-
[75]
Specifies focus, implies ongoing updates
Write the question: One natural instruction at 00:00. Specifies focus, implies ongoing updates. No spoilers
-
[76]
- Specific verifiable details (names, quantities)
Write responses — one per breakpoint (aim 3-6): - Factual summary since last update, within focus. - Specific verifiable details (names, quantities). - trigger_time (MM:SS). Under 40 words. - Chronological. Each adds NEW information. - Distributed across video (max 2 in first quarter)
-
[77]
Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, trigger_type, trigger_type_reason, event_description. If no focus: {"status":"skip","reason":"..."} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss}]. - 3-6 response...
-
[79]
Duration: {duration_mmss} ({duration_sec:.0f}s)
Timestamped dense caption (rough reference). Duration: {duration_mmss} ({duration_sec:.0f}s). ## Steps
-
[80]
Check required appear/disappear/reappear pattern: - Targets appear at spread-out times? - At least one disappears and reappears later? - At least 3 unique targets? If no reappear pattern, return skip
-
[81]
Find target category (must satisfy ALL): - Distinct identities. Appear-disappear-reappear. - 3+ targets, min 15s span. Unambiguous (9/10 agree). - Precisely scoped with qualifier when noisy: GOOD: "people interviewed on camera", "products picked up and demonstrated" BAD: "different scenes" (vague)
-
[82]
Write the question: One natural instruction at 00:00. Emphasizes unique/ different. No expected count revealed
-
[83]
- count: cumulative unique count (1, 2, 3,...)
Write responses — one per NEW unique target: - Describe what distinguishes from prior targets. - count: cumulative unique count (1, 2, 3,...). - trigger_time = first appearance (MM:SS). - Under 20 words. NEVER count re-appearances
-
[84]
Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, count, trigger_type, trigger_type_reason, event_description. If no dedup pattern or <3 targets: {"status":"skip",...} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss...
-
[85]
Original video (ground truth)
-
[86]
Duration: {duration_mmss} ({duration_sec:.0f}s)
Timestamped dense caption (reference). Duration: {duration_mmss} ({duration_sec:.0f}s). 20 ## Constraints - Real-time — no future knowledge. Instructions based on observations + domain knowledge. - ONLY tutorials: cooking, DIY, repair, beauty, exercise. NOT: interviews, vlogs, news, reviews, sports. If not a replicable process, return skip. ## Steps
-
[87]
Determine suitability: Clear goal? Sequential steps? Observable? If any = NO, return skip
-
[88]
Write the question: One natural instruction at 00:00. User wants to follow along. States learning goal. No spoilers
-
[89]
Write responses — one per step transition: Timing: previous step completed, next not started. - Actionable instruction (WHAT + HOW). - Key parameters (quantities, temps, times). - Instructional language ("Now add...", "Next...") NOT descriptive ("He is adding..."). - Verified by video. trigger_time (MM:SS). - Under 40 words. Chronological
-
[90]
Classify: trigger_type, audio_dependency. ## Output (single JSON object, no markdown) Fields: status, question, question_time, audio_dependency, responses[] with: trigger_time, response, trigger_type, trigger_type_reason, event_description. If not tutorial: {"status":"skip","reason":"..."} ## Rules - Timestamps MM:SS in [00:00,{duration_mmss}]. - At least...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.