pith. sign in

arxiv: 2605.17360 · v1 · pith:D7D26HBYnew · submitted 2026-05-17 · 💻 cs.CV

Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction

Pith reviewed 2026-05-20 13:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords real-time duplex interactionmultimodal large language modelsbenchmarkproactive reminderLLM judge evaluationomni-modalstreaming inputsvideo annotation
0
0 comments X

The pith

Omni-DuplexEval benchmark shows state-of-the-art duplex MLLMs achieve only 39.6 percent overall in real-time interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Omni-DuplexEval as a new benchmark to assess real-time duplex omni-modal interaction capabilities in multimodal large language models. It addresses the gap in offline evaluations by introducing scenarios for continuous description and proactive responses to salient events in streaming videos. Experiments demonstrate that even top models perform poorly, scoring just 39.6 percent overall and 20 percent on proactive tasks, pointing to difficulties in timing and content coherence.

Core claim

The authors establish Omni-DuplexEval consisting of 660 videos with fine-grained annotations across nine real-world tasks, evaluated via an LLM-as-a-Judge framework that assesses both response content and timing. This reveals substantial limitations in current duplex MLLMs, where the best model reaches only 39.6 percent overall performance and 20.0 percent on the Proactive Reminder scenario.

What carries the argument

Omni-DuplexEval benchmark with Real-Time Description and Proactive Reminder scenarios, using LLM-as-a-Judge for timestamp-aware evaluation of open-ended queries on 660 annotated videos.

If this is right

  • Models need to better balance timely responses with holistic content generation in streaming settings.
  • Determining both when to respond and what content to produce remains a core challenge for MLLMs.
  • The automatic evaluation framework provides a scalable way to assess alignment with human judgments on timing and content.
  • Progress in real-time duplex capabilities will require addressing these identified limitations in future model designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could use this benchmark to prioritize temporal reasoning in training data for interactive AI.
  • Extending the approach to non-video modalities might reveal similar gaps in audio or text streaming scenarios.
  • Low scores suggest that architectural changes beyond current MLLM designs may be necessary for true real-time interaction.

Load-bearing premise

The selected 660 videos with human annotations are taken to represent the full range of real-world streaming scenarios without major selection bias or coverage issues.

What would settle it

Re-evaluating the same models on a significantly expanded and independently collected video dataset that yields substantially higher performance scores would challenge the finding of substantial limitations.

Figures

Figures reproduced from arXiv: 2605.17360 by Bokai Xu, Chaoqun He, Jie Zhou, Junbo Cui, Lijie Wen, Mingyang Xiang, Yingjing Xu, Yuan Yao.

Figure 1
Figure 1. Figure 1: Comparison between Omni-DuplexEval and offline evaluation paradigms. Offline settings [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (1) Counting (CT) assesses the model’s capacity for incremental tallying and temporal consistency as it tracks the entry, exit, or occlusion of objects (e.g., fluctuating pedestrian counts) in a fluid scene. (2) Interaction Relation (IR) examines the model’s understanding of the social or physical connections between multiple entities. It requires describing how people or objects interact as those relation… view at source ↗
Figure 2
Figure 2. Figure 2: Example of each task in Real-Time Description. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of each task in Proactive Reminder. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the dataset characteristics: (a) Distribution of video durations; (b) Distribution [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The automatic evaluation pipeline for Real-Time Description. The framework assesses two dimensions: Content Consistency for global quality, and Temporal Sensitivity for streaming alignment. The final score is computed as a weighted combination of the two. 3.3.1 Real-Time Description Real-Time Description requires models to generate continuous, streaming descriptions synchronized with evolving video content… view at source ↗
Figure 6
Figure 6. Figure 6: Example of model predictions in Real-Time Description. Models Excel at Perception but Struggle with Struc￾tured Reasoning. Fine-grained analysis reveals a clear gap between perception and reasoning abilities. While models perform relatively well on low-level tasks such as OCR and fine-grained motion (e.g., MiniCPM-o 4.5 achieves 68.6 on OCR), performance drops on tasks re￾quiring structured reasoning. In p… view at source ↗
read the original abstract

Real-time duplex interaction is essential for multimodal AI systems operating in real-world scenarios, where models must continuously process streaming inputs and respond at appropriate moments. However, most existing multimodal large language models (MLLMs) are evaluated in offline settings, where the entire video input is processed before any response is generated. While recent work has started to explore real-time duplex MLLMs, there is still no comprehensive benchmark or automatic evaluation method for this setting. To address this gap, we propose Omni-DuplexEval, a benchmark for systematically evaluating real-time duplex interaction. The benchmark consists of two complementary scenarios: (1) Real-Time Description, which evaluates the ability to generate continuous, time-aligned responses that track evolving multimodal inputs, and (2) Proactive Reminder, which evaluates the ability to identify salient events and respond at appropriate moments. Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks grounded in real-world scenarios, where all questions are formulated as open-ended queries. We further introduce an automatic evaluation framework based on LLM-as-a-Judge, which enables systematic assessment by jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning, achieving strong alignment with human judgments. Experiments on state-of-the-art duplex MLLMs reveal substantial limitations. The best-performing model achieves only 39.6% overall, while scoring only 20.0% on Proactive Reminder. Our analysis identifies two key challenges: models struggle to balance timely responses with coherent, holistic content generation, and they often fail to determine both when to respond and what to produce. We hope our work facilitates further progress in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Omni-DuplexEval, a benchmark for real-time duplex omni-modal interaction in MLLMs consisting of 660 videos with human-annotated temporal labels across 9 tasks in two scenarios: Real-Time Description (continuous time-aligned responses) and Proactive Reminder (identifying salient events and responding at appropriate moments). It proposes an LLM-as-a-Judge automatic evaluation framework that jointly assesses response content and timing via timestamp-aware reasoning, reports strong alignment with human judgments, and shows that SOTA duplex MLLMs achieve only 39.6% overall with the best model scoring 20.0% on Proactive Reminder, identifying challenges in balancing timely responses with coherent content generation.

Significance. If the benchmark videos and annotations are shown to be representative, this work fills a clear gap in evaluating streaming duplex capabilities that offline MLLM benchmarks overlook. The human-annotated temporal labels and reported alignment between the LLM judge and human judgments provide useful grounding for the performance claims. The identification of specific failure modes (timing vs. content trade-offs) offers concrete directions for future model development.

major comments (1)
  1. [Dataset construction / Experiments] The central claim that SOTA duplex MLLMs exhibit substantial limitations in real-time interaction rests on the 660-video benchmark being representative of diverse streaming dynamics (event salience, temporal density, multimodal asynchrony, task-specific patterns). The manuscript grounds the videos in 'real-world scenarios' but provides no quantitative diversity statistics, sourcing methodology, or coverage analysis across the nine tasks; without these, the low scores (39.6% overall, 20.0% on Proactive Reminder) could reflect benchmark construction rather than intrinsic model shortcomings.
minor comments (2)
  1. [Abstract / Evaluation framework] The abstract and evaluation section should report exact metrics used by the LLM judge, inter-annotator agreement statistics for the human labels, and any controls for judge-model bias to strengthen the claim of strong human alignment.
  2. [Benchmark design] Clarify how the nine tasks were selected and whether any overlap or redundancy exists between Real-Time Description and Proactive Reminder scenarios.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully reviewed the major comment concerning the representativeness of the Omni-DuplexEval benchmark and provide a point-by-point response below. We believe addressing this point will further strengthen the paper.

read point-by-point responses
  1. Referee: The central claim that SOTA duplex MLLMs exhibit substantial limitations in real-time interaction rests on the 660-video benchmark being representative of diverse streaming dynamics (event salience, temporal density, multimodal asynchrony, task-specific patterns). The manuscript grounds the videos in 'real-world scenarios' but provides no quantitative diversity statistics, sourcing methodology, or coverage analysis across the nine tasks; without these, the low scores (39.6% overall, 20.0% on Proactive Reminder) could reflect benchmark construction rather than intrinsic model shortcomings.

    Authors: We agree that explicit quantitative evidence of diversity is necessary to robustly support the claim that the observed performance limitations are intrinsic to current models rather than artifacts of benchmark construction. While the manuscript describes the 660 videos as spanning 9 tasks grounded in real-world scenarios with human-annotated temporal labels, it does not include the requested statistics or methodology details. In the revised manuscript, we will add a dedicated subsection under Dataset Construction that reports: (1) sourcing methodology, including selection criteria from diverse real-world video sources to ensure variation in lengths, event densities, and multimodal asynchrony; (2) quantitative diversity statistics such as distributions and summary metrics for event salience, temporal density, and asynchrony measures across the full set; and (3) coverage analysis with per-task video counts and percentages to confirm balanced representation. These additions will directly address the concern and provide stronger grounding for the reported scores. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or evaluation

full rationale

The paper introduces Omni-DuplexEval as a new benchmark with 660 human-annotated videos spanning 9 tasks and proposes an LLM-as-Judge framework whose alignment with human judgments is presented as an external validation step. The reported model scores (39.6% overall, 20.0% on Proactive Reminder) are direct empirical measurements obtained by running existing SOTA duplex MLLMs on this benchmark rather than any fitted parameter, self-defined quantity, or prediction that reduces to the benchmark inputs by construction. No self-citation load-bearing, ansatz smuggling, or uniqueness theorem imported from prior author work appears in the derivation; the central claim of model limitations follows from the experimental application of the benchmark, which remains self-contained against external human judgments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is an empirical benchmark rather than a theoretical derivation, so the ledger contains only standard assumptions about annotation quality and judge reliability with no free parameters or invented entities.

axioms (1)
  • domain assumption Human annotations provide reliable ground truth for response timing and content in open-ended video tasks
    The benchmark and LLM-judge validation rest on this without further justification in the abstract.

pith-pipeline@v0.9.0 · 5856 in / 1150 out tokens · 39105 ms · 2026-05-20T13:32:43.593258+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks... automatic evaluation framework based on LLM-as-a-Judge... jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 8 internal anchors

  1. [1]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  2. [2]

    Gemini 3.1 pro model card

    Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026

  3. [3]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

  4. [4]

    Lvbench: An extreme long video understanding benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958– 22967, 2025

  5. [5]

    Liu, and Hung yi Lee

    Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, and Hung yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities, 2025

  6. [6]

    Livecc: Learning video llm with streaming speech transcription at scale

    Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with streaming speech transcription at scale. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29083–29095, 2025

  7. [7]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024

  8. [8]

    Streamingbench: Assessing the gap for mllms to achieve streaming video understanding

    Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding. arXiv preprint arXiv:2411.03628, 2024

  9. [9]

    Ovo-bench: How far is your video-llms from real-world online video understanding?arXiv preprint arXiv:2501.05510, 2025

    Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding?arXiv preprint arXiv:2501.05510, 2025

  10. [10]

    Omn- immi: A comprehensive multi-modal interaction benchmark in streaming video contexts

    Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omn- immi: A comprehensive multi-modal interaction benchmark in streaming video contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18925–18935, 2025

  11. [11]

    Proactivev- ideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models.arXiv preprint arXiv:2507.09313, 2025

    Yueqian Wang, Xiaojun Meng, Yifan Wang, Huishuai Zhang, and Dongyan Zhao. Proactivev- ideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models.arXiv preprint arXiv:2507.09313, 2025

  12. [12]

    Phostream: Benchmarking real-world streaming for omnimodal assistants in mobile scenarios.arXiv preprint arXiv:2601.22575, 2026

    Xudong Lu, Huankan Guan, Yang Bo, Jinpeng Chen, Xintong Guo, Shuhan Li, Fang Liu, Peiwen Sun, Xueying Li, Wei Zhang, et al. Phostream: Benchmarking real-world streaming for omnimodal assistants in mobile scenarios.arXiv preprint arXiv:2601.22575, 2026

  13. [13]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

  14. [14]

    Mlvu: A comprehensive benchmark for multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 11

  15. [15]

    Longvideobench: A benchmark for long- context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long- context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024

  16. [16]

    Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272, 2024

    Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272, 2024

  17. [17]

    WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

    Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluat- ing real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025

  18. [18]

    River: A real-time interaction benchmark for video llms

    Yansong Shi, Qingsong Zhao, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, and Limin Wang. River: A real-time interaction benchmark for video llms. InInternational Conference on Learning Representations (ICLR), 2026

  19. [19]

    VideoChat: Chat-Centric Video Understanding

    KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023

  20. [20]

    PandaGPT: One Model To Instruction-Follow Them All

    Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023

  21. [21]

    Vast: A vision-audio-subtitle-text omni-modality foundation model.arXiv preprint arXiv:2305.18500, 2023

    Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, and Jing Liu. Vast: A vision-audio-subtitle-text omni-modality foundation model.arXiv preprint arXiv:2305.18500, 2023

  22. [22]

    Next-gpt: Any-to-any multimodal llm.arXiv preprint arXiv:2309.05519, 2024

    Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm.arXiv preprint arXiv:2309.05519, 2024

  23. [23]

    Onellm: One framework to align all modalities with language

    Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. Onellm: One framework to align all modalities with language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26584–26595, 2024

  24. [24]

    Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024

    Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Shaoqi Dong, Xiong Wang, Di Yin, Long Ma, et al. Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024

  25. [25]

    CogVLM2: Visual Language Models for Image and Video Understanding

    Wuyang Chen, Zhaohui Wang, Yizhou Jiang, Xiaolin Zhang, Jiayu Wang, Junyan He, Li Yuan, Yong Zhang, Tong Zhang, and Dahua Lin. Cogvlm2: Visual language models for image and video understanding.arXiv preprint arXiv:2408.16500, 2024

  26. [26]

    Videollm-online: Online video large language model for streaming video

    Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024

  27. [27]

    arXiv preprint arXiv:2406.08085 , year=

    Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024

  28. [28]

    Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.02295, 2025

    Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.02295, 2025

  29. [29]

    Streambridge: Transforming offline video-llms into streaming models

    Yuxuan Wang, Xiaojun Meng, Yueqian Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, and Dongyan Zhao. Streambridge: Transforming offline video-llms into streaming models

  30. [30]

    Apple Research, September 2025

  31. [31]

    Video-salmonn s: Test-time training memory for streaming video understanding.arXiv preprint arXiv:2510.11129, 2025

    Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. Video-salmonn s: Test-time training memory for streaming video understanding.arXiv preprint arXiv:2510.11129, 2025. 12

  32. [32]

    Vista: Scene-aware optimization for streaming video question answering under post-hoc queries

    Haocheng Lu, Nan Zhang, Wei Tao, Xiaoyang Qu, Guokuan Li, Jiguang Wan, and Jianzong Wang. Vista: Scene-aware optimization for streaming video question answering under post-hoc queries. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 7539–7547, 2026

  33. [33]

    Streamingeval: A unified evaluation protocol towards realistic streaming video understanding.arXiv preprint arXiv:2603.21493, 2026

    Guowei Tang, Yifei Wang, Jiacheng Li, Yue Zhang, and Yuxuan Chen. Streamingeval: A unified evaluation protocol towards realistic streaming video understanding.arXiv preprint arXiv:2603.21493, 2026

  34. [34]

    Egoschema: A diagnostic benchmark for video understanding.arXiv preprint arXiv:2403.12155, 2024

    Karttikeya Mangalam, Linxi Fan, Yuxuan Li, Yuxuan Wang, Jiahao Li, Xinlei Chen, Haoqi Fan, Yu Xiang, Zhou Lou, Yuhan Shi, et al. Egoschema: A diagnostic benchmark for video understanding.arXiv preprint arXiv:2403.12155, 2024

  35. [35]

    Perception test: A diagnostic benchmark for multimodal models.arXiv preprint arXiv:2405.17348, 2024

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Nando Risi, Abhishek Goyal, Kaiming He, Skanda Koppula, et al. Perception test: A diagnostic benchmark for multimodal models.arXiv preprint arXiv:2405.17348, 2024

  36. [36]

    Activitynet-qa: A dataset for video question answering.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019

    Zhou Yu, Dejing Xu, Jun Yu, Zhipeng Cai, and Dacheng Tao. Activitynet-qa: A dataset for video question answering.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019

  37. [37]

    A survey on video large language models: Benchmarks and evaluation methodologies.arXiv preprint arXiv:2501.02688, 2025

    Haiyang Kong, Jiale Wu, Xiaohui Li, Jinlong Wang, Yong Wu, Longtao Li, and Ming Sun. A survey on video large language models: Benchmarks and evaluation methodologies.arXiv preprint arXiv:2501.02688, 2025

  38. [38]

    Rtv-bench: Benchmarking mllm continuous perception, understanding and reasoning through real-time video

    Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, and Xuming Hu. Rtv-bench: Benchmarking mllm continuous perception, understanding and reasoning through real-time video. InAdvances in Neural Information Processing Systems, volume 38, 2025

  39. [39]

    Livecc: Learning video llm with streaming speech transcription at scale.arXiv preprint arXiv:2504.16030, 2025

    Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with streaming speech transcription at scale.arXiv preprint arXiv:2504.16030, 2025

  40. [40]

    Spot-bench: Benchmarking real-time spoken proactive video understanding.arXiv preprint arXiv:2505.08765, 2025

    Hao Zhang, Yuxuan Li, Ziqian Wang, and Sijia Chen. Spot-bench: Benchmarking real-time spoken proactive video understanding.arXiv preprint arXiv:2505.08765, 2025

  41. [41]

    Bayesian Credible Sets for Phylogenetic Tree Topologies with Applications to Coverage Analysis and Cross-Model Comparison

    Jiacheng Li, Yue Zhang, Xinyu Wang, and Yuxuan Chen. Vsas-bench: A synchronous- asynchronous streaming benchmark for multimodal llms.arXiv preprint arXiv:2505.14532, 2025

  42. [42]

    Streamingeval: A unified framework for evaluating streaming multimodal systems.arXiv preprint arXiv:2506.02148, 2025

    Xinyu Wang, Jiacheng Li, Yue Zhang, and Yuxuan Chen. Streamingeval: A unified framework for evaluating streaming multimodal systems.arXiv preprint arXiv:2506.02148, 2025

  43. [43]

    Lvomnibench: Long audio-video under- standing for omni-modal llms.arXiv preprint arXiv:2506.08764, 2025

    Jun Xiao, Ziqian Wang, Yifan Liu, and Sijia Chen. Lvomnibench: Long audio-video under- standing for omni-modal llms.arXiv preprint arXiv:2506.08764, 2025

  44. [44]

    Mmou: A massive multi-task omni understanding and reasoning benchmark for long and complex real-world videos.arXiv preprint arXiv:2603.14145, 2026

    Arushi Goel et al. Mmou: A massive multi-task omni understanding and reasoning benchmark for long and complex real-world videos.arXiv preprint arXiv:2603.14145, 2026

  45. [45]

    Maverix: Multimodal audio-visual evaluation and recognition index

    Liuyue Xie, Avik Kuthiala, George Z Wei, Ce Zheng, Ananya Bal, Mosam Dabhi, Liting Wen, Taru Rustagi, Ethan Lai, Sushil Khyalia, Rohan Choudhury, Morteza Ziyadi, Xu Zhang, Hao Yang, and Laszlo A Jeni. Maverix: Multimodal audio-visual evaluation and recognition index. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 27090–2...

  46. [46]

    WildVideo Team. Wildvideo: A systematic multi-round open-ended qa benchmark for real- world video-language interaction.IEEE Transactions on Pattern Analysis and Machine Intelli- gence (TPAMI), 2025. Accepted

  47. [47]

    Liu, and Hung-yi Lee

    Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, and Hung-yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025. 13

  48. [48]

    Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

    HumDial Challenge Team. Full-duplex interaction in spoken dialogue systems: A compre- hensive study from the icassp 2026 humdial challenge.arXiv preprint arXiv:2604.21406, 2026

  49. [49]

    Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning, 2025

    Yueqian Wang, Songxiang Liu, Disong Wang, Nuo Xu, Guanglu Wan, Huishuai Zhang, and Dongyan Zhao. Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning, 2025. 14 A Detailed Evaluation Protocols This section provides the complete evaluation protocols for the Real-Time Description and Proactive Reminder. A.1 Content C...

  50. [50]

    The evaluator starts from a perfect score of3.00

  51. [51]

    For each error identified, a specific penalty is deducted according to Table 5

  52. [52]

    dark blue

    The final score is the maximum of the calculated result and0.01, unless the response is completely empty or entirely irrelevant, in which case the score is0.00. A.1.2 Penalty Table Table 5: Content Consistency Penalty Values Error Category Severity Penalty Critical Factual Error (wrong object/action/color/count) High -1.00 Critical Factual Error (partiall...

  53. [53]

    Deduct penalties for each error

  54. [54]

    content_score

    Output ONLY JSON with "content_score" and "content_reasoning" 15 A.2 Temporal Sensitivity Temporal Sensitivity measures the alignment between the model-generated text and the video’s temporal windows—specifically, whether the model describes the corresponding video content at the appropriate time. A.2.1 Evaluation Process The metric evaluates a timestampe...

  55. [55]

    Clearly refer to the target event described in the instruction

  56. [56]

    Express an intention to remind or inform that the event has occurred

  57. [57]

    Not be vague or unrelated to the event

  58. [58]

    success_score

    If the output is ambiguous, misidentifies the event, or does not mention the event, it is considered a failure. Scoring: - 1 = Successful reminder (explicitly mentions the event and completes the reminder) - 0 = Unsuccessful reminder (vague / incorrect / event not mentioned) Output Format: Only output JSON: { "success_score": <0 or 1>, "reasoning": "<expl...

  59. [59]

    Compare the user instruction with the ground truth answer to identify the error(s)

  60. [60]

    Check whether the model output corrects these error(s) consistent with the ground truth

  61. [61]

    The correction must maintain correct context (e.g., subject, object) consistent with both instruction and answer

  62. [62]

    success_score

    Extra information unrelated to correction should be ignored, unless it contradicts the instruction or answer. Scoring: - 1 = Successful correction (all errors corrected with consistent context) - 0 = Unsuccessful correction (missing errors, inconsistent correction, or context mismatch) Output Format: Only output JSON: { "success_score": <0 or 1>, "reasoni...