Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction
Pith reviewed 2026-05-20 13:32 UTC · model grok-4.3
The pith
Omni-DuplexEval benchmark shows state-of-the-art duplex MLLMs achieve only 39.6 percent overall in real-time interactions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish Omni-DuplexEval consisting of 660 videos with fine-grained annotations across nine real-world tasks, evaluated via an LLM-as-a-Judge framework that assesses both response content and timing. This reveals substantial limitations in current duplex MLLMs, where the best model reaches only 39.6 percent overall performance and 20.0 percent on the Proactive Reminder scenario.
What carries the argument
Omni-DuplexEval benchmark with Real-Time Description and Proactive Reminder scenarios, using LLM-as-a-Judge for timestamp-aware evaluation of open-ended queries on 660 annotated videos.
If this is right
- Models need to better balance timely responses with holistic content generation in streaming settings.
- Determining both when to respond and what content to produce remains a core challenge for MLLMs.
- The automatic evaluation framework provides a scalable way to assess alignment with human judgments on timing and content.
- Progress in real-time duplex capabilities will require addressing these identified limitations in future model designs.
Where Pith is reading between the lines
- Developers could use this benchmark to prioritize temporal reasoning in training data for interactive AI.
- Extending the approach to non-video modalities might reveal similar gaps in audio or text streaming scenarios.
- Low scores suggest that architectural changes beyond current MLLM designs may be necessary for true real-time interaction.
Load-bearing premise
The selected 660 videos with human annotations are taken to represent the full range of real-world streaming scenarios without major selection bias or coverage issues.
What would settle it
Re-evaluating the same models on a significantly expanded and independently collected video dataset that yields substantially higher performance scores would challenge the finding of substantial limitations.
Figures
read the original abstract
Real-time duplex interaction is essential for multimodal AI systems operating in real-world scenarios, where models must continuously process streaming inputs and respond at appropriate moments. However, most existing multimodal large language models (MLLMs) are evaluated in offline settings, where the entire video input is processed before any response is generated. While recent work has started to explore real-time duplex MLLMs, there is still no comprehensive benchmark or automatic evaluation method for this setting. To address this gap, we propose Omni-DuplexEval, a benchmark for systematically evaluating real-time duplex interaction. The benchmark consists of two complementary scenarios: (1) Real-Time Description, which evaluates the ability to generate continuous, time-aligned responses that track evolving multimodal inputs, and (2) Proactive Reminder, which evaluates the ability to identify salient events and respond at appropriate moments. Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks grounded in real-world scenarios, where all questions are formulated as open-ended queries. We further introduce an automatic evaluation framework based on LLM-as-a-Judge, which enables systematic assessment by jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning, achieving strong alignment with human judgments. Experiments on state-of-the-art duplex MLLMs reveal substantial limitations. The best-performing model achieves only 39.6% overall, while scoring only 20.0% on Proactive Reminder. Our analysis identifies two key challenges: models struggle to balance timely responses with coherent, holistic content generation, and they often fail to determine both when to respond and what to produce. We hope our work facilitates further progress in MLLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Omni-DuplexEval, a benchmark for real-time duplex omni-modal interaction in MLLMs consisting of 660 videos with human-annotated temporal labels across 9 tasks in two scenarios: Real-Time Description (continuous time-aligned responses) and Proactive Reminder (identifying salient events and responding at appropriate moments). It proposes an LLM-as-a-Judge automatic evaluation framework that jointly assesses response content and timing via timestamp-aware reasoning, reports strong alignment with human judgments, and shows that SOTA duplex MLLMs achieve only 39.6% overall with the best model scoring 20.0% on Proactive Reminder, identifying challenges in balancing timely responses with coherent content generation.
Significance. If the benchmark videos and annotations are shown to be representative, this work fills a clear gap in evaluating streaming duplex capabilities that offline MLLM benchmarks overlook. The human-annotated temporal labels and reported alignment between the LLM judge and human judgments provide useful grounding for the performance claims. The identification of specific failure modes (timing vs. content trade-offs) offers concrete directions for future model development.
major comments (1)
- [Dataset construction / Experiments] The central claim that SOTA duplex MLLMs exhibit substantial limitations in real-time interaction rests on the 660-video benchmark being representative of diverse streaming dynamics (event salience, temporal density, multimodal asynchrony, task-specific patterns). The manuscript grounds the videos in 'real-world scenarios' but provides no quantitative diversity statistics, sourcing methodology, or coverage analysis across the nine tasks; without these, the low scores (39.6% overall, 20.0% on Proactive Reminder) could reflect benchmark construction rather than intrinsic model shortcomings.
minor comments (2)
- [Abstract / Evaluation framework] The abstract and evaluation section should report exact metrics used by the LLM judge, inter-annotator agreement statistics for the human labels, and any controls for judge-model bias to strengthen the claim of strong human alignment.
- [Benchmark design] Clarify how the nine tasks were selected and whether any overlap or redundancy exists between Real-Time Description and Proactive Reminder scenarios.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully reviewed the major comment concerning the representativeness of the Omni-DuplexEval benchmark and provide a point-by-point response below. We believe addressing this point will further strengthen the paper.
read point-by-point responses
-
Referee: The central claim that SOTA duplex MLLMs exhibit substantial limitations in real-time interaction rests on the 660-video benchmark being representative of diverse streaming dynamics (event salience, temporal density, multimodal asynchrony, task-specific patterns). The manuscript grounds the videos in 'real-world scenarios' but provides no quantitative diversity statistics, sourcing methodology, or coverage analysis across the nine tasks; without these, the low scores (39.6% overall, 20.0% on Proactive Reminder) could reflect benchmark construction rather than intrinsic model shortcomings.
Authors: We agree that explicit quantitative evidence of diversity is necessary to robustly support the claim that the observed performance limitations are intrinsic to current models rather than artifacts of benchmark construction. While the manuscript describes the 660 videos as spanning 9 tasks grounded in real-world scenarios with human-annotated temporal labels, it does not include the requested statistics or methodology details. In the revised manuscript, we will add a dedicated subsection under Dataset Construction that reports: (1) sourcing methodology, including selection criteria from diverse real-world video sources to ensure variation in lengths, event densities, and multimodal asynchrony; (2) quantitative diversity statistics such as distributions and summary metrics for event salience, temporal density, and asynchrony measures across the full set; and (3) coverage analysis with per-task video counts and percentages to confirm balanced representation. These additions will directly address the concern and provide stronger grounding for the reported scores. revision: yes
Circularity Check
No significant circularity in benchmark construction or evaluation
full rationale
The paper introduces Omni-DuplexEval as a new benchmark with 660 human-annotated videos spanning 9 tasks and proposes an LLM-as-Judge framework whose alignment with human judgments is presented as an external validation step. The reported model scores (39.6% overall, 20.0% on Proactive Reminder) are direct empirical measurements obtained by running existing SOTA duplex MLLMs on this benchmark rather than any fitted parameter, self-defined quantity, or prediction that reduces to the benchmark inputs by construction. No self-citation load-bearing, ansatz smuggling, or uniqueness theorem imported from prior author work appears in the derivation; the central claim of model limitations follows from the experimental application of the benchmark, which remains self-contained against external human judgments.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human annotations provide reliable ground truth for response timing and content in open-ended video tasks
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Omni-DuplexEval contains 660 videos with fine-grained, human-annotated labels and precise temporal metadata, spanning 9 tasks... automatic evaluation framework based on LLM-as-a-Judge... jointly evaluating response-content alignment and response timing through timestamp-aware and sequential reasoning
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Google DeepMind. Gemini 3.1 pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/, 2026
work page 2026
-
[3]
Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025
work page 2025
-
[4]
Lvbench: An extreme long video understanding benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958– 22967, 2025
work page 2025
-
[5]
Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, and Hung yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities, 2025
work page 2025
-
[6]
Livecc: Learning video llm with streaming speech transcription at scale
Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with streaming speech transcription at scale. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29083–29095, 2025
work page 2025
-
[7]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Streamingbench: Assessing the gap for mllms to achieve streaming video understanding
Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding. arXiv preprint arXiv:2411.03628, 2024
-
[9]
Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding?arXiv preprint arXiv:2501.05510, 2025
-
[10]
Omn- immi: A comprehensive multi-modal interaction benchmark in streaming video contexts
Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omn- immi: A comprehensive multi-modal interaction benchmark in streaming video contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18925–18935, 2025
work page 2025
-
[11]
Yueqian Wang, Xiaojun Meng, Yifan Wang, Huishuai Zhang, and Dongyan Zhao. Proactivev- ideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models.arXiv preprint arXiv:2507.09313, 2025
-
[12]
Xudong Lu, Huankan Guan, Yang Bo, Jinpeng Chen, Xintong Guo, Shuhan Li, Fang Liu, Peiwen Sun, Xueying Li, Wei Zhang, et al. Phostream: Benchmarking real-world streaming for omnimodal assistants in mobile scenarios.arXiv preprint arXiv:2601.22575, 2026
-
[13]
Mvbench: A comprehensive multi-modal video understanding benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024
work page 2024
-
[14]
Mlvu: A comprehensive benchmark for multi-task long video understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 11
work page 2025
-
[15]
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long- context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024
work page 2024
-
[16]
Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272, 2024
-
[17]
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluat- ing real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
River: A real-time interaction benchmark for video llms
Yansong Shi, Qingsong Zhao, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, and Limin Wang. River: A real-time interaction benchmark for video llms. InInternational Conference on Learning Representations (ICLR), 2026
work page 2026
-
[19]
VideoChat: Chat-Centric Video Understanding
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.arXiv preprint arXiv:2305.06355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
PandaGPT: One Model To Instruction-Follow Them All
Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, and Jing Liu. Vast: A vision-audio-subtitle-text omni-modality foundation model.arXiv preprint arXiv:2305.18500, 2023
-
[22]
Next-gpt: Any-to-any multimodal llm.arXiv preprint arXiv:2309.05519, 2024
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm.arXiv preprint arXiv:2309.05519, 2024
-
[23]
Onellm: One framework to align all modalities with language
Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xiangyu Yue. Onellm: One framework to align all modalities with language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26584–26595, 2024
work page 2024
-
[24]
Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024
Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Shaoqi Dong, Xiong Wang, Di Yin, Long Ma, et al. Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024
-
[25]
CogVLM2: Visual Language Models for Image and Video Understanding
Wuyang Chen, Zhaohui Wang, Yizhou Jiang, Xiaolin Zhang, Jiayu Wang, Junyan He, Li Yuan, Yong Zhang, Tong Zhang, and Dahua Lin. Cogvlm2: Visual language models for image and video understanding.arXiv preprint arXiv:2408.16500, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Videollm-online: Online video large language model for streaming video
Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. Videollm-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18407–18418, 2024
work page 2024
-
[27]
arXiv preprint arXiv:2406.08085 , year=
Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, and Xiaojie Jin. Flash-vstream: Memory-based real-time understanding for long video streams.arXiv preprint arXiv:2406.08085, 2024
-
[28]
Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Streamingvlm: Real-time understanding for infinite video streams.arXiv preprint arXiv:2510.02295, 2025
-
[29]
Streambridge: Transforming offline video-llms into streaming models
Yuxuan Wang, Xiaojun Meng, Yueqian Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, and Dongyan Zhao. Streambridge: Transforming offline video-llms into streaming models
-
[30]
Apple Research, September 2025
work page 2025
-
[31]
Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. Video-salmonn s: Test-time training memory for streaming video understanding.arXiv preprint arXiv:2510.11129, 2025. 12
-
[32]
Vista: Scene-aware optimization for streaming video question answering under post-hoc queries
Haocheng Lu, Nan Zhang, Wei Tao, Xiaoyang Qu, Guokuan Li, Jiguang Wan, and Jianzong Wang. Vista: Scene-aware optimization for streaming video question answering under post-hoc queries. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 7539–7547, 2026
work page 2026
-
[33]
Guowei Tang, Yifei Wang, Jiacheng Li, Yue Zhang, and Yuxuan Chen. Streamingeval: A unified evaluation protocol towards realistic streaming video understanding.arXiv preprint arXiv:2603.21493, 2026
-
[34]
Egoschema: A diagnostic benchmark for video understanding.arXiv preprint arXiv:2403.12155, 2024
Karttikeya Mangalam, Linxi Fan, Yuxuan Li, Yuxuan Wang, Jiahao Li, Xinlei Chen, Haoqi Fan, Yu Xiang, Zhou Lou, Yuhan Shi, et al. Egoschema: A diagnostic benchmark for video understanding.arXiv preprint arXiv:2403.12155, 2024
-
[35]
Perception test: A diagnostic benchmark for multimodal models.arXiv preprint arXiv:2405.17348, 2024
Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Nando Risi, Abhishek Goyal, Kaiming He, Skanda Koppula, et al. Perception test: A diagnostic benchmark for multimodal models.arXiv preprint arXiv:2405.17348, 2024
-
[36]
Zhou Yu, Dejing Xu, Jun Yu, Zhipeng Cai, and Dacheng Tao. Activitynet-qa: A dataset for video question answering.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019
work page 2019
-
[37]
Haiyang Kong, Jiale Wu, Xiaohui Li, Jinlong Wang, Yong Wu, Longtao Li, and Ming Sun. A survey on video large language models: Benchmarks and evaluation methodologies.arXiv preprint arXiv:2501.02688, 2025
-
[38]
Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, and Xuming Hu. Rtv-bench: Benchmarking mllm continuous perception, understanding and reasoning through real-time video. InAdvances in Neural Information Processing Systems, volume 38, 2025
work page 2025
-
[39]
Joya Chen, Ziyun Zeng, Yiqi Lin, Wei Li, Zejun Ma, and Mike Zheng Shou. Livecc: Learning video llm with streaming speech transcription at scale.arXiv preprint arXiv:2504.16030, 2025
-
[40]
Hao Zhang, Yuxuan Li, Ziqian Wang, and Sijia Chen. Spot-bench: Benchmarking real-time spoken proactive video understanding.arXiv preprint arXiv:2505.08765, 2025
-
[41]
Jiacheng Li, Yue Zhang, Xinyu Wang, and Yuxuan Chen. Vsas-bench: A synchronous- asynchronous streaming benchmark for multimodal llms.arXiv preprint arXiv:2505.14532, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Xinyu Wang, Jiacheng Li, Yue Zhang, and Yuxuan Chen. Streamingeval: A unified framework for evaluating streaming multimodal systems.arXiv preprint arXiv:2506.02148, 2025
-
[43]
Jun Xiao, Ziqian Wang, Yifan Liu, and Sijia Chen. Lvomnibench: Long audio-video under- standing for omni-modal llms.arXiv preprint arXiv:2506.08764, 2025
-
[44]
Arushi Goel et al. Mmou: A massive multi-task omni understanding and reasoning benchmark for long and complex real-world videos.arXiv preprint arXiv:2603.14145, 2026
-
[45]
Maverix: Multimodal audio-visual evaluation and recognition index
Liuyue Xie, Avik Kuthiala, George Z Wei, Ce Zheng, Ananya Bal, Mosam Dabhi, Liting Wen, Taru Rustagi, Ethan Lai, Sushil Khyalia, Rohan Choudhury, Morteza Ziyadi, Xu Zhang, Hao Yang, and Laszlo A Jeni. Maverix: Multimodal audio-visual evaluation and recognition index. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 27090–2...
work page 2026
-
[46]
WildVideo Team. Wildvideo: A systematic multi-round open-ended qa benchmark for real- world video-language interaction.IEEE Transactions on Pattern Analysis and Machine Intelli- gence (TPAMI), 2025. Accepted
work page 2025
-
[47]
Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, and Hung-yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025. 13
-
[48]
HumDial Challenge Team. Full-duplex interaction in spoken dialogue systems: A compre- hensive study from the icassp 2026 humdial challenge.arXiv preprint arXiv:2604.21406, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[49]
Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning, 2025
Yueqian Wang, Songxiang Liu, Disong Wang, Nuo Xu, Guanglu Wan, Huishuai Zhang, and Dongyan Zhao. Mmduet2: Enhancing proactive interaction of video mllms with multi-turn reinforcement learning, 2025. 14 A Detailed Evaluation Protocols This section provides the complete evaluation protocols for the Real-Time Description and Proactive Reminder. A.1 Content C...
work page 2025
-
[50]
The evaluator starts from a perfect score of3.00
-
[51]
For each error identified, a specific penalty is deducted according to Table 5
-
[52]
The final score is the maximum of the calculated result and0.01, unless the response is completely empty or entirely irrelevant, in which case the score is0.00. A.1.2 Penalty Table Table 5: Content Consistency Penalty Values Error Category Severity Penalty Critical Factual Error (wrong object/action/color/count) High -1.00 Critical Factual Error (partiall...
-
[53]
Deduct penalties for each error
-
[54]
Output ONLY JSON with "content_score" and "content_reasoning" 15 A.2 Temporal Sensitivity Temporal Sensitivity measures the alignment between the model-generated text and the video’s temporal windows—specifically, whether the model describes the corresponding video content at the appropriate time. A.2.1 Evaluation Process The metric evaluates a timestampe...
-
[55]
Clearly refer to the target event described in the instruction
-
[56]
Express an intention to remind or inform that the event has occurred
-
[57]
Not be vague or unrelated to the event
-
[58]
If the output is ambiguous, misidentifies the event, or does not mention the event, it is considered a failure. Scoring: - 1 = Successful reminder (explicitly mentions the event and completes the reminder) - 0 = Unsuccessful reminder (vague / incorrect / event not mentioned) Output Format: Only output JSON: { "success_score": <0 or 1>, "reasoning": "<expl...
-
[59]
Compare the user instruction with the ground truth answer to identify the error(s)
-
[60]
Check whether the model output corrects these error(s) consistent with the ground truth
-
[61]
The correction must maintain correct context (e.g., subject, object) consistent with both instruction and answer
-
[62]
Extra information unrelated to correction should be ignored, unless it contradicts the instruction or answer. Scoring: - 1 = Successful correction (all errors corrected with consistent context) - 0 = Unsuccessful correction (missing errors, inconsistent correction, or context mismatch) Output Format: Only output JSON: { "success_score": <0 or 1>, "reasoni...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.