pith. sign in

arxiv: 2605.30256 · v1 · pith:JVPKRD5Vnew · submitted 2026-05-28 · 💻 cs.CV · cs.CL· cs.HC

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents

Pith reviewed 2026-06-29 08:26 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.HC
keywords full-duplex conversationaudiovisual agentsvision-speech modelsnonverbal cuesmultimodal benchmarkdyadic interactionconversational evaluation
0
0 comments X

The pith

Current vision-speech agents fail at full-duplex audiovisual conversation by treating vision as optional except for explicit questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VideoFDB as the first benchmark for full-duplex audio-visual-to-audio-visual conversational agents. It supplies 237 real dyadic clips covering nonverbal dynamics, a taxonomy that separates perception from generation, and a rubric-based LM-as-judge framework. Tests across open- and closed-source models reveal consistent failures: captioning collapse, visual-stream ignorance, and inability to perform streaming joint audiovisual grounding. Cascaded speech-to-avatar pipelines are shown to be structurally unable to produce simultaneous nonverbal cues. The work positions the benchmark as a foundation for measuring and improving multimodal conversational systems.

Core claim

VideoFDB establishes that existing vision-speech agents exploit vision for explicit visual question answering but not for the streaming joint audiovisual grounding required in natural conversation. Systematic failure modes include captioning collapse and visual-stream ignorance. Cascaded architectures fundamentally preclude production of full-duplex nonverbal cues.

What carries the argument

VideoFDB benchmark with 237 dyadic clips from video calls, a perception-generation taxonomy, and rubric-based LM-as-judge evaluation of nonverbal conversational dynamics.

If this is right

  • Agents require continuous visual interpretation during speech rather than vision as an on-demand module.
  • Cascaded speech-to-avatar designs cannot support simultaneous nonverbal production and must be replaced by integrated architectures.
  • Evaluation of conversational agents must shift from isolated visual QA tasks to joint audiovisual streaming scenarios.
  • Development of next-generation multimodal agents can now be tracked against concrete failure modes identified in the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models trained end-to-end on synchronized audiovisual streams may overcome the observed visual-stream ignorance.
  • Extending the benchmark to measure cultural variation in nonverbal dynamics could expose additional failure modes.
  • Combining VideoFDB scores with existing speech-only full-duplex benchmarks might reveal whether audio failures compound visual ones.
  • Live deployment tests with the same clips could serve as an external check on the LM judge's reliability.

Load-bearing premise

The 237 dyadic clips and the rubric-based LM-as-judge framework are assumed to provide a representative and reliable measure of full-duplex AV2AV conversational quality without further human validation or inter-rater checks.

What would settle it

A new agent that scores high on the VideoFDB rubrics yet produces mismatched nonverbal timing and responses when tested in live human video calls would falsify the claim that the benchmark captures required capabilities.

Figures

Figures reproduced from arXiv: 2605.30256 by Amrita Mazumdar, Julia Wang, Koki Nagano, Nikhil Srihari, Rajarshi Roy, Seonwook Park, Shalini De Mello, Shengze Wang, Yuhao Zhou.

Figure 1
Figure 1. Figure 1: VideoFDB (a) curates evaluation samples from natural conversations, where visual cues [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Annotation workflow. Human annotators select and curate clips over multiple passes [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation flow. We feed agent-generated audio and/or video with annotated captions into [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mini-Omni2 [51] and VITA-1.5 [14] often respond to vision-speech input with visual captions or blanket disclaimers. We find that no model achieves simultaneous AV2A gains over its audio-only counterpart on all perception axes. Our results indicate current systems primarily use visual input for explicit vi￾sual question answering rather than for robust joint audiovisual grounding in natural conversation. As… view at source ↗
read the original abstract

Natural human conversation is full-duplex and audio-visual: people simultaneously speak and listen while continuously interpreting and producing nonverbal cues, such as nods, smiles, and gestures. To support successful human-agent interaction, agents must model full-duplex audiovisual conversation; however, existing full-duplex benchmarks evaluate only speech. In this work, we present VideoFDB, the first benchmark to evaluate full-duplex audio-visual-to-audio-visual (AV2AV) conversational agents. VideoFDB contributes (i) 237 dyadic clips spanning 11 nonverbal conversational dynamics from real-world video calls, (ii) a taxonomy separating perception from generation behaviors, and (iii) a rubric-based LM-as-judge evaluation framework with interpretable axes for assessing conversational quality with respect to nonverbal conversational dynamics. Across open- and closed-source vision-speech agents, we find systematic failure modes: captioning collapse and visual-stream ignorance, and we show that current systems exploit vision for explicit visual question answering but not for the streaming joint audiovisual grounding required in natural conversation. We further evaluate cascaded speech-to-avatar systems and find that their architecture fundamentally precludes the production of full-duplex nonverbal cues. As the first benchmark for full-duplex AV2AV interaction, VideoFDB establishes a foundation for systematic evaluation and, we hope, will accelerate the advancement and development of next-generation multimodal conversational agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VideoFDB, the first benchmark for full-duplex AV2AV conversational agents. It contributes 237 dyadic clips spanning 11 nonverbal dynamics from real-world video calls, a taxonomy separating perception from generation behaviors, and a rubric-based LM-as-judge evaluation framework. Across open- and closed-source vision-speech agents, it reports systematic failure modes including captioning collapse and visual-stream ignorance, while also evaluating cascaded speech-to-avatar systems.

Significance. If the evaluation framework proves reliable, VideoFDB would be a significant contribution as the first dedicated benchmark for full-duplex audiovisual conversation, highlighting gaps between explicit VQA use of vision and streaming joint audiovisual grounding. The taxonomy and real-world clips provide a useful foundation for future work on multimodal agents.

major comments (2)
  1. [Evaluation Framework] Evaluation Framework section: The claims of systematic captioning collapse and visual-stream ignorance across agents rest entirely on the rubric-based LM-as-judge correctly assessing the 11 nonverbal dynamics in the 237 clips. No human validation, inter-rater agreement statistics, or calibration of the LM judge against human raters is reported, which is load-bearing for interpreting all failure-mode results.
  2. [Dataset Construction] Dataset section: The manuscript does not specify clip selection criteria or inter-annotator agreement for assigning the 11 dynamics to the 237 dyadic clips. This undermines assessment of whether the benchmark is representative for the reported failure modes.
minor comments (2)
  1. [Abstract] The abstract states quantitative findings on failure modes but the results section would benefit from explicit tables or metrics reporting the per-dynamics scores.
  2. [Taxonomy] Ensure the taxonomy is illustrated with concrete examples of perception vs. generation behaviors in a table or figure for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on VideoFDB. We address each major comment below and outline planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation Framework] Evaluation Framework section: The claims of systematic captioning collapse and visual-stream ignorance across agents rest entirely on the rubric-based LM-as-judge correctly assessing the 11 nonverbal dynamics in the 237 clips. No human validation, inter-rater agreement statistics, or calibration of the LM judge against human raters is reported, which is load-bearing for interpreting all failure-mode results.

    Authors: We agree that human validation of the LM-as-judge is essential for establishing the reliability of the reported failure modes. The original submission did not include such calibration. In the revised manuscript we will add a human validation study on a representative subset of clips, reporting inter-rater agreement (e.g., Cohen's kappa) and agreement between the LM judge and human raters across the 11 dynamics. revision: yes

  2. Referee: [Dataset Construction] Dataset section: The manuscript does not specify clip selection criteria or inter-annotator agreement for assigning the 11 dynamics to the 237 dyadic clips. This undermines assessment of whether the benchmark is representative for the reported failure modes.

    Authors: We acknowledge that explicit clip selection criteria and inter-annotator agreement were not reported. The revised manuscript will include a new subsection detailing the selection process from real-world video calls (e.g., duration, quality, and diversity filters) and will report inter-annotator agreement statistics for the labeling of the 11 nonverbal dynamics. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and empirical evaluation contain no derivations or self-referential reductions

full rationale

The paper introduces a benchmark (237 clips, taxonomy of 11 dynamics, rubric-based LM-as-judge) and applies it to existing agents to report failure modes. No equations, first-principles derivations, fitted parameters, or predictions appear anywhere in the manuscript. Claims rest on direct empirical measurement via the new benchmark rather than any chain that reduces to its own inputs by construction. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz. This is a standard benchmark paper whose central contribution is the dataset and evaluation protocol itself; no circularity patterns from the enumerated list apply.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper with no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.1-grok · 5811 in / 1023 out tokens · 22341 ms · 2026-06-29T08:26:43.947394+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 19 canonical work pages · 8 internal anchors

  1. [1]

    Parakeet tdt 0.6b v2 (en).https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2, 2024

  2. [2]

    Effects of visibility between speaker and listener on gesture production: Some gestures are meant to be seen.Journal of Memory and Language, 44(2): 169–188, 2001

    Martha W Alibali, Dana C Heath, and Heather J Myers. Effects of visibility between speaker and listener on gesture production: Some gestures are meant to be seen.Journal of Memory and Language, 44(2): 169–188, 2001

  3. [3]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

  4. [4]

    Anam cara-3: Why ai needs a face

    Ben Carr. Anam cara-3: Why ai needs a face. https://anam.ai/blog/ cara-3-interactive-avatars, feb 2026. Accessed: 2026-05-05

  5. [5]

    Avere: Improving audiovisual emotion reasoning with preference optimization

    Ashutosh Chaubey, Jiacheng Pang, Maksim Siniukov, and Mohammad Soleymani. Avere: Improving audiovisual emotion reasoning with preference optimization. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://openreview.net/forum?id=td682AAuPr

  6. [6]

    Talking-head generation with rhythmic head motion

    Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. Talking-head generation with rhythmic head motion. InEuropean conference on computer vision, pages 35–51. Springer, 2020

  7. [7]

    Savvy: Spatial awareness via audio-visual llms through seeing and hearing.NeurIPS, 2025

    Mingfei Chen, Zijun Cui, Xiulong Liu, Jinlin Xiang, Caleb Zheng, Jingyuan Li, and Eli Shlizerman. Savvy: Spatial awareness via audio-visual llms through seeing and hearing.NeurIPS, 2025

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  9. [9]

    MiniCPM-o 4.5: Towards real-time full-duplex omni-modal interaction, 2025

    Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, Jiancheng Gui, Luoyuan Zhang, Xian Sun, Fuwei Huang, Moye Chen, Changlin Liu, Hanyu Liu, Ziyang Wang, Qingxin Gui, Qingzhe Han, Yuyang Wen, Huiping Liu, Rongkang Wang, Yaqi Zhang, Hongliang Wei, Chi Chen, You Li, Kechen Fang, Jie Zhou, ...

  10. [10]

    Moshi: a speech-text foundation model for real-time dialogue

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037, 2024

  11. [11]

    Pearson Education, Inc, 16th edition, 01 2019

    Joseph A DeVito.The Interpersonal Communication Book. Pearson Education, Inc, 16th edition, 01 2019

  12. [12]

    Kling-avatar: Grounding multimodal instructions for cascaded long- duration avatar animation synthesis.arXiv preprint arXiv:2509.09595, 2025

    Yikang Ding, Jiwen Liu, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao, Yingchao Shao, Hui Liu, Xiaohan Li, et al. Kling-avatar: Grounding multimodal instructions for cascaded long- duration avatar animation synthesis.arXiv preprint arXiv:2509.09595, 2025

  13. [13]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025. 10

  14. [14]

    VITA-1.5: Towards GPT-4o level real-time vision and speech interaction

    Chaoyou Fu, Haojia Lin, Xiong Wang, YiFan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long MA, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, and Ran He. VITA-1.5: Towards GPT-4o level real-time vision and speech interaction. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://...

  15. [15]

    Charles Goodwin.Conversational Organization: Interaction Between Speakers and Hearers. 01 1981

  16. [16]

    Gemini live api

    Google DeepMind. Gemini live api. https://ai.google.dev/gemini-api/docs/live-api, 2025. Accessed: 2026-04-30

  17. [17]

    Salm-duplex: Efficient and direct duplex modeling for speech-to-speech language model.Interspeech, 2025

    Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr ˙Zelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, and Boris Ginsburg. Salm-duplex: Efficient and direct duplex modeling for speech-to-speech language model.Interspeech, 2025

  18. [18]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  19. [19]

    Keyframe: Persona-1-live

    Keyframe. Keyframe: Persona-1-live. https://www.keyframelabs.com/blog/persona-1-live, may 2026. Accessed: 2026-05-05

  20. [20]

    A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of chiropractic medicine, 15 2:155–63, 2016

    Terry K Koo and Ma Li. A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of chiropractic medicine, 15 2:155–63, 2016. URL https://api. semanticscholar.org/CorpusID:1837377

  21. [21]

    Perceptions as hypotheses: Saccades as experiments.Frontiers in Psychology, V olume 3 - 2012, 2012

    Stephen C. Levinson and Francisco Torreira. Timing in turn-taking and its implications for processing models of language.Frontiers in Psychology, V olume 6 - 2015, 2015. ISSN 1664-1078. doi: 10.3389/fpsyg. 2015.00731. URL https://www.frontiersin.org/journals/psychology/articles/10.3389/ fpsyg.2015.00731

  22. [22]

    Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

    Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian, Tingle Li, and Hung-yi Lee. Full-duplex- bench v1. 5: Evaluating overlap handling for full-duplex speech models.arXiv preprint arXiv:2507.23159, 2025

  23. [23]

    Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025

    Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H Liu, and Hung- yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025

  24. [24]

    Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner

    Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai-Wei Chang, Siddhant Arora, Shinji Watanabe, and Hung-yi Lee. Full-duplex-bench-v2: A multi-turn evaluation framework for duplex dialogue systems with an automated examiner.arXiv preprint arXiv:2510.07838, 2026

  25. [25]

    Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis

    Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, and Bo Zheng. Beat: A large-scale semantic and emotional multi-modal dataset for conversational gestures synthesis. InEuropean conference on computer vision, pages 612–630. Springer, 2022

  26. [26]

    Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling

    Haiyang Liu, Zihao Zhu, Giorgio Becherini, Yichen Peng, Mingyang Su, You Zhou, Xuefei Zhe, Naoya Iwamoto, Bo Zheng, and Michael J Black. Emage: Towards unified holistic co-speech gesture generation via expressive masked audio gesture modeling. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1144–1154, 2024

  27. [27]

    Emo-reasoning: Benchmarking emotional reasoning capabilities in spoken dialogue systems

    Jingwen Liu, Kan Jen Cheng, Jiachen Lian, Akshay Anand, Rishi Jain, Faith Qiao, Robin Netzorg, Huang- Cheng Chou, Tingle Li, Guan-Ting Lin, and Gopala Anumanchipalli. Emo-reasoning: Benchmarking emotional reasoning capabilities in spoken dialogue systems. 2025

  28. [28]

    Livekit cloud

    LiveKit, Inc. Livekit cloud. https://docs.livekit.io/intro/cloud/, 2026. Fully managed plat- form for building, hosting, and operating AI agent applications at scale

  29. [29]

    Omniresponse: Online multimodal conversational response generation in dyadic interactions.NeurIPS, 2025

    Cheng Luo, Jianghui Wang, Bing Li, Siyang Song, and Bernard Ghanem. Omniresponse: Online multimodal conversational response generation in dyadic interactions.NeurIPS, 2025

  30. [30]

    Antje S. Meyer. Timing in conversation.Journal of Cognition, Apr 2023. doi: 10.5334/joc.268

  31. [31]

    Examining gesture production in the presence of communication challenges.JoVE (Journal of Visualized Experiments), (203):e66256, 2024

    Laura M Morett. Examining gesture production in the presence of communication challenges.JoVE (Journal of Visualized Experiments), (203):e66256, 2024

  32. [32]

    See, hear, and understand: Benchmarking audiovisual human speech understanding in multimodal large language models.CVPR Findings, 2026

    Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, et al. See, hear, and understand: Benchmarking audiovisual human speech understanding in multimodal large language models.CVPR Findings, 2026. 11

  33. [33]

    Generative spoken dialogue language modeling.Transactions of the Association for Computational Linguistics, 11:250–266, 2023

    Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, et al. Generative spoken dialogue language modeling.Transactions of the Association for Computational Linguistics, 11:250–266, 2023

  34. [34]

    Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

    NVIDIA, :, Amala Sanjay Deshmukh, Kateryna Chumachenko, Tuomas Rintamaki, Matthieu Le, Tyler Poon, Danial Mohseni Taheri, Ilia Karmanov, Guilin Liu, Jarno Seppanen, Arushi Goel, Mike Ranzinger, Greg Heinrich, Guo Chen, Lukas V oegtle, Philipp Fischer, Timo Roman, Karan Sapra, Collin McCarthy, Shaokun Zhang, Fuxiao Liu, Hanrong Ye, Yi Dong, Mingjie Liu, Yi...

  35. [35]

    Openai realtime api

    OpenAI. Openai realtime api. https://developers.openai.com/api/docs/guides/realtime,

  36. [36]

    Accessed: 2026-04-30

  37. [37]

    Fd- bench: A full-duplex benchmarking pipeline designed for full duplex spoken dialogue systems, 2025

    Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, and Eng Siong Chng. Fd- bench: A full-duplex benchmarking pipeline designed for full duplex spoken dialogue systems, 2025. URL https://arxiv.org/abs/2507.19040

  38. [38]

    Dualtalk: Dual-speaker interaction for 3d talking head conversations

    Ziqiao Peng, Yanbo Fan, Haoyu Wu, Xuan Wang, Hongyan Liu, Jun He, and Zhaoxin Fan. Dualtalk: Dual-speaker interaction for 3d talking head conversations. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21055–21064, 2025

  39. [39]

    Can vision-language models answer face to face questions in the real-world? InThe Fourteenth International Conference on Learning Representations, 2026

    Reza Pourreza, Rishit Dagli, Apratim Bhattacharyya, Sunny Panchal, Guillaume Berger, and Roland Memisevic. Can vision-language models answer face to face questions in the real-world? InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=I3dPEvbp8o

  40. [40]

    Face-human-bench: A comprehensive benchmark of face and human understanding for multi-modal assistants.NeurIPS, 2025

    Lixiong Qin, Shilong Ou, Miaoxuan Zhang, Jiangning Wei, Yuhang Zhang, Xiaoshuai Song, Yuchen Liu, Mei Wang, and Weiran Xu. Face-human-bench: A comprehensive benchmark of face and human understanding for multi-modal assistants.NeurIPS, 2025

  41. [41]

    Qwen2.5-Omni Technical Report

    Qwen Team. Qwen2.5-Omni technical report. arXiv preprint arXiv:2503.20215, 2025

  42. [42]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/ blog?id=qwen3.5

  43. [43]

    Blaschko, and Tinne Tuytelaars

    Gorjan Radevski, Teodora Popordanoska, Matthew B. Blaschko, and Tinne Tuytelaars. DA VE: Diagnostic benchmark for audio visual evaluation. InThe Thirty-ninth Annual Conference on Neural Information 12 Processing Systems Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum? id=4ZAX1NT0ms

  44. [44]

    Personaplex: V oice and role control for full duplex conversational speech models, 2026

    Rajarshi Roy, Jonathan Raiman, Sang gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, and Bryan Catanzaro. Personaplex: V oice and role control for full duplex conversational speech models, 2026. URLhttps://arxiv.org/abs/2602.06053

  45. [45]

    Vision-speech models: Teaching speech models to converse about images.arXiv preprint arXiv:2503.15633, 2025

    Amélie Royer, Moritz Böhle, Gabriel de Marmiesse, Laurent Mazaré, Neil Zeghidour, Alexandre Défossez, and Patrick Pérez. Vision-speech models: Teaching speech models to converse about images.arXiv preprint arXiv:2503.15633, 2025

  46. [46]

    video-SALMONN: Speech-enhanced audio-visual large language models

    Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun MA, Yuxuan Wang, and Chao Zhang. video-SALMONN: Speech-enhanced audio-visual large language models. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/ forum?id=nYsh5GFIqX

  47. [47]

    Yunlong Tang, Pinxin Liu, Zhangyun Tan, Mingqian Feng, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, et al. Mmperspective: Do mllms understand perspective? a comprehensive benchmark for perspective perception, reasoning, and robustness.The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks...

  48. [48]

    Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier.https://github.com/snakers4/silero-vad, 2024

    Silero Team. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier.https://github.com/snakers4/silero-vad, 2024

  49. [49]

    Beyond turn-based interfaces: Synchronous LLMs as full-duplex dialogue agents

    Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, Hongyu Gong, and Shyamnath Gollakota. Beyond turn-based interfaces: Synchronous LLMs as full-duplex dialogue agents. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21390–21402, Miami, Florida, USA, Nov...

  50. [50]

    Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts, 2025

    Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts, 2025

  51. [51]

    Jackson.Pragmatics of Human Communication: A Study of Interactional Patterns, Pathologies, and Paradoxes

    Paul Watzlawick, Janet Helmick Beavin, and Don D. Jackson.Pragmatics of Human Communication: A Study of Interactional Patterns, Pathologies, and Paradoxes. W. W. Norton, New York, NY , 1967

  52. [52]

    Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.ArXiv, abs/2410.11190, 2024

    Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.ArXiv, abs/2410.11190, 2024

  53. [53]

    Codetalker: Speech-driven 3d facial animation with discrete motion prior

    Jinbo Xing, Menghan Xia, Yuechen Zhang, Xiaodong Cun, Jue Wang, and Tien-Tsin Wong. Codetalker: Speech-driven 3d facial animation with discrete motion prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12780–12790, 2023

  54. [54]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

  55. [55]

    Rtv-bench: Benchmarking mllm continuous perception, understanding and reasoning through real-time video

    Shuhang Xun, Sicheng Tao, Jungang Li, Yibo Shi, Zhixin Lin, Zhanhui Zhu, Yibo Yan, Hanqian Li, Linghao Zhang, Shikang Wang, Yixin Liu, Hanbo Zhang, Ying Ma, and Xuming Hu. Rtv-bench: Benchmarking mllm continuous perception, understanding and reasoning through real-time video. InAdvances in Neural Information Processing Systems, volume 38. NeurIPS, 2025

  56. [56]

    Victor H. Yngve. On getting a word in edgewise. 1970. URL https://api.semanticscholar.org/ CorpusID:143317921

  57. [57]

    SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities

    Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Singapore, December 2023. Asso...

  58. [58]

    OmniFlatten: An end-to-end GPT model for seamless voice conversation

    Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chao- Hong Tan, Zhihao Du, and ShiLiang Zhang. OmniFlatten: An end-to-end GPT model for seamless voice conversation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association...

  59. [59]

    I am an AI robot named VITA

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023. 14 Appendix A Dataset Reviewer Dataset Access.We provide access to the full VideoFDB Evalua...