pith. machine review for the scientific record. sign in

arxiv: 2603.21664 · v2 · submitted 2026-03-23 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

HumanOmni-Speaker: Identifying Who said What and When

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:54 UTC · model grok-4.3

classification 💻 cs.CV
keywords speaker diarizationmultimodal large language modelsvisual token compressionlip readingspatio-temporal identity bindingvideo understandingwho said what
0
0 comments X

The pith

Compressing 25 fps video motion into six tokens per frame enables accurate end-to-end speaker identification in multi-person conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current omni-modal models fail at determining who said what and when because they rely on low frame rates and visual shortcuts that bypass real cross-modal alignment. The paper introduces the VR-SDR benchmark to remove those shortcuts and force genuine spatio-temporal identity binding through natural language queries. HumanOmni-Speaker addresses the gap with a Visual Delta Encoder that samples raw video at 25 fps and compresses inter-frame motion residuals into six tokens per frame. This approach preserves fine-grained visemes and speaker trajectories without token explosion or cropping. The result is native support for lip reading, precise localization, and superior results on speaker-centric tasks.

Core claim

HumanOmni-Speaker, powered by a Visual Delta Encoder, samples raw video at 25 fps and explicitly compresses inter-frame motion residuals into just 6 tokens per frame to capture fine-grained visemes and speaker trajectories, enabling true end-to-end spatio-temporal identity binding and accurate answers to who said what and when using only natural language queries.

What carries the argument

The Visual Delta Encoder, which compresses inter-frame motion residuals from 25 fps video into six tokens per frame to retain viseme and trajectory details.

If this is right

  • Natively supports end-to-end lip reading from full video without separate modules.
  • Achieves high-precision spatial localization of speakers without intrusive cropping.
  • Delivers superior performance across speaker diarization, recognition, and related tasks.
  • Requires genuine cross-modal alignment instead of exploiting benchmark biases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The token-compression approach may extend to other high-frequency visual tasks like gesture recognition or action prediction.
  • Longer untrimmed videos could test whether the six-token representation scales without losing identity binding over time.
  • Integrating the encoder with audio-only or text-only queries might reveal how much visual motion is strictly necessary.

Load-bearing premise

That sampling at 25 fps and compressing to six tokens per frame preserves all high-frequency viseme and trajectory information without leaving any visual shortcuts.

What would settle it

Run the model on a modified version of the VR-SDR benchmark where high-frequency lip motion is removed or altered while low-frequency visuals remain unchanged, then measure whether accuracy on who-said-what queries drops sharply.

Figures

Figures reproduced from arXiv: 2603.21664 by Detao Bai, Jingren Zhou, Shimin Yao, Weixuan Chen, Xihan Wei, Zhiheng Ma.

Figure 1
Figure 1. Figure 1: Speaker-Centric examples in HumanOmni-Speaker benchmark. (top) Visual-Registered [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Humanomni-Speaker benchmark sample characteristics and statistics.(a) Samples with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of HumanOmni-Speaker architecture for human-centric speaking scenarios. It [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The attention maps generated by Grad-CAM show that Visual Delta Encoder successfully [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The Progressive Training Pipeline of HumanOmni-Speaker. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Effect of token number on SL task. (b) Effect of FPS on VSR task. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that current omni-modal LLMs suffer from visual biases and low-frame-rate sampling that prevent genuine spatio-temporal identity binding in multi-speaker conversations. It introduces the VR-SDR benchmark to eliminate such shortcuts and proposes HumanOmni-Speaker, which uses a Visual Delta Encoder to sample raw video at 25 fps and compress inter-frame motion residuals into 6 tokens per frame. This architecture is asserted to natively support end-to-end lip-reading and high-precision spatial localization without cropping while delivering superior performance on speaker diarization, recognition, and related tasks.

Significance. If the empirical claims are substantiated, the work would provide a concrete architectural solution to the perception gap in multimodal LLMs for conversational dynamics, along with a benchmark that enforces true cross-modal alignment rather than shortcut exploitation. The explicit compression strategy for high-frequency visual dynamics could influence future token-efficient video-language models.

major comments (2)
  1. [Abstract] Abstract: The central performance claims ('superior performance across a wide spectrum of speaker-centric tasks', 'strong multimodal synergy', 'natively enabling end-to-end lip-reading and high-precision spatial localization') are stated without any quantitative metrics, benchmark scores, ablation tables, or baseline comparisons, leaving the primary contribution unsupported by evidence in the manuscript.
  2. [Abstract] Abstract (Visual Delta Encoder description): The assertion that sampling at 25 fps and compressing motion residuals to exactly 6 tokens per frame 'captures fine-grained visemes and speaker trajectories' without 'catastrophic token explosion' or residual visual shortcuts is load-bearing for the 'strictly eliminating visual shortcuts' guarantee, yet the text supplies no reconstruction-error bounds, viseme-classification accuracy from tokens alone, or ablation on token count / frame rate to validate information preservation.
minor comments (1)
  1. [Abstract] The phrase 'illusion of competence' is used without a concrete example of the visual bias it refers to; a brief illustrative case would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where the abstract could more clearly substantiate our claims. We agree that strengthening the presentation of quantitative evidence and validation details will improve the manuscript and will incorporate the suggested revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claims ('superior performance across a wide spectrum of speaker-centric tasks', 'strong multimodal synergy', 'natively enabling end-to-end lip-reading and high-precision spatial localization') are stated without any quantitative metrics, benchmark scores, ablation tables, or baseline comparisons, leaving the primary contribution unsupported by evidence in the manuscript.

    Authors: We agree that the abstract would benefit from explicit quantitative support. The full manuscript contains detailed results, baseline comparisons, and ablation tables in Sections 4 and 5 demonstrating superior performance on VR-SDR and related tasks. We will revise the abstract to include key metrics (e.g., diarization error rates and recognition accuracy) that directly support the stated claims. revision: yes

  2. Referee: [Abstract] Abstract (Visual Delta Encoder description): The assertion that sampling at 25 fps and compressing motion residuals to exactly 6 tokens per frame 'captures fine-grained visemes and speaker trajectories' without 'catastrophic token explosion' or residual visual shortcuts is load-bearing for the 'strictly eliminating visual shortcuts' guarantee, yet the text supplies no reconstruction-error bounds, viseme-classification accuracy from tokens alone, or ablation on token count / frame rate to validate information preservation.

    Authors: We acknowledge that direct validation of the compression strategy (reconstruction error, viseme accuracy from tokens, and ablations on token count/frame rate) is not currently in the abstract. The manuscript reports end-to-end task performance that implicitly validates the 6-token design, but we agree additional targeted metrics would strengthen the claim. We will add a new ablation subsection with reconstruction-error bounds, viseme-classification accuracy, and token-count/frame-rate sweeps in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architecture and benchmark are independent proposals

full rationale

The paper text contains no equations, derivations, or load-bearing self-citations. The Visual Delta Encoder is introduced as a new component with explicit design choices (25 fps sampling, 6-token compression of motion residuals), but these are presented as architectural proposals rather than quantities derived from or fitted to the target performance metrics. No step reduces the claimed end-to-end lip-reading or spatio-temporal binding to a self-defined input by construction. Performance superiority is asserted as an empirical outcome on the new VR-SDR benchmark, not as a prediction forced by the same data or prior self-citations. This matches the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption that the chosen frame rate and token compression suffice for identity binding, plus the new encoder component itself.

free parameters (2)
  • tokens per frame = 6
    Fixed at 6 to avoid token explosion while capturing motion residuals
  • frame rate = 25
    Set to 25 fps to capture high-frequency dynamics
axioms (1)
  • domain assumption 25 fps video sampling is sufficient to capture visemes and speaker trajectories for identity binding
    Invoked when stating that high-frame-rate sampling overcomes low-frame-rate limitations
invented entities (1)
  • Visual Delta Encoder no independent evidence
    purpose: Compress inter-frame motion residuals into a small number of tokens
    New architectural module introduced to bridge the perception gap

pith-pipeline@v0.9.0 · 5532 in / 1328 out tokens · 45266 ms · 2026-05-15T00:54:07.441144+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search

    cs.SD 2026-05 unverdicted novelty 8.0

    Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper

  1. [1]

    https://deepmind.google/models/gemini/, 2025

    Google DeepMind. https://deepmind.google/models/gemini/, 2025

  2. [2]

    Qwen3-omni technical report, 2025

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

  3. [3]

    Qwen2.5-omni technical report, 2025

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025

  4. [4]

    Llama-omni: Seamless speech interaction with large language models, 2025

    Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni: Seamless speech interaction with large language models, 2025. 10

  5. [5]

    Ola: Pushing the frontiers of omni-modal language model, 2025

    Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model, 2025

  6. [6]

    Vita-1.5: Towards gpt-4o level real-time vision and speech interaction, 2025

    Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, and Ran He. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction, 2025

  7. [7]

    Embodied ai agents: Modeling the world, 2025

    Pascale Fung, Yoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hongyu Gong, Hervé Jégou, Alessandro Lazaric, Arjun Majumdar, Andrea Madotto, Franziska Meier, Florian Metze, Louis-Philippe Morency, Théo Moutakanni, Juan Pino, Basile Terver, Joseph Tighe, Paden Tomasello, and Jitendra Malik. Embodied ai agents...

  8. [8]

    Han, Shinji Watanabe, and Shrikanth Narayanan

    Tae Jin Park, Naoyuki Kanda, Dimitrios Dimitriadis, Kyu J. Han, Shinji Watanabe, and Shrikanth Narayanan. A review of speaker diarization: Recent advances with deep learning, 2021

  9. [9]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Gird- har, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car...

  10. [10]

    Show and tell: A neural image caption generator, 2015

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator, 2015

  11. [11]

    Qwen3-vl technical report, 2025

    Alibaba QwenTeam. Qwen3-vl technical report, 2025

  12. [12]

    Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models, 2023

    Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models, 2023

  13. [13]

    Qwen2-audio technical report, 2024

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-audio technical report, 2024

  14. [14]

    Ava- activespeaker: An audio-visual dataset for active speaker detection, 2019

    Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, and Caroline Pantofaru. Ava- activespeaker: An audio-visual dataset for active speaker detection, 2019

  15. [15]

    Librispeech: An asr corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015

  16. [16]

    Cross-modal supervision for learning active speaker detection in video, 2016

    Punarjay Chakravarty and Tinne Tuytelaars. Cross-modal supervision for learning active speaker detection in video, 2016

  17. [17]

    Speakerlm: End-to-end versatile speaker diarization and recognition with multimodal large language models, 2026

    Han Yin, Yafeng Chen, Chong Deng, Luyao Cheng, Hui Wang, Chao-Hong Tan, Qian Chen, Wen Wang, and Xiangang Li. Speakerlm: End-to-end versatile speaker diarization and recognition with multimodal large language models, 2026

  18. [18]

    Omnibench: Towards the future of universal omni-language models, 2025

    Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, Siwei Wu, Xingwei Qu, Jinjie Shi, Xinyue Zhang, Zhenzhu Yang, Xiangzhou Wang, Zhaoxiang Zhang, Zachary Liu, Emmanouil Benetos, Wenhao Huang, and Chenghua Lin. Omnibench: Towards the future of universal omni-language models, 2025

  19. [19]

    Streamingbench: Assessing the gap for mllms to achieve streaming video understanding, 2024

    Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding, 2024

  20. [20]

    Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts, 2025

    Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts, 2025. 11

  21. [21]

    Unveiling visual biases in audio-visual localization benchmarks, 2024

    Liangyu Chen, Zihao Yue, Boshen Xu, and Qin Jin. Unveiling visual biases in audio-visual localization benchmarks, 2024

  22. [22]

    V-star: Benchmarking video-llms on video spatio-temporal reasoning, 2025

    Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Benchmarking video-llms on video spatio-temporal reasoning, 2025

  23. [23]

    Improving llm video understanding with 16 frames per second, 2025

    Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. Improving llm video understanding with 16 frames per second, 2025

  24. [24]

    Freeman, and Michael Rubinstein

    Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, and Michael Rubinstein. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation.ACM Transactions on Graphics, 37(4):1–11, July 2018

  25. [25]

    V oxmm: Rich transcription of conversations in the wild

    Doyeop Kwak, Jaemin Jung, Kihyun Nam, Youngjoon Jang, Jee-Weon Jung, Shinji Watanabe, and Joon Son Chung. V oxmm: Rich transcription of conversations in the wild. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12551–12555, 2024

  26. [26]

    V oxceleb2: Deep speaker recognition

    Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. V oxceleb2: Deep speaker recognition. In Interspeech 2018, interspeech-2018. ISCA, September 2018

  27. [27]

    Lip reading sentences in the wild

    Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading sentences in the wild. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, July 2017

  28. [28]

    Lrs3-ted: a large-scale dataset for visual speech recognition, 2018

    Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. Lrs3-ted: a large-scale dataset for visual speech recognition, 2018

  29. [29]

    Cogenav: Versatile audio-visual representation learning via contrastive-generative synchronization, 2025

    Detao Bai, Zhiheng Ma, Xihan Wei, and Liefeng Bo. Cogenav: Versatile audio-visual representation learning via contrastive-generative synchronization, 2025

  30. [30]

    Audio- visual speech recognition with a hybrid ctc/attention architecture, 2018

    Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Georgios Tzimiropoulos, and Maja Pantic. Audio- visual speech recognition with a hybrid ctc/attention architecture, 2018

  31. [31]

    Learning audio-visual speech representation by masked multimodal cluster prediction.arXiv preprint arXiv:2201.02184, 2022

    Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed. Learning audio-visual speech representation by masked multimodal cluster prediction.arXiv preprint arXiv:2201.02184, 2022

  32. [32]

    Vatlm: Visual-audio-text pre-training with unified masked prediction for speech representation learning.IEEE Transactions on Multimedia, 26:1055–1064, 2024

    Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie Zhang, Lirong Dai, Daxin Jiang, Jinyu Li, and Furu Wei. Vatlm: Visual-audio-text pre-training with unified masked prediction for speech representation learning.IEEE Transactions on Multimedia, 26:1055–1064, 2024

  33. [33]

    Auto-avsr: Audio-visual speech recognition with automatic labels

    Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, and Maja Pantic. Auto-avsr: Audio-visual speech recognition with automatic labels. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

  34. [34]

    Scaling and enhancing llm-based avsr: A sparse mixture of projectors approach, 2025

    Umberto Cappellazzo, Minsu Kim, Stavros Petridis, Daniele Falavigna, and Alessio Brutti. Scaling and enhancing llm-based avsr: A sparse mixture of projectors approach, 2025

  35. [35]

    Large language models are strong audio-visual speech recognition learners, 2025

    Umberto Cappellazzo, Minsu Kim, Honglie Chen, Pingchuan Ma, Stavros Petridis, Daniele Falavigna, Alessio Brutti, and Maja Pantic. Large language models are strong audio-visual speech recognition learners, 2025

  36. [36]

    Whisper-flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation.arXiv preprint arXiv:2406.10082, 2024

    Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, and James Glass. Whisper-flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation.arXiv preprint arXiv:2406.10082, 2024

  37. [37]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

  38. [38]

    FunAudioLLM: V oice understanding and generation foun- dation models for natural interaction between humans and LLMs,

    Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, et al. Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms.arXiv preprint arXiv:2407.04051, 2024

  39. [39]

    Humanomni: A large vision-speech language model for human-centric video understanding, 2025

    Jiaxing Zhao, Qize Yang, Yixing Peng, Detao Bai, Shimin Yao, Boyuan Sun, Xiang Chen, Shenghao Fu, Weixuan chen, Xihan Wei, and Liefeng Bo. Humanomni: A large vision-speech language model for human-centric video understanding, 2025. 12

  40. [40]

    Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities, 2024

    Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities, 2024

  41. [41]

    Whisper diarization: Speaker diarization using openai whisper

    Mahmoud Ashraf. Whisper diarization: Speaker diarization using openai whisper. Available at https: //github.com/m-bain/whisperX, 2024. 13