arxiv: 2603.21664 · v2 · submitted 2026-03-23 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

HumanOmni-Speaker: Identifying Who said What and When

Detao Bai , Shimin Yao , Weixuan Chen , Zhiheng Ma , Xihan Wei , Jingren Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords speaker diarizationmultimodal large language modelsvisual token compressionlip readingspatio-temporal identity bindingvideo understandingwho said what

0 comments

The pith

Compressing 25 fps video motion into six tokens per frame enables accurate end-to-end speaker identification in multi-person conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current omni-modal models fail at determining who said what and when because they rely on low frame rates and visual shortcuts that bypass real cross-modal alignment. The paper introduces the VR-SDR benchmark to remove those shortcuts and force genuine spatio-temporal identity binding through natural language queries. HumanOmni-Speaker addresses the gap with a Visual Delta Encoder that samples raw video at 25 fps and compresses inter-frame motion residuals into six tokens per frame. This approach preserves fine-grained visemes and speaker trajectories without token explosion or cropping. The result is native support for lip reading, precise localization, and superior results on speaker-centric tasks.

Core claim

HumanOmni-Speaker, powered by a Visual Delta Encoder, samples raw video at 25 fps and explicitly compresses inter-frame motion residuals into just 6 tokens per frame to capture fine-grained visemes and speaker trajectories, enabling true end-to-end spatio-temporal identity binding and accurate answers to who said what and when using only natural language queries.

What carries the argument

The Visual Delta Encoder, which compresses inter-frame motion residuals from 25 fps video into six tokens per frame to retain viseme and trajectory details.

If this is right

Natively supports end-to-end lip reading from full video without separate modules.
Achieves high-precision spatial localization of speakers without intrusive cropping.
Delivers superior performance across speaker diarization, recognition, and related tasks.
Requires genuine cross-modal alignment instead of exploiting benchmark biases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The token-compression approach may extend to other high-frequency visual tasks like gesture recognition or action prediction.
Longer untrimmed videos could test whether the six-token representation scales without losing identity binding over time.
Integrating the encoder with audio-only or text-only queries might reveal how much visual motion is strictly necessary.

Load-bearing premise

That sampling at 25 fps and compressing to six tokens per frame preserves all high-frequency viseme and trajectory information without leaving any visual shortcuts.

What would settle it

Run the model on a modified version of the VR-SDR benchmark where high-frequency lip motion is removed or altered while low-frequency visuals remain unchanged, then measure whether accuracy on who-said-what queries drops sharply.

Figures

Figures reproduced from arXiv: 2603.21664 by Detao Bai, Jingren Zhou, Shimin Yao, Weixuan Chen, Xihan Wei, Zhiheng Ma.

**Figure 2.** Figure 2: Humanomni-Speaker benchmark sample characteristics and statistics.(a) Samples with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of HumanOmni-Speaker architecture for human-centric speaking scenarios. It [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: The attention maps generated by Grad-CAM show that Visual Delta Encoder successfully [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The Progressive Training Pipeline of HumanOmni-Speaker. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Effect of token number on SL task. (b) Effect of FPS on VSR task. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-frequency dynamics like lip movements. To shatter this illusion, we introduce Visual-Registered Speaker Diarization and Recognition (VR-SDR) and the HumanOmni-Speaker Benchmark. By strictly eliminating visual shortcuts, this rigorous paradigm demands true end-to-end spatio-temporal identity binding using only natural language queries. To overcome the underlying architectural perception gap, we propose HumanOmni-Speaker, powered by a Visual Delta Encoder. By sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame, it captures fine-grained visemes and speaker trajectories without triggering a catastrophic token explosion. Ultimately, HumanOmni-Speaker demonstrates strong multimodal synergy, natively enabling end-to-end lip-reading and high-precision spatial localization without intrusive cropping, and achieving superior performance across a wide spectrum of speaker-centric tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VR-SDR benchmark forces better multimodal binding in speaker tasks, but missing results leave the Visual Delta Encoder's effectiveness unproven.

read the letter

The paper calls out how omni-modal LLMs often fake competence on speaker diarization by leaning on visual biases and low frame-rate sampling that erases lip motion details. Their response is the VR-SDR benchmark, which uses natural language queries to demand real end-to-end spatio-temporal binding and removes those shortcuts. That benchmark looks like a solid step toward stricter testing in this area. They back it with the Visual Delta Encoder, which pulls video at 25 fps and packs the inter-frame motion residuals into 6 tokens per frame. The idea is to retain viseme and trajectory information without a token blowup. This design choice directly targets the perception gap in handling conversational video dynamics, and it allows native lip-reading and spatial localization without cropping. The engineering seems practical for keeping compute reasonable while increasing temporal resolution. The focus on binding identity through language queries is a clear improvement over separate diarization pipelines. Still, the text supplies no numbers at all. No accuracy scores, no ablations on the token count or frame rate, no baseline comparisons, and no statistics on the benchmark itself. Without those, the assertion of strong multimodal synergy and superior performance cannot be checked. The central claim rests on the compression preserving all needed details, but there is no validation like reconstruction error or direct viseme tests from the tokens. That 6-token limit at 25 fps is the load-bearing part. If it drops high-frequency cues, models could still fall back on lower-frequency patterns or audio hints, which would weaken the no-shortcuts guarantee. This paper is mainly for groups working on video-language models and speaker understanding tasks. The benchmark idea could be picked up by others even if the specific model needs more work. I would bring it to a reading group to talk through the evaluation design. It does not yet look ready for peer review because the evidence is missing. Once the results and controls are added, it would be worth sending out.

Referee Report

2 major / 1 minor

Summary. The paper claims that current omni-modal LLMs suffer from visual biases and low-frame-rate sampling that prevent genuine spatio-temporal identity binding in multi-speaker conversations. It introduces the VR-SDR benchmark to eliminate such shortcuts and proposes HumanOmni-Speaker, which uses a Visual Delta Encoder to sample raw video at 25 fps and compress inter-frame motion residuals into 6 tokens per frame. This architecture is asserted to natively support end-to-end lip-reading and high-precision spatial localization without cropping while delivering superior performance on speaker diarization, recognition, and related tasks.

Significance. If the empirical claims are substantiated, the work would provide a concrete architectural solution to the perception gap in multimodal LLMs for conversational dynamics, along with a benchmark that enforces true cross-modal alignment rather than shortcut exploitation. The explicit compression strategy for high-frequency visual dynamics could influence future token-efficient video-language models.

major comments (2)

[Abstract] Abstract: The central performance claims ('superior performance across a wide spectrum of speaker-centric tasks', 'strong multimodal synergy', 'natively enabling end-to-end lip-reading and high-precision spatial localization') are stated without any quantitative metrics, benchmark scores, ablation tables, or baseline comparisons, leaving the primary contribution unsupported by evidence in the manuscript.
[Abstract] Abstract (Visual Delta Encoder description): The assertion that sampling at 25 fps and compressing motion residuals to exactly 6 tokens per frame 'captures fine-grained visemes and speaker trajectories' without 'catastrophic token explosion' or residual visual shortcuts is load-bearing for the 'strictly eliminating visual shortcuts' guarantee, yet the text supplies no reconstruction-error bounds, viseme-classification accuracy from tokens alone, or ablation on token count / frame rate to validate information preservation.

minor comments (1)

[Abstract] The phrase 'illusion of competence' is used without a concrete example of the visual bias it refers to; a brief illustrative case would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where the abstract could more clearly substantiate our claims. We agree that strengthening the presentation of quantitative evidence and validation details will improve the manuscript and will incorporate the suggested revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims ('superior performance across a wide spectrum of speaker-centric tasks', 'strong multimodal synergy', 'natively enabling end-to-end lip-reading and high-precision spatial localization') are stated without any quantitative metrics, benchmark scores, ablation tables, or baseline comparisons, leaving the primary contribution unsupported by evidence in the manuscript.

Authors: We agree that the abstract would benefit from explicit quantitative support. The full manuscript contains detailed results, baseline comparisons, and ablation tables in Sections 4 and 5 demonstrating superior performance on VR-SDR and related tasks. We will revise the abstract to include key metrics (e.g., diarization error rates and recognition accuracy) that directly support the stated claims. revision: yes
Referee: [Abstract] Abstract (Visual Delta Encoder description): The assertion that sampling at 25 fps and compressing motion residuals to exactly 6 tokens per frame 'captures fine-grained visemes and speaker trajectories' without 'catastrophic token explosion' or residual visual shortcuts is load-bearing for the 'strictly eliminating visual shortcuts' guarantee, yet the text supplies no reconstruction-error bounds, viseme-classification accuracy from tokens alone, or ablation on token count / frame rate to validate information preservation.

Authors: We acknowledge that direct validation of the compression strategy (reconstruction error, viseme accuracy from tokens, and ablations on token count/frame rate) is not currently in the abstract. The manuscript reports end-to-end task performance that implicitly validates the 6-token design, but we agree additional targeted metrics would strengthen the claim. We will add a new ablation subsection with reconstruction-error bounds, viseme-classification accuracy, and token-count/frame-rate sweeps in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architecture and benchmark are independent proposals

full rationale

The paper text contains no equations, derivations, or load-bearing self-citations. The Visual Delta Encoder is introduced as a new component with explicit design choices (25 fps sampling, 6-token compression of motion residuals), but these are presented as architectural proposals rather than quantities derived from or fitted to the target performance metrics. No step reduces the claimed end-to-end lip-reading or spatio-temporal binding to a self-defined input by construction. Performance superiority is asserted as an empirical outcome on the new VR-SDR benchmark, not as a prediction forced by the same data or prior self-citations. This matches the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption that the chosen frame rate and token compression suffice for identity binding, plus the new encoder component itself.

free parameters (2)

tokens per frame = 6
Fixed at 6 to avoid token explosion while capturing motion residuals
frame rate = 25
Set to 25 fps to capture high-frequency dynamics

axioms (1)

domain assumption 25 fps video sampling is sufficient to capture visemes and speaker trajectories for identity binding
Invoked when stating that high-frame-rate sampling overcomes low-frame-rate limitations

invented entities (1)

Visual Delta Encoder no independent evidence
purpose: Compress inter-frame motion residuals into a small number of tokens
New architectural module introduced to bridge the perception gap

pith-pipeline@v0.9.0 · 5532 in / 1328 out tokens · 45266 ms · 2026-05-15T00:54:07.441144+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

sampling raw video at 25 fps and explicitly compressing inter-frame motion residuals into just 6 tokens per frame
Foundation/DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Structured Visual Tokenizer (SVT) applies hierarchical spatial (7×7) and large-receptive-field temporal (k=63) convolutions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Omni-DeepSearch: A Benchmark for Audio-Driven Omni-Modal Deep Search
cs.SD 2026-05 unverdicted novelty 8.0

Omni-DeepSearch is a 640-sample benchmark for audio-driven omni-modal search where the best model reaches only 43.44% accuracy, exposing bottlenecks in audio inference, tool use, and cross-modal reasoning.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper

[1]

https://deepmind.google/models/gemini/, 2025

Google DeepMind. https://deepmind.google/models/gemini/, 2025

work page 2025
[2]

Qwen3-omni technical report, 2025

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xuejing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, Bo ...

work page 2025
[3]

Qwen2.5-omni technical report, 2025

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report, 2025

work page 2025
[4]

Llama-omni: Seamless speech interaction with large language models, 2025

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni: Seamless speech interaction with large language models, 2025. 10

work page 2025
[5]

Ola: Pushing the frontiers of omni-modal language model, 2025

Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model, 2025

work page 2025
[6]

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction, 2025

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, and Ran He. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction, 2025

work page 2025
[7]

Embodied ai agents: Modeling the world, 2025

Pascale Fung, Yoram Bachrach, Asli Celikyilmaz, Kamalika Chaudhuri, Delong Chen, Willy Chung, Emmanuel Dupoux, Hongyu Gong, Hervé Jégou, Alessandro Lazaric, Arjun Majumdar, Andrea Madotto, Franziska Meier, Florian Metze, Louis-Philippe Morency, Théo Moutakanni, Juan Pino, Basile Terver, Joseph Tighe, Paden Tomasello, and Jitendra Malik. Embodied ai agents...

work page 2025
[8]

Han, Shinji Watanabe, and Shrikanth Narayanan

Tae Jin Park, Naoyuki Kanda, Dimitrios Dimitriadis, Kyu J. Han, Shinji Watanabe, and Shrikanth Narayanan. A review of speaker diarization: Recent advances with deep learning, 2021

work page 2021
[9]

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Gird- har, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Martin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car...

work page 2022
[10]

Show and tell: A neural image caption generator, 2015

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator, 2015

work page 2015
[11]

Qwen3-vl technical report, 2025

Alibaba QwenTeam. Qwen3-vl technical report, 2025

work page 2025
[12]

Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models, 2023

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models, 2023

work page 2023
[13]

Qwen2-audio technical report, 2024

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-audio technical report, 2024

work page 2024
[14]

Ava- activespeaker: An audio-visual dataset for active speaker detection, 2019

Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, and Caroline Pantofaru. Ava- activespeaker: An audio-visual dataset for active speaker detection, 2019

work page 2019
[15]

Librispeech: An asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015

work page 2015
[16]

Cross-modal supervision for learning active speaker detection in video, 2016

Punarjay Chakravarty and Tinne Tuytelaars. Cross-modal supervision for learning active speaker detection in video, 2016

work page 2016
[17]

Speakerlm: End-to-end versatile speaker diarization and recognition with multimodal large language models, 2026

Han Yin, Yafeng Chen, Chong Deng, Luyao Cheng, Hui Wang, Chao-Hong Tan, Qian Chen, Wen Wang, and Xiangang Li. Speakerlm: End-to-end versatile speaker diarization and recognition with multimodal large language models, 2026

work page 2026
[18]

Omnibench: Towards the future of universal omni-language models, 2025

Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, Siwei Wu, Xingwei Qu, Jinjie Shi, Xinyue Zhang, Zhenzhu Yang, Xiangzhou Wang, Zhaoxiang Zhang, Zachary Liu, Emmanouil Benetos, Wenhao Huang, and Chenghua Lin. Omnibench: Towards the future of universal omni-language models, 2025

work page 2025
[19]

Streamingbench: Assessing the gap for mllms to achieve streaming video understanding, 2024

Junming Lin, Zheng Fang, Chi Chen, Zihao Wan, Fuwen Luo, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding, 2024

work page 2024
[20]

Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts, 2025

Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, and Zilong Zheng. Omnimmi: A comprehensive multi-modal interaction benchmark in streaming video contexts, 2025. 11

work page 2025
[21]

Unveiling visual biases in audio-visual localization benchmarks, 2024

Liangyu Chen, Zihao Yue, Boshen Xu, and Qin Jin. Unveiling visual biases in audio-visual localization benchmarks, 2024

work page 2024
[22]

V-star: Benchmarking video-llms on video spatio-temporal reasoning, 2025

Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Benchmarking video-llms on video spatio-temporal reasoning, 2025

work page 2025
[23]

Improving llm video understanding with 16 frames per second, 2025

Yixuan Li, Changli Tang, Jimin Zhuang, Yudong Yang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. Improving llm video understanding with 16 frames per second, 2025

work page 2025
[24]

Freeman, and Michael Rubinstein

Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T. Freeman, and Michael Rubinstein. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation.ACM Transactions on Graphics, 37(4):1–11, July 2018

work page 2018
[25]

V oxmm: Rich transcription of conversations in the wild

Doyeop Kwak, Jaemin Jung, Kihyun Nam, Youngjoon Jang, Jee-Weon Jung, Shinji Watanabe, and Joon Son Chung. V oxmm: Rich transcription of conversations in the wild. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12551–12555, 2024

work page 2024
[26]

V oxceleb2: Deep speaker recognition

Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. V oxceleb2: Deep speaker recognition. In Interspeech 2018, interspeech-2018. ISCA, September 2018

work page 2018
[27]

Lip reading sentences in the wild

Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading sentences in the wild. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, July 2017

work page 2017
[28]

Lrs3-ted: a large-scale dataset for visual speech recognition, 2018

Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. Lrs3-ted: a large-scale dataset for visual speech recognition, 2018

work page 2018
[29]

Cogenav: Versatile audio-visual representation learning via contrastive-generative synchronization, 2025

Detao Bai, Zhiheng Ma, Xihan Wei, and Liefeng Bo. Cogenav: Versatile audio-visual representation learning via contrastive-generative synchronization, 2025

work page 2025
[30]

Audio- visual speech recognition with a hybrid ctc/attention architecture, 2018

Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Georgios Tzimiropoulos, and Maja Pantic. Audio- visual speech recognition with a hybrid ctc/attention architecture, 2018

work page 2018
[31]

Learning audio-visual speech representation by masked multimodal cluster prediction.arXiv preprint arXiv:2201.02184, 2022

Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Abdelrahman Mohamed. Learning audio-visual speech representation by masked multimodal cluster prediction.arXiv preprint arXiv:2201.02184, 2022

work page arXiv 2022
[32]

Vatlm: Visual-audio-text pre-training with unified masked prediction for speech representation learning.IEEE Transactions on Multimedia, 26:1055–1064, 2024

Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie Zhang, Lirong Dai, Daxin Jiang, Jinyu Li, and Furu Wei. Vatlm: Visual-audio-text pre-training with unified masked prediction for speech representation learning.IEEE Transactions on Multimedia, 26:1055–1064, 2024

work page 2024
[33]

Auto-avsr: Audio-visual speech recognition with automatic labels

Pingchuan Ma, Alexandros Haliassos, Adriana Fernandez-Lopez, Honglie Chen, Stavros Petridis, and Maja Pantic. Auto-avsr: Audio-visual speech recognition with automatic labels. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

work page 2023
[34]

Scaling and enhancing llm-based avsr: A sparse mixture of projectors approach, 2025

Umberto Cappellazzo, Minsu Kim, Stavros Petridis, Daniele Falavigna, and Alessio Brutti. Scaling and enhancing llm-based avsr: A sparse mixture of projectors approach, 2025

work page 2025
[35]

Large language models are strong audio-visual speech recognition learners, 2025

Umberto Cappellazzo, Minsu Kim, Honglie Chen, Pingchuan Ma, Stavros Petridis, Daniele Falavigna, Alessio Brutti, and Maja Pantic. Large language models are strong audio-visual speech recognition learners, 2025

work page 2025
[36]

Whisper-flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation.arXiv preprint arXiv:2406.10082, 2024

Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, and James Glass. Whisper-flamingo: Integrating visual features into whisper for audio-visual speech recognition and translation.arXiv preprint arXiv:2406.10082, 2024

work page arXiv 2024
[37]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

work page 2023
[38]

FunAudioLLM: V oice understanding and generation foun- dation models for natural interaction between humans and LLMs,

Keyu An, Qian Chen, Chong Deng, Zhihao Du, Changfeng Gao, Zhifu Gao, Yue Gu, Ting He, Hangrui Hu, Kai Hu, et al. Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms.arXiv preprint arXiv:2407.04051, 2024

work page arXiv 2024
[39]

Humanomni: A large vision-speech language model for human-centric video understanding, 2025

Jiaxing Zhao, Qize Yang, Yixing Peng, Detao Bai, Shimin Yao, Boyuan Sun, Xiang Chen, Shenghao Fu, Weixuan chen, Xihan Wei, and Liefeng Bo. Humanomni: A large vision-speech language model for human-centric video understanding, 2025. 12

work page 2025
[40]

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities, 2024

Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities, 2024

work page 2024
[41]

Whisper diarization: Speaker diarization using openai whisper

Mahmoud Ashraf. Whisper diarization: Speaker diarization using openai whisper. Available at https: //github.com/m-bain/whisperX, 2024. 13

work page 2024