arxiv: 2512.02231 · v2 · submitted 2025-12-01 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen , Zhuoran Yu , Samuel Low Yu Hang , Subin An , Jeongik Lee , Yohan Ban , SeungEun Chung , Thanh-Huy Nguyen

show 3 more authors

JuWan Maeng Soochahn Lee Yong Jae Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords multimodal large language modelsaudiovisual fusionspeaker reasoningspeech understandingvideo benchmarkcross-modal evaluation

0 comments

The pith

A new benchmark shows Gemini models lead in audiovisual human speech understanding while open models lag in fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AV-SpeakerBench, a benchmark consisting of 3,212 multiple-choice questions aimed at testing multimodal large language models on speaker-centric reasoning that requires combining visual, audio, and language cues in real-world videos. It addresses limitations in existing benchmarks by using a speaker-centered approach and fusion-grounded questions that embed audiovisual dependencies, along with expert-curated annotations for precision. Evaluations across models reveal that the Gemini family, with Gemini 2.5 Pro at the top, outperforms open-source systems. Open models such as Qwen3-Omni-30B get closer to Gemini 2.0 Flash but fall short of Gemini 2.5 Pro mainly because of weaker audiovisual fusion capabilities. Sympathetic readers would care as this pinpoints where current MLLMs need improvement to handle real human speech in videos effectively.

Core claim

We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features a speaker-centered formulation that treats speakers as the core reasoning unit, fusion-grounded question design embedding audiovisual dependencies, and expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather.

What carries the argument

AV-SpeakerBench benchmark with speaker-centered formulation and fusion-grounded questions that force integration of who speaks, what is said, and when.

If this is right

Gemini 2.5 Pro represents the current best performance on fine-grained audiovisual speech tasks.
Open-source multimodal models require advances in audiovisual fusion to approach closed model capabilities.
Benchmarks for MLLMs should incorporate questions that cannot be solved by single modalities or coarse information.
Development of future systems should emphasize temporal alignment between visual and audio speech elements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Enhanced fusion techniques could improve model utility in applications like automated video summarization or accessibility tools.
The benchmark design could inspire similar tests for other multimodal challenges involving timing and identity.
Training open models with more synchronized audiovisual data might reduce the observed performance gap.

Load-bearing premise

The expert-curated questions and fusion-grounded design embed true audiovisual dependencies that prevent solving them with visual cues or coarse speech information alone.

What would settle it

A test where models are evaluated on the same questions but with audio removed or timing disrupted; high performance in that case would falsify the claim that the benchmark measures fusion.

Figures

Figures reproduced from arXiv: 2512.02231 by Jeongik Lee, JuWan Maeng, Le Thien Phuc Nguyen, Samuel Low Yu Hang, SeungEun Chung, Soochahn Lee, Subin An, Thanh-Huy Nguyen, Yohan Ban, Yong Jae Lee, Zhuoran Yu.

**Figure 1.** Figure 1: Motivation of AV-SpeakerBench. Existing video benchmarks often contain visually solvable questions—such as counting visible people—where state-of-the-art multimodal models can answer correctly even when the audio stream is muted (left; examples from Video-MME [13]). In contrast, questions in AVSpeakerBench (right) are explicitly designed to require audiovisual fusion: the correct answer depends on who s… view at source ↗

**Figure 2.** Figure 2: Top: Examples of audiovisual reasoning questions in AV-SpeakerBench. Each question illustrates a distinct way in which audiovisual dependency is enforced—through spoken-phrase grounding, visual event conditioning, cross-modal temporal localization, or multi-speaker coordination—ensuring that the correct answer cannot be inferred from a single modality. Bottom: Dataset Distribution. We present the distribut… view at source ↗

**Figure 3.** Figure 3: Multimodal ablation and error analysis. der vision-only and audiovisual input settings (Figure 3a). Gemini 2.5 Pro exhibits consistent gains of roughly 10–20 percentage points across all tasks when both modalities are available, indicating stable and effective fusion. In contrast, Qwen3-Omni-30B achieves much smaller gains—and in some tasks, even negative differences—suggesting that audio input does not r… view at source ↗

**Figure 4.** Figure 4: Qualitative examples of Gemini 2.5 Pro reasoning traces on AV-SpeakerBench. Green and red highlight colors indicate the model’s correct and incorrect reasoning, respectively. (a) Vision-only example answered correctly: the model identifies the correct speaker by tracking the duration and consistency of mouth movement and conversational gestures, which serve as natural visual cues for inferring who is speak… view at source ↗

**Figure 5.** Figure 5: Annotation interface for rate–comparison tasks. The interface presents annotators with the video clip, metadata (video ID, category, task type), the question, all answer choices, and the selected response. Annotators also specify the temporal window used for judgment and provide a brief justification. The examples shown correspond to (left) lowest rate of speech, (middle) highest rate of speech, and (right… view at source ↗

**Figure 6.** Figure 6: Examples of trivially solvable questions removed during filtering. (Top) A moment-specific visibility question becomes trivial because only one person is visible throughout the entire clip, making the answer recoverable without grounding to the referenced utterance. (Bottom) A speech-content question becomes trivial because the spoken line appears as burned-in captions, allowing the answer to be selected… view at source ↗

**Figure 8.** Figure 8: Human evaluation interface. Evaluators watch the video clip, then answer the corresponding multiple-choice question (A–D). No transcript or subtitle is provided. The interface also includes an optional refinement tag and a control question asking for the total number of people visible in the video. This setup ensures that human performance is independent of annotation and directly comparable to model out… view at source ↗

**Figure 9.** Figure 9: Qualitative examples of Gemini 2.5 Pro reasoning traces on AV-SpeakerBench. Green and red highlight colors indicate the model’s correct and incorrect reasoning, respectively. The figure above contains representative failure cases spanning four key error patterns: (a) cross-modality attribution, (b) audio and visual perception, (c) temporal grounding, and (d) temporal localization. Detailed analyses are pro… view at source ↗

**Figure 10.** Figure 10: Visualization of speaker tasks. Top: Speaker Detection. Middle: Speaker Recognition. Bottom: Speaker Counting. 7 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of speaker-visual tasks. Top: Activity Recognition. Middle: Visual Counting. Bottom: Attribute Recognition. 8 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Visualization of speech tasks. Top: Speech Counting. Middle: Speech Duration. Bottom: Speech Recognition. 9 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Visualization of speech attribute tasks. Top: Speech Intensity. Middle: Speech Pitch. Bottom: Speech Rate. 10 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AV-SpeakerBench adds a speaker-focused audiovisual benchmark but the claim that open models lag mainly from fusion deficits lacks the modality controls needed to back it up.

read the letter

The paper's main point is a new benchmark called AV-SpeakerBench with 3,212 expert-curated multiple-choice questions that target fine-grained reasoning about speakers in real videos. It emphasizes who is talking, what they say, and timing, using a speaker-centered framing and questions meant to require audiovisual fusion rather than letting models get by on visuals or loose audio alone. That setup is new relative to most existing video benchmarks, which often stay visual or test speech only coarsely. The evaluations are straightforward and show the Gemini family ahead, with Gemini 2.5 Pro on top and Qwen3-Omni-30B closer to Gemini 2.0 Flash than to the leader. The curation by experts and the focus on temporal precision are clear strengths and give the benchmark a practical feel for testing real multimodal understanding. The results are presented cleanly enough to let readers see the performance ordering. The soft spot sits in the interpretation of those gaps. The paper says open models trail mainly because of weaker audiovisual fusion rather than visual perception, yet the abstract and available details do not include the obvious checks: video-only runs, audio-masked runs, or any quantitative measure showing that single-modality accuracy stays near chance. Without those, the diagnostic claim rests on the assumption that the questions truly embed cross-modal dependencies, and that assumption is not yet demonstrated. Minor details like inter-annotator agreement numbers or explicit validation steps for question hardness would also help, but the missing ablations are the larger issue because they directly affect how much weight to give the fusion explanation. This work is aimed at researchers building or evaluating multimodal models for video with speech content, especially in areas like accessibility or dialogue understanding. A reader who needs a targeted test set for audiovisual speech reasoning will get something usable here even before revisions. The core benchmark idea holds up and the paper engages honestly with the literature on existing video evaluations. It deserves a serious referee who can ask for the modality controls and any additional validation stats.

Referee Report

2 major / 2 minor

Summary. The paper introduces AV-SpeakerBench, a benchmark of 3,212 expert-curated multiple-choice questions targeting fine-grained audiovisual reasoning about human speech in real-world videos. It emphasizes a speaker-centered formulation and fusion-grounded question design that embeds dependencies between speaker identity, speech content, and timing. Comprehensive evaluations across models conclude that the Gemini family leads, with Gemini 2.5 Pro strongest overall, while open-source models such as Qwen3-Omni-30B trail primarily due to weaker audiovisual fusion rather than deficits in visual perception.

Significance. If the benchmark questions are shown to require genuine cross-modal fusion, the work would provide a useful diagnostic tool for multimodal LLMs and help quantify gaps between proprietary and open systems in speaker-centric audiovisual understanding. The scale, expert curation, and focus on temporal and cross-modal validity are positive contributions that could guide future model development.

major comments (2)

[Benchmark Construction] The fusion-grounded question design and speaker-centered formulation are presented as ensuring audiovisual dependencies, yet no validation details (inter-annotator agreement, question difficulty controls, or checks that questions cannot be solved unimodally) are provided. This directly undermines the load-bearing claim in the results that performance gaps reflect audiovisual fusion deficits rather than visual or coarse-audio shortcuts.
[Results and Analysis] The diagnostic conclusion that Qwen3-Omni-30B approaches Gemini 2.0 Flash but lags Gemini 2.5 Pro 'primarily due to weaker audiovisual fusion rather than visual perception' lacks supporting evidence. No unimodal baselines (video-only, audio-masked) or ablation results are reported to confirm that single-modality accuracy remains near chance on the 3,212 questions.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief explicit statement of the total number of models evaluated and the precise accuracy metric used for all comparisons.
[Figures and Tables] Figure captions and table headers could more clearly distinguish between overall accuracy and per-category breakdowns to aid quick interpretation of the fusion-related claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our work introducing AV-SpeakerBench. We address each of the major comments in detail below, providing clarifications and outlining the revisions we plan to incorporate to strengthen the manuscript.

read point-by-point responses

Referee: [Benchmark Construction] The fusion-grounded question design and speaker-centered formulation are presented as ensuring audiovisual dependencies, yet no validation details (inter-annotator agreement, question difficulty controls, or checks that questions cannot be solved unimodally) are provided. This directly undermines the load-bearing claim in the results that performance gaps reflect audiovisual fusion deficits rather than visual or coarse-audio shortcuts.

Authors: We agree that providing explicit validation details is important for substantiating the benchmark's design. The expert curation process involved multiple annotators to ensure temporal precision and cross-modal validity, as described in the manuscript. In the revised version, we will report inter-annotator agreement metrics, details on question difficulty controls via iterative review, and preliminary unimodal checks demonstrating that questions are not solvable from single modalities alone. This will directly address the concern and reinforce that performance differences stem from audiovisual fusion capabilities. revision: yes
Referee: [Results and Analysis] The diagnostic conclusion that Qwen3-Omni-30B approaches Gemini 2.0 Flash but lags Gemini 2.5 Pro 'primarily due to weaker audiovisual fusion rather than visual perception' lacks supporting evidence. No unimodal baselines (video-only, audio-masked) or ablation results are reported to confirm that single-modality accuracy remains near chance on the 3,212 questions.

Authors: We acknowledge that the current manuscript relies on comparative model performances and error analysis to infer the role of audiovisual fusion. To provide stronger evidence, we will add unimodal baseline experiments in the revision, including video-only and audio-masked settings across the evaluated models. These results are expected to show near-chance performance on the benchmark questions, thereby supporting our diagnostic conclusion regarding fusion deficits in open-source models like Qwen3-Omni-30B. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation is self-contained

full rationale

The paper introduces AV-SpeakerBench as a new expert-curated benchmark of 3,212 questions with speaker-centered and fusion-grounded design, then reports direct empirical evaluations of existing MLLMs such as Gemini and Qwen3-Omni on it. No equations, parameter fitting, derivations, or predictions appear in the provided text. The performance gap attribution follows from observed results and the benchmark's stated design rather than reducing to any self-definitional loop, fitted input renamed as prediction, or load-bearing self-citation chain. The derivation chain is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new benchmark without deriving results from first principles or fitting parameters; it relies on the assumption that expert curation produces questions requiring true audiovisual fusion.

axioms (1)

domain assumption Expert-curated annotations ensure temporal precision and cross-modal validity of questions.
Invoked in the abstract description of benchmark construction.

pith-pipeline@v0.9.0 · 5562 in / 1194 out tokens · 26003 ms · 2026-05-17T02:08:52.074118+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos... fusion-grounded question design embedding audiovisual dependencies into question semantics
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gemini 2.5 Pro... remains far behind... primarily due to weaker audiovisual fusion rather than visual perception

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 10 internal anchors

[1]

Lrs3-ted: a large-scale dataset for visual speech recog- nition, 2018

Triantafyllos Afouras, Joon Son Chung, and Andrew Zisser- man. Lrs3-ted: a large-scale dataset for visual speech recog- nition, 2018. 2

work page 2018
[2]

Vqa: Visual question answering

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425– 2433, 2015. 2

work page 2015
[3]

Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024

Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024. 2

work page arXiv 2024
[4]

Vggsound: A large-scale audio-visual dataset, 2020

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zis- serman. Vggsound: A large-scale audio-visual dataset, 2020. 2

work page 2020
[5]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, and Lidong Bing. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video- llms.arXiv preprint arXiv:2406.07476, 2024. 3, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

J. S. Chung, A. Nagrani, and A. Zisserman. V oxceleb2: Deep speaker recognition. InINTERSPEECH, 2018. 2

work page 2018
[7]

Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision- language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023. 1, 3

work page 2023
[8]

Scaling egocentric vision: The epic-kitchens dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018. 2

work page 2018
[9]

Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing

Xinyu Fang, Kangrui Mao, Haodong Duan, Xiangyu Zhao, Yining Li, Dahua Lin, and Kai Chen. Mmbench-video: A long-form multi-shot benchmark for holistic video under- standing. InAdvances in Neural Information Processing Sys- tems, pages 89098–89124. Curran Associates, Inc., 2024. 3

work page 2024
[10]

Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024

Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Shaoqi Dong, Xiong Wang, Di Yin, Long Ma, et al. Vita: Towards open-source interactive omni multimodal llm.arXiv preprint arXiv:2408.05211, 2024. 3, 5, 6

work page arXiv 2024
[11]

Mme: A comprehensive evaluation bench- mark for multimodal large language models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation bench- mark for multimodal large language models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. 2

work page 2025
[12]

Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shao- hui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, En- hong Chen, Caifeng Shan, Ran He, and Xing Sun. Video- mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis...

work page 2025
[13]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1, 3

work page 2025
[14]

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction

Chaoyou Fu, Haojia Lin, Xiong Wang, Yi-Fan Zhang, Yun- hang Shen, Xiaoyu Liu, Yangze Li, Zuwei Long, Het- ing Gao, Ke Li, et al. Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025. 5, 6

work page arXiv 2025
[15]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Google Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. 2, 6, 7

work page 2025
[16]

Gemini: A family of highly capable multimodal models, 2025

Google Gemini Team. Gemini: A family of highly capable multimodal models, 2025. 3, 5, 6, 7

work page 2025
[17]

Gemmeke, Daniel P

Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human- labeled dataset for audio events. InProc. IEEE ICASSP 2017, New Orleans, LA, 2017. 2

work page 2017
[18]

Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,

Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Ki- ran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, and Dinesh Manocha. Gama: A large audio-language model with advanced audio under- standing and complex reasoning abilities.arXiv preprint arXiv:2406.11768, 2024. 1

work page arXiv 2024
[19]

Av-odyssey bench: Can your multimodal llms really understand audio-visual in- formation?, 2024

Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mo- fan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, and Xiangyu Yue. Av-odyssey bench: Can your multimodal llms really understand audio-visual in- formation?, 2024. 2, 3

work page 2024
[20]

Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 2

work page 2017
[21]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In 9 Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 2

work page 2022
[22]

Onellm: One framework to align all modalities with language

Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xi- angyu Yue. Onellm: One framework to align all modalities with language. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26584– 26595, 2024. 1, 6

work page 2024
[23]

Onellm: One framework to align all modalities with language, 2025

Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, and Xi- angyu Yue. Onellm: One framework to align all modalities with language, 2025. 3, 5

work page 2025
[24]

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omni- modal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025. 2

work page internal anchor Pith review arXiv 2025
[25]

Worldsense: Evaluating real-world omni- modal understanding for multimodal llms, 2025

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omni- modal understanding for multimodal llms, 2025. 3

work page 2025
[26]

Talknce: Improving active speaker detection with talk-aware contrastive learning

Chaeyoung Jung, Suyeon Lee, Kihyun Nam, Kyeongha Rho, You Jin Kim, Youngjoon Jang, and Joon Son Chung. Talknce: Improving active speaker detection with talk-aware contrastive learning. InICASSP 2024-2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pages 8391–8395. IEEE, 2024. 1, 2

work page 2024
[27]

Look who’s talking: Active speaker detec- tion in the wild.arXiv preprint arXiv:2108.07640, 2021

You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon, Bong-Jin Lee, Youngki Kwon, and Joon Son Chung. Look who’s talking: Active speaker detec- tion in the wild.arXiv preprint arXiv:2108.07640, 2021. 1, 2, 5

work page arXiv 2021
[28]

Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities.arXiv preprint arXiv:2402.01831, 2024

Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities.arXiv preprint arXiv:2402.01831, 2024. 1

work page arXiv 2024
[29]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking mul- timodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Learning to answer questions in dynamic audio-visual scenarios

Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji- Rong Wen, and Di Hu. Learning to answer questions in dynamic audio-visual scenarios. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19108–19118, 2022. 1, 3

work page 2022
[31]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 3

work page 2023
[32]

Mvbench: A comprehensive multi- modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi- modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22195–22206, 2024. 2, 3

work page 2024
[33]

Omnibench: Towards the future of univer- sal omni-language models, 2025

Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, Siwei Wu, Xingwei Qu, Jinjie Shi, Xinyue Zhang, Zhenzhu Yang, Xiangzhou Wang, Zhaoxiang Zhang, Zachary Liu, Emmanouil Benetos, Wenhao Huang, and Chenghua Lin. Omnibench: Towards the future of univer- sal omni-language models, 2025. 3

work page 2025
[34]

Video-llava: Learning united visual repre- sentation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 1

work page 2024
[35]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 3

work page 2023
[36]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InCVPR,

work page
[37]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 2

work page 2024
[38]

TempCompass: Do Video LLMs Really Understand Videos?

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 2

work page internal anchor Pith review arXiv 2024
[39]

Ola: Pushing the frontiers of omni-modal language model, 2025

Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Ola: Pushing the frontiers of omni-modal language model, 2025. 3, 5, 6

work page 2025
[40]

Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action, 2023

Jiasen Lu, Christopher Clark, Sangho Lee, Zichen Zhang, Savya Khosla, Ryan Marten, Derek Hoiem, and Aniruddha Kembhavi. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action, 2023. 3, 5, 6

work page 2023
[41]

Egoschema: A diagnostic benchmark for very long- form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long- form video language understanding. InAdvances in Neural Information Processing Systems, pages 46212–46244. Cur- ran Associates, Inc., 2023. 2, 3

work page 2023
[42]

Chartqa: A benchmark for question answer- ing about charts with visual and logical reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answer- ing about charts with visual and logical reasoning. InFind- ings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022. 2

work page 2022
[43]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceed- ings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021. 2

work page 2021
[44]

Phi-4-mini technical report: Compact yet power- ful multimodal language models via mixture-of-loras, 2025

Microsoft. Phi-4-mini technical report: Compact yet power- ful multimodal language models via mixture-of-loras, 2025. 3, 5, 6

work page 2025
[45]

Nagrani, J

A. Nagrani, J. S. Chung, and A. Zisserman. V oxceleb: a large-scale speaker identification dataset. InINTER- SPEECH, 2017. 2

work page 2017
[46]

Unitalk: Towards universal active speaker detection in real world scenarios.arXiv preprint arXiv:2505.21954, 2025

Le Thien Phuc Nguyen, Zhuoran Yu, Khoa Quang Nhat Cao, Yuwei Guo, Tu Ho Manh Pham, Tuan Tai Nguyen, Toan Ngo Duc V o, Lucas Poon, Soochahn Lee, and Yong Jae Lee. Unitalk: Towards universal active speaker detection in real world scenarios.arXiv preprint arXiv:2505.21954, 2025. 1, 2, 5 10

work page arXiv 2025
[47]

Laser: Lip landmark assisted speaker detection for robust- ness.arXiv preprint arXiv:2501.11899, 2025

Le Thien Phuc Nguyen, Zhuoran Yu, and Yong Jae Lee. Laser: Lip landmark assisted speaker detection for robust- ness.arXiv preprint arXiv:2501.11899, 2025. 1, 2

work page arXiv 2025
[48]

Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based large language models, 2023

Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, and Li Yuan. Video-bench: A com- prehensive benchmark and toolkit for evaluating video-based large language models, 2023. 3

work page 2023
[49]

Ava active speaker: An audio-visual dataset for active speaker detection

Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Rad- hika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, et al. Ava active speaker: An audio-visual dataset for active speaker detection. InICASSP 2020-2020 IEEE international conference on acoustics, speech and sig- nal processing (ICASSP), pages...

work page 2020
[50]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019. 2

work page 2019
[51]

PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355, 2023. 1, 3, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024. 1

work page arXiv 2024
[53]

Coin: A large-scale dataset for comprehensive instructional video analysis

Yansong Tang, Dajun Ding, Yongming Rao, Yu Zheng, Danyang Zhang, Lili Zhao, Jiwen Lu, and Jie Zhou. Coin: A large-scale dataset for comprehensive instructional video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207– 1216, 2019. 2

work page 2019
[54]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Loconet: Long-short context network for active speaker detection

Xizi Wang, Feng Cheng, and Gedas Bertasius. Loconet: Long-short context network for active speaker detection. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18462–18472, 2024. 1, 2

work page 2024
[56]

Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural In- formation Processing Systems, 37:113569–113697, 2024

Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sad- hika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural In- formation Processing Systems, 37:113569–113697, 2024. 2

work page 2024
[57]

Next-qa: Next phase of question-answering to explaining temporal actions

Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9777–9786, 2021. 2

work page 2021
[58]

Video question answer- ing via gradually refined attention over appearance and mo- tion

Dejing Xu, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. Video question answer- ing via gradually refined attention over appearance and mo- tion. InProceedings of the 25th ACM International Confer- ence on Multimedia, page 1645–1653, New York, NY , USA,

work page
[59]

Association for Computing Machinery. 3

work page
[60]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 1, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 1, 2, 3, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Avqa: A dataset for audio-visual question answering on videos

Pinci Yang, Xin Wang, Xuguang Duan, Hong Chen, Runze Hou, Cong Jin, and Wenwu Zhu. Avqa: A dataset for audio-visual question answering on videos. InProceedings of the 30th ACM International Conference on Multimedia, page 3480–3491, New York, NY , USA, 2022. Association for Computing Machinery. 1, 3

work page 2022
[63]

Activitynet-qa: A dataset for understanding complex web videos via question answer- ing.Proceedings of the AAAI Conference on Artificial Intel- ligence, 33(01):9127–9134, 2019

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yuet- ing Zhuang, and Dacheng Tao. Activitynet-qa: A dataset for understanding complex web videos via question answer- ing.Proceedings of the AAAI Conference on Artificial Intel- ligence, 33(01):9127–9134, 2019. 3

work page 2019
[64]

Anygpt: Unified multimodal llm with discrete sequence modeling, 2025

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, Hang Yan, Jie Fu, Tao Gui, Tianxiang Sun, Yu- Gang Jiang, and Xipeng Qiu. Anygpt: Unified multimodal llm with discrete sequence modeling, 2025. 3, 5, 6

work page 2025
[65]

Speechgpt: Empowering large language models with intrinsic cross- modal conversational abilities,

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empow- ering large language models with intrinsic cross-modal con- versational abilities.arXiv preprint arXiv:2305.11000, 2023. 1

work page arXiv 2023
[67]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

Stream-omni: Simultaneous multimodal in- teractions with large language-vision-speech model.arXiv preprint arXiv:2506.13642, 2025

Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, and Yang Feng. Stream-omni: Simultaneous multimodal in- teractions with large language-vision-speech model.arXiv preprint arXiv:2506.13642, 2025. 3

work page arXiv 2025
[69]

Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025

Ziwei Zhou, Rui Wang, and Zuxuan Wu. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025. 3

work page 2025
[70]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

reason for the answer

Daniil Zverev, Thadd ¨aus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, and A Koepke. Vg- gsounder: Audio-visual evaluations for foundation models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1027–1037, 2025. 1, 2, 3 11 See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multim...

work page 2025