XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

Alan Yuille; Chao Huang; Emad Barsoum; Jialian Wu; Jiang Liu; Xiaodong Yu; Ximeng Sun; Xingrui Wang; Ze Wang; Zicheng Liu

arxiv: 2510.15148 · v2 · submitted 2025-10-16 · 💻 cs.CV · cs.AI

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

Xingrui Wang , Jiang Liu , Chao Huang , Xiaodong Yu , Ze Wang , Ximeng Sun , Jialian Wu , Alan Yuille

show 2 more authors

Emad Barsoum Zicheng Liu

This is my paper

Pith reviewed 2026-05-18 05:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords omni-modal large language modelscross-modal consistencybenchmarkmodality disparitydirectional imbalancespatial temporal reasoningGemini 2.5 Pro

0 comments

The pith

Current omni-modal models show clear biases and fail to reason consistently across text, vision, and audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents XModBench, a benchmark containing over 60,000 questions that tests whether omni-modal language models can maintain the same level of understanding and accuracy no matter which combination of text, vision, or audio carries the question and answer. It targets three specific problems: weak performance on spatial and temporal tasks, drops in accuracy when the same content shifts from text to audio, and uneven consistency depending on which modality serves as the reference context. A reader would care because these gaps mean models cannot yet be trusted for reliable reasoning in mixed-media settings where inputs and outputs cross modalities freely.

Core claim

XModBench shows that even Gemini 2.5 Pro reaches under 60 percent accuracy on spatial and temporal reasoning, drops substantially when content moves from text to audio, and displays lower consistency when vision rather than text acts as context, proving that present OLLMs have not reached modality-invariant reasoning.

What carries the argument

XModBench, a tri-modal benchmark of 60,828 multiple-choice questions across five task families that covers every possible question-answer modality pair to isolate consistency, disparity, and directional imbalance.

If this is right

Spatial and temporal reasoning must be strengthened as a distinct capability.
Audio-to-text performance gaps need direct attention to reduce modality disparity.
Consistency must be raised when vision provides context to match text-context levels.
Development focus should shift from general question answering toward explicit modality-invariance tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could serve as a repeated test during model training to track reductions in modality bias over time.
Similar consistency checks might apply to other multi-modal systems that combine vision, audio, and text outside language models.
Closing these gaps would support more dependable AI tools for real-world tasks that mix audio, images, and text.

Load-bearing premise

The benchmark questions carry identical meaning and difficulty across every modality so that any measured differences come from the model rather than from changes in how the questions are presented.

What would settle it

A model that scores comparably high accuracy on every spatial, temporal, and cross-modality subset of XModBench would show the reported disparities do not hold.

Figures

Figures reproduced from arXiv: 2510.15148 by Alan Yuille, Chao Huang, Emad Barsoum, Jialian Wu, Jiang Liu, Xiaodong Yu, Ximeng Sun, Xingrui Wang, Ze Wang, Zicheng Liu.

**Figure 1.** Figure 1: Overview of XModBench. (a) Instances are built from aligned text–image–audio triplets; (b) instantiated into six modality configurations by permuting context and candidate modalities; (c) spanning five domains with 17 subtasks and 60,828 question–answer pairs; and (d) illustrated with example multiple-choice questions under balanced modality settings. 2. Comprehensive coverage. The benchmark spans five tas… view at source ↗

**Figure 2.** Figure 2: Distribution of XModBench’s questions across five task families with specific subtasks. XModBench covers five task families with seventeen subtasks, spanning perception, spatial reasoning, temporal reasoning, linguistic understanding, and external knowledge (see [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: XModBench task examples. We show sample questions from six subtasks in the bench [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Modality disparity across different configura [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Directional imbalance: accuracy gaps between paired inverse settings among audio, vision [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Failure cases. (a) Gemini 2.5 pro correctly identifies a didgeridoo in text but fails to match [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: showing a screenshot of the interface and example questions. For each subtask, we collected responses from 10 valid participants per modality configuration. D TECHINIQAL DETAILS IN TRIPLET DATA COLLECTION AND PROCESSING.DATA FOR EACH SUBTASK In this section, we provide detailed descriptions of the data sources are collected, and how each data in each modality are processed for each subtask in XModBench. D.… view at source ↗

read the original abstract

Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM's modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with performance dropping substantially when the same semantic content is conveyed through audio rather than text, and (iii) shows systematic directional imbalance, exhibiting lower consistency when vision serves as context compared to text. These findings indicate that current OLLMs remain far from truly modality-invariant reasoning and position XModBench as a fundamental diagnostic tool for evaluating and improving cross-modal competence. All data and evaluation tools will be available at https://xingruiwang.github.io/projects/XModBench/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

XModBench gives a practical new way to measure modality consistency gaps in omni-models, but the headline claims rest on unverified semantic equivalence in how questions were built across text, audio, and vision.

read the letter

The paper's main contribution is a benchmark that hits all six modality combinations in question-answer pairs and tracks consistency, disparity, and directional imbalance on top of standard accuracy. That setup is new and lets them report concrete patterns like Gemini 2.5 Pro staying below 60 percent on spatial and temporal tasks, clear drops when content moves to audio, and lower consistency when vision is the context rather than text. Those numbers are useful for anyone trying to diagnose whether current models are actually modality-invariant or just biased toward text-like inputs. The work is straightforward empirical work with no circular math or invented parameters, and it ships the data and tools, which counts as real value. The soft spot is exactly the one the stress test flags. The results attribute performance gaps to model limitations only if every question carries identical meaning, difficulty, and information load no matter the modality. The abstract describes the coverage but does not show quantitative checks such as human equivalence ratings or information-loss scores for the conversions. If the full paper has only procedural descriptions without those controls, then some of the reported disparities could trace back to artifacts in speech clarity or visual ambiguity instead of reasoning limits. That makes the central interpretation moderate rather than ironclad. This is for researchers who build or evaluate multimodal models and want a diagnostic that goes beyond generic QA scores. A reader working on training methods or evaluation standards will find the empirical patterns worth looking at. It deserves a serious referee because the benchmark scale and coverage are novel enough to warrant scrutiny, even if the validation details need tightening. I would send it to review with a specific request for the equivalence checks and any difficulty parity metrics they collected.

Referee Report

2 major / 2 minor

Summary. The paper introduces XModBench, a tri-modal benchmark with 60,828 multiple-choice questions spanning five task families and all six modality compositions (text/vision/audio for questions and answers). It evaluates omni-language models on cross-modal consistency, reporting that even Gemini 2.5 Pro achieves <60% accuracy on spatial/temporal reasoning, shows substantial performance drops when content is presented via audio versus text, and exhibits lower consistency when vision is the context modality compared to text. The work positions the benchmark as a diagnostic tool revealing that current OLLMs lack modality-invariant reasoning.

Significance. If the benchmark questions are shown to maintain semantic equivalence and comparable difficulty across modalities, the results would provide a valuable, fine-grained diagnostic for identifying specific limitations in OLLM cross-modal reasoning. The systematic coverage of all modality pairs and the reported imbalances offer concrete directions for model improvement, with the public release of data and tools strengthening reproducibility.

major comments (2)

[§3] §3 (Benchmark Construction): The description of question generation and modality conversion (e.g., text-to-audio via TTS, text-to-vision) provides only procedural steps without quantitative validation such as human equivalence ratings, difficulty parity scores, or information-content metrics between modalities. This directly undermines the central claims of modality disparity and directional imbalance, as performance gaps could arise from modality-specific artifacts rather than model reasoning limits.
[§4] §4 (Experiments and Results): The reported accuracy drops (e.g., audio vs. text) and consistency imbalances are presented without controls or ablations that isolate question difficulty or presentation effects from model behavior; for instance, no comparison of model performance on the same questions in their original versus converted forms is shown to confirm equivalence.

minor comments (2)

[Table 1] Table 1 or the dataset statistics section would benefit from an explicit breakdown of question counts per task family and per modality composition to allow readers to assess balance.
[§5] The abstract and §5 could more precisely define 'directional imbalance' with a short formal statement or equation rather than relying solely on descriptive text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas for strengthening the validation of XModBench. We address each major comment below and will incorporate the suggested quantitative validations and controls in the revised manuscript to better support our claims on modality-invariant reasoning.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The description of question generation and modality conversion (e.g., text-to-audio via TTS, text-to-vision) provides only procedural steps without quantitative validation such as human equivalence ratings, difficulty parity scores, or information-content metrics between modalities. This directly undermines the central claims of modality disparity and directional imbalance, as performance gaps could arise from modality-specific artifacts rather than model reasoning limits.

Authors: We agree that quantitative validation is essential to rule out conversion artifacts. In the revised §3, we will add results from a human study with 50 annotators providing equivalence and difficulty ratings (5-point scale) on 500 sampled questions across all modality pairs, along with inter-annotator agreement. We will also report information-content metrics using cross-modal embedding similarities (e.g., via CLIP for vision-text and audio-text models) and difficulty parity scores. These additions will confirm high semantic equivalence and support that observed disparities reflect model limitations. revision: yes
Referee: [§4] §4 (Experiments and Results): The reported accuracy drops (e.g., audio vs. text) and consistency imbalances are presented without controls or ablations that isolate question difficulty or presentation effects from model behavior; for instance, no comparison of model performance on the same questions in their original versus converted forms is shown to confirm equivalence.

Authors: We concur that explicit controls strengthen the results. The revised §4 will include ablations evaluating the same questions in original text form versus converted audio and vision forms for multiple models. This directly compares performance to isolate modality effects from content difficulty. We will also add controls for presentation effects through standardized input formatting and prompting. These will provide clearer evidence that the reported drops and imbalances arise from model behavior rather than artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct model evaluations

full rationale

The paper introduces XModBench, a new tri-modal dataset with 60,828 questions, and reports empirical accuracy results on existing models such as Gemini 2.5 Pro. No derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the provided abstract or described structure. Central claims rest on direct measurements of model performance across modality compositions rather than any reduction to inputs by construction. The analysis is self-contained as standard benchmark construction and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper. The central claims rest on the design and validation of the question set rather than on mathematical axioms, free parameters, or new postulated entities.

pith-pipeline@v0.9.0 · 5794 in / 1139 out tokens · 41411 ms · 2026-05-18T05:41:53.488739+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

struggles with spatial and temporal reasoning, achieving less than 60% accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
cs.CV 2026-04 unverdicted novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 2 Pith papers · 10 internal anchors

[1]

Algazi, R.O

V .R. Algazi, R.O. Duda, D.M. Thompson, and C. Avendano. The cipic hrtf database. InProceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575), pp. 99–102,

work page 2001
[2]

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia

doi: 10.1109/ASPAA.2001.969552. Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14455–14465,

work page doi:10.1109/aspaa.2001.969552 2001
[3]

EmotionLines: An Emotion Corpus of Multi-Party Conversations

Ssu-Yen Chen, Chao-Chun Hsu, Chuan-Chun Kuo, and Lun-Wei Ku. Emotionlines: An emotion corpus of multi-party conversations.arXiv preprint arXiv:1802.08379,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

Wey Yeh Choong, Yangyang Guo, and Mohan Kankanhalli. Vidhal: Benchmarking temporal hallu- cinations in vision llms.arXiv preprint arXiv:2411.16771,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024a. URLhttps://arxiv.org/abs/ 2306.13394. Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, C...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Av-odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611,

MIT License, Accessed: YYYY-MM-DD. Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, et al. Av-odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611,

work page arXiv
[9]

Fireredtts-1s: An upgraded streamable foundation text-to-speech system.arXiv preprint arXiv:2503.20499,

Hao-Han Guo, Yao Hu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, and Kun Xie. Fireredtts-1s: An upgraded streamable foundation text-to-speech system.arXiv preprint arXiv:2503.20499,

work page arXiv
[10]

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

10 Preprint Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Baichuan-omni-1.5 technical report

Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baichuan-omni-1.5 technical report.arXiv preprint arXiv:2501.15368,

work page arXiv
[12]

Omnibench: Towards the future of universal omni-language models,

Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272, 2024b. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. M...

work page arXiv
[13]

Av-reasoner: Improving and bench- marking clue-grounded audio-visual counting for mllms.arXiv preprint arXiv:2506.05328,

Lidong Lu, Guo Chen, Zhiqi Li, Yicheng Liu, and Tong Lu. Av-reasoner: Improving and bench- marking clue-grounded audio-visual counting for mllms.arXiv preprint arXiv:2506.05328,

work page arXiv
[14]

Montesinos, Olga Slizovskaia, and Gloria Haro

Juan F. Montesinos, Olga Slizovskaia, and Gloria Haro. Solos: A dataset for audio-visual music analysis.2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6,

work page 2020
[15]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ra- mani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark.arXiv preprint arXiv:2410.19168,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704,

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704,

work page arXiv
[18]

Avhbench: A cross- modal hallucination benchmark for audio-visual large lan- guage models

Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. Avhbench: A cross-modal hallucination benchmark for audio-visual large language models.arXiv preprint arXiv:2410.18325,

work page arXiv
[19]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context.arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Audiobench: A universal benchmark for audio large language models

Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F Chen. Audiobench: A universal benchmark for audio large language models. arXiv preprint arXiv:2406.16020,

work page arXiv
[21]

Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning,

Accessed: YYYY-MM-DD. Zhenghao Xing, Xiaowei Hu, Chi-Wing Fu, Wenhai Wang, Jifeng Dai, and Pheng-Ann Heng. Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning. arXiv preprint arXiv:2505.04623,

work page arXiv
[22]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Acvubench: Audio-centric video understanding benchmark

Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, and Chao Zhang. Acvubench: Audio-centric video understanding benchmark. arXiv preprint arXiv:2503.19951,

work page arXiv
[24]

Pano-avqa: Grounded audio-visual question answering on 360deg videos

Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, and Gunhee Kim. Pano-avqa: Grounded audio-visual question answering on 360deg videos. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp. 2031–2041,

work page 2031
[25]

Cross-modal consistency in multimodal large language mod- els.arXiv preprint arXiv:2411.09273,

12 Preprint Xiang Zhang, Senyu Li, Ning Shi, Bradley Hauer, Zijun Wu, Grzegorz Kondrak, Muhammad Abdul- Mageed, and Laks VS Lakshmanan. Cross-modal consistency in multimodal large language mod- els.arXiv preprint arXiv:2411.09273,

work page arXiv
[26]

Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256, 2025

Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, and Chunhua Shen. Omni-r1: Reinforcement learning for omnimodal reasoning via two- system collaboration.arXiv preprint arXiv:2505.20256,

work page arXiv
[27]

Mlvu: Benchmarking multi-task long video understanding

Ziwei Zhou, Rui Wang, and Zuxuan Wu. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.arXiv preprint arXiv:2505.17862,

work page arXiv
[28]

Yet, real-world multimodal sce- narios are more complex: information from multiple modalities often arrives simultaneously and must be pro- cessed in an integrated manner

13 Preprint APPENDIX A TASK SPECIFICEDMODEL PERFORMANCE A.1 TASK1: PERCEPTUALTASK Table 3: T1 (Perception) Results Model Perception Task Model Task General General - Hard Scene Instruments Instruments-multi Gemini 2.5 Pro Audio7→Text 81.05 71.39 67.20 47.75 44.09 Audio7→Vision 76.26 65.25 64.60 44.30 36.60 Text7→Audio 79.95 79.22 75.05 59.05 49.30 Text7→V...

work page 2021
[29]

Event A→Event B→Event C

filter if each instance if the audio and video frame is clear to be hear and the image frame and audio are all match the category name. Fine-grained Categories.This subtask uses the same pool of video clips as the General Categories setting. The difference lies in reorganizing the activity classes into eight fine-grained clusters:Animal sounds,Musical ins...

work page 2022

[1] [1]

Algazi, R.O

V .R. Algazi, R.O. Duda, D.M. Thompson, and C. Avendano. The cipic hrtf database. InProceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575), pp. 99–102,

work page 2001

[2] [2]

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia

doi: 10.1109/ASPAA.2001.969552. Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14455–14465,

work page doi:10.1109/aspaa.2001.969552 2001

[3] [3]

EmotionLines: An Emotion Corpus of Multi-Party Conversations

Ssu-Yen Chen, Chao-Chun Hsu, Chuan-Chun Kuo, and Lun-Wei Ku. Emotionlines: An emotion corpus of multi-party conversations.arXiv preprint arXiv:1802.08379,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

Wey Yeh Choong, Yangyang Guo, and Mohan Kankanhalli. Vidhal: Benchmarking temporal hallu- cinations in vision llms.arXiv preprint arXiv:2411.16771,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024a. URLhttps://arxiv.org/abs/ 2306.13394. Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, C...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Av-odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611,

MIT License, Accessed: YYYY-MM-DD. Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, et al. Av-odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611,

work page arXiv

[9] [9]

Fireredtts-1s: An upgraded streamable foundation text-to-speech system.arXiv preprint arXiv:2503.20499,

Hao-Han Guo, Yao Hu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, and Kun Xie. Fireredtts-1s: An upgraded streamable foundation text-to-speech system.arXiv preprint arXiv:2503.20499,

work page arXiv

[10] [10]

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

10 Preprint Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Baichuan-omni-1.5 technical report

Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baichuan-omni-1.5 technical report.arXiv preprint arXiv:2501.15368,

work page arXiv

[12] [12]

Omnibench: Towards the future of universal omni-language models,

Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272, 2024b. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. M...

work page arXiv

[13] [13]

Av-reasoner: Improving and bench- marking clue-grounded audio-visual counting for mllms.arXiv preprint arXiv:2506.05328,

Lidong Lu, Guo Chen, Zhiqi Li, Yicheng Liu, and Tong Lu. Av-reasoner: Improving and bench- marking clue-grounded audio-visual counting for mllms.arXiv preprint arXiv:2506.05328,

work page arXiv

[14] [14]

Montesinos, Olga Slizovskaia, and Gloria Haro

Juan F. Montesinos, Olga Slizovskaia, and Gloria Haro. Solos: A dataset for audio-visual music analysis.2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6,

work page 2020

[15] [15]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ra- mani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark.arXiv preprint arXiv:2410.19168,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

PandaGPT: One Model To Instruction-Follow Them All

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704,

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704,

work page arXiv

[18] [18]

Avhbench: A cross- modal hallucination benchmark for audio-visual large lan- guage models

Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. Avhbench: A cross-modal hallucination benchmark for audio-visual large language models.arXiv preprint arXiv:2410.18325,

work page arXiv

[19] [19]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context.arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Audiobench: A universal benchmark for audio large language models

Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F Chen. Audiobench: A universal benchmark for audio large language models. arXiv preprint arXiv:2406.16020,

work page arXiv

[21] [21]

Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning,

Accessed: YYYY-MM-DD. Zhenghao Xing, Xiaowei Hu, Chi-Wing Fu, Wenhai Wang, Jifeng Dai, and Pheng-Ann Heng. Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning. arXiv preprint arXiv:2505.04623,

work page arXiv

[22] [22]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Acvubench: Audio-centric video understanding benchmark

Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, and Chao Zhang. Acvubench: Audio-centric video understanding benchmark. arXiv preprint arXiv:2503.19951,

work page arXiv

[24] [24]

Pano-avqa: Grounded audio-visual question answering on 360deg videos

Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, and Gunhee Kim. Pano-avqa: Grounded audio-visual question answering on 360deg videos. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp. 2031–2041,

work page 2031

[25] [25]

Cross-modal consistency in multimodal large language mod- els.arXiv preprint arXiv:2411.09273,

12 Preprint Xiang Zhang, Senyu Li, Ning Shi, Bradley Hauer, Zijun Wu, Grzegorz Kondrak, Muhammad Abdul- Mageed, and Laks VS Lakshmanan. Cross-modal consistency in multimodal large language mod- els.arXiv preprint arXiv:2411.09273,

work page arXiv

[26] [26]

Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256, 2025

Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, and Chunhua Shen. Omni-r1: Reinforcement learning for omnimodal reasoning via two- system collaboration.arXiv preprint arXiv:2505.20256,

work page arXiv

[27] [27]

Mlvu: Benchmarking multi-task long video understanding

Ziwei Zhou, Rui Wang, and Zuxuan Wu. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.arXiv preprint arXiv:2505.17862,

work page arXiv

[28] [28]

Yet, real-world multimodal sce- narios are more complex: information from multiple modalities often arrives simultaneously and must be pro- cessed in an integrated manner

13 Preprint APPENDIX A TASK SPECIFICEDMODEL PERFORMANCE A.1 TASK1: PERCEPTUALTASK Table 3: T1 (Perception) Results Model Perception Task Model Task General General - Hard Scene Instruments Instruments-multi Gemini 2.5 Pro Audio7→Text 81.05 71.39 67.20 47.75 44.09 Audio7→Vision 76.26 65.25 64.60 44.30 36.60 Text7→Audio 79.95 79.22 75.05 59.05 49.30 Text7→V...

work page 2021

[29] [29]

Event A→Event B→Event C

filter if each instance if the audio and video frame is clear to be hear and the image frame and audio are all match the category name. Fine-grained Categories.This subtask uses the same pool of video clips as the General Categories setting. The difference lies in reorganizing the activity classes into eight fine-grained clusters:Animal sounds,Musical ins...

work page 2022