XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
Pith reviewed 2026-05-18 05:41 UTC · model grok-4.3
The pith
Current omni-modal models show clear biases and fail to reason consistently across text, vision, and audio.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
XModBench shows that even Gemini 2.5 Pro reaches under 60 percent accuracy on spatial and temporal reasoning, drops substantially when content moves from text to audio, and displays lower consistency when vision rather than text acts as context, proving that present OLLMs have not reached modality-invariant reasoning.
What carries the argument
XModBench, a tri-modal benchmark of 60,828 multiple-choice questions across five task families that covers every possible question-answer modality pair to isolate consistency, disparity, and directional imbalance.
If this is right
- Spatial and temporal reasoning must be strengthened as a distinct capability.
- Audio-to-text performance gaps need direct attention to reduce modality disparity.
- Consistency must be raised when vision provides context to match text-context levels.
- Development focus should shift from general question answering toward explicit modality-invariance tests.
Where Pith is reading between the lines
- The benchmark could serve as a repeated test during model training to track reductions in modality bias over time.
- Similar consistency checks might apply to other multi-modal systems that combine vision, audio, and text outside language models.
- Closing these gaps would support more dependable AI tools for real-world tasks that mix audio, images, and text.
Load-bearing premise
The benchmark questions carry identical meaning and difficulty across every modality so that any measured differences come from the model rather than from changes in how the questions are presented.
What would settle it
A model that scores comparably high accuracy on every spatial, temporal, and cross-modality subset of XModBench would show the reported disparities do not hold.
Figures
read the original abstract
Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM's modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with performance dropping substantially when the same semantic content is conveyed through audio rather than text, and (iii) shows systematic directional imbalance, exhibiting lower consistency when vision serves as context compared to text. These findings indicate that current OLLMs remain far from truly modality-invariant reasoning and position XModBench as a fundamental diagnostic tool for evaluating and improving cross-modal competence. All data and evaluation tools will be available at https://xingruiwang.github.io/projects/XModBench/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces XModBench, a tri-modal benchmark with 60,828 multiple-choice questions spanning five task families and all six modality compositions (text/vision/audio for questions and answers). It evaluates omni-language models on cross-modal consistency, reporting that even Gemini 2.5 Pro achieves <60% accuracy on spatial/temporal reasoning, shows substantial performance drops when content is presented via audio versus text, and exhibits lower consistency when vision is the context modality compared to text. The work positions the benchmark as a diagnostic tool revealing that current OLLMs lack modality-invariant reasoning.
Significance. If the benchmark questions are shown to maintain semantic equivalence and comparable difficulty across modalities, the results would provide a valuable, fine-grained diagnostic for identifying specific limitations in OLLM cross-modal reasoning. The systematic coverage of all modality pairs and the reported imbalances offer concrete directions for model improvement, with the public release of data and tools strengthening reproducibility.
major comments (2)
- [§3] §3 (Benchmark Construction): The description of question generation and modality conversion (e.g., text-to-audio via TTS, text-to-vision) provides only procedural steps without quantitative validation such as human equivalence ratings, difficulty parity scores, or information-content metrics between modalities. This directly undermines the central claims of modality disparity and directional imbalance, as performance gaps could arise from modality-specific artifacts rather than model reasoning limits.
- [§4] §4 (Experiments and Results): The reported accuracy drops (e.g., audio vs. text) and consistency imbalances are presented without controls or ablations that isolate question difficulty or presentation effects from model behavior; for instance, no comparison of model performance on the same questions in their original versus converted forms is shown to confirm equivalence.
minor comments (2)
- [Table 1] Table 1 or the dataset statistics section would benefit from an explicit breakdown of question counts per task family and per modality composition to allow readers to assess balance.
- [§5] The abstract and §5 could more precisely define 'directional imbalance' with a short formal statement or equation rather than relying solely on descriptive text.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify key areas for strengthening the validation of XModBench. We address each major comment below and will incorporate the suggested quantitative validations and controls in the revised manuscript to better support our claims on modality-invariant reasoning.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): The description of question generation and modality conversion (e.g., text-to-audio via TTS, text-to-vision) provides only procedural steps without quantitative validation such as human equivalence ratings, difficulty parity scores, or information-content metrics between modalities. This directly undermines the central claims of modality disparity and directional imbalance, as performance gaps could arise from modality-specific artifacts rather than model reasoning limits.
Authors: We agree that quantitative validation is essential to rule out conversion artifacts. In the revised §3, we will add results from a human study with 50 annotators providing equivalence and difficulty ratings (5-point scale) on 500 sampled questions across all modality pairs, along with inter-annotator agreement. We will also report information-content metrics using cross-modal embedding similarities (e.g., via CLIP for vision-text and audio-text models) and difficulty parity scores. These additions will confirm high semantic equivalence and support that observed disparities reflect model limitations. revision: yes
-
Referee: [§4] §4 (Experiments and Results): The reported accuracy drops (e.g., audio vs. text) and consistency imbalances are presented without controls or ablations that isolate question difficulty or presentation effects from model behavior; for instance, no comparison of model performance on the same questions in their original versus converted forms is shown to confirm equivalence.
Authors: We concur that explicit controls strengthen the results. The revised §4 will include ablations evaluating the same questions in original text form versus converted audio and vision forms for multiple models. This directly compares performance to isolate modality effects from content difficulty. We will also add controls for presentation effects through standardized input formatting and prompting. These will provide clearer evidence that the reported drops and imbalances arise from model behavior rather than artifacts. revision: yes
Circularity Check
No circularity: empirical benchmark with direct model evaluations
full rationale
The paper introduces XModBench, a new tri-modal dataset with 60,828 questions, and reports empirical accuracy results on existing models such as Gemini 2.5 Pro. No derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the provided abstract or described structure. Central claims rest on direct measurements of model performance across modality compositions rather than any reduction to inputs by construction. The analysis is self-contained as standard benchmark construction and evaluation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
struggles with spatial and temporal reasoning, achieving less than 60% accuracy
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...
-
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
Reference graph
Works this paper leans on
-
[1]
V .R. Algazi, R.O. Duda, D.M. Thompson, and C. Avendano. The cipic hrtf database. InProceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575), pp. 99–102,
work page 2001
-
[2]
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia
doi: 10.1109/ASPAA.2001.969552. Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14455–14465,
-
[3]
EmotionLines: An Emotion Corpus of Multi-Party Conversations
Ssu-Yen Chen, Chao-Chun Hsu, Chuan-Chun Kuo, and Lun-Wei Ku. Emotionlines: An emotion corpus of multi-party conversations.arXiv preprint arXiv:1802.08379,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs
Wey Yeh Choong, Yangyang Guo, and Mohan Kankanhalli. Vidhal: Benchmarking temporal hallu- cinations in vision llms.arXiv preprint arXiv:2411.16771,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024a. URLhttps://arxiv.org/abs/ 2306.13394. Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, C...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
MIT License, Accessed: YYYY-MM-DD. Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, et al. Av-odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611,
-
[9]
Hao-Han Guo, Yao Hu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, and Kun Xie. Fireredtts-1s: An upgraded streamable foundation text-to-speech system.arXiv preprint arXiv:2503.20499,
-
[10]
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs
10 Preprint Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Baichuan-omni-1.5 technical report
Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baichuan-omni-1.5 technical report.arXiv preprint arXiv:2501.15368,
-
[12]
Omnibench: Towards the future of universal omni-language models,
Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272, 2024b. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. M...
-
[13]
Lidong Lu, Guo Chen, Zhiqi Li, Yicheng Liu, and Tong Lu. Av-reasoner: Improving and bench- marking clue-grounded audio-visual counting for mllms.arXiv preprint arXiv:2506.05328,
-
[14]
Montesinos, Olga Slizovskaia, and Gloria Haro
Juan F. Montesinos, Olga Slizovskaia, and Gloria Haro. Solos: A dataset for audio-visual music analysis.2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6,
work page 2020
-
[15]
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ra- mani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark.arXiv preprint arXiv:2410.19168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
PandaGPT: One Model To Instruction-Follow Them All
Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704,
Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704,
-
[18]
Avhbench: A cross- modal hallucination benchmark for audio-visual large lan- guage models
Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. Avhbench: A cross-modal hallucination benchmark for audio-visual large language models.arXiv preprint arXiv:2410.18325,
-
[19]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context.arXiv preprint arXiv:2403.05530,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Audiobench: A universal benchmark for audio large language models
Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F Chen. Audiobench: A universal benchmark for audio large language models. arXiv preprint arXiv:2406.16020,
-
[21]
Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning,
Accessed: YYYY-MM-DD. Zhenghao Xing, Xiaowei Hu, Chi-Wing Fu, Wenhai Wang, Jifeng Dai, and Pheng-Ann Heng. Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning. arXiv preprint arXiv:2505.04623,
-
[22]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Acvubench: Audio-centric video understanding benchmark
Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, and Chao Zhang. Acvubench: Audio-centric video understanding benchmark. arXiv preprint arXiv:2503.19951,
-
[24]
Pano-avqa: Grounded audio-visual question answering on 360deg videos
Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, and Gunhee Kim. Pano-avqa: Grounded audio-visual question answering on 360deg videos. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp. 2031–2041,
work page 2031
-
[25]
Cross-modal consistency in multimodal large language mod- els.arXiv preprint arXiv:2411.09273,
12 Preprint Xiang Zhang, Senyu Li, Ning Shi, Bradley Hauer, Zijun Wu, Grzegorz Kondrak, Muhammad Abdul- Mageed, and Laks VS Lakshmanan. Cross-modal consistency in multimodal large language mod- els.arXiv preprint arXiv:2411.09273,
-
[26]
Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, and Chunhua Shen. Omni-r1: Reinforcement learning for omnimodal reasoning via two- system collaboration.arXiv preprint arXiv:2505.20256,
-
[27]
Mlvu: Benchmarking multi-task long video understanding
Ziwei Zhou, Rui Wang, and Zuxuan Wu. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.arXiv preprint arXiv:2505.17862,
-
[28]
13 Preprint APPENDIX A TASK SPECIFICEDMODEL PERFORMANCE A.1 TASK1: PERCEPTUALTASK Table 3: T1 (Perception) Results Model Perception Task Model Task General General - Hard Scene Instruments Instruments-multi Gemini 2.5 Pro Audio7→Text 81.05 71.39 67.20 47.75 44.09 Audio7→Vision 76.26 65.25 64.60 44.30 36.60 Text7→Audio 79.95 79.22 75.05 59.05 49.30 Text7→V...
work page 2021
-
[29]
filter if each instance if the audio and video frame is clear to be hear and the image frame and audio are all match the category name. Fine-grained Categories.This subtask uses the same pool of video clips as the General Categories setting. The difference lies in reorganizing the activity classes into eight fine-grained clusters:Animal sounds,Musical ins...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.