pith. sign in

arxiv: 2510.15148 · v2 · submitted 2025-10-16 · 💻 cs.CV · cs.AI

XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models

Pith reviewed 2026-05-18 05:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords omni-modal large language modelscross-modal consistencybenchmarkmodality disparitydirectional imbalancespatial temporal reasoningGemini 2.5 Pro
0
0 comments X

The pith

Current omni-modal models show clear biases and fail to reason consistently across text, vision, and audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents XModBench, a benchmark containing over 60,000 questions that tests whether omni-modal language models can maintain the same level of understanding and accuracy no matter which combination of text, vision, or audio carries the question and answer. It targets three specific problems: weak performance on spatial and temporal tasks, drops in accuracy when the same content shifts from text to audio, and uneven consistency depending on which modality serves as the reference context. A reader would care because these gaps mean models cannot yet be trusted for reliable reasoning in mixed-media settings where inputs and outputs cross modalities freely.

Core claim

XModBench shows that even Gemini 2.5 Pro reaches under 60 percent accuracy on spatial and temporal reasoning, drops substantially when content moves from text to audio, and displays lower consistency when vision rather than text acts as context, proving that present OLLMs have not reached modality-invariant reasoning.

What carries the argument

XModBench, a tri-modal benchmark of 60,828 multiple-choice questions across five task families that covers every possible question-answer modality pair to isolate consistency, disparity, and directional imbalance.

If this is right

  • Spatial and temporal reasoning must be strengthened as a distinct capability.
  • Audio-to-text performance gaps need direct attention to reduce modality disparity.
  • Consistency must be raised when vision provides context to match text-context levels.
  • Development focus should shift from general question answering toward explicit modality-invariance tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could serve as a repeated test during model training to track reductions in modality bias over time.
  • Similar consistency checks might apply to other multi-modal systems that combine vision, audio, and text outside language models.
  • Closing these gaps would support more dependable AI tools for real-world tasks that mix audio, images, and text.

Load-bearing premise

The benchmark questions carry identical meaning and difficulty across every modality so that any measured differences come from the model rather than from changes in how the questions are presented.

What would settle it

A model that scores comparably high accuracy on every spatial, temporal, and cross-modality subset of XModBench would show the reported disparities do not hold.

Figures

Figures reproduced from arXiv: 2510.15148 by Alan Yuille, Chao Huang, Emad Barsoum, Jialian Wu, Jiang Liu, Xiaodong Yu, Ximeng Sun, Xingrui Wang, Ze Wang, Zicheng Liu.

Figure 1
Figure 1. Figure 1: Overview of XModBench. (a) Instances are built from aligned text–image–audio triplets; (b) instantiated into six modality configurations by permuting context and candidate modalities; (c) spanning five domains with 17 subtasks and 60,828 question–answer pairs; and (d) illustrated with example multiple-choice questions under balanced modality settings. 2. Comprehensive coverage. The benchmark spans five tas… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of XModBench’s questions across five task families with spe￾cific subtasks. XModBench covers five task families with seven￾teen subtasks, spanning perception, spatial reason￾ing, temporal reasoning, linguistic understanding, and external knowledge (see [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: XModBench task examples. We show sample questions from six subtasks in the bench [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Modality disparity across different configura [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Directional imbalance: accuracy gaps between paired inverse settings among audio, vision [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Failure cases. (a) Gemini 2.5 pro correctly identifies a didgeridoo in text but fails to match [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: showing a screenshot of the interface and example questions. For each subtask, we collected responses from 10 valid participants per modality configuration. D TECHINIQAL DETAILS IN TRIPLET DATA COLLECTION AND PROCESSING.DATA FOR EACH SUBTASK In this section, we provide detailed descriptions of the data sources are collected, and how each data in each modality are processed for each subtask in XModBench. D.… view at source ↗
read the original abstract

Omni-modal large language models (OLLMs) aim to unify audio, vision, and text understanding within a single framework. While existing benchmarks primarily evaluate general cross-modal question-answering ability, it remains unclear whether OLLMs achieve modality-invariant reasoning or exhibit modality-specific biases. We introduce XModBench, a large-scale tri-modal benchmark explicitly designed to measure cross-modal consistency. XModBench comprises 60,828 multiple-choice questions spanning five task families and systematically covers all six modality compositions in question-answer pairs, enabling fine-grained diagnosis of an OLLM's modality-invariant reasoning, modality disparity, and directional imbalance. Experiments show that even the strongest model, Gemini 2.5 Pro, (i) struggles with spatial and temporal reasoning, achieving less than 60% accuracy, (ii) reveals persistent modality disparities, with performance dropping substantially when the same semantic content is conveyed through audio rather than text, and (iii) shows systematic directional imbalance, exhibiting lower consistency when vision serves as context compared to text. These findings indicate that current OLLMs remain far from truly modality-invariant reasoning and position XModBench as a fundamental diagnostic tool for evaluating and improving cross-modal competence. All data and evaluation tools will be available at https://xingruiwang.github.io/projects/XModBench/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces XModBench, a tri-modal benchmark with 60,828 multiple-choice questions spanning five task families and all six modality compositions (text/vision/audio for questions and answers). It evaluates omni-language models on cross-modal consistency, reporting that even Gemini 2.5 Pro achieves <60% accuracy on spatial/temporal reasoning, shows substantial performance drops when content is presented via audio versus text, and exhibits lower consistency when vision is the context modality compared to text. The work positions the benchmark as a diagnostic tool revealing that current OLLMs lack modality-invariant reasoning.

Significance. If the benchmark questions are shown to maintain semantic equivalence and comparable difficulty across modalities, the results would provide a valuable, fine-grained diagnostic for identifying specific limitations in OLLM cross-modal reasoning. The systematic coverage of all modality pairs and the reported imbalances offer concrete directions for model improvement, with the public release of data and tools strengthening reproducibility.

major comments (2)
  1. [§3] §3 (Benchmark Construction): The description of question generation and modality conversion (e.g., text-to-audio via TTS, text-to-vision) provides only procedural steps without quantitative validation such as human equivalence ratings, difficulty parity scores, or information-content metrics between modalities. This directly undermines the central claims of modality disparity and directional imbalance, as performance gaps could arise from modality-specific artifacts rather than model reasoning limits.
  2. [§4] §4 (Experiments and Results): The reported accuracy drops (e.g., audio vs. text) and consistency imbalances are presented without controls or ablations that isolate question difficulty or presentation effects from model behavior; for instance, no comparison of model performance on the same questions in their original versus converted forms is shown to confirm equivalence.
minor comments (2)
  1. [Table 1] Table 1 or the dataset statistics section would benefit from an explicit breakdown of question counts per task family and per modality composition to allow readers to assess balance.
  2. [§5] The abstract and §5 could more precisely define 'directional imbalance' with a short formal statement or equation rather than relying solely on descriptive text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas for strengthening the validation of XModBench. We address each major comment below and will incorporate the suggested quantitative validations and controls in the revised manuscript to better support our claims on modality-invariant reasoning.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The description of question generation and modality conversion (e.g., text-to-audio via TTS, text-to-vision) provides only procedural steps without quantitative validation such as human equivalence ratings, difficulty parity scores, or information-content metrics between modalities. This directly undermines the central claims of modality disparity and directional imbalance, as performance gaps could arise from modality-specific artifacts rather than model reasoning limits.

    Authors: We agree that quantitative validation is essential to rule out conversion artifacts. In the revised §3, we will add results from a human study with 50 annotators providing equivalence and difficulty ratings (5-point scale) on 500 sampled questions across all modality pairs, along with inter-annotator agreement. We will also report information-content metrics using cross-modal embedding similarities (e.g., via CLIP for vision-text and audio-text models) and difficulty parity scores. These additions will confirm high semantic equivalence and support that observed disparities reflect model limitations. revision: yes

  2. Referee: [§4] §4 (Experiments and Results): The reported accuracy drops (e.g., audio vs. text) and consistency imbalances are presented without controls or ablations that isolate question difficulty or presentation effects from model behavior; for instance, no comparison of model performance on the same questions in their original versus converted forms is shown to confirm equivalence.

    Authors: We concur that explicit controls strengthen the results. The revised §4 will include ablations evaluating the same questions in original text form versus converted audio and vision forms for multiple models. This directly compares performance to isolate modality effects from content difficulty. We will also add controls for presentation effects through standardized input formatting and prompting. These will provide clearer evidence that the reported drops and imbalances arise from model behavior rather than artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct model evaluations

full rationale

The paper introduces XModBench, a new tri-modal dataset with 60,828 questions, and reports empirical accuracy results on existing models such as Gemini 2.5 Pro. No derivations, equations, fitted parameters presented as predictions, or self-citation chains appear in the provided abstract or described structure. Central claims rest on direct measurements of model performance across modality compositions rather than any reduction to inputs by construction. The analysis is self-contained as standard benchmark construction and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark paper. The central claims rest on the design and validation of the question set rather than on mathematical axioms, free parameters, or new postulated entities.

pith-pipeline@v0.9.0 · 5794 in / 1139 out tokens · 41411 ms · 2026-05-18T05:41:53.488739+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 7.0

    XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning...

  2. Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 2 Pith papers · 10 internal anchors

  1. [1]

    Algazi, R.O

    V .R. Algazi, R.O. Duda, D.M. Thompson, and C. Avendano. The cipic hrtf database. InProceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575), pp. 99–102,

  2. [2]

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia

    doi: 10.1109/ASPAA.2001.969552. Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14455–14465,

  3. [3]

    EmotionLines: An Emotion Corpus of Multi-Party Conversations

    Ssu-Yen Chen, Chao-Chun Hsu, Chuan-Chun Kuo, and Lun-Wei Ku. Emotionlines: An emotion corpus of multi-party conversations.arXiv preprint arXiv:1802.08379,

  4. [4]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476,

  5. [5]

    VidHal: Benchmarking Temporal Hallucinations in Vision LLMs

    Wey Yeh Choong, Yangyang Guo, and Mohan Kankanhalli. Vidhal: Benchmarking temporal hallu- cinations in vision llms.arXiv preprint arXiv:2411.16771,

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capa- bilities.arXiv preprint arXiv:2507.06261,

  7. [7]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024a. URLhttps://arxiv.org/abs/ 2306.13394. Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, C...

  8. [8]

    Av-odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611,

    MIT License, Accessed: YYYY-MM-DD. Kaixiong Gong, Kaituo Feng, Bohao Li, Yibing Wang, Mofan Cheng, Shijia Yang, Jiaming Han, Benyou Wang, Yutong Bai, Zhuoran Yang, et al. Av-odyssey bench: Can your multimodal llms really understand audio-visual information?arXiv preprint arXiv:2412.02611,

  9. [9]

    Fireredtts-1s: An upgraded streamable foundation text-to-speech system.arXiv preprint arXiv:2503.20499,

    Hao-Han Guo, Yao Hu, Fei-Yu Shen, Xu Tang, Yi-Chen Wu, Feng-Long Xie, and Kun Xie. Fireredtts-1s: An upgraded streamable foundation text-to-speech system.arXiv preprint arXiv:2503.20499,

  10. [10]

    WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

    10 Preprint Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326,

  11. [11]

    Baichuan-omni-1.5 technical report

    Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, et al. Baichuan-omni-1.5 technical report.arXiv preprint arXiv:2501.15368,

  12. [12]

    Omnibench: Towards the future of universal omni-language models,

    Yizhi Li, Ge Zhang, Yinghao Ma, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, et al. Omnibench: Towards the future of universal omni-language models.arXiv preprint arXiv:2409.15272, 2024b. Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. M...

  13. [13]

    Av-reasoner: Improving and bench- marking clue-grounded audio-visual counting for mllms.arXiv preprint arXiv:2506.05328,

    Lidong Lu, Guo Chen, Zhiqi Li, Yicheng Liu, and Tong Lu. Av-reasoner: Improving and bench- marking clue-grounded audio-visual counting for mllms.arXiv preprint arXiv:2506.05328,

  14. [14]

    Montesinos, Olga Slizovskaia, and Gloria Haro

    Juan F. Montesinos, Olga Slizovskaia, and Gloria Haro. Solos: A dataset for audio-visual music analysis.2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6,

  15. [15]

    MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

    S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ra- mani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark.arXiv preprint arXiv:2410.19168,

  16. [16]

    PandaGPT: One Model To Instruction-Follow Them All

    Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355,

  17. [17]

    video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704,

    Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, and Chao Zhang. video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704,

  18. [18]

    Avhbench: A cross- modal hallucination benchmark for audio-visual large lan- guage models

    Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. Avhbench: A cross-modal hallucination benchmark for audio-visual large language models.arXiv preprint arXiv:2410.18325,

  19. [19]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context.arXiv preprint arXiv:2403.05530,

  20. [20]

    Audiobench: A universal benchmark for audio large language models

    Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F Chen. Audiobench: A universal benchmark for audio large language models. arXiv preprint arXiv:2406.16020,

  21. [21]

    Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning,

    Accessed: YYYY-MM-DD. Zhenghao Xing, Xiaowei Hu, Chi-Wing Fu, Wenhai Wang, Jifeng Dai, and Pheng-Ann Heng. Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning. arXiv preprint arXiv:2505.04623,

  22. [22]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215,

  23. [23]

    Acvubench: Audio-centric video understanding benchmark

    Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, and Chao Zhang. Acvubench: Audio-centric video understanding benchmark. arXiv preprint arXiv:2503.19951,

  24. [24]

    Pano-avqa: Grounded audio-visual question answering on 360deg videos

    Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, and Gunhee Kim. Pano-avqa: Grounded audio-visual question answering on 360deg videos. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp. 2031–2041,

  25. [25]

    Cross-modal consistency in multimodal large language mod- els.arXiv preprint arXiv:2411.09273,

    12 Preprint Xiang Zhang, Senyu Li, Ning Shi, Bradley Hauer, Zijun Wu, Grzegorz Kondrak, Muhammad Abdul- Mageed, and Laks VS Lakshmanan. Cross-modal consistency in multimodal large language mod- els.arXiv preprint arXiv:2411.09273,

  26. [26]

    Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256, 2025

    Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, and Chunhua Shen. Omni-r1: Reinforcement learning for omnimodal reasoning via two- system collaboration.arXiv preprint arXiv:2505.20256,

  27. [27]

    Mlvu: Benchmarking multi-task long video understanding

    Ziwei Zhou, Rui Wang, and Zuxuan Wu. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.arXiv preprint arXiv:2505.17862,

  28. [28]

    Yet, real-world multimodal sce- narios are more complex: information from multiple modalities often arrives simultaneously and must be pro- cessed in an integrated manner

    13 Preprint APPENDIX A TASK SPECIFICEDMODEL PERFORMANCE A.1 TASK1: PERCEPTUALTASK Table 3: T1 (Perception) Results Model Perception Task Model Task General General - Hard Scene Instruments Instruments-multi Gemini 2.5 Pro Audio7→Text 81.05 71.39 67.20 47.75 44.09 Audio7→Vision 76.26 65.25 64.60 44.30 36.60 Text7→Audio 79.95 79.22 75.05 59.05 49.30 Text7→V...

  29. [29]

    Event A→Event B→Event C

    filter if each instance if the audio and video frame is clear to be hear and the image frame and audio are all match the category name. Fine-grained Categories.This subtask uses the same pool of video clips as the General Categories setting. The difference lies in reorganizing the activity classes into eight fine-grained clusters:Animal sounds,Musical ins...