pith. machine review for the scientific record. sign in

arxiv: 2604.16902 · v3 · submitted 2026-04-18 · 💻 cs.AI

Recognition: unknown

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:58 UTC · model grok-4.3

classification 💻 cs.AI
keywords omni-modal LLMsmodality preferencevisual preferenceconflict-based benchmarklayer-wise probingcross-modal hallucinationsmodality selection rate
0
0 comments X

The pith

Omni-modal LLMs prefer visual input over text, with the bias appearing in mid-to-late layers and usable to spot hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a conflict-based benchmark to measure how omni-modal large language models choose between modalities when inputs disagree. Unlike older vision-language models that default to text, the tested OLLMs favor visual information. Layer-by-layer analysis shows this preference is not present from the start but builds in the middle and later layers. The same internal signals then serve as a detector for cross-modal hallucinations on standard benchmarks, matching specialized methods without any task-specific fine-tuning.

Core claim

Native omni-modal LLMs exhibit a pronounced visual preference measured by modality selection rate on a new conflict benchmark, in contrast to the text-dominance of traditional VLMs. Layer-wise probing reveals that this preference emerges progressively through the mid-to-late layers rather than being fixed at the input stage. Internal activation patterns from these layers can be used directly to diagnose cross-modal hallucinations, yielding competitive results on three downstream multi-modal benchmarks without requiring task-specific training data.

What carries the argument

A conflict-based benchmark paired with the modality selection rate metric, plus layer-wise probing of internal signals to track preference emergence and hallucination indicators.

If this is right

  • Modality preference can be monitored during inference by inspecting mid-to-late layer activations.
  • Hallucination detection becomes possible without collecting task-specific labeled data.
  • Unified omni-modal training produces different default behaviors than pipeline vision-language models.
  • Preference is dynamic across network depth rather than a fixed property of the input embedding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training objectives could be adjusted to counteract the visual bias if balanced modality use is desired.
  • The same probing approach might reveal similar layer-wise shifts in other multi-modal architectures.
  • Benchmark conflicts might be extended to audio-text or other modality pairs to test generality.

Load-bearing premise

The conflict-based benchmark accurately captures genuine modality preference without selection bias, and the layer-wise signals directly reflect hallucinations without overfitting to the specific models examined.

What would settle it

A test on new OLLMs or balanced non-conflict inputs where the visual preference disappears or where the same layer signals show no correlation with hallucination labels on held-out data.

Figures

Figures reproduced from arXiv: 2604.16902 by Boxi Cao, Hongyu Lin, Le Sun, Weixiang Zhou, Xianpei Han, Xinru Yan, Yaojie Lu.

Figure 1
Figure 1. Figure 1: Illustration of a tri-modal conflict input sam [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MSR (%) results of all evaluated OLLMs on [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pairwise MSR (%) of all evaluated OLLMs under three bi-modal conflict settings. From top to bot￾tom: text+image, image+audio, and text+audio. Qwen3- Omni refers to Qwen3-Omni-30B-A3B-Instruct. reaching as high as 82%. This suggests that, unlike the text-dominant modality preference observed in traditional VLMs (Deng et al., 2025), the majority of OLLMs exhibit a pronounced visual preference when confronted… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the layer-wise linear probe [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the four-phase decomposi [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: SVD projections of hidden states onto the top [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Density distributions of interfering modality prediction probabilities from layer-wise linear probes on [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Representative cases of the linear probe detecting hallucinations by predicting the interfering modality [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that native Omni-modal LLMs (OLLMs) exhibit a pronounced visual preference (unlike text-dominance in traditional VLMs), quantified via a new conflict-based benchmark and modality selection rate metric across ten models. Layer-wise probing shows this preference emerges progressively in mid-to-late layers rather than being static. The authors then use these internal signals to diagnose cross-modal hallucinations, reporting competitive performance on three downstream multimodal benchmarks without task-specific training data. Code is released publicly.

Significance. If the benchmark isolates genuine modality preference without curation artifacts, the work provides mechanistic insight into unified multimodal representations and a practical, data-efficient tool for hallucination detection. Public code release is a clear strength for reproducibility.

major comments (2)
  1. [§4] §4 (Benchmark Construction and Modality Selection Rate): The central claim of visual preference in most OLLMs rests on modality selection rates computed from the newly-curated conflict-based benchmark. However, the manuscript provides insufficient detail on conflict generation, balancing of difficulty/salience/contradiction strength across modalities, and controls for encoder or prompt biases. Without these, selection rates may reflect benchmark artifacts rather than internal model properties, undermining the paradigm-shift conclusion relative to VLMs.
  2. [§5.2] §5.2 (Layer-wise Probing and Hallucination Diagnosis): The claim that modality preference emerges in mid-to-late layers and can be leveraged for hallucination diagnosis inherits the same benchmark dependency. The probing methodology and transfer to three downstream benchmarks lack explicit ablations (e.g., random layer signals, alternative probes) or cross-model validation to rule out overfitting to the ten evaluated OLLMs.
minor comments (2)
  1. [Abstract] Abstract: The three downstream benchmarks are not named, which reduces immediate clarity for readers.
  2. [Figures/Tables] Figure captions and tables: Some results (e.g., selection rates per model) would benefit from explicit error bars or statistical significance tests to support the 'most OLLMs' generalization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback on our work. The comments highlight important areas for clarification and strengthening, particularly around benchmark details and experimental validation. We address each major comment point-by-point below, outlining specific revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Benchmark Construction and Modality Selection Rate): The central claim of visual preference in most OLLMs rests on modality selection rates computed from the newly-curated conflict-based benchmark. However, the manuscript provides insufficient detail on conflict generation, balancing of difficulty/salience/contradiction strength across modalities, and controls for encoder or prompt biases. Without these, selection rates may reflect benchmark artifacts rather than internal model properties, undermining the paradigm-shift conclusion relative to VLMs.

    Authors: We agree that the original manuscript provided insufficient detail on the benchmark construction process, which is a fair critique that could lead to concerns about artifacts. In the revised version, we will substantially expand §4 to include a step-by-step description of conflict generation (including the use of paired contradictory statements across modalities), explicit balancing criteria for difficulty, salience, and contradiction strength (via both automated metrics and human verification on a subset), and additional controls such as neutral prompt variants, encoder-only baselines, and single-modality ablation tests. These new controls demonstrate that the visual preference persists consistently across models, supporting that the selection rates reflect internal model properties rather than curation artifacts. We believe this strengthens rather than undermines the paradigm-shift claim relative to traditional VLMs. revision: yes

  2. Referee: [§5.2] §5.2 (Layer-wise Probing and Hallucination Diagnosis): The claim that modality preference emerges in mid-to-late layers and can be leveraged for hallucination diagnosis inherits the same benchmark dependency. The probing methodology and transfer to three downstream benchmarks lack explicit ablations (e.g., random layer signals, alternative probes) or cross-model validation to rule out overfitting to the ten evaluated OLLMs.

    Authors: We acknowledge that additional ablations and validation would improve rigor, particularly to address potential overfitting concerns. In the revision, we will add explicit ablations in §5.2, including random layer signal controls and comparisons with alternative probes (e.g., linear classifiers versus MLP heads), confirming that the progressive emergence in mid-to-late layers is not an artifact of the probe design. We also performed and will report leave-one-model-out cross-validation across the ten OLLMs, where probes trained on nine models generalize to the held-out model with consistent layer-wise patterns. While the initial preference quantification relies on the conflict benchmark, the hallucination diagnosis applies the resulting internal signals to independent downstream benchmarks without task-specific training, and we will clarify this separation in the text to avoid any implication of direct dependency. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and probing results are self-contained

full rationale

The paper quantifies modality preference via a new conflict-based benchmark and modality selection rate, then uses layer-wise probing to show progressive emergence in mid-to-late layers, and applies the resulting internal signals to hallucination diagnosis on downstream benchmarks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; all claims rest on direct measurement against external data rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the work is empirical.

pith-pipeline@v0.9.0 · 5500 in / 904 out tokens · 40529 ms · 2026-05-10T06:58:05.389852+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 28 canonical work pages · 12 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, and 1 others. 2025. Ming-omni: A unified multimodal model for perception and generation. arXiv preprint arXiv:2506.09344

  4. [4]

    Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644

  5. [5]

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966

  6. [6]

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930

  7. [7]

    Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207--219

  8. [8]

    Meiqi Chen, Yixin Cao, Yan Zhang, and Chaochao Lu. 2024 a . Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16449--16469

  9. [9]

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, and 1 others. 2024 b . Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185--24198

  10. [10]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261

  11. [11]

    Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. 2025. Words or vision: Do vision-language models have blind faith in text? In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 3867--3876

  12. [12]

    David Duki \'c and Jan S najder. 2024. Looking right is sometimes right: Investigating the capabilities of decoder-only llms for sequence labeling. In Findings of the Association for Computational Linguistics: ACL 2024, pages 14168--14181

  13. [13]

    Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Yuhang Dai, Meng Zhao, Yi-Fan Zhang, Shaoqi Dong, Yangze Li, Xiong Wang, and 1 others. 2024. Vita: Towards open-source interactive omni multimodal llm. arXiv preprint arXiv:2408.05211

  14. [14]

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

  15. [15]

    Tianze Hua, Tian Yun, and Ellie Pavlick. 2025. How do vision-language models process conflicting information across modalities? arXiv preprint arXiv:2507.01790

  16. [16]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

  17. [17]

    Shixin Jiang, Jiafeng Liang, Jiyuan Wang, Xuan Dong, Heng Chang, Weijiang Yu, Jinhua Du, Ming Liu, and Bing Qin. 2025. From specific-mllms to omni-mllms: a survey on mllms aligned with multi-modalities. In Findings of the Association for Computational Linguistics: ACL 2025, pages 8617--8652

  18. [18]

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neural network representations revisited. In International conference on machine learning, pages 3519--3529. PMlR

  19. [19]

    Chun-Yi Kuan and Hung-yi Lee. 2025. Can large audio-language models truly hear? tackling hallucinations with multi-task assessment and stepwise audio reasoning. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

  20. [20]

    JuneHyoung Kwon, MiHyeon Kim, Eunju Lee, Juhwan Choi, and YoungBin Kim. 2025. See-saw modality balance: See gradient, and sew impaired vision-language balance to mitigate dominant modality bias. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume...

  21. [21]

    Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2024. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872--13882

  22. [22]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023 a . Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730--19742. PMLR

  23. [23]

    Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, and 1 others. 2025. Baichuan-omni-1.5 technical report. arXiv preprint arXiv:2501.15368

  24. [24]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023 b . Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 292--305

  25. [25]

    Yizhi Li, Yinghao Ma, Ge Zhang, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, and 1 others. 2024. Omnibench: Towards the future of universal omni-language models. arXiv preprint arXiv:2409.15272

  26. [26]

    Zihao Lin, Samyadeep Basu, Mohammad Beigi, Varun Manjunatha, Ryan A Rossi, Zichao Wang, Yufan Zhou, Sriram Balasubramanian, Arman Zarei, Keivan Rezaei, and 1 others. 2025. A survey on mechanistic interpretability for multi-modal foundation models. arXiv preprint arXiv:2502.17516

  27. [27]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296--26306

  28. [28]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems, 36:34892--34916

  29. [29]

    Hantao Lou, Changye Li, Jiaming Ji, and Yaodong Yang. 2025. Sae-v: Interpreting multimodal models for enhanced alignment. arXiv preprint arXiv:2502.17514

  30. [30]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359--17372

  31. [31]

    Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, and 1 others. 2022. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005

  32. [32]

    Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan, Anette Frank, Iacer Calixto, and Albert Gatt. 2022. Valse: A task-independent benchmark for vision and language models centered on linguistic phenomena. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8253--8280

  33. [33]

    Fabio Petroni, Tim Rockt \"a schel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2463--2473

  34. [34]

    Pouya Pezeshkpour, Moin Aminnaseri, and Estevam Hruschka. 2025. Mixed signals: Decoding vlms’ reasoning and underlying bias in vision-language conflict. arXiv preprint arXiv:2504.08974

  35. [35]

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, and 1 others. 2025. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267

  36. [36]

    Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. 2025. Layer by layer: Uncovering hidden representations in language models. arXiv preprint arXiv:2502.02013

  37. [37]

    Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. 2024. Avhbench: A cross-modal hallucination benchmark for audio-visual large language models. arXiv preprint arXiv:2410.18325

  38. [38]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, and 1 others. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805

  39. [39]

    Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4593--4601

  40. [40]

    Lindia Tjuatja, Valerie Chen, Tongshuang Wu, Ameet Talwalkwar, and Graham Neubig. 2024. Do llms exhibit human-like response biases? a case study in survey design. Transactions of the Association for Computational Linguistics, 12:1011--1026

  41. [41]

    Xingrui Wang, Jiang Liu, Chao Huang, Xiaodong Yu, Ze Wang, Ximeng Sun, Jialian Wu, Alan Yuille, Emad Barsoum, and Zicheng Liu. 2025. Xmodbench: Benchmarking cross-modal capabilities and consistency in omni-language models. arXiv preprint arXiv:2510.15148

  42. [42]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025 a . Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215

  43. [43]

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, and 1 others. 2025 b . Qwen3-omni technical report. arXiv preprint arXiv:2509.17765

  44. [44]

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, and 1 others. 2024. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800

  45. [45]

    Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, and 1 others. 2025. Omnivinci: Enhancing architecture and data for omni-modal understanding llm. arXiv preprint arXiv:2510.15870

  46. [46]

    Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Haotian Wang, Di Wang, and Lijie Hu. 2025. When modalities conflict: How unimodal reasoning uncertainty governs preference dynamics in mllms. arXiv preprint arXiv:2511.02243

  47. [47]

    Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2023. Large language models are not robust multiple choice selectors. arXiv preprint arXiv:2309.03882

  48. [48]

    Xinhan Zheng, Huyu Wu, Xueting Wang, and Haiyun Jiang. 2025 a . Unveiling intrinsic text bias in multimodal large language models through attention key-space analysis. arXiv preprint arXiv:2510.26721

  49. [49]

    Xu Zheng, Chenfei Liao, Yuqian Fu, Kaiyu Lei, Yuanhuiyi Lyu, Lutao Jiang, Bin Ren, Jialei Chen, Jiawen Wang, Chengxin Li, and 1 others. 2025 b . Mllms are deeply affected by modality bias. arXiv preprint arXiv:2505.18657