arxiv: 2604.16902 · v3 · submitted 2026-04-18 · 💻 cs.AI

Recognition: unknown

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

Xinru Yan , Boxi Cao , Yaojie Lu , Hongyu Lin , Weixiang Zhou , Le Sun , Xianpei Han

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:58 UTC · model grok-4.3

classification 💻 cs.AI

keywords omni-modal LLMsmodality preferencevisual preferenceconflict-based benchmarklayer-wise probingcross-modal hallucinationsmodality selection rate

0 comments

The pith

Omni-modal LLMs prefer visual input over text, with the bias appearing in mid-to-late layers and usable to spot hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a conflict-based benchmark to measure how omni-modal large language models choose between modalities when inputs disagree. Unlike older vision-language models that default to text, the tested OLLMs favor visual information. Layer-by-layer analysis shows this preference is not present from the start but builds in the middle and later layers. The same internal signals then serve as a detector for cross-modal hallucinations on standard benchmarks, matching specialized methods without any task-specific fine-tuning.

Core claim

Native omni-modal LLMs exhibit a pronounced visual preference measured by modality selection rate on a new conflict benchmark, in contrast to the text-dominance of traditional VLMs. Layer-wise probing reveals that this preference emerges progressively through the mid-to-late layers rather than being fixed at the input stage. Internal activation patterns from these layers can be used directly to diagnose cross-modal hallucinations, yielding competitive results on three downstream multi-modal benchmarks without requiring task-specific training data.

What carries the argument

A conflict-based benchmark paired with the modality selection rate metric, plus layer-wise probing of internal signals to track preference emergence and hallucination indicators.

If this is right

Modality preference can be monitored during inference by inspecting mid-to-late layer activations.
Hallucination detection becomes possible without collecting task-specific labeled data.
Unified omni-modal training produces different default behaviors than pipeline vision-language models.
Preference is dynamic across network depth rather than a fixed property of the input embedding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives could be adjusted to counteract the visual bias if balanced modality use is desired.
The same probing approach might reveal similar layer-wise shifts in other multi-modal architectures.
Benchmark conflicts might be extended to audio-text or other modality pairs to test generality.

Load-bearing premise

The conflict-based benchmark accurately captures genuine modality preference without selection bias, and the layer-wise signals directly reflect hallucinations without overfitting to the specific models examined.

What would settle it

A test on new OLLMs or balanced non-conflict inputs where the visual preference disappears or where the same layer signals show no correlation with hallucination labels on held-out data.

Figures

Figures reproduced from arXiv: 2604.16902 by Boxi Cao, Hongyu Lin, Le Sun, Weixiang Zhou, Xianpei Han, Xinru Yan, Yaojie Lu.

**Figure 2.** Figure 2: MSR (%) results of all evaluated OLLMs on [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Pairwise MSR (%) of all evaluated OLLMs under three bi-modal conflict settings. From top to bottom: text+image, image+audio, and text+audio. Qwen3- Omni refers to Qwen3-Omni-30B-A3B-Instruct. reaching as high as 82%. This suggests that, unlike the text-dominant modality preference observed in traditional VLMs (Deng et al., 2025), the majority of OLLMs exhibit a pronounced visual preference when confronted… view at source ↗

**Figure 4.** Figure 4: Illustration of the layer-wise linear probe [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Illustration of the four-phase decomposi [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: SVD projections of hidden states onto the top [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Density distributions of interfering modality prediction probabilities from layer-wise linear probes on [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Representative cases of the linear probe detecting hallucinations by predicting the interfering modality [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

read the original abstract

Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive performance across three downstream multi-modal benchmarks without task-specific data. Our work provides both a mechanistic understanding and a practical tool for building more trustworthy OLLMs. Our code and related resources are publicly available at: https://github.com/icip-cas/OmniPreference

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows OLLMs favor visual over text inputs in conflicts and uses layer signals for hallucination checks, but benchmark balance is the make-or-break detail.

read the letter

The main thing to know is that this work finds most omni-modal LLMs prefer visual modality in conflicting inputs, unlike the text dominance seen in standard VLMs, and that this preference builds in mid-to-late layers and can be turned into a signal for spotting cross-modal hallucinations without task-specific training data. They introduce a conflict-based benchmark and a modality selection rate metric, test ten models, run layer-wise probes, and report competitive results on three downstream benchmarks. Code is released, which helps. What is actually new is the specific quantification of this visual preference shift in native OLLMs plus the direct link to internal-signal hallucination diagnosis. The layer-wise emergence finding adds a mechanistic angle that goes beyond simple accuracy numbers. The evaluation breadth across models and the practical application without extra fine-tuning are the stronger parts. The paper does a reasonable job framing the paradigm change and showing a usable downstream effect. The soft spot is the conflict benchmark itself. The central claims rest on modality selection rates from that benchmark, so the conflicts must be balanced in difficulty, salience, and contradiction strength; otherwise the rates could reflect encoder quirks or prompt artifacts rather than genuine preference. The abstract gives no construction protocol, and the stress-test note correctly flags this as the least secure link. If the full paper supplies a clear, reproducible curation method with controls for balance, the results gain weight; without it, the visual-preference and hallucination-diagnosis findings stay provisional. The probing and diagnosis steps inherit the same dependency. This is for researchers working on multi-modal model internals and trustworthiness. A reader focused on hallucination mitigation or representation analysis would find the empirical patterns and new metric worth looking at. It has enough substance and a clear practical angle to deserve serious referee time, though the benchmark details will need close checking in review.

Referee Report

2 major / 2 minor

Summary. The paper claims that native Omni-modal LLMs (OLLMs) exhibit a pronounced visual preference (unlike text-dominance in traditional VLMs), quantified via a new conflict-based benchmark and modality selection rate metric across ten models. Layer-wise probing shows this preference emerges progressively in mid-to-late layers rather than being static. The authors then use these internal signals to diagnose cross-modal hallucinations, reporting competitive performance on three downstream multimodal benchmarks without task-specific training data. Code is released publicly.

Significance. If the benchmark isolates genuine modality preference without curation artifacts, the work provides mechanistic insight into unified multimodal representations and a practical, data-efficient tool for hallucination detection. Public code release is a clear strength for reproducibility.

major comments (2)

[§4] §4 (Benchmark Construction and Modality Selection Rate): The central claim of visual preference in most OLLMs rests on modality selection rates computed from the newly-curated conflict-based benchmark. However, the manuscript provides insufficient detail on conflict generation, balancing of difficulty/salience/contradiction strength across modalities, and controls for encoder or prompt biases. Without these, selection rates may reflect benchmark artifacts rather than internal model properties, undermining the paradigm-shift conclusion relative to VLMs.
[§5.2] §5.2 (Layer-wise Probing and Hallucination Diagnosis): The claim that modality preference emerges in mid-to-late layers and can be leveraged for hallucination diagnosis inherits the same benchmark dependency. The probing methodology and transfer to three downstream benchmarks lack explicit ablations (e.g., random layer signals, alternative probes) or cross-model validation to rule out overfitting to the ten evaluated OLLMs.

minor comments (2)

[Abstract] Abstract: The three downstream benchmarks are not named, which reduces immediate clarity for readers.
[Figures/Tables] Figure captions and tables: Some results (e.g., selection rates per model) would benefit from explicit error bars or statistical significance tests to support the 'most OLLMs' generalization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the constructive and detailed feedback on our work. The comments highlight important areas for clarification and strengthening, particularly around benchmark details and experimental validation. We address each major comment point-by-point below, outlining specific revisions we will make to the manuscript.

read point-by-point responses

Referee: [§4] §4 (Benchmark Construction and Modality Selection Rate): The central claim of visual preference in most OLLMs rests on modality selection rates computed from the newly-curated conflict-based benchmark. However, the manuscript provides insufficient detail on conflict generation, balancing of difficulty/salience/contradiction strength across modalities, and controls for encoder or prompt biases. Without these, selection rates may reflect benchmark artifacts rather than internal model properties, undermining the paradigm-shift conclusion relative to VLMs.

Authors: We agree that the original manuscript provided insufficient detail on the benchmark construction process, which is a fair critique that could lead to concerns about artifacts. In the revised version, we will substantially expand §4 to include a step-by-step description of conflict generation (including the use of paired contradictory statements across modalities), explicit balancing criteria for difficulty, salience, and contradiction strength (via both automated metrics and human verification on a subset), and additional controls such as neutral prompt variants, encoder-only baselines, and single-modality ablation tests. These new controls demonstrate that the visual preference persists consistently across models, supporting that the selection rates reflect internal model properties rather than curation artifacts. We believe this strengthens rather than undermines the paradigm-shift claim relative to traditional VLMs. revision: yes
Referee: [§5.2] §5.2 (Layer-wise Probing and Hallucination Diagnosis): The claim that modality preference emerges in mid-to-late layers and can be leveraged for hallucination diagnosis inherits the same benchmark dependency. The probing methodology and transfer to three downstream benchmarks lack explicit ablations (e.g., random layer signals, alternative probes) or cross-model validation to rule out overfitting to the ten evaluated OLLMs.

Authors: We acknowledge that additional ablations and validation would improve rigor, particularly to address potential overfitting concerns. In the revision, we will add explicit ablations in §5.2, including random layer signal controls and comparisons with alternative probes (e.g., linear classifiers versus MLP heads), confirming that the progressive emergence in mid-to-late layers is not an artifact of the probe design. We also performed and will report leave-one-model-out cross-validation across the ten OLLMs, where probes trained on nine models generalize to the held-out model with consistent layer-wise patterns. While the initial preference quantification relies on the conflict benchmark, the hallucination diagnosis applies the resulting internal signals to independent downstream benchmarks without task-specific training, and we will clarify this separation in the text to avoid any implication of direct dependency. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark and probing results are self-contained

full rationale

The paper quantifies modality preference via a new conflict-based benchmark and modality selection rate, then uses layer-wise probing to show progressive emergence in mid-to-late layers, and applies the resulting internal signals to hallucination diagnosis on downstream benchmarks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; all claims rest on direct measurement against external data rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the work is empirical.

pith-pipeline@v0.9.0 · 5500 in / 904 out tokens · 40529 ms · 2026-05-10T06:58:05.389852+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 28 canonical work pages · 12 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, and 1 others. 2025. Ming-omni: A unified multimodal model for perception and generation. arXiv preprint arXiv:2506.09344

work page arXiv 2025
[4]

Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644

work page Pith review arXiv 2016
[5]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966

work page internal anchor Pith review arXiv 2023
[6]

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930

work page internal anchor Pith review arXiv 2024
[7]

Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207--219

2022
[8]

Meiqi Chen, Yixin Cao, Yan Zhang, and Chaochao Lu. 2024 a . Quantifying and mitigating unimodal biases in multimodal large language models: A causal perspective. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 16449--16469

2024
[9]

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, and 1 others. 2024 b . Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185--24198

2024
[10]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. 2025. Words or vision: Do vision-language models have blind faith in text? In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 3867--3876

2025
[12]

David Duki \'c and Jan S najder. 2024. Looking right is sometimes right: Investigating the capabilities of decoder-only llms for sequence labeling. In Findings of the Association for Computational Linguistics: ACL 2024, pages 14168--14181

2024
[13]

Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Yuhang Dai, Meng Zhao, Yi-Fan Zhang, Shaoqi Dong, Yangze Li, Xiong Wang, and 1 others. 2024. Vita: Towards open-source interactive omni multimodal llm. arXiv preprint arXiv:2408.05211

work page arXiv 2024
[14]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[15]

Tianze Hua, Tian Yun, and Ellie Pavlick. 2025. How do vision-language models process conflicting information across modalities? arXiv preprint arXiv:2507.01790

work page arXiv 2025
[16]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Shixin Jiang, Jiafeng Liang, Jiyuan Wang, Xuan Dong, Heng Chang, Weijiang Yu, Jinhua Du, Ming Liu, and Bing Qin. 2025. From specific-mllms to omni-mllms: a survey on mllms aligned with multi-modalities. In Findings of the Association for Computational Linguistics: ACL 2025, pages 8617--8652

2025
[18]

Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. 2019. Similarity of neural network representations revisited. In International conference on machine learning, pages 3519--3529. PMlR

2019
[19]

Chun-Yi Kuan and Hung-yi Lee. 2025. Can large audio-language models truly hear? tackling hallucinations with multi-task assessment and stepwise audio reasoning. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1--5. IEEE

2025
[20]

JuneHyoung Kwon, MiHyeon Kim, Eunju Lee, Juhwan Choi, and YoungBin Kim. 2025. See-saw modality balance: See gradient, and sew impaired vision-language balance to mitigate dominant modality bias. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume...

2025
[21]

Sicong Leng, Hang Zhang, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, and Lidong Bing. 2024. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13872--13882

2024
[22]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023 a . Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730--19742. PMLR

2023
[23]

Yadong Li, Jun Liu, Tao Zhang, Song Chen, Tianpeng Li, Zehuan Li, Lijun Liu, Lingfeng Ming, Guosheng Dong, Da Pan, and 1 others. 2025. Baichuan-omni-1.5 technical report. arXiv preprint arXiv:2501.15368

work page arXiv 2025
[24]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023 b . Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 292--305

2023
[25]

Yizhi Li, Yinghao Ma, Ge Zhang, Ruibin Yuan, Kang Zhu, Hangyu Guo, Yiming Liang, Jiaheng Liu, Zekun Wang, Jian Yang, and 1 others. 2024. Omnibench: Towards the future of universal omni-language models. arXiv preprint arXiv:2409.15272

work page arXiv 2024
[26]

Zihao Lin, Samyadeep Basu, Mohammad Beigi, Varun Manjunatha, Ryan A Rossi, Zichao Wang, Yufan Zhou, Sriram Balasubramanian, Arman Zarei, Keivan Rezaei, and 1 others. 2025. A survey on mechanistic interpretability for multi-modal foundation models. arXiv preprint arXiv:2502.17516

work page arXiv 2025
[27]

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296--26306

2024
[28]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. Advances in neural information processing systems, 36:34892--34916

2023
[29]

Hantao Lou, Changye Li, Jiaming Ji, and Yaodong Yang. 2025. Sae-v: Interpreting multimodal models for enhanced alignment. arXiv preprint arXiv:2502.17514

work page arXiv 2025
[30]

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359--17372

2022
[31]

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, and 1 others. 2022. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005

work page arXiv 2022
[32]

Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan, Anette Frank, Iacer Calixto, and Albert Gatt. 2022. Valse: A task-independent benchmark for vision and language models centered on linguistic phenomena. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8253--8280

2022
[33]

Fabio Petroni, Tim Rockt \"a schel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2463--2473

2019
[34]

Pouya Pezeshkpour, Moin Aminnaseri, and Estevam Hruschka. 2025. Mixed signals: Decoding vlms’ reasoning and underlying bias in vision-language conflict. arXiv preprint arXiv:2504.08974

work page arXiv 2025
[35]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, and 1 others. 2025. Openai gpt-5 system card. arXiv preprint arXiv:2601.03267

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz-Ziv. 2025. Layer by layer: Uncovering hidden representations in language models. arXiv preprint arXiv:2502.02013

work page arXiv 2025
[37]

Kim Sung-Bin, Oh Hyun-Bin, JungMok Lee, Arda Senocak, Joon Son Chung, and Tae-Hyun Oh. 2024. Avhbench: A cross-modal hallucination benchmark for audio-visual large language models. arXiv preprint arXiv:2410.18325

work page arXiv 2024
[38]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, and 1 others. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp pipeline. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4593--4601

2019
[40]

Lindia Tjuatja, Valerie Chen, Tongshuang Wu, Ameet Talwalkwar, and Graham Neubig. 2024. Do llms exhibit human-like response biases? a case study in survey design. Transactions of the Association for Computational Linguistics, 12:1011--1026

2024
[41]

Xingrui Wang, Jiang Liu, Chao Huang, Xiaodong Yu, Ze Wang, Ximeng Sun, Jialian Wu, Alan Yuille, Emad Barsoum, and Zicheng Liu. 2025. Xmodbench: Benchmarking cross-modal capabilities and consistency in omni-language models. arXiv preprint arXiv:2510.15148

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025 a . Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215

work page internal anchor Pith review arXiv 2025
[43]

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, and 1 others. 2025 b . Qwen3-omni technical report. arXiv preprint arXiv:2509.17765

work page internal anchor Pith review arXiv 2025
[44]

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, and 1 others. 2024. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800

work page internal anchor Pith review arXiv 2024
[45]

Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, and 1 others. 2025. Omnivinci: Enhancing architecture and data for omni-modal understanding llm. arXiv preprint arXiv:2510.15870

work page arXiv 2025
[46]

Zhuoran Zhang, Tengyue Wang, Xilin Gong, Yang Shi, Haotian Wang, Di Wang, and Lijie Hu. 2025. When modalities conflict: How unimodal reasoning uncertainty governs preference dynamics in mllms. arXiv preprint arXiv:2511.02243

work page arXiv 2025
[47]

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2023. Large language models are not robust multiple choice selectors. arXiv preprint arXiv:2309.03882

work page arXiv 2023
[48]

Xinhan Zheng, Huyu Wu, Xueting Wang, and Haiyun Jiang. 2025 a . Unveiling intrinsic text bias in multimodal large language models through attention key-space analysis. arXiv preprint arXiv:2510.26721

work page internal anchor Pith review Pith/arXiv arXiv 2025
[49]

Xu Zheng, Chenfei Liao, Yuqian Fu, Kaiyu Lei, Yuanhuiyi Lyu, Lutao Jiang, Bin Ren, Jialei Chen, Jiawen Wang, Chengxin Li, and 1 others. 2025 b . Mllms are deeply affected by modality bias. arXiv preprint arXiv:2505.18657

work page arXiv 2025