arxiv: 2604.10014 · v1 · submitted 2026-04-11 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

Demographic and Linguistic Bias Evaluation in Omnimodal Language Models

Alaa Elobaid

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords omnimodal language modelsdemographic biaslinguistic biasaudio understandingimage understandingfairness evaluationprediction collapsemultimodal AI

0 comments

The pith

Omnimodal language models display smaller demographic and linguistic disparities in image and video understanding than in audio tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates four omnimodal language models that handle text, images, audio, and video together on tasks including demographic attribute estimation, identity verification, activity recognition, multilingual speech transcription, and language identification. Accuracy differences are measured across age, gender, skin tone, language, and country of origin. Image and video tasks generally achieve better performance with smaller disparities across these groups. Audio tasks, however, show significantly lower performance and larger biases, with wide accuracy gaps by age, gender, and language plus frequent collapse of predictions into narrow categories. These modality-specific patterns matter because the models are deployed in real applications, so fairness checks must cover every input type they accept.

Core claim

Four omnimodal models were assessed on demographic attribute estimation, identity verification, activity recognition, multilingual speech transcription, and language identification. Accuracy differences were measured across age, gender, skin tone, language, and country of origin. Image and video understanding tasks showed better performance with smaller demographic disparities, whereas audio understanding tasks had significantly lower performance and substantial bias, including large accuracy differences across age groups, genders, and languages, and frequent prediction collapse toward narrow categories.

What carries the argument

Cross-modality comparison of accuracy gaps on shared tasks for image, video, and audio inputs across four models.

Load-bearing premise

The selected tasks, models, and demographic attributes accurately reflect real-world biases without major distortion from training data, model architectures, or evaluation design.

What would settle it

A new omnimodal model where audio tasks match or exceed image and video performance with equally small accuracy gaps across age, gender, and language groups would undermine the reported pattern.

read the original abstract

This paper provides a comprehensive evaluation of demographic and linguistic biases in omnimodal language models that process text, images, audio, and video within a single framework. Although these models are being widely deployed, their performance across different demographic groups and modalities is not well studied. Four omnimodal models are evaluated on tasks that include demographic attribute estimation, identity verification, activity recognition, multilingual speech transcription, and language identification. Accuracy differences are measured across age, gender, skin tone, language, and country of origin. The results show that image and video understanding tasks generally exhibit better performance with smaller demographic disparities. In contrast, audio understanding tasks exhibit significantly lower performance and substantial bias, including large accuracy differences across age groups, genders, and languages, and frequent prediction collapse toward narrow categories. These findings highlight the importance of evaluating fairness across all supported modalities as omnimodal language models are increasingly used in real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a preliminary cross-modal bias audit flagging audio as weaker than vision on fairness in omnimodal models, but the missing methods make the comparison hard to trust.

read the letter

The paper's main observation is that image and video tasks in the tested omnimodal models deliver higher accuracy with smaller gaps across age, gender, skin tone, and language, while audio tasks show lower performance, larger disparities, and occasional collapse to narrow prediction categories. They ran this on demographic estimation, identity verification, activity recognition, multilingual speech transcription, and language identification across four models, tracking differences by those same attributes plus country of origin. The work is a straightforward empirical check that extends existing unimodal bias audits to models handling all four modalities at once. That cross-modal view is the useful new piece, and it gives a clear directional pointer that audio needs more attention in fairness work. The evidence stays thin on the ground. No details appear on the exact models, sample sizes per demographic slice, how tasks were picked or balanced for difficulty, or any statistical tests. Without those, it is difficult to separate real modality effects from differences in task hardness or training data balance. The stress-test concern about confounding holds up: audio inputs often carry more natural variability from accents and noise, so the bigger gaps could stem from that rather than bias alone. This is the kind of early flag that matters for teams building multimodal systems for transcription or accessibility. It deserves peer review once the authors add the methods, datasets, and some error analysis, because the underlying question is practical and the directional result is worth checking.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates demographic and linguistic biases in four omnimodal language models across text, image, video, and audio modalities. It reports that image and video tasks (demographic estimation, identity verification, activity recognition) generally achieve higher accuracy with smaller disparities across age, gender, skin tone, language, and country of origin, whereas audio tasks (multilingual speech transcription, language identification) show significantly lower performance, larger accuracy gaps, and frequent prediction collapse toward narrow categories.

Significance. If the empirical findings are robust to controls for task difficulty and data characteristics, the work provides a timely modality-specific fairness assessment for emerging omnimodal models, underscoring the need to prioritize audio bias mitigation in real-world deployments.

major comments (2)

The central cross-modality claim (image/video tasks exhibit better performance and smaller disparities than audio) rests on the assumption that the chosen tasks are comparable; however, activity recognition and speech transcription differ substantially in input variability, accent/noise factors, and likely training-data balance, with no reported normalization or difficulty metrics to isolate modality effects from these confounds.
No details are provided on sample sizes per demographic group, statistical significance tests for accuracy differences, or error analysis (e.g., confusion matrices showing collapse), which are load-bearing for verifying the reported large gaps and prediction collapse in audio tasks.

minor comments (1)

The abstract omits the names of the four evaluated models, exact task definitions, and any quantitative metrics (e.g., accuracy deltas or disparity measures), reducing immediate verifiability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: The central cross-modality claim (image/video tasks exhibit better performance and smaller disparities than audio) rests on the assumption that the chosen tasks are comparable; however, activity recognition and speech transcription differ substantially in input variability, accent/noise factors, and likely training-data balance, with no reported normalization or difficulty metrics to isolate modality effects from these confounds.

Authors: We acknowledge that the selected tasks inherently differ in complexity, input variability, and likely training data characteristics, which limits the strength of direct cross-modality comparisons. Our evaluation intentionally uses standard, representative benchmarks for each modality to reflect practical deployment scenarios rather than artificially matched tasks. In the revised manuscript we will add an expanded limitations and discussion section that explicitly notes these confounds, cites relevant literature on task difficulty, and qualifies the central claim as observational rather than strictly controlled. We do not have pre-computed difficulty metrics or normalization factors for all tasks, so we cannot retroactively isolate modality effects without new experiments; however, we will report any available proxies (e.g., average input length or noise levels where documented in the datasets) to improve transparency. revision: partial
Referee: No details are provided on sample sizes per demographic group, statistical significance tests for accuracy differences, or error analysis (e.g., confusion matrices showing collapse), which are load-bearing for verifying the reported large gaps and prediction collapse in audio tasks.

Authors: We agree that these details are essential for reproducibility and for assessing the reliability of the reported disparities. The original submission omitted them to keep the main text concise. In the revision we will add: (i) a supplementary table listing the number of samples per demographic subgroup (age, gender, skin tone, language, country of origin) for every task and model; (ii) statistical significance tests (e.g., chi-squared or two-proportion z-tests with p-values and confidence intervals) for the key accuracy differences; and (iii) confusion matrices or error breakdowns for the audio tasks that illustrate the observed prediction collapse. These additions will be placed in the main text or an appendix as appropriate. revision: yes

Circularity Check

0 steps flagged

No circularity; pure empirical measurement study

full rationale

The paper reports observed accuracy differences and bias metrics across modalities on fixed tasks and models. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content. All claims reduce directly to tabulated measurements rather than to any internal definition or prior self-result by construction. This is a standard evaluation paper whose central findings are externally falsifiable via replication on the same benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study is empirical and relies on standard practices in AI fairness research without introducing new mathematical parameters or entities.

axioms (1)

domain assumption Accuracy differences across demographic groups indicate the presence of bias
This is a standard assumption in fairness auditing literature invoked to interpret the measured disparities.

pith-pipeline@v0.9.0 · 5448 in / 1189 out tokens · 34232 ms · 2026-05-10T16:01:18.843430+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 7 canonical work pages · 5 internal anchors

[1]

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

Abouelenin, et al.: Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture of loras. arXiv preprint arXiv:2503.01743 (2025)

work page internal anchor Pith review arXiv 2025
[2]

J Med Internet Res26, e59505 (2024)

AlSaad, et al.: Multimodal large language models in health care: Applications, challenges, and future outlook. J Med Internet Res26, e59505 (2024)

2024
[3]

In: Proceed- ings of the Twelfth Language Resources and Evaluation Conference (2020)

Ardila, et al.: Common voice: A massively-multilingual speech corpus. In: Proceed- ings of the Twelfth Language Resources and Evaluation Conference (2020)

2020
[4]

Advances in neural information processing systems (2020)

Baevski, et al.: wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems (2020)

2020
[5]

Biometric Update: How multimodal biometrics help with age verifica- tion and compliance (2024),https://www.biometricupdate.com/202410/ how-multimodal-biometrics-help-with-age-verification-and-compliance

2024
[6]

In: Findings of the ACL: EMNLP 2024 (2024)

Chen, et al.: Quantifying and mitigating unimodal biases in multimodal large lan- guage models: A causal perspective. In: Findings of the ACL: EMNLP 2024 (2024)

2024
[7]

Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

Cheng, et al.: Social debiasing for fair multi-modal llms. Proceedings of the IEEE/CVF International Conference on Computer Vision (2025)

2025
[8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, mul- timodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Quantifying bias in automatic speech recognition,

Feng, et al.: Quantifying bias in automatic speech recognition. arXiv preprint arXiv:2103.15122 (2021)

work page arXiv 2021
[10]

Computer Speech & Language84, 101567 (2024)

Feng, et al.: Towards inclusive automatic speech recognition. Computer Speech & Language84, 101567 (2024)

2024
[11]

Gemma 3 Technical Report

Gemma Team: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Proceedings of the ACM on Human Computer Interaction (2022)

Groh, et al.: Towards transparency in dermatology image datasets with skin tone annotations. Proceedings of the ACM on Human Computer Interaction (2022)

2022
[13]

IEEE Trans

Hazirbas,etal.:Towardsmeasuringfairnessinai:Thecasualconversationsdataset. IEEE Trans. Biometrics, Behavior, and Identity Science4(3), 324–332 (2021)

2021
[14]

Empirical Methods in Natural Language Processing (2025)

Huang, et al.: Visbias: Measuring explicit and implicit social biases in vision lan- guage models. Empirical Methods in Natural Language Processing (2025)

2025
[15]

GPT-4o System Card

Hurst, et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Proceedings of the 6th AfricaNLP Workshop (2025) 16 A

Imam, et al.: Automatic speech recognition for african low-resource languages: A systematic literature review. Proceedings of the 6th AfricaNLP Workshop (2025) 16 A. Elobaid

2025
[17]

Amazon Technical Reports (2025),https://www.amazon.science/publications/ amazon-nova-2-multimodal-reasoning-and-generation-models

Intelligence,A.A.G.:Amazonnova2:Multimodalreasoningandgenerationmodels. Amazon Technical Reports (2025),https://www.amazon.science/publications/ amazon-nova-2-multimodal-reasoning-and-generation-models

2025
[18]

In: Findings of ACL

Jiang, et al.: From specific-MLLMs to omni-MLLMs: A survey on MLLMs aligned with multi-modalities. In: Findings of ACL. pp. 8617–8652 (2025)

2025
[19]

Proceedings of the 4th LT-EDI Workshop (2024)

Kulkarni, et al.: The balancing act: Unmasking and alleviating asr biases in por- tuguese. Proceedings of the 4th LT-EDI Workshop (2024)

2024
[20]

Interspeech 2024 (2025)

Kulkarni, et al.: Unveiling biases while embracing sustainability: Assessing the dual challenges of automatic speech recognition systems. Interspeech 2024 (2025)

2024
[21]

ACM Computing Surveys (2025)

Nakatumba-Nabende, et al.: A systematic literature review on bias evaluation and mitigation in automatic speech recognition models for low-resource african lan- guages. ACM Computing Surveys (2025)

2025
[22]

In: Proceedings of the 33rd ACM International Conference on Multimedia (2025)

Narayan, et al.: Facexbench: Evaluating multimodal llms on face understanding. In: Proceedings of the 33rd ACM International Conference on Multimedia (2025)

2025
[23]

In: Proc

Perera, et al.: Investigating social biases in multimodal llms. In: Proc. IEEE Int. Conf. Automatic Face and Gesture Recognition (FG). pp. 1–10 (2025)

2025
[24]

In: Proc

Porgali, et al.: The casual conversations v2 dataset. In: Proc. IEEE/CVF CVPR Workshops. pp. 10–17 (2023)

2023
[25]

Journal of Machine Learning Research 25 (2023)

Pratap, et al.: Scaling speech technology to 1000+ languages. Journal of Machine Learning Research 25 (2023)

2023
[26]

Pro- ceedings of the 40th International Conference on Machine Learning (2022)

Radford, et al.: Robust speech recognition via large-scale weak supervision. Pro- ceedings of the 40th International Conference on Machine Learning (2022)

2022
[27]

IEEE/CVF CVPR Workshops

Robinson, et al.: Face recognition: Too bias, or not too bias? In: Proc. IEEE/CVF CVPR Workshops. pp. 0–1 (2020)

2020
[28]

Research in Language21(3), 225–244 (2023)

R´ ıo, et al.: Accents in speech recognition through the lens of a world englishes evaluation set. Research in Language21(3), 225–244 (2023)

2023
[29]

Interspeech 2025 (2025)

Serditova, et al.: Automatic speech recognition biases in newcastle english: an error analysis. Interspeech 2025 (2025)

2025
[30]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops (2025)

Shahreza, et al.: Facellm: A multimodal large language model for face understand- ing. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops (2025)

2025
[31]

In: Proc

Shim, et al.: Dialetto, ma quanto dialetto? transcribing and evaluating dialects on a continuum. In: Proc. NAACL 2025 (2025)

2025
[32]

In: Proc

Song, et al.: The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism. In: Proc. NAACL-HLT. pp. 4195–4206 (2025)

2025
[33]

ICRL (2025)

Sung-Bin, et al.: Avhbench: A cross-modal hallucination benchmark for audio- visual large language models. ICRL (2025)

2025
[34]

Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects (2025)

Torgbi, et al.: Adapting whisper for regional dialects: Enhancing public services for vulnerable populations in the united kingdom. Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects (2025)

2025
[35]

arXiv preprint arXiv:2408.01319 , year=

Wang, et al.: A comprehensive review of multimodal large language models: Per- formance and challenges across different tasks. arXiv preprint arXiv:2408.01319 (2024)

work page arXiv 2024
[36]

Qwen2.5-Omni Technical Report

Xu, et al.: Qwen2.5-omni technical report. arXiv preprint arXiv:2503.20215 (2025)

work page internal anchor Pith review arXiv 2025
[37]

age": [estimate age as number between 15-80],

Zhang, et al.: Llava-video: Video instruction tuning with synthetic data. Transac- tions on Machine Learning Research (2025) Demographic and Linguistic Bias Evaluation in Omnimodal LMs 17 Appendix A Overview This appendixprovides the detailedspecifications andexamplesthat supportthe findings reported in the main text. Complete prompt specifications for al...

2025