Recognition: unknown
All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation
Pith reviewed 2026-05-07 17:33 UTC · model grok-4.3
The pith
Audio-language models keep 60-72 percent of their benchmark scores even with no audio input.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large audio-language models retain 60-72% of their full audio scores even without any audio input. Among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These results indicate that benchmarks often fail to isolate genuine auditory perception.
What carries the argument
Two-axis diagnostic framework that separates text prior (how much a question can be answered from text and knowledge alone) from audio reliance (how much the acoustic signal is actually required).
If this is right
- Benchmark scores likely overestimate how much models rely on actual audio processing.
- Evaluation protocols should routinely include a no-audio baseline to verify genuine audio dependence.
- Most audio-dependent items can be answered from short fragments, so full-clip requirements may be unnecessary in many cases.
- Benchmark design needs to reduce the influence of text priors to better isolate auditory capabilities.
- Guidelines for improving evaluation reliability follow directly from the measured gaps between text-only and full-audio performance.
Where Pith is reading between the lines
- Similar text-prior inflation could affect vision-language or other multimodal benchmarks, suggesting a general diagnostic approach across modalities.
- Training objectives might be adjusted to penalize over-reliance on text patterns when audio is present.
- Developers could apply the same two-axis test during model iteration to detect when improvements come from language shortcuts rather than audio features.
Load-bearing premise
Removing the audio input cleanly measures text priors without introducing new biases or artifacts in how questions are presented.
What would settle it
Re-running the benchmarks after masking or removing all text cues and observing whether scores collapse to near chance levels.
Figures
read the original abstract
Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory understanding. We present a diagnostic framework using two axes: text prior, which measures answerability from text and general knowledge alone, and audio reliance, which assesses actual dependency on the acoustic signal. Evaluating eight LALMs across three benchmarks, we find that models retain 60-72% of their full audio scores even without any audio input. Moreover, among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These findings challenge the assumption that benchmark performance equals robust audio understanding, and we conclude with practical guidelines for improving evaluation reliability and benchmark design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce a two-axis diagnostic framework for evaluating text priors and audio reliance in Large Audio-Language Models. By testing eight LALMs on three benchmarks, the authors find that these models retain 60-72% of their full audio performance even without audio input, suggesting heavy reliance on text priors. Additionally, they report that among audio-requiring items, only 3.0-4.2% necessitate the complete audio clip, with the majority being solvable using localized audio fragments. The work challenges the validity of current benchmarks as measures of auditory understanding and offers practical guidelines for improved evaluation.
Significance. If the experimental conditions are free of confounds, this paper makes a significant contribution by empirically demonstrating that text-based information can account for a large portion of LALM performance on audio benchmarks. The direct, parameter-free ablations on multiple models and benchmarks provide reproducible evidence that current evaluation practices may overestimate audio understanding capabilities. The finding that localized fragments suffice for most items further implies that benchmarks may not be testing holistic audio comprehension, and the proposed guidelines could help in designing more robust tests.
major comments (1)
- [Experimental Setup] The headline results (60-72% retention without audio; 3.0-4.2% needing full clip) are load-bearing for the paper's critique of benchmarks. However, the manuscript does not provide explicit details on how the no-audio condition is implemented, including prompt modifications, audio input handling (e.g., zeroed signals or absent input), and controls for question phrasing. As noted in the stress-test, if these alter model behavior, the percentages may not purely reflect text priors. This requires clarification to validate the isolation of the diagnostic axes.
minor comments (2)
- [Abstract] The abstract reports quantitative results but omits the names of the three benchmarks and eight models, which would aid assessment of representativeness.
- [Results] The results section should report data splits, statistical significance tests, or confidence intervals for the key percentages to allow readers to evaluate robustness.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to strengthen the manuscript. We address the major comment on experimental setup below and will revise the paper to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [Experimental Setup] The headline results (60-72% retention without audio; 3.0-4.2% needing full clip) are load-bearing for the paper's critique of benchmarks. However, the manuscript does not provide explicit details on how the no-audio condition is implemented, including prompt modifications, audio input handling (e.g., zeroed signals or absent input), and controls for question phrasing. As noted in the stress-test, if these alter model behavior, the percentages may not purely reflect text priors. This requires clarification to validate the isolation of the diagnostic axes.
Authors: We agree that explicit implementation details are essential for validating the isolation of text priors. In the no-audio condition, we used the exact same prompts and question phrasing as the full-audio setting for all models and benchmarks, with no modifications or additional text. Audio input was handled by completely omitting the audio modality: for models accepting separate audio features, we provided no audio encoder output; for others requiring a fixed audio tensor, we supplied a zero-valued tensor of identical shape and duration. This approach was chosen to minimize any unintended behavioral shifts. We performed internal consistency checks across models confirming that response distributions aligned with text-only inference expectations. We will add a dedicated subsection 'Implementation of the No-Audio Condition' to the Methods section, specifying the exact procedure per model (including any architecture-specific handling), the zero-tensor details, and the verification steps used to confirm no prompt alterations occurred. This revision will directly address the concern and allow readers to assess whether the reported retention rates purely reflect text priors. revision: yes
Circularity Check
No significant circularity in empirical ablation study
full rationale
The paper defines its diagnostic framework operationally through direct performance measurements on existing benchmarks and models (full-audio vs. no-audio conditions). Claims such as 60-72% score retention and 3.0-4.2% full-clip necessity are reported as observed empirical outcomes rather than derived via equations, fitted parameters, or self-referential constructions. No load-bearing self-citations, ansatzes, or uniqueness theorems reduce the results to inputs by definition. The analysis remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing audio-language benchmarks are designed to measure genuine auditory perception rather than text-based inference
Reference graph
Works this paper leans on
-
[1]
Introduction The rapid development of Large Audio-Language Models (LALMs) [1–15], which extend Large Language Models (LLMs) [16, 17] with auditory perception and knowledge [18, 19], has led to consistent performance gains on speech and au- dio benchmarks [20–32]. These improvements are often inter- preted as evidence of strong auditory understanding [33, ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Methodology We propose two diagnostic axes for auditing audio-language benchmarks.Text prior(§2.1) quantifies how much perfor- mance is achievable without audio.Audio reliance(§2.2) mea- sures how much of the audio signal models actually use. 2.1. Text Prior A desirable audio-language benchmark should require access to the auditory signal to be solved cor...
-
[3]
Benchmarks We evaluate on three public audio-language benchmarks
Experimental Setups 3.1. Benchmarks We evaluate on three public audio-language benchmarks. MMAU[20] is a 10,000-item MCQ dataset covering sound, music, and speech; we use the 1,000-item test-mini split. Table 2:Evaluated LALMs and their text backbones. LALM Text Backbone Audio-Flamingo-3 [4] Qwen2.5-7B-Instruct [41] DeSTA-2.5 [6] Llama-3.1-8B-Instruct [16...
-
[4]
Results 4.1. Results on Text Prior Table 3a reports the accuracies under the Full, None, and Text Backbone (TB) settings for all models and benchmarks, to- gether with the corresponding text-prior rateR TP. TB accuracy substantially exceeds chance for most mod- els across benchmarks. Averaged across models, TB surpasses chance by 12.4%, 5.4%, and 3.6% on ...
-
[5]
Benchmark designers should measuretext priorto en- sure tasks cannot be solved using text-only cues and thus gen- uinely assess auditory understanding
Recommended Practices We hope this study provides practical guidance for LALM re- search. Benchmark designers should measuretext priorto en- sure tasks cannot be solved using text-only cues and thus gen- uinely assess auditory understanding. Model developers should also compare performance with and without audio to verify that improvements arise from audi...
-
[6]
Across three benchmarks and eight LALMs, models retain 60–72% of their full accuracy without audio, revealing strong text priors
Conclusion We analyze how current audio-language benchmarks depend on audio by decomposing performance into text prior and audio reliance. Across three benchmarks and eight LALMs, models retain 60–72% of their full accuracy without audio, revealing strong text priors. Under partial audio, only 3.0–4.2% of items require cross-segment information, while mos...
-
[7]
Acknowledgments We acknowledge the computational and storage support pro- vided by the National Center for High-performance Comput- ing (NCHC) of the National Applied Research Laboratories (NARLabs) in Taiwan. This work was supported by the Min- istry of Education (MOE) of Taiwan under the project Tai- wan Centers of Excellence in Artificial Intelligence,...
-
[8]
In addition, large language models were utilized as judges for the automatic evaluation
Generative AI Use Disclosure Generative AI tools were used in this paper solely for language polishing and writing refinement. In addition, large language models were utilized as judges for the automatic evaluation
-
[9]
Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, C. Zhou, and J. Zhou, “Qwen2- audio technical report,”arXiv preprint arXiv:2407.10759, 2024. [Online]. Available: https://arxiv.org/abs/2407.10759
work page internal anchor Pith review arXiv 2024
-
[10]
J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025. [Online]. Available: https://arxiv.org/ abs/2503.20215
work page internal anchor Pith review arXiv 2025
-
[11]
J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhu, Y . Lv, Y . Wang, D. Guo, H. Wang, L. Ma, P. Zhang, X. Zhang, H. Hao, Z. Guo, B. Yang, B. Zhang, Z. Ma, X. Wei, S. Bai, K. Chen, X. Liu, P. Wang, M. Yang, D. Liu, X. Ren, B. Zheng, R. Men, F. Zhou, B. Yu, J. Yang, L. Yu, J. Zhou, and J. Lin, “Qwen3-omni technical report,” arXi...
work page internal anchor Pith review arXiv 2025
-
[12]
Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,
S. Ghosh, A. Goel, J. Kim, S. Kumar, Z. Kong, S. gil Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, and B. Catanzaro, “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
-
[13]
Available: https://openreview.net/forum?id= FjByDpDVIO
[Online]. Available: https://openreview.net/forum?id= FjByDpDVIO
-
[15]
K.-H. Lu, Z. Chen, S.-W. Fu, C.-H. H. Yang, S.-F. Huang, C.-K. Yang, C.-E. Yu, C.-W. Chen, W.-C. Chen, C.-y. Huang, Y .-C. Lin, Y .-X. Lin, C.-A. Fu, C.-Y . Kuan, W. Ren, X. Chen, W.-P. Huang, E.-P. Hu, T.-Q. Lin, Y .-K. Wu, K.-P. Huang, H.-Y . Huang, H.-C. Chou, K.-W. Chang, C.-H. Chiang, B. Ginsburg, Y .-C. F. Wang, and H.-y. Lee, “DeSTA2.5-Audio: Towar...
-
[17]
[Online]. Available: https://arxiv.org/abs/2507.13264
-
[18]
Building a taiwanese mandarin spoken language model: A first attempt,
C.-K. Yang, Y .-K. Fu, C.-A. Li, Y .-C. Lin, Y .-X. Lin, W.-C. Chen, H. L. Chung, C.-Y . Kuan, W.-P. Huang, K.-H. Luet al., “Building a taiwanese mandarin spoken language model: A first attempt,” arXiv preprint arXiv:2411.07111, 2024
-
[19]
Speech-copilot: Leveraging large language models for speech processing via task decomposition, modular- ization, and program generation,
C.-Y . Kuanet al., “Speech-copilot: Leveraging large language models for speech processing via task decomposition, modular- ization, and program generation,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 1060–1067
2024
-
[20]
DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment,
K.-H. Lu, Z. Chen, S.-W. Fu, H. Huang, B. Ginsburg, Y .-C. F. Wang, and H. yi Lee, “DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment,” inInter- speech 2024, 2024, pp. 4159–4163
2024
-
[21]
Developing instruction- following speech language model without speech instruction- tuning data,
K.-H. Lu, Z. Chen, S.-W. Fu, C.-H. H. Yang, J. Balam, B. Gins- burg, Y .-C. F. Wang, and H.-Y . Lee, “Developing instruction- following speech language model without speech instruction- tuning data,” inICASSP 2025 - 2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5
2025
-
[22]
A preliminary exploration with gpt-4o voice mode,
Y .-X. Lin, C.-K. Yang, W.-C. Chen, C.-A. Li, C.-y. Huang, X. Chen, and H.-y. Lee, “A preliminary exploration with gpt-4o voice mode,”arXiv preprint arXiv:2502.09940, 2025
-
[23]
Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,
Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and B. Catanzaro, “Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,” inProceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 2024, pp. 25 125–25 148, also available as arXiv:2402.01831. ...
-
[24]
SALMONN: Towards generic hearing abilities for large language models,
C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. MA, and C. Zhang, “SALMONN: Towards generic hearing abilities for large language models,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=14rn7HpKVk
2024
-
[25]
BLSP-emo: Towards empathetic large speech-language models,
C. Wang, M. Liao, Z. Huang, J. Wu, C. Zong, and J. Zhang, “BLSP-emo: Towards empathetic large speech-language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: As- sociation for Computational Linguistics, Nov. 2024, pp. 19 186– 19 199
2024
-
[26]
A. Grattafioriet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review arXiv 2024
-
[27]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review arXiv 2025
-
[28]
C.-K. Yang, N. Ho, Y .-J. Lee, and H.-y. Lee, “Audiolens: A closer look at auditory attribute perception of large audio-language mod- els,”arXiv preprint arXiv:2506.05140, 2025
-
[29]
Sake: Towards editing auditory attribute knowledge of large audio-language models,
C.-K. Yang, Y .-T. Piao, T.-W. Hsu, S.-W. Fu, Z. Chen, K.-H. Lu, S.-F. Huang, C.-H. H. Yang, Y .-C. F. Wang, Y .-N. Chenet al., “Sake: Towards editing auditory attribute knowledge of large audio-language models,”arXiv preprint arXiv:2510.16917, 2025
-
[30]
MMAU: A massive multi-task audio understanding and reasoning benchmark,
S. Sakshi, U. Tyagi, S. Kumar, A. Seth, R. Selvakumar, O. Nieto, R. Duraiswami, S. Ghosh, and D. Manocha, “MMAU: A massive multi-task audio understanding and reasoning benchmark,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https: //openreview.net/forum?id=TeV AZXr3yv
2025
-
[31]
MMAR: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,
Z. Ma, Y . Ma, Y . Zhu, C. Yang, Y .-W. Chao, R. Xu, W. Chen, Y . Chen, Z. Chen, J. Conget al., “MMAR: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. [Online]. Available: https: //openreview.net/forum?id=f...
2025
-
[32]
S. Kumar, ˇS. Sedl ´aˇcek, V . Lokegaonkar, F. L ´opez, W. Yu, N. Anand, H. Ryu, L. Chen, M. Pli ˇcka, M. Hlav ´aˇceket al., “Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence,”arXiv preprint arXiv:2508.13992, 2025
-
[33]
AudioBench: A universal benchmark for audio large language models,
B. Wang, X. Zou, G. Lin, S. Sun, Z. Liu, W. Zhang, Z. Liu, A. Aw, and N. F. Chen, “AudioBench: A universal benchmark for audio large language models,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), L. Chiruzzo, A. Ritter, and...
-
[34]
AIR-bench: Benchmarking large audio-language models via generative comprehension,
Q. Yang, J. Xu, W. Liu, Y . Chu, Z. Jiang, X. Zhou, Y . Leng, Y . Lv, Z. Zhao, C. Zhou, and J. Zhou, “AIR-bench: Benchmarking large audio-language models via generative comprehension,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Tha...
2024
-
[35]
Dynamic- superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech,
C.-y. Huang, K.-H. Lu, S.-H. Wang, C.-Y . Hsiao, C.-Y . Kuan, H. Wu, S. Arora, K.-W. Chang, J. Shi, Y . Penget al., “Dynamic- superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 136–12 140
2024
-
[36]
Dynamic-superb phase-2: A collaboratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,
C.-y. Huang, W.-C. Chen, S.-w. Yang, A. T. Liu, C.-A. Li, Y .-X. Lin, W.-C. Tseng, A. Diwan, Y .-J. Shih, J. Shi et al., “Dynamic-superb phase-2: A collaboratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: h...
2025
-
[37]
SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information,
C.-K. Yang, N. Ho, Y .-T. Piao, and H. yi Lee, “SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information,” inInterspeech 2025, 2025, pp. 1788–1792
2025
-
[38]
Listen and speak fairly: a study on seman- tic gender bias in speech integrated large language models,
Y .-C. Lin, T.-Q. Lin, C.-K. Yang, K.-H. Lu, W.-C. Chen, C.-Y . Kuan, and H.-Y . Lee, “Listen and speak fairly: a study on seman- tic gender bias in speech integrated large language models,” in 2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 439–446
2024
-
[39]
Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models,
K.-H. Lu, C.-Y . Kuan, and H. yi Lee, “Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models,” inInterspeech 2025, 2025, pp. 2078–2082
2025
-
[40]
Mugen: Evaluating and improving multi-audio understanding of large audio-language models,
C.-K. Yang, Y .-S. Tsai, Y .-K. Guo, P.-L. Tsai, Y .-T. Piao, H.- W. Chen, T.-L. Hsiao, Y .-M. Hsu, K.-H. Lu, and H.-y. Lee, “Mugen: Evaluating and improving multi-audio understanding of large audio-language models,”arXiv preprint arXiv:2603.09714, 2026
-
[41]
Speechr: A bench- mark for speech reasoning in large audio-language models,
W. Yang, Y . Li, Y . Wei, M. Fang, and L. Chen, “Speechr: A bench- mark for speech reasoning in large audio-language models,”arXiv preprint arXiv:2508.02018, 2025
-
[42]
VoxEval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models,
W. Cui, X. Jiao, Z. Meng, and I. King, “VoxEval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association for Computatio...
2025
-
[43]
On the landscape of spoken language models: A comprehensive survey,
S. Arora, K.-W. Chang, C.-M. Chien, Y . Peng, H. Wu, Y . Adi, E. Dupoux, H. yi Lee, K. Livescu, and S. Watanabe, “On the landscape of spoken language models: A comprehensive survey,” Transactions on Machine Learning Research, 2025. [Online]. Available: https://openreview.net/forum?id=BvxaP3sVbA
2025
-
[44]
Towards holistic evaluation of large audio-language models: A comprehensive survey,
C.-K. Yang, N. S. Ho, and H.-y. Lee, “Towards holistic evaluation of large audio-language models: A comprehensive survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 10 1...
2025
-
[45]
Hypothesis only baselines in natural language inference,
A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme, “Hypothesis only baselines in natural language inference,” inProceedings of the Seventh Joint Conference on Lexical and Computational Semantics, M. Nissim, J. Berant, and A. Lenci, Eds. New Orleans, Louisiana: Association for Computational Linguistics, Jun. 2018, pp. 180–191. [Online]. Av...
2018
-
[46]
Annotation artifacts in natural language inference data,
S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith, “Annotation artifacts in natural language inference data,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), M. Walker, H. Ji, and A. Stent, Eds. New Orlea...
2018
-
[47]
Making the v in vqa matter: Elevating the role of image un- derstanding in visual question answering,
Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image un- derstanding in visual question answering,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
2017
-
[48]
Are we on the right way for evaluating large vision-language models?
L. Chen, J. Li, X. Dong, P. Zhang, Y . Zang, Z. Chen, H. Duan, J. Wang, Y . Qiao, D. Lin, and F. Zhao, “Are we on the right way for evaluating large vision-language models?” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, Inc., 2024, p...
2024
-
[49]
C.-A. Li, T.-H. Lin, and H.-y. Lee, “When silence matters: The impact of irrelevant audio on text reasoning in large audio- language models,”arXiv preprint arXiv:2510.00626, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Measuring audio’s impact on correctness: Audio-contribution-aware post-training of large audio language models,
H. He, X. Du, R. Sun, Z. Dai, Y . Xiao, M. Yang, J. Zhou, X. Li, Z. Liu, Z. Liang, C. Wu, Q. He, T. Lee, X. Chen, W.-L. Zheng, W. Wang, M. D. Plumbley, J. liu, and Q. Kong, “Measuring audio’s impact on correctness: Audio-contribution-aware post-training of large audio language models,” inThe Fourteenth International Conference on Learning Representations,...
2026
-
[51]
Qwen2. 5 technical report,
A. Y . Qwen, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint, 2024
2024
-
[52]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach, J. Bao, A. Benhaim, M. Cai, V . Chaudhary, C. Chenet al., “Phi-4-mini technical report: Compact yet powerful multi- modal language models via mixture-of-loras,”arXiv preprint arXiv:2503.01743, 2025
work page internal anchor Pith review arXiv 2025
-
[53]
J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review arXiv 2023
-
[54]
Un ministral, des ministraux,
Mistral AI, “Un ministral, des ministraux,” 2024. [Online]. Available: https://mistral.ai/news/ministraux
2024
-
[55]
Claude 4.5 haiku (version 20251001),
Anthropic, “Claude 4.5 haiku (version 20251001),” https:// www.anthropic.com/news/claude-4-5-haiku, 2025, model ver- sion: claude-haiku-4-5-20251001. Accessed 2026-03-05
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.