Investigating Modality Contribution in Audio LLMs for Music
Pith reviewed 2026-05-21 22:41 UTC · model grok-4.3
The pith
Audio LLMs for music draw more from text than sound, yet still identify key audio events.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By adapting MM-SHAP to decompose modality contributions in a performance-agnostic way, the evaluation of two Audio LLMs on MuChoMusic shows that higher accuracy correlates with greater text reliance; however, low overall audio contribution scores coexist with successful localization of key sound events, indicating that audio input is not entirely ignored.
What carries the argument
Adapted MM-SHAP, a Shapley-value-based score that decomposes the relative contribution of audio and text modalities to each model prediction without depending on final accuracy.
If this is right
- The higher-accuracy model on music questions relies more on text than on audio.
- Low aggregate audio contribution scores do not prevent the model from localizing specific sound events.
- Audio input supplies usable information in these tasks even when it is not the dominant modality.
- The MM-SHAP adaptation offers a reusable tool for measuring modality balance in other Audio LLM applications.
Where Pith is reading between the lines
- If localization succeeds at low contribution levels, training objectives that reward explicit audio event detection could raise overall audio reliance without hurting accuracy.
- Benchmarks for Audio LLMs could add separate localization sub-tasks to distinguish true audio use from text-based guessing.
- The same contribution analysis could be run on non-music audio tasks to test whether low audio scores are a general pattern or specific to music questions.
Load-bearing premise
The adaptation of MM-SHAP to the MuChoMusic questions produces a faithful decomposition of modality contributions that introduces no new biases from the modification process itself.
What would settle it
A controlled run in which audio is masked or replaced with silence and the model loses all ability to localize the same sound events that it previously identified would falsify the claim that audio is still used despite low overall contribution scores.
Figures
read the original abstract
Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model's output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model's prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper adapts the MM-SHAP framework (a Shapley-value-based, performance-agnostic method) to quantify audio versus text modality contributions in two Audio LLMs on the MuChoMusic benchmark. It reports that the higher-accuracy model relies more on text, yet claims that even with low overall audio contribution the models can localize key sound events and therefore do not entirely ignore audio. The work positions itself as the first application of MM-SHAP to Audio LLMs.
Significance. If the adaptation of MM-SHAP is shown to be faithful and the localization analysis is made quantitative and systematic, the results would offer a concrete, falsifiable decomposition of modality use in music-oriented Audio LLMs and could serve as a useful baseline for future explainability studies. The explicit first-application framing is a modest but clear contribution.
major comments (2)
- Abstract and §4 (further inspection): the central claim that 'models can successfully localize key sound events' despite low global audio contribution rests on an unspecified additional step. For the claim to be load-bearing, this inspection must (a) be performed systematically across the MuChoMusic set, (b) employ a metric still performance-agnostic, and (c) demonstrate that localized events receive measurably higher marginal Shapley contribution than non-localized segments under the same protocol. If the inspection is qualitative or uses a separate attention/gradient method, the link between the reported MM-SHAP scores and retained local utility is not established.
- §3 (adaptation of MM-SHAP): the manuscript must specify exactly how the original MM-SHAP procedure was modified for audio inputs (e.g., masking strategy for audio segments, handling of variable-length audio, choice of baseline, and any post-hoc aggregation). Without these details it is impossible to verify whether the adaptation introduces new biases that undermine the performance-agnostic property asserted in the abstract.
minor comments (1)
- The abstract states concrete findings on two models and one benchmark but provides neither error bars nor statistical significance tests on the modality scores; these should be added to the results section.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving the rigor and reproducibility of our analysis. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core findings.
read point-by-point responses
-
Referee: Abstract and §4 (further inspection): the central claim that 'models can successfully localize key sound events' despite low global audio contribution rests on an unspecified additional step. For the claim to be load-bearing, this inspection must (a) be performed systematically across the MuChoMusic set, (b) employ a metric still performance-agnostic, and (c) demonstrate that localized events receive measurably higher marginal Shapley contribution than non-localized segments under the same protocol. If the inspection is qualitative or uses a separate attention/gradient method, the link between the reported MM-SHAP scores and retained local utility is not established.
Authors: We agree that the localization claim requires more systematic and quantitative support to be fully load-bearing. The current §4 presents illustrative case studies showing elevated audio Shapley contributions around annotated key events. In the revision we will expand this into a systematic evaluation over the full MuChoMusic test set. Using the identical MM-SHAP protocol, we will partition audio into event-containing segments (based on benchmark metadata) and non-event segments, compute marginal audio contributions for each, and report mean values with statistical significance tests demonstrating higher contributions for localized events. This keeps the metric strictly performance-agnostic and directly links the global MM-SHAP scores to local utility. The abstract and §4 will be updated to describe the new quantitative protocol and results. revision: yes
-
Referee: §3 (adaptation of MM-SHAP): the manuscript must specify exactly how the original MM-SHAP procedure was modified for audio inputs (e.g., masking strategy for audio segments, handling of variable-length audio, choice of baseline, and any post-hoc aggregation). Without these details it is impossible to verify whether the adaptation introduces new biases that undermine the performance-agnostic property asserted in the abstract.
Authors: We acknowledge that the original manuscript describes the adaptation at a high level and omits the precise implementation choices needed for verification. In the revised §3 we will add an explicit subsection that details: (i) masking strategy—audio segments are replaced by silence (zero-valued waveform) while the text prompt remains unchanged; (ii) variable-length handling—longer clips are divided into fixed 1-second windows with zero-padding for shorter inputs to ensure uniform feature dimensionality; (iii) baseline selection—a zero-audio (silence) input paired with the original text serves as the reference for Shapley value estimation; (iv) post-hoc aggregation—segment-level audio Shapley values are summed to obtain the total audio modality contribution, with analogous summation for text. These specifications preserve the performance-agnostic character of MM-SHAP because all scores derive solely from changes in model output probability. revision: yes
Circularity Check
No significant circularity; external framework applied to new benchmark
full rationale
The paper adapts the pre-existing MM-SHAP Shapley-value framework to compute modality contributions on MuChoMusic questions for two Audio LLMs. Reported accuracy differences and the observation that models can localize key events despite low global audio scores are obtained by applying this adapted method plus separate inspection to the benchmark data. No equation or result is defined in terms of itself, no fitted parameter is relabeled as a prediction, and no self-citation chain supplies the central claim. The derivation remains independent of the outputs it produces.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MM-SHAP provides a performance-agnostic decomposition of modality contributions
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model's prediction... A-SHAP = ΦA / (ΦT + ΦA)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We mask text tokens... and audio waveform segments... compute MM-SHAP
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook
A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Audio Large Language Models (Audio LLMs) aim to expand the capabilities of LLMs by incorporating audio information into their reasoning [1]. While the number of proposed models continues to grow [2–6] difficulty lies in assessing free-form text outputs, due to the unstructured nature of the predictions [7]. Additionally, evalu- ation benchmar...
-
[2]
Adapt MM-SHAP [13] to inspect how Audio LLMs are using each modality in different tasks
-
[3]
Investigate two well-known Audio LLMs (Qwen-Audio [2] and MU-LLaMA [3]) in multiple-choice Q&A using the open-source MuChoMusic benchmark dataset [10]; and
-
[4]
Examine how the Audio LLMs use the two modalities. We show that the usage of text is higher for multiple-choice questions, aligning with results from Vision LLMs. We also demonstrate that good performance on MuChoMusic does not imply balanced modality contributions and vice versa. Our aim is to gain insight into how much these models employ each modality ...
-
[5]
Investigating Modality Contribution in Audio LLMs for Music
RELA TED WORK Within explainability techniques in machine learning, there is a fam- ily of “post-hoc” methods whose aim is to analyze how each in- put feature contributes to a given model output via input pertur- bations. These include approaches like LIME (Local Interpretable Model-agnostic Explanations) [16], that explain a prediction by ap- proximating...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
METHOD 3.1. Shapley Values and Feature Contribution Shapley values were first proposed in the context of game theory to estimate how much each player contributes to the overall outcome of a cooperative game [15]. They were adapted for explaining machine learning methods by [24] in the SHAP (SHapley Additive exPla- nations) framework, where features act as...
-
[7]
You’re a reliable assistant, follow these instructions
EXPERIMENTS DataOur experiments focus on the MuChoMusic benchmark [10], as we have access to both the audio and the questions’ answers. MuChoMusic is made of 1,187 multiple-choice questions validated by human annotators and associated with 644 music tracks sourced from the SongDescriberDataset (SDD) and MusicCaps. We limit our experiments to MusicCaps tra...
-
[8]
The sound effect that can be heard in the piece is a bell sound effect
RESULTS AND DISCUSSION Table 1: Accuracy and average A-SHAP for the two experiments. Accuracy A-SHAP MC-PI MC-NPI MC-PI MC-NPI MU-LLaMA0.30 0.32 0.50±0.02 0.47±0.02 QwenAudio0.44 0.47 0.23±0.02 0.21±0.02 Table 1 shows that the more accurate model, Qwen-Audio, relies less on audio, whereas MU-LLaMA uses both modalities in a bal- anced manner. Indicating th...
-
[9]
CONCLUSION In this work, we adapted MM-SHAP to measure modality contribu- tions in Audio LLMs, aiming to measure and understand how these models leverage audio and text for perception tasks. Through a sys- tematic evaluation of Qwen-Audio and MU-LLaMA, we discovered that the higher-performing model, Qwen-Audio, relied significantly more on its text modali...
-
[10]
Pengi: An audio language model for audio tasks,
S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,” inAdvances in Neu- ral Information Processing Systems 36: Annual Conference on Neural Information Processing Systems, NeurIPS 2023, New Orleans, LA, USA, 2023
work page 2023
-
[11]
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Y . Chuet al., “Qwen-Audio: Advancing Universal Audio Un- derstanding via Unified Large-Scale Audio-Language Mod- els,”CoRR, vol. abs/2311.07919, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
S. Liu, A. S. Hussain, C. Sun, and Y . Shan, “Music understand- ing LLaMA: Advancing text-to-music generation with ques- tion answering and captioning,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2024
work page 2024
-
[13]
Z. Denget al., “MusiLingo: Bridging Music and Text with Pre- trained Language Models for Music Captioning and Query Re- sponse,” inProceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2024
work page 2024
-
[14]
Listen, Think, and Understand,
Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. R. Glass, “Listen, Think, and Understand,” inThe Twelfth International Conference on Learning Representations, ICLR, Vienna, Aus- tria, 2024
work page 2024
-
[15]
SALMONN: Towards Generic Hearing Abili- ties for Large Language Models,
C. Tanget al., “SALMONN: Towards Generic Hearing Abili- ties for Large Language Models,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 2024
work page 2024
-
[16]
Holistic Evaluation of Language Models
P. Lianget al., “Holistic evaluation of language models,” CoRR, vol. abs/2211.09110, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
MusicLM: Generating Music From Text
A. Agostinelliet al., “MusicLM: Generating music from text,” CoRR, vol. abs/2301.11325, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
CompA: Addressing the Gap in Composi- tional Reasoning in Audio-Language Models,
S. Ghoshet al., “CompA: Addressing the Gap in Composi- tional Reasoning in Audio-Language Models,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 2024
work page 2024
-
[19]
MuChoMusic: Evaluating Music Understand- ing in Multimodal Audio-Language Models,
B. Weck, I. Manco, E. Benetos, E. Quinton, G. Fazekas, and D. Bogdanov, “MuChoMusic: Evaluating Music Understand- ing in Multimodal Audio-Language Models,” inProceedings of the 25th International Society for Music Information Re- trieval Conference (ISMIR), 2024
work page 2024
-
[20]
MMAU: A massive multi-task audio under- standing and reasoning benchmark,
S. Sakshiet al., “MMAU: A massive multi-task audio under- standing and reasoning benchmark,” inThe Thirteenth Inter- national Conference on Learning Representations, ICLR 2025, Singapore, 2025
work page 2025
-
[21]
Perceptual score: What data modalities does your model perceive?,
I. Gat, I. Schwartz, and A. G. Schwing, “Perceptual score: What data modalities does your model perceive?,” inAd- vances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, virtual, 2021
work page 2021
-
[22]
L. Parcalabescu and A. Frank, “MM-SHAP: A performance- agnostic metric for measuring multimodal contributions in vi- sion and language models & tasks,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguis- tics (V olume 1: Long Papers), (Toronto, Canada), July 2023
work page 2023
-
[23]
On measuring faithfulness or self-consistency of natural language explanations,
L. Parcalabescu and A. Frank, “On measuring faithfulness or self-consistency of natural language explanations,” inProceed- ings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (V olume 1: Long Papers), (Bangkok, Thai- land), Aug. 2024
work page 2024
-
[24]
L. S. Shapley,17. A V alue for n-Person Games, p. 307–318. Princeton University Press, Dec. 1953
work page 1953
-
[25]
M. T. Ribeiro, S. Singh, and C. Guestrin, ““Why Should I Trust You?”: Explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining, San Fran- cisco, CA, USA, 2016
work page 2016
-
[26]
audiolime: Listenable explanations using source separation,
V . Haunschmid, E. Manilow, and G. Widmer, “audiolime: Listenable explanations using source separation,”CoRR, vol. abs/2008.00582, 2020
-
[27]
Local interpretable model-agnostic explanations for music content analysis,
S. Mishra, B. L. Sturm, and S. Dixon, “Local interpretable model-agnostic explanations for music content analysis,” in Proceedings of the 18th International Society for Music In- formation Retrieval Conference, ISMIR 2017, Suzhou, China, October 23-27, 2017(S. J. Cunningham, Z. Duan, X. Hu, and D. Turnbull, eds.), pp. 537–543, 2017
work page 2017
-
[28]
Musiclime: Explainable multimodal music understand- ing,
T. Sotirou, V . Lyberatos, O. M. Mastromichalakis, and G. Sta- mou, “Musiclime: Explainable multimodal music understand- ing,” 2024
work page 2024
-
[29]
DIME: fine-grained interpretations of multi- modal models via disentangled local explanations,
Y . Lyu, P. P. Liang, Z. Deng, R. Salakhutdinov, and L. Morency, “DIME: fine-grained interpretations of multi- modal models via disentangled local explanations,” inAIES ’22: AAAI/ACM Conference on AI, Ethics, and Society, Ox- ford, United Kingdom, 2022
work page 2022
-
[30]
Does my multimodal model learn cross- modal interactions? It’s harder to tell than you might think!,
J. Hessel and L. Lee, “Does my multimodal model learn cross- modal interactions? It’s harder to tell than you might think!,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, 2020
work page 2020
-
[31]
Molnar,Interpretable Machine Learning
C. Molnar,Interpretable Machine Learning. 3 ed., 2025
work page 2025
-
[32]
Benchmark- ing time-localized explanations for audio classification mod- els,
C. Bola ˜nos, L. Pepino, M. Meza, and L. Ferrer, “Benchmark- ing time-localized explanations for audio classification mod- els,”CoRR, vol. abs/2506.04391, 2025
-
[33]
A unified approach to interpreting model predictions,
S. M. Lundberg and S. Lee, “A unified approach to interpreting model predictions,” inAdvances in Neural Information Pro- cessing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 2017
work page 2017
-
[34]
Explaining prediction models and individual predictions with feature contributions,
E. Strumbelj and I. Kononenko, “Explaining prediction models and individual predictions with feature contributions,”Knowl. Inf. Syst., vol. 41, no. 3, 2014
work page 2014
-
[35]
L. Parcalabescu and A. Frank, “Do Vision & Language De- coders use Images and Text equally? How Self-consistent are their Explanations?,” inThe Thirteenth International Confer- ence on Learning Representations, 2025
work page 2025
-
[36]
Evaluation of algorithms using games: The case of music tag- ging,
E. Law, K. West, M. I. Mandel, M. Bay, and J. S. Downie, “Evaluation of algorithms using games: The case of music tag- ging,” inProceedings of the 10th International Society for Mu- sic Information Retrieval Conference, ISMIR 2009, Kobe In- ternational Conference Center , Kobe, Japan, 2009
work page 2009
-
[37]
Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks,
Y . Zang, S. O’Brien, T. Berg-Kirkpatrick, J. J. McAuley, and Z. Novack, “Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks,”CoRR, vol. abs/2504.00369, 2025
-
[38]
Reliable lo- cal explanations for machine listening,
S. Mishra, E. Benetos, B. L. Sturm, and S. Dixon, “Reliable lo- cal explanations for machine listening,” in2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19-24, 2020, pp. 1–8, IEEE, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.