pith. sign in

arxiv: 2509.20641 · v2 · pith:WZEEDQYEnew · submitted 2025-09-25 · 💻 cs.LG · cs.SD

Investigating Modality Contribution in Audio LLMs for Music

Pith reviewed 2026-05-21 22:41 UTC · model grok-4.3

classification 💻 cs.LG cs.SD
keywords Audio LLMsmodality contributionMM-SHAPmusic understandingmultimodal modelsexplainable AIShapley values
0
0 comments X

The pith

Audio LLMs for music draw more from text than sound, yet still identify key audio events.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts the MM-SHAP framework, which uses Shapley values to score how much each input modality shapes a model's answer, and applies it to Audio LLMs answering music questions on the MuChoMusic benchmark. Results show that the model with higher overall accuracy depends more on the text part of the prompt. At the same time, the scores reveal that the models still succeed at locating specific sound events in the audio even when the total audio contribution remains low. A sympathetic reader would care because this distinction separates cases where audio is truly ignored from cases where it supplies targeted information. The work marks the first use of this modality analysis for Audio LLMs and supplies a concrete method for checking whether such models actually listen.

Core claim

By adapting MM-SHAP to decompose modality contributions in a performance-agnostic way, the evaluation of two Audio LLMs on MuChoMusic shows that higher accuracy correlates with greater text reliance; however, low overall audio contribution scores coexist with successful localization of key sound events, indicating that audio input is not entirely ignored.

What carries the argument

Adapted MM-SHAP, a Shapley-value-based score that decomposes the relative contribution of audio and text modalities to each model prediction without depending on final accuracy.

If this is right

  • The higher-accuracy model on music questions relies more on text than on audio.
  • Low aggregate audio contribution scores do not prevent the model from localizing specific sound events.
  • Audio input supplies usable information in these tasks even when it is not the dominant modality.
  • The MM-SHAP adaptation offers a reusable tool for measuring modality balance in other Audio LLM applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If localization succeeds at low contribution levels, training objectives that reward explicit audio event detection could raise overall audio reliance without hurting accuracy.
  • Benchmarks for Audio LLMs could add separate localization sub-tasks to distinguish true audio use from text-based guessing.
  • The same contribution analysis could be run on non-music audio tasks to test whether low audio scores are a general pattern or specific to music questions.

Load-bearing premise

The adaptation of MM-SHAP to the MuChoMusic questions produces a faithful decomposition of modality contributions that introduces no new biases from the modification process itself.

What would settle it

A controlled run in which audio is masked or replaced with silence and the model loses all ability to localize the same sound events that it previously identified would falsify the claim that audio is still used despite low overall contribution scores.

Figures

Figures reproduced from arXiv: 2509.20641 by Giovana Morais, Magdalena Fuentes.

Figure 1
Figure 1. Figure 1: First, we obtain a set of tokens T corresponding to the model’s answer without masking any modality (i.e., “(B) Doorbell” in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Audio Large Language Models (Audio LLMs) enable human-like conversation about music, yet it is unclear if they are truly listening to the audio or just using textual reasoning, as recent benchmarks suggest. This paper investigates this issue by quantifying the contribution of each modality to a model's output. We adapt the MM-SHAP framework, a performance-agnostic score based on Shapley values that quantifies the relative contribution of each modality to a model's prediction. We evaluate two models on the MuChoMusic benchmark and find that the model with higher accuracy relies more on text to answer questions, but further inspection shows that even if the overall audio contribution is low, models can successfully localize key sound events, suggesting that audio is not entirely ignored. Our study is the first application of MM-SHAP to Audio LLMs and we hope it will serve as a foundational step for future research in explainable AI and audio.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper adapts the MM-SHAP framework (a Shapley-value-based, performance-agnostic method) to quantify audio versus text modality contributions in two Audio LLMs on the MuChoMusic benchmark. It reports that the higher-accuracy model relies more on text, yet claims that even with low overall audio contribution the models can localize key sound events and therefore do not entirely ignore audio. The work positions itself as the first application of MM-SHAP to Audio LLMs.

Significance. If the adaptation of MM-SHAP is shown to be faithful and the localization analysis is made quantitative and systematic, the results would offer a concrete, falsifiable decomposition of modality use in music-oriented Audio LLMs and could serve as a useful baseline for future explainability studies. The explicit first-application framing is a modest but clear contribution.

major comments (2)
  1. Abstract and §4 (further inspection): the central claim that 'models can successfully localize key sound events' despite low global audio contribution rests on an unspecified additional step. For the claim to be load-bearing, this inspection must (a) be performed systematically across the MuChoMusic set, (b) employ a metric still performance-agnostic, and (c) demonstrate that localized events receive measurably higher marginal Shapley contribution than non-localized segments under the same protocol. If the inspection is qualitative or uses a separate attention/gradient method, the link between the reported MM-SHAP scores and retained local utility is not established.
  2. §3 (adaptation of MM-SHAP): the manuscript must specify exactly how the original MM-SHAP procedure was modified for audio inputs (e.g., masking strategy for audio segments, handling of variable-length audio, choice of baseline, and any post-hoc aggregation). Without these details it is impossible to verify whether the adaptation introduces new biases that undermine the performance-agnostic property asserted in the abstract.
minor comments (1)
  1. The abstract states concrete findings on two models and one benchmark but provides neither error bars nor statistical significance tests on the modality scores; these should be added to the results section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important areas for improving the rigor and reproducibility of our analysis. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core findings.

read point-by-point responses
  1. Referee: Abstract and §4 (further inspection): the central claim that 'models can successfully localize key sound events' despite low global audio contribution rests on an unspecified additional step. For the claim to be load-bearing, this inspection must (a) be performed systematically across the MuChoMusic set, (b) employ a metric still performance-agnostic, and (c) demonstrate that localized events receive measurably higher marginal Shapley contribution than non-localized segments under the same protocol. If the inspection is qualitative or uses a separate attention/gradient method, the link between the reported MM-SHAP scores and retained local utility is not established.

    Authors: We agree that the localization claim requires more systematic and quantitative support to be fully load-bearing. The current §4 presents illustrative case studies showing elevated audio Shapley contributions around annotated key events. In the revision we will expand this into a systematic evaluation over the full MuChoMusic test set. Using the identical MM-SHAP protocol, we will partition audio into event-containing segments (based on benchmark metadata) and non-event segments, compute marginal audio contributions for each, and report mean values with statistical significance tests demonstrating higher contributions for localized events. This keeps the metric strictly performance-agnostic and directly links the global MM-SHAP scores to local utility. The abstract and §4 will be updated to describe the new quantitative protocol and results. revision: yes

  2. Referee: §3 (adaptation of MM-SHAP): the manuscript must specify exactly how the original MM-SHAP procedure was modified for audio inputs (e.g., masking strategy for audio segments, handling of variable-length audio, choice of baseline, and any post-hoc aggregation). Without these details it is impossible to verify whether the adaptation introduces new biases that undermine the performance-agnostic property asserted in the abstract.

    Authors: We acknowledge that the original manuscript describes the adaptation at a high level and omits the precise implementation choices needed for verification. In the revised §3 we will add an explicit subsection that details: (i) masking strategy—audio segments are replaced by silence (zero-valued waveform) while the text prompt remains unchanged; (ii) variable-length handling—longer clips are divided into fixed 1-second windows with zero-padding for shorter inputs to ensure uniform feature dimensionality; (iii) baseline selection—a zero-audio (silence) input paired with the original text serves as the reference for Shapley value estimation; (iv) post-hoc aggregation—segment-level audio Shapley values are summed to obtain the total audio modality contribution, with analogous summation for text. These specifications preserve the performance-agnostic character of MM-SHAP because all scores derive solely from changes in model output probability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; external framework applied to new benchmark

full rationale

The paper adapts the pre-existing MM-SHAP Shapley-value framework to compute modality contributions on MuChoMusic questions for two Audio LLMs. Reported accuracy differences and the observation that models can localize key events despite low global audio scores are obtained by applying this adapted method plus separate inspection to the benchmark data. No equation or result is defined in terms of itself, no fitted parameter is relabeled as a prediction, and no self-citation chain supplies the central claim. The derivation remains independent of the outputs it produces.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central measurements rest on the assumption that MM-SHAP can be directly transferred to audio-text inputs without domain-specific adjustments that would require new validation.

axioms (1)
  • domain assumption MM-SHAP provides a performance-agnostic decomposition of modality contributions
    Invoked when the authors state they adapt the framework to quantify relative contribution without depending on accuracy.

pith-pipeline@v0.9.0 · 5679 in / 1102 out tokens · 25568 ms · 2026-05-21T22:41:34.139042+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook

    cs.SD 2026-05 unverdicted novelty 5.0

    A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    While the number of proposed models continues to grow [2–6] difficulty lies in assessing free-form text outputs, due to the unstructured nature of the predictions [7]

    INTRODUCTION Audio Large Language Models (Audio LLMs) aim to expand the capabilities of LLMs by incorporating audio information into their reasoning [1]. While the number of proposed models continues to grow [2–6] difficulty lies in assessing free-form text outputs, due to the unstructured nature of the predictions [7]. Additionally, evalu- ation benchmar...

  2. [2]

    Adapt MM-SHAP [13] to inspect how Audio LLMs are using each modality in different tasks

  3. [3]

    Investigate two well-known Audio LLMs (Qwen-Audio [2] and MU-LLaMA [3]) in multiple-choice Q&A using the open-source MuChoMusic benchmark dataset [10]; and

  4. [4]

    We show that the usage of text is higher for multiple-choice questions, aligning with results from Vision LLMs

    Examine how the Audio LLMs use the two modalities. We show that the usage of text is higher for multiple-choice questions, aligning with results from Vision LLMs. We also demonstrate that good performance on MuChoMusic does not imply balanced modality contributions and vice versa. Our aim is to gain insight into how much these models employ each modality ...

  5. [5]

    Investigating Modality Contribution in Audio LLMs for Music

    RELA TED WORK Within explainability techniques in machine learning, there is a fam- ily of “post-hoc” methods whose aim is to analyze how each in- put feature contributes to a given model output via input pertur- bations. These include approaches like LIME (Local Interpretable Model-agnostic Explanations) [16], that explain a prediction by ap- proximating...

  6. [6]

    audio indicators

    METHOD 3.1. Shapley Values and Feature Contribution Shapley values were first proposed in the context of game theory to estimate how much each player contributes to the overall outcome of a cooperative game [15]. They were adapted for explaining machine learning methods by [24] in the SHAP (SHapley Additive exPla- nations) framework, where features act as...

  7. [7]

    You’re a reliable assistant, follow these instructions

    EXPERIMENTS DataOur experiments focus on the MuChoMusic benchmark [10], as we have access to both the audio and the questions’ answers. MuChoMusic is made of 1,187 multiple-choice questions validated by human annotators and associated with 644 music tracks sourced from the SongDescriberDataset (SDD) and MusicCaps. We limit our experiments to MusicCaps tra...

  8. [8]

    The sound effect that can be heard in the piece is a bell sound effect

    RESULTS AND DISCUSSION Table 1: Accuracy and average A-SHAP for the two experiments. Accuracy A-SHAP MC-PI MC-NPI MC-PI MC-NPI MU-LLaMA0.30 0.32 0.50±0.02 0.47±0.02 QwenAudio0.44 0.47 0.23±0.02 0.21±0.02 Table 1 shows that the more accurate model, Qwen-Audio, relies less on audio, whereas MU-LLaMA uses both modalities in a bal- anced manner. Indicating th...

  9. [9]

    Through a sys- tematic evaluation of Qwen-Audio and MU-LLaMA, we discovered that the higher-performing model, Qwen-Audio, relied significantly more on its text modality

    CONCLUSION In this work, we adapted MM-SHAP to measure modality contribu- tions in Audio LLMs, aiming to measure and understand how these models leverage audio and text for perception tasks. Through a sys- tematic evaluation of Qwen-Audio and MU-LLaMA, we discovered that the higher-performing model, Qwen-Audio, relied significantly more on its text modali...

  10. [10]

    Pengi: An audio language model for audio tasks,

    S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,” inAdvances in Neu- ral Information Processing Systems 36: Annual Conference on Neural Information Processing Systems, NeurIPS 2023, New Orleans, LA, USA, 2023

  11. [11]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Y . Chuet al., “Qwen-Audio: Advancing Universal Audio Un- derstanding via Unified Large-Scale Audio-Language Mod- els,”CoRR, vol. abs/2311.07919, 2023

  12. [12]

    Music understand- ing LLaMA: Advancing text-to-music generation with ques- tion answering and captioning,

    S. Liu, A. S. Hussain, C. Sun, and Y . Shan, “Music understand- ing LLaMA: Advancing text-to-music generation with ques- tion answering and captioning,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2024

  13. [13]

    MusiLingo: Bridging Music and Text with Pre- trained Language Models for Music Captioning and Query Re- sponse,

    Z. Denget al., “MusiLingo: Bridging Music and Text with Pre- trained Language Models for Music Captioning and Query Re- sponse,” inProceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2024

  14. [14]

    Listen, Think, and Understand,

    Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. R. Glass, “Listen, Think, and Understand,” inThe Twelfth International Conference on Learning Representations, ICLR, Vienna, Aus- tria, 2024

  15. [15]

    SALMONN: Towards Generic Hearing Abili- ties for Large Language Models,

    C. Tanget al., “SALMONN: Towards Generic Hearing Abili- ties for Large Language Models,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 2024

  16. [16]

    Holistic Evaluation of Language Models

    P. Lianget al., “Holistic evaluation of language models,” CoRR, vol. abs/2211.09110, 2022

  17. [17]

    MusicLM: Generating Music From Text

    A. Agostinelliet al., “MusicLM: Generating music from text,” CoRR, vol. abs/2301.11325, 2023

  18. [18]

    CompA: Addressing the Gap in Composi- tional Reasoning in Audio-Language Models,

    S. Ghoshet al., “CompA: Addressing the Gap in Composi- tional Reasoning in Audio-Language Models,” inThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, 2024

  19. [19]

    MuChoMusic: Evaluating Music Understand- ing in Multimodal Audio-Language Models,

    B. Weck, I. Manco, E. Benetos, E. Quinton, G. Fazekas, and D. Bogdanov, “MuChoMusic: Evaluating Music Understand- ing in Multimodal Audio-Language Models,” inProceedings of the 25th International Society for Music Information Re- trieval Conference (ISMIR), 2024

  20. [20]

    MMAU: A massive multi-task audio under- standing and reasoning benchmark,

    S. Sakshiet al., “MMAU: A massive multi-task audio under- standing and reasoning benchmark,” inThe Thirteenth Inter- national Conference on Learning Representations, ICLR 2025, Singapore, 2025

  21. [21]

    Perceptual score: What data modalities does your model perceive?,

    I. Gat, I. Schwartz, and A. G. Schwing, “Perceptual score: What data modalities does your model perceive?,” inAd- vances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, virtual, 2021

  22. [22]

    MM-SHAP: A performance- agnostic metric for measuring multimodal contributions in vi- sion and language models & tasks,

    L. Parcalabescu and A. Frank, “MM-SHAP: A performance- agnostic metric for measuring multimodal contributions in vi- sion and language models & tasks,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguis- tics (V olume 1: Long Papers), (Toronto, Canada), July 2023

  23. [23]

    On measuring faithfulness or self-consistency of natural language explanations,

    L. Parcalabescu and A. Frank, “On measuring faithfulness or self-consistency of natural language explanations,” inProceed- ings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (V olume 1: Long Papers), (Bangkok, Thai- land), Aug. 2024

  24. [24]

    L. S. Shapley,17. A V alue for n-Person Games, p. 307–318. Princeton University Press, Dec. 1953

  25. [25]

    “Why Should I Trust You?

    M. T. Ribeiro, S. Singh, and C. Guestrin, ““Why Should I Trust You?”: Explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining, San Fran- cisco, CA, USA, 2016

  26. [26]

    audiolime: Listenable explanations using source separation,

    V . Haunschmid, E. Manilow, and G. Widmer, “audiolime: Listenable explanations using source separation,”CoRR, vol. abs/2008.00582, 2020

  27. [27]

    Local interpretable model-agnostic explanations for music content analysis,

    S. Mishra, B. L. Sturm, and S. Dixon, “Local interpretable model-agnostic explanations for music content analysis,” in Proceedings of the 18th International Society for Music In- formation Retrieval Conference, ISMIR 2017, Suzhou, China, October 23-27, 2017(S. J. Cunningham, Z. Duan, X. Hu, and D. Turnbull, eds.), pp. 537–543, 2017

  28. [28]

    Musiclime: Explainable multimodal music understand- ing,

    T. Sotirou, V . Lyberatos, O. M. Mastromichalakis, and G. Sta- mou, “Musiclime: Explainable multimodal music understand- ing,” 2024

  29. [29]

    DIME: fine-grained interpretations of multi- modal models via disentangled local explanations,

    Y . Lyu, P. P. Liang, Z. Deng, R. Salakhutdinov, and L. Morency, “DIME: fine-grained interpretations of multi- modal models via disentangled local explanations,” inAIES ’22: AAAI/ACM Conference on AI, Ethics, and Society, Ox- ford, United Kingdom, 2022

  30. [30]

    Does my multimodal model learn cross- modal interactions? It’s harder to tell than you might think!,

    J. Hessel and L. Lee, “Does my multimodal model learn cross- modal interactions? It’s harder to tell than you might think!,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, 2020

  31. [31]

    Molnar,Interpretable Machine Learning

    C. Molnar,Interpretable Machine Learning. 3 ed., 2025

  32. [32]

    Benchmark- ing time-localized explanations for audio classification mod- els,

    C. Bola ˜nos, L. Pepino, M. Meza, and L. Ferrer, “Benchmark- ing time-localized explanations for audio classification mod- els,”CoRR, vol. abs/2506.04391, 2025

  33. [33]

    A unified approach to interpreting model predictions,

    S. M. Lundberg and S. Lee, “A unified approach to interpreting model predictions,” inAdvances in Neural Information Pro- cessing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 2017

  34. [34]

    Explaining prediction models and individual predictions with feature contributions,

    E. Strumbelj and I. Kononenko, “Explaining prediction models and individual predictions with feature contributions,”Knowl. Inf. Syst., vol. 41, no. 3, 2014

  35. [35]

    Do Vision & Language De- coders use Images and Text equally? How Self-consistent are their Explanations?,

    L. Parcalabescu and A. Frank, “Do Vision & Language De- coders use Images and Text equally? How Self-consistent are their Explanations?,” inThe Thirteenth International Confer- ence on Learning Representations, 2025

  36. [36]

    Evaluation of algorithms using games: The case of music tag- ging,

    E. Law, K. West, M. I. Mandel, M. Bay, and J. S. Downie, “Evaluation of algorithms using games: The case of music tag- ging,” inProceedings of the 10th International Society for Mu- sic Information Retrieval Conference, ISMIR 2009, Kobe In- ternational Conference Center , Kobe, Japan, 2009

  37. [37]

    Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks,

    Y . Zang, S. O’Brien, T. Berg-Kirkpatrick, J. J. McAuley, and Z. Novack, “Are you really listening? Boosting Perceptual Awareness in Music-QA Benchmarks,”CoRR, vol. abs/2504.00369, 2025

  38. [38]

    Reliable lo- cal explanations for machine listening,

    S. Mishra, E. Benetos, B. L. Sturm, and S. Dixon, “Reliable lo- cal explanations for machine listening,” in2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19-24, 2020, pp. 1–8, IEEE, 2020