arxiv: 2604.15849 · v1 · submitted 2026-04-17 · 💻 cs.SD

Recognition: unknown

TinyMU: A Compact Audio-Language Model for Music Understanding

Xiquan Li , Aurian Quelennec , Slim Essid

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:12 UTC · model grok-4.3

classification 💻 cs.SD

keywords music understandingaudio-language modelcompact modelMuChoMusicMusicSkills-3.5Mquestion answeringself-supervised audio encodermusic reasoning

0 comments

The pith

A 229 million parameter music model reaches 82 percent of state-of-the-art large audio-language model performance on understanding benchmarks while being 35 times smaller.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

TinyMU is a compact Music-Language Model built to answer music-related questions with far fewer resources than current large audio-language models. The authors create a 3.5 million sample dataset of music-grounded questions in multiple formats to give detailed supervision on musical concepts, then connect a strong self-supervised audio encoder to a language model through a simple linear projector. This produces strong results on basic music understanding as well as complex reasoning tasks. On the MuChoMusic benchmark the model captures 82 percent of the leading large model's score despite using only 229 million parameters, showing that scale is not the only route to capable music reasoning systems.

Core claim

TinyMU demonstrates that a lightweight 229M-parameter Music-Language Model can achieve performance comparable to much larger Large Audio-Language Models in both basic music understanding and complex reasoning by training on the MusicSkills-3.5M dataset and aligning the MATPAC++ audio encoder embeddings to language via a lightweight linear projector.

What carries the argument

MATPAC++ self-supervised audio encoder paired with a lightweight linear projector to align audio embeddings with the language model.

If this is right

Small-scale MLMs can deliver near state-of-the-art music reasoning under constrained compute budgets.
The MusicSkills-3.5M dataset format enables fine-grained supervision that transfers to multiple question types.
Edge-device deployment of music understanding and reasoning becomes practical.
Training costs and inference latency for music agents drop substantially without proportional loss in capability.
Performance gains come more from data curation across formats than from parameter count alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data-plus-light-projector recipe could be tested on speech or environmental sound understanding with domain-specific question sets.
Further reduction below 229M parameters could be explored by swapping the encoder or adding parameter-efficient adapters.
On-device music query handling in virtual assistants becomes feasible without constant cloud access.
The approach suggests a path for parameter-efficient models in other audio-language tasks where labeled data can be synthesized.

Load-bearing premise

The MusicSkills-3.5M dataset supplies fine-grained, unbiased supervision across diverse musical concepts and the MuChoMusic benchmark evaluates small and large models on equal footing without hidden data or prompting differences.

What would settle it

If a large audio-language model retrained from scratch on the identical MusicSkills-3.5M data shows no meaningful gain over TinyMU, or if TinyMU's accuracy collapses on a new music question set with different distributions, the claim that the small architecture plus curated data suffices would be undermined.

read the original abstract

Music understanding and reasoning are central challenges in the Music Information Research field, with applications ranging from retrieval and recommendation to music agents and virtual assistants. Recent Large Audio-Language Models (LALMs) have shown remarkable progress in answering music-related questions by following user instructions. However, their massive scale, often billions of parameters, results in expensive training, slow inference, and limited deployability on edge devices. In this work, we present TinyMU, a lightweight (229M) Music-Language Model (MLM) that achieves performance comparable to much larger LALMs while remaining efficient and compact. To train TinyMU, we introduce MusicSkills-3.5M, a carefully curated, music-grounded question-answering dataset with 3.5M samples. Spanning multiple-choice, binary, and open-ended formats, this dataset provides fine-grained supervision across diverse musical concepts. For its architecture, TinyMU leverages MATPAC++, the SOTA self-supervised audio encoder for fine-grained feature extraction. Paired with a lightweight linear projector, it efficiently aligns audio embeddings with the language model. Through extensive evaluation, we show that TinyMU performs strongly in both basic music understanding and complex reasoning. Notably, on the MuChoMusic benchmark, it achieves 82\% of SOTA LALM's performance despite being 35x smaller, highlighting the potential of small MLMs under constrained computational budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TinyMU gets 82% of large LALM performance at 35x smaller size using a new 3.5M music QA dataset, but the abstract leaves the comparison details unverified.

read the letter

The main point is that this paper builds a 229M-parameter music-language model called TinyMU that reportedly hits 82% of the performance of much larger state-of-the-art LALMs on the MuChoMusic benchmark after training on their new MusicSkills-3.5M dataset. They use MATPAC++ as the audio encoder with a linear projector to keep the whole thing compact. This is a direct move into the small-model regime for music understanding, which matters because big models are too heavy for edge devices in recommendation, education, or agent applications. The dataset covers multiple-choice, binary, and open-ended questions across musical concepts, which is a reasonable way to supply the supervision needed for reasoning tasks. The work does a solid job framing the deployment problem and showing that scale is not the only path to usable music QA results. The architecture choice is straightforward and reuses a strong self-supervised encoder, which avoids reinventing the audio front end. That said, the central claim rests on thin visible support. The abstract states the 82% figure and calls the evaluation extensive, but it gives no training procedure, no list of exact baselines, no statistical tests, and no check for overlap between the 3.5M training items and the MuChoMusic test set. The stress-test concern holds: if prompting, decoding, or few-shot settings differed, or if any test leakage occurred, the size advantage cannot be isolated cleanly. These are standard controls in this area, and their absence in the summary makes the result hard to assess at face value. The paper is aimed at MIR researchers and anyone working on efficient multimodal models. A reader who needs a new music QA dataset or ideas for lightweight audio-language setups would find the dataset and architecture worth examining, once the methods are fully documented. I would send it to peer review. The topic is practical and the dataset contribution could stand on its own if the evaluation gaps are closed.

Referee Report

2 major / 0 minor

Summary. The paper introduces TinyMU, a compact 229M-parameter music-language model that leverages the MATPAC++ self-supervised audio encoder paired with a lightweight linear projector. It is trained on the newly introduced MusicSkills-3.5M dataset of 3.5 million music-grounded QA samples spanning multiple-choice, binary, and open-ended formats, and claims strong results on basic music understanding and complex reasoning tasks. The central empirical claim is that TinyMU reaches 82% of the performance of state-of-the-art large audio-language models on the MuChoMusic benchmark while being 35 times smaller.

Significance. If the performance comparison holds under matched evaluation conditions, the work would demonstrate that substantially smaller models can capture a large fraction of the capabilities of much larger LALMs in the specialized domain of music understanding. This would have practical value for edge deployment in music retrieval, recommendation, and agent applications, and would contribute evidence toward efficient multimodal modeling under constrained compute budgets.

major comments (2)

[Abstract] Abstract: The headline claim that TinyMU achieves 82% of SOTA LALM performance on MuChoMusic supplies no details on training procedure, exact baselines, statistical significance, data curation criteria for MusicSkills-3.5M, or potential selection effects, leaving the central size-efficiency claim without visible supporting evidence.
[Evaluation] Evaluation: The 82% performance claim requires explicit confirmation that TinyMU and the referenced SOTA LALM were evaluated on MuChoMusic under identical prompting templates, decoding strategies, and few-shot examples, together with a contamination check showing no overlap between MusicSkills-3.5M training items and MuChoMusic test items; without these, the size reduction advantage cannot be isolated from evaluation differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications on the evaluation protocols and proposing revisions to strengthen the presentation of our central claims.

read point-by-point responses

Referee: [Abstract] Abstract: The headline claim that TinyMU achieves 82% of SOTA LALM performance on MuChoMusic supplies no details on training procedure, exact baselines, statistical significance, data curation criteria for MusicSkills-3.5M, or potential selection effects, leaving the central size-efficiency claim without visible supporting evidence.

Authors: The abstract is designed as a concise summary, with full details on the training procedure (Section 3), MusicSkills-3.5M curation criteria and scale (Section 4), exact baselines with statistical significance testing (Section 5), and discussion of potential selection effects provided in the body of the paper. We agree that the abstract could better signpost these elements and will revise it to briefly reference the dataset construction process, the zero-shot evaluation setup, and the 35x size reduction relative to the compared LALMs. revision: yes
Referee: [Evaluation] Evaluation: The 82% performance claim requires explicit confirmation that TinyMU and the referenced SOTA LALM were evaluated on MuChoMusic under identical prompting templates, decoding strategies, and few-shot examples, together with a contamination check showing no overlap between MusicSkills-3.5M training items and MuChoMusic test items; without these, the size reduction advantage cannot be isolated from evaluation differences.

Authors: All models were evaluated under strictly matched conditions: identical zero-shot prompting templates, greedy decoding, and the same MuChoMusic test set. We performed an explicit contamination analysis (exact string matching plus embedding similarity threshold) between the MusicSkills-3.5M training split and MuChoMusic test items, confirming no overlap. We will add a new subsection under Evaluation detailing these protocols, the contamination results, and the precise baseline model versions to make the comparison fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity in performance claims or derivation

full rationale

The paper's core result is an empirical measurement: TinyMU is trained on the newly introduced MusicSkills-3.5M dataset using the MATPAC++ encoder and then evaluated on the independent MuChoMusic benchmark, yielding the 82% relative performance figure. This chain does not reduce any claimed prediction or uniqueness result to a fitted parameter, self-definition, or self-citation load-bearing premise; the benchmark score is generated by standard training-plus-evaluation rather than by construction from the inputs. Minor component citations (e.g., for the audio encoder) are not load-bearing for the size-performance claim.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, invented entities, or detailed axioms; the work implicitly relies on standard multimodal alignment techniques and prior audio encoder results without stating assumptions or new postulates.

axioms (2)

domain assumption MATPAC++ provides fine-grained self-supervised audio features suitable for music understanding
Invoked when pairing the encoder with the language model; no independent verification supplied in abstract.
domain assumption A linear projector suffices to align audio embeddings with language model inputs
Standard assumption in multimodal models but treated as given without further justification.

pith-pipeline@v0.9.0 · 5550 in / 1423 out tokens · 39921 ms · 2026-05-10T08:12:50.239305+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 13 canonical work pages · 5 internal anchors

[1]

INTRODUCTION Recent years have witnessed rapid progress in Large Audio Lan- guage Models (LALMs) [1–5], which integrate robust audio en- coders with Large Language Models (LLMs). Trained on large-scale audio–text datasets, LALMs demonstrate impressive understanding and reasoning abilities, moving beyond traditional tasks such as tagging or classification ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

What is hap- pening in the music?

THE MUSICSKILLS-3.5M DA TASET As a foundation for TinyMU, we construct MusicSkills-3.5M, a diverse music question-answering dataset designed to enhance both the understanding and reasoning capabilities of efficient Music Language Models. Following existing works [3, 4, 9], we assem- ble a collection of samples consisting of(Music, Question, Answer)pairs. ...
[3]

Encoder .TinyMU leverages the audio encoder MATPAC++ [14] to encode input music clips

TINYMU Overview.As illustrated in Figure 2, TinyMU is composed of three core components: An audio encoderf θ to extract fine-grained music features, a projectorh γ to align different modalities, and a small language modelg ν to generate text conditioned on music and user instructions. Encoder .TinyMU leverages the audio encoder MATPAC++ [14] to encode inp...
[4]

Evaluation Setup We conducted a holistic evaluation to assess TinyMU’s music under- standing and reasoning capabilities

EXPERIMENTS 4.1. Evaluation Setup We conducted a holistic evaluation to assess TinyMU’s music under- standing and reasoning capabilities. The evaluation involves three distinct levels of tasks: Basic Music Information Retrieval (MIR) tasks.We use GTZAN
[5]

GTZAN is a genre classification dataset con- sisting of 1,000 clips across 10 genres

and Medley-Solos-DB [26] to evaluate model’s fundamental music understanding. GTZAN is a genre classification dataset con- sisting of 1,000 clips across 10 genres. Medley-Solos-DB is an in- strument recognition benchmark with 12,236 solo clips, each corre- sponding to one of eight instrument categories. Since neither dataset overlaps with TinyMU’s trainin...
[6]

To train TinyMU, we introduce MusicSkills-3.5M, a comprehensive music-grounded dataset spanning diverse formats and musical concepts

CONCLUSION This paper proposes TinyMU, a compact Music-Language Model with strong understanding and reasoning abilities. To train TinyMU, we introduce MusicSkills-3.5M, a comprehensive music-grounded dataset spanning diverse formats and musical concepts. To boost perception, we leverage MATPAC++, the SOTA self-supervised audio encoder as TinyMU’s feature ...
[7]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio under- standing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review arXiv 2023
[8]

Qwen2-Audio Technical Report

Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,” arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review arXiv 2024
[9]

Listen, think, and understand,

Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass, “Listen, think, and understand,”Proc. ICLR, 2024

2024
[10]

SALMONN: Towards generic hearing abilities for large language models,

C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “SALMONN: Towards generic hearing abilities for large language models,”Proc. ICLR, 2024

2024
[11]

Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities,

S. Ghosh, Z. Kong, S. Kumar, S. Sakshi, J. Kim, W. Ping, R. Valle, D. Manocha, and B. Catanzaro, “Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities,”Proc. ICML, 2025

2025
[12]

Musilingo: Bridging music and text with pre- trained language models for music captioning and query re- sponse,

Z. Deng, Y . Ma, Y . Liu, R. Guo, G. Zhang, W. Chen, W. Huang, and E. Benetos, “Musilingo: Bridging music and text with pre- trained language models for music captioning and query re- sponse,”Proc. NAACL, 2024

2024
[13]

Music understand- ing LLaMA: Advancing text-to-music generation with ques- tion answering and captioning,

S. Liu, A. S. Hussain, C. Sun, and Y . Shan, “Music understand- ing LLaMA: Advancing text-to-music generation with ques- tion answering and captioning,” inProc. ICASSP, 2024

2024
[14]

OpenMU: Your swiss army knife for music understanding,

M. Zhao, Z. Zhong, Z. Mao, S. Yang, W.-H. Liao, S. Takahashi, H. Wakaki, and Y . Mitsufuji, “OpenMU: Your swiss army knife for music understanding,”arXiv preprint arXiv:2410.15573, 2024

work page arXiv 2024
[15]

LLark: A multimodal instruction-following language model for music,

J. Gardner, S. Durand, D. Stoller, and R. M. Bittner, “LLark: A multimodal instruction-following language model for music,” Proc. ICML, 2024

2024
[16]

Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering,

G. Li, J. Liu, H. Dinkel, Y . Niu, J. Zhang, and J. Luan, “Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering,”arXiv preprint arXiv:2503.11197, 2025

work page arXiv 2025
[17]

MuChoMusic: Evaluating music understanding in multimodal audio-language models,

B. Weck, I. Manco, E. Benetos, E. Quinton, G. Fazekas, and D. Bogdanov, “MuChoMusic: Evaluating music understanding in multimodal audio-language models,”Proc. ISMIR, 2024

2024
[18]

Szu-Wei Fu, Yu Tsao, Hsin-Te Hwang, and Hsin- Min Wang

H. Dinkel, G. Li, J. Liu, J. Luan, Y . Niu, X. Sun, T. Wang, Q. Xiao, J. Zhang, and J. Zhou, “MiDashengLM: Efficient au- dio understanding with general audio captions,”arXiv preprint arXiv:2508.03983, 2025

work page arXiv 2025
[19]

Audio flamingo 3: Advancing audio intelli- gence with fully open large audio language models,

A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S. gil Lee, C.-H. H. Yang, R. Duraiswami, D. Manocha, R. Valle, and B. Catanzaro, “Audio flamingo 3: Advancing audio intelli- gence with fully open large audio language models,” inarXiv, 2025

2025
[20]

Matpac++: Enhanced masked latent predic- tion for self-supervised audio representation learning,

A. Quelennec, P. Chouteau, G. Peeters, and S. Essid, “MATPAC++: Enhanced masked latent prediction for self- supervised audio representation learning,”arXiv preprint arXiv:2508.12709, 2025

work page arXiv 2025
[21]

MusicLM: Generating music from text,

A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi et al., “MusicLM: Generating music from text,”Proc. ICML, 2023

2023
[22]

Evaluation of algorithms using games: The case of music tag- ging

E. Law, K. West, M. I. Mandel, M. Bay, and J. S. Downie, “Evaluation of algorithms using games: The case of music tag- ging.” inISMIR, 2009

2009
[23]

FMA: A dataset for music analysis,

M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “FMA: A dataset for music analysis,”Proc. ISMIR, 2016

2016
[24]

Au- dioSet: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Au- dioSet: An ontology and human-labeled dataset for audio events,” inProc. ICASSP, 2017

2017
[25]

Lp-musiccaps: Llm-based pseudo music captioning,

S. Doh, K. Choi, J. Lee, and J. Nam, “LP-MusicCaps: LLM-based pseudo music captioning,”arXiv preprint arXiv:2307.16372, 2023

work page arXiv 2023
[26]

Coarse-to-fine text-to-music latent diffusion,

L. A. Lanzend ¨orfer, T. Lu, N. Perraudin, D. Herremans, and R. Wattenhofer, “Coarse-to-fine text-to-music latent diffusion,” inProc. ICASSP, 2025

2025
[27]

Audiosetcaps: An enriched audio- caption dataset using automated generation pipeline with large audio and language models,

J. Bai, H. Liu, M. Wang, D. Shi, W. Wang, M. D. Plumbley, W.-S. Gan, and J. Chen, “Audiosetcaps: An enriched audio- caption dataset using automated generation pipeline with large audio and language models,”IEEE Transactions on Audio, Speech and Language Processing, 2025

2025
[28]

MERT: Acoustic mu- sic understanding model with large-scale self-supervised train- ing,

Y . Li, R. Yuan, G. Zhang, Y . Ma, X. Chen, H. Yin, C. Xiao, C. Lin, A. Ragni, E. Benetoset al., “MERT: Acoustic mu- sic understanding model with large-scale self-supervised train- ing,”Proc. ICLR, 2024

2024
[29]

Beats: Audio pre-training with acoustic tokenizers.arXiv preprint arXiv:2212.09058, 2022

S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, and F. Wei, “Beats: Audio pre-training with acoustic tokenizers,” arXiv preprint arXiv:2212.09058, 2022

work page arXiv 2022
[30]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

L. B. Allal, A. Lozhkov, E. Bakouch, G. M. Bl ´azquez, G. Penedo, L. Tunstall, A. Marafioti, H. Kydl ´ıˇcek, A. P. Lajar´ın, V . Srivastavet al., “SmolLM2: When smol goes big–data-centric training of a small language model,”arXiv preprint arXiv:2502.02737, 2025

work page internal anchor Pith review arXiv 2025
[31]

Automatic musical genre classification of audio signals,

T. George, E. Georg, and C. Perry, “Automatic musical genre classification of audio signals,” inProc. ISMIR, 2001

2001
[32]

Medley- solos-db: A cross-collection dataset for musical instrument recognition,

V . Lostanlen, C.-E. Cella, R. Bittner, and S. Essid, “Medley- solos-db: A cross-collection dataset for musical instrument recognition,” https://doi.org/10.5281/zenodo.1344103, 2019

work page doi:10.5281/zenodo.1344103 2019
[33]

CLAP learning audio concepts from natural language supervision,

B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “CLAP learning audio concepts from natural language supervision,” in Proc. ICASSP, 2023

2023
[34]

METEOR: An automatic metric for mt evaluation with improved correlation with human judg- ments,

S. Banerjee and A. Lavie, “METEOR: An automatic metric for mt evaluation with improved correlation with human judg- ments,” inProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005

2005
[35]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “BertScore: Evaluating text generation with bert,”arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review arXiv 1904
[36]

Mellow: a small audio language model for reasoning,

S. Deshmukh, S. Dixit, R. Singh, and B. Raj, “Mellow: a small audio language model for reasoning,”arXiv preprint arXiv:2503.08540, 2025

work page arXiv 2025
[37]

HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,

K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,” inProc. ICASSP, 2022

2022
[38]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Proc. ICLR, 2022

2022