Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores
Pith reviewed 2026-05-17 05:42 UTC · model grok-4.3
The pith
MSU-Bench shows that large models have gaps in comprehending complete musical scores but improve with fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluations of more than fifteen state-of-the-art models reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness; fine-tuning substantially improves results across modalities while preserving general knowledge.
What carries the argument
MSU-Bench, the human-curated collection of 1,800 generative QA pairs organized by four levels of musical understanding difficulty in textual and visual modalities.
If this is right
- Models display clear performance differences between processing scores as text and as images.
- Accuracy does not increase steadily with higher difficulty levels.
- Models often fail to sustain correct reasoning across basic and advanced questions on the same score.
- Fine-tuning enhances musical score comprehension in both input types while general capabilities remain intact.
Where Pith is reading between the lines
- Future models might be designed to explicitly combine information from multiple musical dimensions like harmony and form.
- Applying similar benchmarks to other creative domains could highlight similar modality-specific weaknesses in AI systems.
- The results point to the value of domain-specific fine-tuning for tasks requiring integrated perceptual and structural reasoning.
Load-bearing premise
That the human-curated generative QA pairs at each difficulty level provide a faithful and unbiased measure of genuine musical score comprehension rather than testing superficial pattern matching.
What would settle it
If fine-tuned models show no improvement over zero-shot performance when tested on musical scores from entirely new composers, this would indicate that the gains do not reflect true comprehension.
Figures
read the original abstract
Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision--Language Models to interpret full musical notation remains insufficiently examined. We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative question-answer pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. The benchmark and code are available at https://github.com/Congren-Dai/MSU-Bench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Musical Score Understanding Benchmark (MSU-Bench), a collection of 1,800 human-curated generative QA pairs drawn from complete musical scores by composers such as Bach, Beethoven, Chopin, and Debussy. The pairs are organized into four increasing difficulty levels (onset information through texture and form) and support evaluation in textual (ABC notation) and visual (PDF) modalities. Evaluations of more than fifteen state-of-the-art LLMs and VLMs in zero-shot and fine-tuned regimes report pronounced modality gaps, unstable level-wise performance, and difficulties maintaining multilevel correctness, while fine-tuning yields substantial gains across modalities without apparent loss of general knowledge.
Significance. If the QA pairs genuinely probe integrated reasoning over pitch, rhythm, harmony, and form rather than superficial cues, the benchmark would fill a notable gap in multimodal evaluation for music. The reported modality differences and fine-tuning benefits could usefully guide development of models capable of score-level understanding, and the public release of the benchmark and code supports reproducibility and follow-on work.
major comments (2)
- [Abstract / Benchmark Construction] Abstract and benchmark description: No details are supplied on inter-annotator agreement, expert validation of the generative QA pairs, or controls (e.g., adversarial minimal-knowledge baselines or checks that answers cannot be derived from local ABC token patterns or PDF visual heuristics). Because the central claims of modality gaps and multilevel correctness failures rest on the assumption that the 1,800 pairs measure genuine integrated comprehension, this omission is load-bearing.
- [Evaluation Results] Evaluation results: The reported patterns of 'pronounced modality gaps' and 'unstable level-wise performance' are presented without statistical significance tests, confidence intervals, or error bars. This makes it difficult to determine whether the observed differences are robust or could arise from question distribution or model selection.
minor comments (1)
- [Experimental Setup] A table listing all evaluated models with exact versions, parameter counts, and input formats would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their valuable comments on our manuscript. We provide point-by-point responses to the major comments below and have incorporated revisions to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract / Benchmark Construction] Abstract and benchmark description: No details are supplied on inter-annotator agreement, expert validation of the generative QA pairs, or controls (e.g., adversarial minimal-knowledge baselines or checks that answers cannot be derived from local ABC token patterns or PDF visual heuristics). Because the central claims of modality gaps and multilevel correctness failures rest on the assumption that the 1,800 pairs measure genuine integrated comprehension, this omission is load-bearing.
Authors: We agree that additional details on benchmark construction are required to substantiate claims of genuine integrated comprehension. In the revised manuscript we expand the relevant section to describe the human curation process, report inter-annotator agreement on a reviewed subset, and document controls confirming that questions cannot be solved from local token or visual heuristics alone. These additions directly support the reported modality gaps and multilevel consistency issues. revision: yes
-
Referee: [Evaluation Results] Evaluation results: The reported patterns of 'pronounced modality gaps' and 'unstable level-wise performance' are presented without statistical significance tests, confidence intervals, or error bars. This makes it difficult to determine whether the observed differences are robust or could arise from question distribution or model selection.
Authors: We acknowledge that statistical support strengthens the evaluation claims. The revised manuscript now includes appropriate significance tests for modality and level-wise comparisons together with 95% confidence intervals and error bars on all performance figures. These additions confirm the robustness of the observed gaps and instabilities. revision: yes
Circularity Check
Empirical benchmark evaluation with no derivation chain or self-referential predictions
full rationale
The paper constructs MSU-Bench (1,800 human-curated generative QA pairs across four difficulty levels) and reports zero-shot and fine-tuned performance of >15 models on textual (ABC) and visual (PDF) modalities. No equations, fitted parameters, or predictions appear in the provided text. Claims about modality gaps and multilevel correctness rest directly on observed accuracy numbers rather than reducing to quantities defined by the authors' own inputs. No self-citations are invoked to justify uniqueness theorems or ansatzes. The skeptic concern about superficial pattern matching is a validity question, not a circularity reduction. This is a standard self-contained empirical benchmark paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human-curated generative QA pairs at four difficulty levels accurately capture distinct aspects of musical score understanding
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MSU-Bench contains 1,800 generative question-answer pairs ... organised into four levels of increasing difficulty, ranging from onset information to texture and form.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Evaluations ... reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Decoupled Weight Decay Regularization
Pirhdy: Learning pitch-, rhythm-, and dynamics-aware embeddings for symbolic music. In Proceedings of the 28th ACM international confer- ence on multimedia, pages 574–582. Ilya Loshchilov and Frank Hutter. 2019. De- coupled weight decay regularization.Preprint, arXiv:1711.05101. Yinghao Ma, Anders Øland, Anton Ragni, Bleiz Mac- Sen Del Sette, Charalampos ...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[2]
Nota: Multimodal music notation understand- ing for visual large language model.arXiv preprint arXiv:2502.14893. Yashan Wang, Shangda Wu, Jianhuai Hu, Xingjian Du, Yueqi Peng, Yongxin Huang, Shuai Fan, Xiaobing Li, Feng Yu, and Maosong Sun. 2025a. Notagen: Advancing musicality in symbolic music generation with large language model training paradigms.arXiv...
-
[3]
Cello Suite No.1 BWV 1007 - 1. Prélude
-
[4]
Solfeggietto in C minor
-
[5]
Toccata and Fugue in D minor BWV 565
-
[6]
Fugue in G Minor BWV 542
-
[7]
Fugue I in C major BWV 846
-
[8]
Fugue in D minor BWV 948
-
[9]
Fugue in G Minor BWV 578
-
[10]
Prelude I in C major BWV 846
- [11]
- [12]
-
[13]
Sonata in G Op.14 No.2 Movement 1
-
[14]
Piano Sonata in A major Op.2 No.2
- [15]
-
[16]
Sonata No. 23 Op. 57 Appassionata
-
[17]
Sonata Op.31 No.17 in D minor Tempest
- [18]
-
[19]
Piano Sonata No.18 in E flat major Op.31 No.3
- [20]
-
[21]
Symphonie fantastique, H 48
-
[22]
Hungarian Dance No. 5
-
[23]
Rhapsody Op. 79 No. 2
-
[24]
Intermezzo in E flat major Op.117 No.1
-
[25]
B minor Rhapsody 1 Op. 79
-
[26]
Intermezzo Op. 116 No. 2
-
[27]
Intermezzo Op. 118 No. 2 A Major
-
[28]
Violin Concerto in E minor Op.64
-
[29]
Waltz in A Major Op.39 No.15
-
[30]
Fantaisie-Impromptu in C♯Minor
- [31]
-
[32]
Ballade no.1 in G minor Op.23
-
[33]
Sonata No.2 Op.35 1st Movement
-
[34]
Ballade No.3 in A flat major Op.47
-
[35]
Ballade No.4 in F minor Op
- [36]
-
[37]
Nocturne Op. 27, No. 2
-
[38]
La fille aux cheveux de lin
-
[39]
Sonate pour Violoncelle et Piano
- [40]
- [41]
-
[42]
Holberg Suite Op.40 I.Praeludium
-
[43]
Wedding Day at Troldhaugen
-
[44]
Anitras Dance Piano solo
-
[45]
Sailors Song Op.68 No.1
-
[46]
Butterfly Sommerfugl Op. 43 No. 1
-
[47]
Piano Concerto in A minor Op.16 11
-
[48]
In the Hall of the Mountain King
-
[49]
Lyric Pieces Op.47 Grieg
-
[50]
Lyric Pieces Op. 54 No. 4
-
[51]
Morning Mood from Peer Gynt Suite No. 1
- [52]
-
[53]
String quartet - Op.76, No.5, in D major
-
[54]
Cello Concerto C Major Movement 1
-
[55]
Piano Sonata in F Major HOB.XVI/23
- [56]
-
[57]
String Quartet Op.64 No.3
-
[58]
Piano Concerto in D major
-
[59]
Die Schöpfung Mit Würd’ und Hoheit angetan
-
[60]
Piano Sonata in E minor HOB. XVI/34
-
[61]
Sonata in C minor HOB/XVI:20
-
[62]
String Quartet in C major (“Emperor”) Op. 76 No. 3
- [63]
-
[64]
Pizzicato Polka Arranged for Piano Solo
-
[65]
The Blue Danube Accordion Solo
-
[66]
Tratsch-Polka Op.214
-
[67]
Strauss Die Fledermaus Op.362 Overture
-
[68]
Hungarian Rhapsody No. 2
-
[69]
Trois Etudes de Concert No. 3
- [70]
-
[71]
Hungarian Rhapsody No. 6
-
[72]
William Tell Overture Finale
-
[73]
Grandes études de Paganini, S.141: No. 6
- [74]
-
[75]
S.541 No.3 in A♭Major
- [76]
-
[77]
Song Without Words Op.85 No.3
-
[78]
Song without Words Op. 38 No.6
-
[79]
Song Without Words Op.30 No.5
-
[80]
Melodie Op.4 No.2 in C minor
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.