pith. sign in

arxiv: 2511.20697 · v4 · submitted 2025-11-24 · 💻 cs.SD · cs.AI

Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores

Pith reviewed 2026-05-17 05:42 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords musical score understandinglarge language modelsvision-language modelsbenchmarkmultimodal reasoningfine-tuningABC notationmusic analysis
0
0 comments X

The pith

MSU-Bench shows that large models have gaps in comprehending complete musical scores but improve with fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MSU-Bench as a benchmark for testing large language models and vision-language models on their understanding of full musical scores. It includes 1,800 human-generated question-answer pairs drawn from compositions by Bach, Beethoven, and others, available in both text-based ABC notation and visual PDF formats across four difficulty levels. Testing more than fifteen leading models uncovers significant differences in how well they handle text versus visual inputs, inconsistent results as tasks get harder, and problems keeping accuracy steady at all levels. Fine-tuning the models on this benchmark markedly raises their scores in both modalities without reducing their performance on other tasks. This establishes a standard way to measure and enhance AI's ability to reason about music structure.

Core claim

Evaluations of more than fifteen state-of-the-art models reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness; fine-tuning substantially improves results across modalities while preserving general knowledge.

What carries the argument

MSU-Bench, the human-curated collection of 1,800 generative QA pairs organized by four levels of musical understanding difficulty in textual and visual modalities.

If this is right

  • Models display clear performance differences between processing scores as text and as images.
  • Accuracy does not increase steadily with higher difficulty levels.
  • Models often fail to sustain correct reasoning across basic and advanced questions on the same score.
  • Fine-tuning enhances musical score comprehension in both input types while general capabilities remain intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future models might be designed to explicitly combine information from multiple musical dimensions like harmony and form.
  • Applying similar benchmarks to other creative domains could highlight similar modality-specific weaknesses in AI systems.
  • The results point to the value of domain-specific fine-tuning for tasks requiring integrated perceptual and structural reasoning.

Load-bearing premise

That the human-curated generative QA pairs at each difficulty level provide a faithful and unbiased measure of genuine musical score comprehension rather than testing superficial pattern matching.

What would settle it

If fine-tuned models show no improvement over zero-shot performance when tested on musical scores from entirely new composers, this would indicate that the gains do not reflect true comprehension.

Figures

Figures reproduced from arXiv: 2511.20697 by Bo Zhang, Congren Dai, Enyang Liu, Ge Jin, Haosen Zhang, Hongran An, Huichi Zhou, Kinhei Lee, Krinos Li, Maosong Sun, Peiyuan Jing, Shijie Liang, Xiaobing Li, Yue Yang, Z henxuan Zhang.

Figure 1
Figure 1. Figure 1: (a) Hallucination. When queried about specific score features in bars, VLMs often fabricate responses that are not grounded in the actual score. (b) Ideal scenario. Models should accurately localise and analyse bars, thereby supporting reliable higher-level musicological reasoning. These levels range from basic recognition of no￾tational elements to advanced harmonic and struc￾tural analysis. By explicitly… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of multi-level understanding in MSU-Bench using Mussorgsky’s [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MSU-Bench data curation and evaluation framework. We collect 150 musical scores from MuseScore [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of 4-Level Questions. in MSU-Bench is provided in Section A. For visual QA, the PDF of each score is employed, whereas for textual QA, the corresponding MusicXML file is converted into ABC notation. A comprehen￾sive set of general questions is then developed and categorised into three levels of difficulty (Levels 1–3), designed to evaluate a broad range of musi￾cal concepts encompassing fundam… view at source ↗
Figure 5
Figure 5. Figure 5: Performance of baseline and LoRA-adapted models on MSU-Bench (testing set). Qwen2.5-VL-3B [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Level-wise Success Rate for textual QA. with depth in both settings. In textual QA (Fig￾ure 6), models perform moderately at Level 1 (25–35%), with Gemini 2.5 Pro slightly ahead of ChatGPT-5 and Grok 4, but drop below 10% by Level 2 and nearly vanish by Level 3. Visual QA ( [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Frequency distribution of composers represented in MSU-Bench. The histogram illustrates the number of [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of musical periods and genres in MSU-Bench. (a) shows the historical periods of the selected [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Level-wise Success Rate. We evaluate model [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The inference time for models exceeding 40% overall accuracy. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Level-wise Success Rate for Models Adapted Using LoRA. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Overall accuracy comparison under the Question-by-Question and Single-Run (Batch) evaluation for [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
read the original abstract

Understanding complete musical scores entails integrated reasoning over pitch, rhythm, harmony, and large-scale structure, yet the ability of Large Language Models and Vision--Language Models to interpret full musical notation remains insufficiently examined. We introduce Musical Score Understanding Benchmark (MSU-Bench), a human-curated benchmark for score-level musical understanding across textual (ABC notation) and visual (PDF) modalities. MSU-Bench contains 1,800 generative question-answer pairs from works by Bach, Beethoven, Chopin, Debussy, and others, organised into four levels of increasing difficulty, ranging from onset information to texture and form. Evaluations of more than fifteen state-of-the-art models, in both zero-shot and fine-tuned settings, reveal pronounced modality gaps, unstable level-wise performance, and challenges in maintaining multilevel correctness. Fine-tuning substantially improves results across modalities while preserving general knowledge, positioning MSU-Bench as a robust foundation for future research in multimodal reasoning. The benchmark and code are available at https://github.com/Congren-Dai/MSU-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Musical Score Understanding Benchmark (MSU-Bench), a collection of 1,800 human-curated generative QA pairs drawn from complete musical scores by composers such as Bach, Beethoven, Chopin, and Debussy. The pairs are organized into four increasing difficulty levels (onset information through texture and form) and support evaluation in textual (ABC notation) and visual (PDF) modalities. Evaluations of more than fifteen state-of-the-art LLMs and VLMs in zero-shot and fine-tuned regimes report pronounced modality gaps, unstable level-wise performance, and difficulties maintaining multilevel correctness, while fine-tuning yields substantial gains across modalities without apparent loss of general knowledge.

Significance. If the QA pairs genuinely probe integrated reasoning over pitch, rhythm, harmony, and form rather than superficial cues, the benchmark would fill a notable gap in multimodal evaluation for music. The reported modality differences and fine-tuning benefits could usefully guide development of models capable of score-level understanding, and the public release of the benchmark and code supports reproducibility and follow-on work.

major comments (2)
  1. [Abstract / Benchmark Construction] Abstract and benchmark description: No details are supplied on inter-annotator agreement, expert validation of the generative QA pairs, or controls (e.g., adversarial minimal-knowledge baselines or checks that answers cannot be derived from local ABC token patterns or PDF visual heuristics). Because the central claims of modality gaps and multilevel correctness failures rest on the assumption that the 1,800 pairs measure genuine integrated comprehension, this omission is load-bearing.
  2. [Evaluation Results] Evaluation results: The reported patterns of 'pronounced modality gaps' and 'unstable level-wise performance' are presented without statistical significance tests, confidence intervals, or error bars. This makes it difficult to determine whether the observed differences are robust or could arise from question distribution or model selection.
minor comments (1)
  1. [Experimental Setup] A table listing all evaluated models with exact versions, parameter counts, and input formats would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable comments on our manuscript. We provide point-by-point responses to the major comments below and have incorporated revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract / Benchmark Construction] Abstract and benchmark description: No details are supplied on inter-annotator agreement, expert validation of the generative QA pairs, or controls (e.g., adversarial minimal-knowledge baselines or checks that answers cannot be derived from local ABC token patterns or PDF visual heuristics). Because the central claims of modality gaps and multilevel correctness failures rest on the assumption that the 1,800 pairs measure genuine integrated comprehension, this omission is load-bearing.

    Authors: We agree that additional details on benchmark construction are required to substantiate claims of genuine integrated comprehension. In the revised manuscript we expand the relevant section to describe the human curation process, report inter-annotator agreement on a reviewed subset, and document controls confirming that questions cannot be solved from local token or visual heuristics alone. These additions directly support the reported modality gaps and multilevel consistency issues. revision: yes

  2. Referee: [Evaluation Results] Evaluation results: The reported patterns of 'pronounced modality gaps' and 'unstable level-wise performance' are presented without statistical significance tests, confidence intervals, or error bars. This makes it difficult to determine whether the observed differences are robust or could arise from question distribution or model selection.

    Authors: We acknowledge that statistical support strengthens the evaluation claims. The revised manuscript now includes appropriate significance tests for modality and level-wise comparisons together with 95% confidence intervals and error bars on all performance figures. These additions confirm the robustness of the observed gaps and instabilities. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivation chain or self-referential predictions

full rationale

The paper constructs MSU-Bench (1,800 human-curated generative QA pairs across four difficulty levels) and reports zero-shot and fine-tuned performance of >15 models on textual (ABC) and visual (PDF) modalities. No equations, fitted parameters, or predictions appear in the provided text. Claims about modality gaps and multilevel correctness rest directly on observed accuracy numbers rather than reducing to quantities defined by the authors' own inputs. No self-citations are invoked to justify uniqueness theorems or ansatzes. The skeptic concern about superficial pattern matching is a validity question, not a circularity reduction. This is a standard self-contained empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's contribution rests on the creation and use of a new evaluation benchmark rather than on mathematical derivations or new theoretical entities.

axioms (1)
  • domain assumption Human-curated generative QA pairs at four difficulty levels accurately capture distinct aspects of musical score understanding
    This premise is required for the benchmark to serve as a valid test of model comprehension.

pith-pipeline@v0.9.0 · 5532 in / 1298 out tokens · 32489 ms · 2026-05-17T05:42:28.431310+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

134 extracted references · 134 canonical work pages · 1 internal anchor

  1. [1]

    Decoupled Weight Decay Regularization

    Pirhdy: Learning pitch-, rhythm-, and dynamics-aware embeddings for symbolic music. In Proceedings of the 28th ACM international confer- ence on multimedia, pages 574–582. Ilya Loshchilov and Frank Hutter. 2019. De- coupled weight decay regularization.Preprint, arXiv:1711.05101. Yinghao Ma, Anders Øland, Anton Ragni, Bleiz Mac- Sen Del Sette, Charalampos ...

  2. [2]

    Yashan Wang, Shangda Wu, Jianhuai Hu, Xingjian Du, Yueqi Peng, Yongxin Huang, Shuai Fan, Xiaobing Li, Feng Yu, and Maosong Sun

    Nota: Multimodal music notation understand- ing for visual large language model.arXiv preprint arXiv:2502.14893. Yashan Wang, Shangda Wu, Jianhuai Hu, Xingjian Du, Yueqi Peng, Yongxin Huang, Shuai Fan, Xiaobing Li, Feng Yu, and Maosong Sun. 2025a. Notagen: Advancing musicality in symbolic music generation with large language model training paradigms.arXiv...

  3. [3]

    Cello Suite No.1 BWV 1007 - 1. Prélude

  4. [4]

    Solfeggietto in C minor

  5. [5]

    Toccata and Fugue in D minor BWV 565

  6. [6]

    Fugue in G Minor BWV 542

  7. [7]

    Fugue I in C major BWV 846

  8. [8]

    Fugue in D minor BWV 948

  9. [9]

    Fugue in G Minor BWV 578

  10. [10]

    Prelude I in C major BWV 846

  11. [11]

    16 1st Movement

    Sonate No. 16 1st Movement

  12. [12]

    5 in C Minor Op.10 No.1

    Piano Sonata No. 5 in C Minor Op.10 No.1

  13. [13]

    Sonata in G Op.14 No.2 Movement 1

  14. [14]

    Piano Sonata in A major Op.2 No.2

  15. [15]

    3 in C Major Op

    Piano Sonata No. 3 in C Major Op. 2 No. 3

  16. [16]

    Sonata No. 23 Op. 57 Appassionata

  17. [17]

    Sonata Op.31 No.17 in D minor Tempest

  18. [18]

    17 in D minor Op

    Piano Sonata No. 17 in D minor Op. 31 No. 2

  19. [19]

    Piano Sonata No.18 in E flat major Op.31 No.3

  20. [20]

    Allegro Sonata No.8

    Sonate No.8 Op.13 Pathétique 3 Rondo. Allegro Sonata No.8

  21. [21]

    Symphonie fantastique, H 48

  22. [22]

    Hungarian Dance No. 5

  23. [23]

    Rhapsody Op. 79 No. 2

  24. [24]

    Intermezzo in E flat major Op.117 No.1

  25. [25]

    B minor Rhapsody 1 Op. 79

  26. [26]

    Intermezzo Op. 116 No. 2

  27. [27]

    Intermezzo Op. 118 No. 2 A Major

  28. [28]

    Violin Concerto in E minor Op.64

  29. [29]

    Waltz in A Major Op.39 No.15

  30. [30]

    Fantaisie-Impromptu in C♯Minor

  31. [31]

    20 in C Sharp Minor

    Nocturne-No. 20 in C Sharp Minor

  32. [32]

    Ballade no.1 in G minor Op.23

  33. [33]

    Sonata No.2 Op.35 1st Movement

  34. [34]

    Ballade No.3 in A flat major Op.47

  35. [35]

    Ballade No.4 in F minor Op

  36. [36]

    4 in E Minor

    Prélude Opus 28 No. 4 in E Minor

  37. [37]

    Nocturne Op. 27, No. 2

  38. [38]

    La fille aux cheveux de lin

  39. [39]

    Sonate pour Violoncelle et Piano

  40. [40]

    9 New World II, Largo

    Symphony No. 9 New World II, Largo

  41. [41]

    9 New World:IV , Allegro con fuoco

    Symphony No. 9 New World:IV , Allegro con fuoco

  42. [42]

    Holberg Suite Op.40 I.Praeludium

  43. [43]

    Wedding Day at Troldhaugen

  44. [44]

    Anitras Dance Piano solo

  45. [45]

    Sailors Song Op.68 No.1

  46. [46]

    Butterfly Sommerfugl Op. 43 No. 1

  47. [47]

    Piano Concerto in A minor Op.16 11

  48. [48]

    In the Hall of the Mountain King

  49. [49]

    Lyric Pieces Op.47 Grieg

  50. [50]

    Lyric Pieces Op. 54 No. 4

  51. [51]

    Morning Mood from Peer Gynt Suite No. 1

  52. [52]

    XVI: 34 (I: Presto)

    Sonata in E Minor, Hob. XVI: 34 (I: Presto)

  53. [53]

    String quartet - Op.76, No.5, in D major

  54. [54]

    Cello Concerto C Major Movement 1

  55. [55]

    Piano Sonata in F Major HOB.XVI/23

  56. [56]

    XVI37 Mov

    Haydn Sonata Hob. XVI37 Mov. 1 D Major

  57. [57]

    String Quartet Op.64 No.3

  58. [58]

    Piano Concerto in D major

  59. [59]

    Die Schöpfung Mit Würd’ und Hoheit angetan

  60. [60]

    Piano Sonata in E minor HOB. XVI/34

  61. [61]

    Sonata in C minor HOB/XVI:20

  62. [62]

    String Quartet in C major (“Emperor”) Op. 76 No. 3

  63. [63]

    56 Konzertpara- phrase

    Die Fledermaus Grunfeld Op. 56 Konzertpara- phrase

  64. [64]

    Pizzicato Polka Arranged for Piano Solo

  65. [65]

    The Blue Danube Accordion Solo

  66. [66]

    Tratsch-Polka Op.214

  67. [67]

    Strauss Die Fledermaus Op.362 Overture

  68. [68]

    Hungarian Rhapsody No. 2

  69. [69]

    Trois Etudes de Concert No. 3

  70. [70]

    D795, S.5652

    Der Müller Und Der Bach. D795, S.5652

  71. [71]

    Hungarian Rhapsody No. 6

  72. [72]

    William Tell Overture Finale

  73. [73]

    Grandes études de Paganini, S.141: No. 6

  74. [74]

    1413 in G♯Minor, La Campanella

    S. 1413 in G♯Minor, La Campanella

  75. [75]

    S.541 No.3 in A♭Major

  76. [76]

    Adagio Complete Score

    Symphony No.10 - I. Adagio Complete Score

  77. [77]

    Song Without Words Op.85 No.3

  78. [78]

    Song without Words Op. 38 No.6

  79. [79]

    Song Without Words Op.30 No.5

  80. [80]

    Melodie Op.4 No.2 in C minor

Showing first 80 references.