pith. sign in

arxiv: 2505.12863 · v1 · submitted 2025-05-19 · 💻 cs.SD · cs.AI· cs.CV· eess.AS

Unified Cross-modal Translation of Score Images, Symbolic Music, and Performance Audio

Pith reviewed 2026-05-22 14:57 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CVeess.AS
keywords cross-modal music translationoptical music recognitionscore to audio generationunified tokenizationmultitask transformermusic information retrievalYouTube music datasetaudio-symbolic alignment
0
0 comments X

The pith

A single Transformer trained jointly on music translations outperforms separate models for each task and generates audio from score images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that one general-purpose encoder-decoder Transformer can handle translations among score images, symbolic scores, MIDI, and audio when all modalities are turned into token sequences and the model is trained on every task at once. A new dataset of more than 1,300 hours of paired audio and score images collected from YouTube supplies the scale needed for this joint training. The approach improves optical music recognition symbol error rate from 24.58 percent to 13.67 percent and produces the first working examples of generating audio directly from a score image. A sympathetic reader would care because the method collapses several separate music-information-retrieval pipelines into one model while opening new cross-modal generation capabilities.

Core claim

The central claim is that a unified multitask model, built on a shared tokenization of score images, symbolic scores, MIDI, and audio into discrete sequences, outperforms single-task baselines across translation directions. The model is trained on a newly collected dataset of over 1,300 hours of aligned audio-score pairs from YouTube videos. This joint training yields a symbol error rate of 13.67 percent on optical music recognition and achieves the first successful score-image-conditioned audio generation.

What carries the argument

The unified tokenization framework that discretizes score images, audio waveforms, MIDI events, and MusicXML into a single sequence of tokens so one encoder-decoder Transformer can perform every cross-modal translation as a sequence-to-sequence task.

If this is right

  • Optical music recognition reaches a symbol error rate of 13.67 percent rather than the prior 24.58 percent baseline.
  • Audio-to-MIDI and other modality translations show comparable error reductions.
  • Score images become a viable conditioning input for direct audio generation.
  • A single set of model parameters serves all translation directions instead of requiring separate models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tokenization strategy could be tested on other paired multimodal datasets to see whether unified training generalizes beyond music.
  • The 1,300-hour YouTube collection might support pre-training for downstream tasks such as music style transfer or alignment without additional labels.
  • Adding explicit timing tokens or hierarchical structures could further reduce synchronization errors in audio generation.
  • Zero-shot translation between modalities never seen together during training becomes a natural next experiment.

Load-bearing premise

The YouTube video pairs supply sufficiently accurate alignments and labels across audio, images, and symbols so that joint training does not suffer from noise or synchronization errors.

What would settle it

Retraining the model on a controlled dataset that introduces measurable synchronization offsets or label noise and observing that the reported error reductions disappear would falsify the claim that the unified approach is robust.

Figures

Figures reproduced from arXiv: 2505.12863 by Chris Donahue, Dasaem Jeong, Dongmin Kim, Hyungjoon Soh, Irmak Bukey, Jongmin Jung, Seola Cho, Sihun Lee.

Figure 1
Figure 1. Figure 1: Conventional cross-modal conversion tasks in music information [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed unified multimodal translation framework. We employ a single Transformer encoder-decoder model for each direction— [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The four modalities of music representation used in this paper. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example from one of the videos collected for the YouTube Score [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: One example from audio-to-image translation. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Piano roll visualizations of MIDI representations from (a): Ground [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Token distribution histograms used for Earth Mover’s Distance [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of a poor-quality scan that necessitates our video-level [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Example patches showing the 16×16 pixel resolution of individual tokens [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison between input sheet music (top) and model reconstruction (bottom), demonstrating reconstruction artifacts in staff lines when a model [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
read the original abstract

Music exists in various modalities, such as score images, symbolic scores, MIDI, and audio. Translations between each modality are established as core tasks of music information retrieval, such as automatic music transcription (audio-to-MIDI) and optical music recognition (score image to symbolic score). However, most past work on multimodal translation trains specialized models on individual translation tasks. In this paper, we propose a unified approach, where we train a general-purpose model on many translation tasks simultaneously. Two key factors make this unified approach viable: a new large-scale dataset and the tokenization of each modality. Firstly, we propose a new dataset that consists of more than 1,300 hours of paired audio-score image data collected from YouTube videos, which is an order of magnitude larger than any existing music modal translation datasets. Secondly, our unified tokenization framework discretizes score images, audio, MIDI, and MusicXML into a sequence of tokens, enabling a single encoder-decoder Transformer to tackle multiple cross-modal translation as one coherent sequence-to-sequence task. Experimental results confirm that our unified multitask model improves upon single-task baselines in several key areas, notably reducing the symbol error rate for optical music recognition from 24.58% to a state-of-the-art 13.67%, while similarly substantial improvements are observed across the other translation tasks. Notably, our approach achieves the first successful score-image-conditioned audio generation, marking a significant breakthrough in cross-modal music generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a unified encoder-decoder Transformer model for cross-modal music translations among score images, symbolic scores (MusicXML), MIDI, and performance audio. It introduces a new 1,300-hour paired dataset collected from YouTube videos and a shared tokenization scheme that casts all tasks as sequence-to-sequence problems. The multitask model is shown to outperform single-task baselines, with the optical music recognition symbol error rate dropping from 24.58% to 13.67%, comparable gains on other translation directions, and the first reported success on score-image-conditioned audio generation.

Significance. If the collected YouTube pairs supply sufficiently clean and temporally aligned supervision, the work would be significant for multimodal music information retrieval. The combination of an order-of-magnitude larger dataset with a single tokenization framework enables joint training that yields concrete metric improvements and unlocks a previously unreported task (image-to-audio). The explicit comparison against independently implemented single-task baselines on held-out data and the provision of a large public-scale resource are clear strengths that could shape future unified modeling efforts in the field.

major comments (2)
  1. [Section 3] Section 3 (Dataset Construction): the description of pairing YouTube videos with score images does not include any quantitative assessment of synchronization accuracy (e.g., measured temporal offsets, manual verification statistics, or filtering criteria for alignment quality). Because every reported result—including the OMR SER reduction and the novel image-to-audio generation—relies on the same set of pairs for multitask training, unquantified label noise or offsets would directly undermine the claim that unification itself drives the gains rather than simply larger data volume.
  2. [Section 5] Section 5 (Experiments): while single-task baselines are compared on held-out test sets, the manuscript provides limited ablation detail on whether the observed improvements persist when single-task models are trained on the identical 1,300-hour corpus, and no error analysis or exact train/validation/test split statistics are reported. This leaves open the possibility that post-hoc selection or data-scale effects, rather than the unified architecture, explain the state-of-the-art numbers.
minor comments (1)
  1. [Abstract] The abstract states that 'similarly substantial improvements are observed across the other translation tasks' without naming the concrete metrics or numerical values; adding these figures would allow readers to gauge the breadth of the gains immediately.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Dataset Construction): the description of pairing YouTube videos with score images does not include any quantitative assessment of synchronization accuracy (e.g., measured temporal offsets, manual verification statistics, or filtering criteria for alignment quality). Because every reported result—including the OMR SER reduction and the novel image-to-audio generation—relies on the same set of pairs for multitask training, unquantified label noise or offsets would directly undermine the claim that unification itself drives the gains rather than simply larger data volume.

    Authors: We agree that a quantitative assessment of synchronization accuracy is important for validating the dataset quality and isolating the benefits of unification. In the revised manuscript we will expand Section 3 with details on the alignment procedure, including the filtering criteria applied to YouTube pairs, any manual verification statistics collected, and measured temporal offset statistics for a sampled subset of the data. These additions will clarify that the reported gains are supported by reasonably aligned supervision. revision: yes

  2. Referee: [Section 5] Section 5 (Experiments): while single-task baselines are compared on held-out test sets, the manuscript provides limited ablation detail on whether the observed improvements persist when single-task models are trained on the identical 1,300-hour corpus, and no error analysis or exact train/validation/test split statistics are reported. This leaves open the possibility that post-hoc selection or data-scale effects, rather than the unified architecture, explain the state-of-the-art numbers.

    Authors: We acknowledge that the current manuscript does not include an explicit ablation of single-task models trained on the full 1,300-hour corpus nor detailed split statistics or error analysis. To address this directly, the revised version will add (i) an ablation study retraining the single-task baselines on the identical full corpus, (ii) exact train/validation/test split sizes and construction details, and (iii) a concise error analysis for the primary tasks (OMR and image-to-audio) to help attribute performance differences to the unified modeling approach rather than data scale alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on held-out test data against independent baselines

full rationale

The paper presents an empirical multitask Transformer trained on a newly collected 1,300-hour YouTube-sourced dataset of paired audio and score images, with a proposed tokenization scheme that converts each modality into sequences. All reported metrics (e.g., OMR SER reduction from 24.58% to 13.67%, gains on other translation tasks, and first score-to-audio generation) are measured on external test splits against single-task baselines that the authors implement independently. No derivation step reduces by the paper's own equations or self-citations to a quantity defined solely by its inputs; the sequence-to-sequence objective and evaluation protocol supply independent grounding outside any fitted parameter or prior author result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard Transformer seq2seq assumptions and the quality of the newly collected paired data; no new physical entities or ad-hoc mathematical constants are introduced.

axioms (1)
  • domain assumption A single encoder-decoder Transformer can learn effective shared representations when all modalities are discretized into compatible token sequences.
    Invoked when framing all translation tasks as one coherent sequence-to-sequence problem.

pith-pipeline@v0.9.0 · 5827 in / 1512 out tokens · 46133 ms · 2026-05-22T14:57:48.597785+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages

  1. [2]

    Machine learning techniques in automatic music transcription: A systematic survey,

    F. Jamshidi, G. Pike, A. Das, and R. Chapman, “Machine learning techniques in automatic music transcription: A systematic survey,” 06 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

  2. [3]

    Enabling factorized piano music modeling and generation with the MAESTRO dataset,

    C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C. A. Huang, S. Diele- man, E. Elsen, J. H. Engel, and D. Eck, “Enabling factorized piano music modeling and generation with the MAESTRO dataset,” 2019

  3. [4]

    Understanding optical music recognition,

    J. Calvo-Zaragoza, J. Haji ˇc, jr, and A. Pacha, “Understanding optical music recognition,” ACM Computing Surveys , vol. 53, 05 2020

  4. [5]

    Calvo-Zaragoza, J

    J. Calvo-Zaragoza, J. Martinez-Sevilla, C. Pe ˜narrubia, and A. R´ıos Vila, Optical Music Recognition: Recent Advances, Current Challenges, and Future Directions, pp. 94–104. 08 2023

  5. [6]

    Towards com- plete polyphonic music transcription: Integrating multi-pitch detection and rhythm quantization,

    E. Nakamura, E. Benetos, K. Yoshii, and S. Dixon, “Towards com- plete polyphonic music transcription: Integrating multi-pitch detection and rhythm quantization,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 101–105, IEEE, 2018

  6. [7]

    Graph neural network for music score data and modeling expressive piano performance,

    D. Jeong, T. Kwon, Y . Kim, and J. Nam, “Graph neural network for music score data and modeling expressive piano performance,” in International conference on machine learning , pp. 3060–3070, PMLR, 2019

  7. [8]

    Performance midi-to-score conversion by neural beat tracking,

    L. Liu, Q. Kong, V . Morfi, and E. Benetos, “Performance midi-to-score conversion by neural beat tracking,” in International Society for Music Information Retrieval Conference , 2022

  8. [9]

    End-to-end piano performance-midi to score conversion with transformers,

    T. Beyer and A. Dai, “End-to-end piano performance-midi to score conversion with transformers,” in International Society for Music In- formation Retrieval Conference , 2024

  9. [10]

    Deep performer: Score-to-audio music performance synthesis,

    H.-W. Dong, C. Zhou, T. Berg-Kirkpatrick, and J. McAuley, “Deep performer: Score-to-audio music performance synthesis,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 951–955, IEEE, 2022

  10. [11]

    Towards an integrated approach for expressive piano performance synthesis from music scores,

    J. Tang, E. Cooper, X. Wang, J. Yamagishi, and G. Fazekas, “Towards an integrated approach for expressive piano performance synthesis from music scores,” in ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 1–5, 2025

  11. [12]

    Tmt: Tri-modal translation between speech, image, and text by processing different modalities as different languages,

    M. Kim, J.-w. Jung, H. Rha, S. Maiti, S. Arora, X. Chang, S. Watanabe, and Y . M. Ro, “Tmt: Tri-modal translation between speech, image, and text by processing different modalities as different languages,” arXiv preprint arXiv:2402.16021, 2024

  12. [13]

    Automatic music transcription: An overview,

    E. Benetos, S. Dixon, Z. Duan, and S. Ewert, “Automatic music transcription: An overview,” IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 20–30, 2019

  13. [14]

    Onsets and frames: Dual-objective piano transcription,

    C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, J. H. Engel, S. Oore, and D. Eck, “Onsets and frames: Dual-objective piano transcription,” in Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018 (E. G´omez, X. Hu, E. Humphrey, and E. Benetos, eds.), pp....

  14. [15]

    High-resolution piano transcription with pedals by regressing onset and offset times,

    Q. Kong, B. Li, X. Song, Y . Wan, and Y . Wang, “High-resolution piano transcription with pedals by regressing onset and offset times,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 29, pp. 3707–3717, 2021

  15. [16]

    Sequence-to-sequence piano transcription with transformers,

    C. Hawthorne, I. Simon, R. Swavely, E. Manilow, and J. H. En- gel, “Sequence-to-sequence piano transcription with transformers,” in Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, Online, November 7-12, 2021 (J. H. Lee, A. Lerch, Z. Duan, J. Nam, P. Rao, P. van Kranenburg, and A. Srinivasamurthy, eds.)...

  16. [17]

    MT3: multi-task multitrack music transcription,

    J. Gardner, I. Simon, E. Manilow, C. Hawthorne, and J. H. Engel, “MT3: multi-task multitrack music transcription,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 , OpenReview.net, 2022

  17. [18]

    Yourmt3+: Multi- instrument music transcription with enhanced transformer architectures and cross-dataset stem augmentation,

    S. Chang, E. Benetos, H. Kirchhoff, and S. Dixon, “Yourmt3+: Multi- instrument music transcription with enhanced transformer architectures and cross-dataset stem augmentation,” in 2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6, IEEE, 2024

  18. [19]

    Sheet music trans- former: End-to-end optical music recognition beyond monophonic tran- scription,

    A. R ´ıos-Vila, J. Calvo-Zaragoza, and T. Paquet, “Sheet music trans- former: End-to-end optical music recognition beyond monophonic tran- scription,” in International Conference on Document Analysis and Recognition, pp. 20–37, Springer, 2024

  19. [20]

    Practical end-to-end optical music recognition for pianoform music,

    J. Mayer, M. Straka, J. Haji ˇc, and P. Pecina, “Practical end-to-end optical music recognition for pianoform music,” in Document Analysis and Recognition - ICDAR 2024 (E. H. Barney Smith, M. Liwicki, and L. Peng, eds.), (Cham), pp. 55–73, Springer Nature Switzerland, 2024

  20. [22]

    Sheet mu- sic transformer++: End-to-end full-page optical music recognition for pianoform sheet music,

    A. R ´ıos-Vila, J. Calvo-Zaragoza, D. Rizo, and T. Paquet, “Sheet mu- sic transformer++: End-to-end full-page optical music recognition for pianoform sheet music,” arXiv preprint arXiv:2405.12105 , 2024

  21. [23]

    UniAudio: Towards universal audio generation with large language models,

    D. Yang, J. Tian, X. Tan, R. Huang, S. Liu, H. Guo, X. Chang, J. Shi, S. Zhao, J. Bian, Z. Zhao, X. Wu, and H. M. Meng, “UniAudio: Towards universal audio generation with large language models,” in Proceedings of the 41st International Conference on Machine Learning (R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenk...

  22. [24]

    Fugatto 1: Foundational generative audio transformer opus 1,

    R. Valle, R. Badlani, Z. Kong, S. gil Lee, A. Goel, S. Kim, J. F. Santos, S. Dai, S. Gururani, A. Aljafari, A. H. Liu, K. J. Shih, R. Prenger, W. Ping, C.-H. H. Yang, and B. Catanzaro, “Fugatto 1: Foundational generative audio transformer opus 1,” in The Thirteenth International Conference on Learning Representations , 2025

  23. [25]

    Autoregressive image generation using residual quantization,

    D. Lee, C. Kim, S. Kim, M. Cho, and W. Han, “Autoregressive image generation using residual quantization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 , pp. 11513–11522, IEEE, 2022

  24. [26]

    High- fidelity audio compression with improved RVQGAN,

    R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High- fidelity audio compression with improved RVQGAN,” in Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 (A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, an...

  25. [27]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems , vol. 30, pp. 6000–6010, 2017

  26. [28]

    Schubert winterreise dataset: A multimodal sce- nario for music analysis,

    C. Weiß, F. Zalkow, V . Arifi-M ¨uller, M. M ¨uller, H. V . Koops, A. V olk, and H. G. Grohganz, “Schubert winterreise dataset: A multimodal sce- nario for music analysis,” Journal on Computing and Cultural Heritage, vol. 14, May 2021

  27. [29]

    Wagner ring dataset: A complex opera scenario for music processing and computational musicology,

    C. Weiß, V . Arifi-M ¨uller, M. Krause, F. Zalkow, S. Klauk, R. Kleinertz, and M. M ¨uller, “Wagner ring dataset: A complex opera scenario for music processing and computational musicology,” Transactions of the International Society for Music Information Retrieval , vol. 6, no. 1, 2023

  28. [30]

    Learning features of music from scratch,

    J. Thickstun, Z. Harchaoui, and S. Kakade, “Learning features of music from scratch,” in International Conference on Learning Representations, 2017

  29. [31]

    MIDI passage retrieval using cell phone pictures of sheet music,

    D. Yang, T. Tanprasert, T. Jenrungrot, M. Shan, and T. Tsai, “MIDI passage retrieval using cell phone pictures of sheet music,” in Proceed- ings of the 20th International Society for Music Information Retrieval Conference, ISMIR 2019, Delft, The Netherlands, November 4-8, 2019 (A. Flexer, G. Peeters, J. Urbano, and A. V olk, eds.), pp. 916–923, 2019

  30. [32]

    Assistive alignment of in- the-wild sheet music and performances,

    M. Feffer, C. Donahue, and Z. Lipton, “Assistive alignment of in- the-wild sheet music and performances,” in Late-Breaking/Demo of 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022

  31. [33]

    Yolov8: A novel object detection algorithm with enhanced performance and robustness,

    R. Varghese and S. M., “Yolov8: A novel object detection algorithm with enhanced performance and robustness,” in 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), pp. 1–6, 2024

  32. [34]

    Claude [large language model],

    Anthropic, “Claude [large language model],” 2023

  33. [35]

    End-to- end optical music recognition for pianoform sheet music,

    A. R ´ıos-Vila, D. Rizo, J. M. I ˜nesta, and J. Calvo-Zaragoza, “End-to- end optical music recognition for pianoform sheet music,” International Journal on Document Analysis and Recognition (IJDAR) , vol. 26, no. 3, pp. 347–362, 2023

  34. [36]

    The openscore lieder corpus,

    M. Gotham and P. Jonas, “The openscore lieder corpus,” in Music Encoding Conference Proceedings, Universidad de Alicante, 2021

  35. [37]

    Unaligned supervision for automatic music transcription in the wild,

    B. Maman and A. H. Bermano, “Unaligned supervision for automatic music transcription in the wild,” in Proceedings of the 39th International Conference on Machine Learning (K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, eds.), vol. 162 of Proceedings of Machine Learning Research, pp. 14918–14934, PMLR, 17–23 Jul 2022

  36. [38]

    Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity,

    E. Manilow, G. Wichern, P. Seetharaman, and J. Le Roux, “Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) , IEEE, 2019

  37. [39]

    Raffel, Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching

    C. Raffel, Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching . PhD thesis, COLUMBIA UNIVERSITY , 2016

  38. [40]

    Bpsd: A coherent multi-version dataset for analyzing the first movements of beethoven’s piano sonatas,

    J. Zeitler, C. Weiß, V . Arifi-M ¨uller, and M. M ¨uller, “Bpsd: A coherent multi-version dataset for analyzing the first movements of beethoven’s piano sonatas,” Transactions of the International Society for Music Information Retrieval, vol. 7, no. 1, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

  39. [41]

    The OpenScore Lieder Corpus,

    M. R. H. Gotham and P. Jonas, “The OpenScore Lieder Corpus,” in Music Encoding Conference Proceedings 2021(S. M¨unnich and D. Rizo, eds.), pp. 131–136, Humanities Commons, 2022

  40. [42]

    KernScores — kern.humdrum.org

    “KernScores — kern.humdrum.org.” https://kern.humdrum.org/. [Ac- cessed 09-05-2025]

  41. [43]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations , 2019

  42. [44]

    Adapting frechet audio distance for generative music evaluation,

    A. Gui, H. Gamper, S. Braun, and D. Emmanouilidou, “Adapting frechet audio distance for generative music evaluation,” in Proc. IEEE ICASSP, 2024

  43. [45]

    Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

    Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in IEEE International Confer- ence on Acoustics, Speech and Signal Processing, ICASSP , 2023

  44. [46]

    The earth mover’s distance as a metric for image retrieval,

    Y . Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s distance as a metric for image retrieval,” International Journal of Computer Vision , vol. 40, pp. 99–121, Nov 2000

  45. [47]

    Mir eval: A transparent implementation of common mir metrics.,

    C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, D. P. Ellis, and C. C. Raffel, “Mir eval: A transparent implementation of common mir metrics.,” in ISMIR, vol. 10, p. 2014, 2014

  46. [48]

    Reconvat: A semi-supervised automatic music transcription framework for low-resource real-world data,

    K. W. Cheuk, D. Herremans, and L. Su, “Reconvat: A semi-supervised automatic music transcription framework for low-resource real-world data,” in Proceedings of the 29th ACM International Conference on Multimedia, pp. 3918–3926, 2021

  47. [49]

    Multi-instrument music synthesis with spectrogram diffusion,

    C. Hawthorne, I. Simon, A. Roberts, N. Zeghidour, J. Gardner, E. Manilow, and J. Engel, “Multi-instrument music synthesis with spectrogram diffusion,” in Proceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR 2022) , 2022

  48. [50]

    Muscat: a multimodal music collection for automatic transcription of real recordings and image scores,

    A. Galan-Cuenca, J. J. Valero-Mas, J. C. Martinez-Sevilla, A. Hidalgo- Centeno, A. Pertusa, and J. Calvo-Zaragoza, “Muscat: a multimodal music collection for automatic transcription of real recordings and image scores,” in Proceedings of the 32nd ACM International Conference on Multimedia, pp. 583–591, 2024

  49. [51]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision – ECCV 2014 , pp. 740–755, Springer International Publishing, 2014

  50. [52]

    Taming transformers for high- resolution image synthesis,

    P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” in Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) , pp. 12873– 12883, June 2021

  51. [53]

    Ocr-vqgan: Taming text-within-image generation,

    J. A. Rodriguez, D. Vazquez, I. Laradji, M. Pedersoli, and P. Rodriguez, “Ocr-vqgan: Taming text-within-image generation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pp. 3689–3698, 2023. APPENDIX A MATHEMATICAL FORMULATION OF MULTIMODAL TOKENIZATION AND UNIFIED TRANSLATION MODEL A. Tokenizers and Unified Vocabula...

  52. [54]

    For the top 100 such pairs by frequency, we employed Claude-3.5-Sonnet to regenerate standardized titles using a carefully designed prompt

    Title Unification: For composers with numerous pieces in our dataset, we identified potential duplicate entries by examining composer-instrumentation pairs. For the top 100 such pairs by frequency, we employed Claude-3.5-Sonnet to regenerate standardized titles using a carefully designed prompt. The model was instructed to group pieces with matching catal...

  53. [55]

    surname first) and completeness (full name vs

    Composer Name Normalization: We observed multiple representation formats for composer names across video titles, including variations in name order (given name first vs. surname first) and completeness (full name vs. surname only). To address this, we implemented a systematic normalization process based on surname matching. By identifying all variants of ...

  54. [56]

    The process begins by extracting frames at three- frame intervals and comparing each frame with its predecessor to detect visual changes

    Frame Analysis and Page List Generation: The first step constructs a comprehensive page list through sequential frame analysis. The process begins by extracting frames at three- frame intervals and comparing each frame with its predecessor to detect visual changes. When differences are detected, the algorithm records the current frame index and classifies...

  55. [57]

    For each static page segment identified in the previous step, the algorithm extracts both the stable frame image and its temporally aligned audio segment

    Audio-Image Pair Extraction: The second step processes the filtered page list to extract corresponding audio-image pairs. For each static page segment identified in the previous step, the algorithm extracts both the stable frame image and its temporally aligned audio segment

  56. [58]

    These silent segments are typically found in video introductions, outros, and transition screens between movements (distinct sections of a longer musical work)

    Silent Segment Filtering: The final step addresses seg- ments containing silent audio, which commonly occur in spe- cific parts of score videos. These silent segments are typically found in video introductions, outros, and transition screens between movements (distinct sections of a longer musical work). For example, a video of a multi-movement sonata mig...

  57. [59]

    Videos with video intensityv < 200 are excluded

    Pixel Intensity Metrics: We implement comprehensive intensity-based filtering using both video-level and system- level metrics: a) Video-Level Filtering: For each video v, we compute the mean of median pixel values across all systems to identify videos with poor scan quality, which typically appear grayish due to low contrast: video intensityv = 1 N NX i=...

  58. [60]

    Systems with |height anomaly scorei| > 1.2 are excluded

    Dimensional Analysis: We implement three types of dimensional constraints: a) Basic Height Constraints: Systems must satisfy: 70 ≤ height ≤ 390 pixels (18) height < width (19) b) Height Anomaly Detection: To identify cases of over- detection (multiple systems detected as one) or partial detec- tion (incomplete system capture), we compute for each system i...

  59. [61]

    Temporal Constraints: Each segment duration d must satisfy: 3 ≤ d ≤ 20 seconds (22) This filtering serves as an additional safeguard to remove remaining noise segments that persist after silent segment filtering. While most video introductions, outros, and chapter titles are caught by silence detection, this duration-based constraint helps eliminate any r...

  60. [62]

    Our adaptation processes single-channel grayscale sheet music images, significantly reducing model complexity

    Core Architecture Modifications: The original RQV AE architecture was designed for RGB images with three chan- nels. Our adaptation processes single-channel grayscale sheet music images, significantly reducing model complexity. We employ four unshared codebooks, each containing 1024 codes, with a model dimension of 256. JOURNAL OF LATEX CLASS FILES, VOL. ...

  61. [63]

    This ensures each token captures features at a sub-staff-line scale, crucial for precise musical notation representation

    Compression Strategy: While the original model achieved 32x compression using a channel multiplication sequence of [1, 1, 2, 2, 4, 4] , we implement a mod- ified sequence of [1, 1, 2, 2, 4] for 16× compression. This ensures each token captures features at a sub-staff-line scale, crucial for precise musical notation representation

  62. [64]

    The attention blocks in the original model help maintain global coherence across larger image regions

    Architectural Refinements: We removed attention blocks from both the encoder’s final downsampling layer and the decoder’s initial upsampling layer. The attention blocks in the original model help maintain global coherence across larger image regions. However, in our case, we aim to tokenize sheet music at a very local level, focusing on fine details like ...

  63. [65]

    Resolution-Adaptive Training: Sheet music presents unique challenges in image dimensions, varying from com- pact piano scores (around 70 pixels in height) to extensive multi-staff compositions. To effectively tokenize these diverse formats, we needed to address a key limitation in the original RQV AE model, which was trained on fixed resolution images and...