Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training

Daniel Dratschuk; Paul Swoboda

arxiv: 2605.10835 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.LG

Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training

Daniel Dratschuk , Paul Swoboda This is my paper

Pith reviewed 2026-05-12 04:57 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords optical music recognitionsynthetic data generationzero-shot transferkern encoding normalizationgrammar-based decodingmusic score transcriptioncompact neural modelshistorical document analysis

0 comments

The pith

Transcoda uses synthetic data generation, kern normalization, and grammar decoding to train a compact model that outperforms larger baselines on both synthetic and real music scores in zero-shot settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Optical music recognition converts sheet music images into structured text but struggles due to scarce annotated real scans. The paper establishes that an advanced synthetic data pipeline, combined with forcing the kern music encoding into a single normal form and enforcing output grammar, lets a 59-million-parameter model train in six hours on one GPU. This setup yields lower error rates than billion-parameter systems on a new synthetic benchmark and on historical Polish scans. A sympathetic reader would care because it removes the need for massive real datasets and shows that careful data handling can beat scale in this domain. If correct, OMR becomes more accessible for digitizing archives without extensive labeling efforts.

Core claim

The central claim is that an end-to-end zero-shot OMR system built on advanced synthetic data generation, normalization of the kern encoding to a unique normal form, and grammar-based decoding produces a compact 59M-parameter model trainable in six hours on a single GPU that achieves 18.46 percent OMR-NED on synthetic scores (versus 43.91 percent for the next best baseline) and 63.97 percent OMR-NED on historical Polish scans (down from 80.16 percent for the prior best).

What carries the argument

The central mechanism is the data-centric synthetic training pipeline that renders diverse scores, paired with kern normalization to eliminate one-to-many encodings and grammar-based decoding to enforce syntactic validity.

If this is right

A model an order of magnitude smaller can surpass much larger models when training data and output constraints are carefully engineered.
Training an effective OMR system requires only hours on a single consumer GPU rather than large-scale distributed resources.
Zero-shot performance on historical documents improves substantially without any fine-tuning on real annotated examples.
Removing encoding ambiguity during training reduces decoder uncertainty and improves overall transcription reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same combination of synthetic rendering and canonical output forms could apply to other visual-to-text tasks that suffer from one-to-many mappings, such as handwritten text or chemical structure recognition.
Further gains may come from iteratively refining the synthetic generator based on failure modes observed on real scans.
This data-first approach suggests that for many recognition problems, investing in data realism and consistency can substitute for increases in model size.

Load-bearing premise

The synthetic images match the distribution of real scans closely enough for zero-shot transfer, and the kern normalization step preserves all musically relevant information without introducing new systematic errors on actual data.

What would settle it

A test set of diverse real-world music scans from multiple historical periods and styles on which Transcoda shows higher error rates than the strongest baselines would falsify the zero-shot transfer claim.

Figures

Figures reproduced from arXiv: 2605.10835 by Daniel Dratschuk, Paul Swoboda.

**Figure 2.** Figure 2: Transcoda architecture. A ConvNeXt-V2 encoder feeds projected visual features with 2D [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The **kern format encodes sheet music as a text grid. Rows represent simultaneous time steps and columns represent parallel voices. Note tokens define pitch and duration, while dot tokens (.) act as placeholders to keep the voices aligned across different rhythms. We chose **kern over the ubiquitous MusicXML format, since the latter relies on heavy XML boilerplate, which expands sequence lengths and drast… view at source ↗

**Figure 4.** Figure 4: A single visual chord can have multiple valid **kern strings. We sort pitches into one normalized sequence to reduce uncertainty. Stage 2: Filtering. We discard files with broken UTF-8, missing spine terminators, missing clefs, severe conversion artifacts, impossible accidental runs, corrupted octave spellings, or invalid measure mathematics. Stage 3: Target normalization. The **kern format allows for di… view at source ↗

**Figure 5.** Figure 5: Multi-stage data augmentation. To bridge the sim-to-real gap without altering the **kern target, we apply sequential transformations from stage 4 of our data pipeline. 3.4 Constrained decoding Some downstream tools require formally valid **kern, for example rendering, which might break down when presented with small syntactical errors. For this case, Transcoda includes an optional constrained decoder. A GB… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison on a physical scan of Bach’s Duetto No. 1 in E minor (BWV 802). [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Examples from the historical Polish scan benchmark. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

read the original abstract

Optical Music Recognition (OMR), the task of transcribing sheet music into a structured textual representation, is currently bottlenecked by a lack of large-scale, annotated datasets of real scans. This forces models to rely on either few-shot transfer or synthetic training pipelines that remain overly simplistic. A secondary challenge is encoding non-uniqueness: in the popular Humdrum **kern format for transcribing music, multiple different text encodings can render into the same visual sheet music. This one-to-many mapping creates a harder learning task and introduces high uncertainty during decoding. We propose Transcoda, an OMR system built on (i) an advanced synthetic data generation pipeline, (ii) a normalization of the **kern encoding to enforce a unique normal form and (iii) grammar-based decoding to ensure the syntactic correctness of the output. This approach allows us to train a compact 59M-parameter model in just 6 hours on a single GPU that outperforms billion-parameter baselines. Transcoda achieves the best score among state of the art baselines on a newly curated benchmark of synthetically rendered scores at 18.46% OMR-NED (compared to 43.91% for the next-best system, Legato) and reduces the error rate on historical Polish scans to 63.97% OMR-NED (down from 80.16% for SMT++).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Transcoda shows a compact model trained on advanced synthetic data plus kern normalization and grammar decoding can beat larger baselines on OMR benchmarks, but the zero-shot historical gains rest on an unverified assumption that the synthetic distribution matches real scan degradations.

read the letter

The core takeaway is that this paper delivers a working end-to-end OMR pipeline that trains fast and reports lower error rates than much bigger models on both synthetic and historical test sets. They normalize kern to a single form to remove encoding ambiguity, add grammar constraints at decode time, and rely on a more detailed synthetic generation process instead of real labeled scans. That combination lets a 59M-parameter model train in six hours on one GPU and reach 18.46% OMR-NED on their new synthetic benchmark versus 43.91% for Legato, plus a drop to 63.97% on Polish historical scans from 80.16% for SMT++.

Referee Report

3 major / 2 minor

Summary. The paper introduces Transcoda, an end-to-end OMR system trained exclusively on synthetically generated data. It combines an advanced synthetic data pipeline, normalization of **kern to a unique normal form to resolve encoding non-uniqueness, and grammar-based decoding for syntactic validity. A compact 59M-parameter model is trained in 6 hours on one GPU and is claimed to outperform larger baselines, achieving 18.46% OMR-NED on a new synthetic benchmark (vs. 43.91% for Legato) and 63.97% OMR-NED on historical Polish scans in zero-shot setting (vs. 80.16% for SMT++).

Significance. If the zero-shot transfer holds, the work would be significant for OMR by demonstrating that careful data-centric synthetic generation can overcome the annotated-data bottleneck and enable compact models to surpass billion-parameter systems. The normalization and grammar constraints directly address a known source of decoding uncertainty in **kern. Credit for the reported training efficiency and parameter count; these are concrete strengths if the empirical claims are supported.

major comments (3)

[Abstract / Results] Abstract and experimental results: The headline zero-shot result on historical Polish scans (63.97% OMR-NED) rests on the unverified assumption that the synthetic training distribution matches real-scan degradations (ink density, texture, engraving variation). No quantitative domain-similarity metrics (FID, style statistics, or error-pattern analysis) are reported, which is load-bearing for interpreting the improvement over SMT++ as robust transfer rather than test-set artifact.
[Experimental evaluation] Evaluation protocol: No details are given on how OMR-NED is computed, how baselines (Legato, SMT++) were re-implemented or trained, whether synthetic test data were checked for leakage from the training pipeline, or whether error bars / multiple seeds were used. These omissions directly affect confidence in the reported 18.46% and 63.97% figures.
[Method (normalization step)] Kern normalization: The claim that collapsing **kern to a unique normal form preserves all musically relevant information is central to both training and zero-shot decoding, yet no analysis or counter-examples are provided showing that real ambiguous encodings are not systematically altered.

minor comments (2)

[Abstract] The abstract refers to 'a newly curated benchmark' without stating whether the data or generation code will be released.
[Introduction / Method] OMR-NED should be defined on first use in the main text even if standard in the community.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications and committing to specific revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract / Results] Abstract and experimental results: The headline zero-shot result on historical Polish scans (63.97% OMR-NED) rests on the unverified assumption that the synthetic training distribution matches real-scan degradations (ink density, texture, engraving variation). No quantitative domain-similarity metrics (FID, style statistics, or error-pattern analysis) are reported, which is load-bearing for interpreting the improvement over SMT++ as robust transfer rather than test-set artifact.

Authors: We agree that explicit domain-similarity analysis would increase confidence in the zero-shot transfer. The synthetic pipeline incorporates controlled variations in ink density, texture, and engraving to approximate historical degradations, and the 16-point absolute improvement over SMT++ on real Polish scans provides supporting evidence. In the revised version we will add a new subsection with qualitative side-by-side comparisons of synthetic and real degradations together with an error-pattern breakdown (e.g., common substitution types) to make the domain alignment explicit. revision: yes
Referee: [Experimental evaluation] Evaluation protocol: No details are given on how OMR-NED is computed, how baselines (Legato, SMT++) were re-implemented or trained, whether synthetic test data were checked for leakage from the training pipeline, or whether error bars / multiple seeds were used. These omissions directly affect confidence in the reported 18.46% and 63.97% figures.

Authors: We will expand the experimental section to document the exact OMR-NED definition (token-level Levenshtein distance normalized by reference length, following established OMR practice), the re-implementation procedure for each baseline (official repositories with identical synthetic training data), the independent generation of the synthetic test split to preclude leakage, and results averaged over three random seeds with standard deviations. revision: yes
Referee: [Method (normalization step)] Kern normalization: The claim that collapsing **kern to a unique normal form preserves all musically relevant information is central to both training and zero-shot decoding, yet no analysis or counter-examples are provided showing that real ambiguous encodings are not systematically altered.

Authors: We will augment the normalization section with concrete counter-examples of ambiguous **kern strings (different duration and articulation encodings that render identically) and demonstrate via round-trip rendering that the chosen normal form is information-preserving for musically valid content. This analysis will be supported by references to Humdrum semantics and will clarify that the mapping is applied only after confirming syntactic validity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarks on held-out data

full rationale

The paper presents an end-to-end empirical pipeline (synthetic data generation, kern normalization to unique form, grammar-based decoding) trained on synthetic images and evaluated via direct OMR-NED metrics on separate synthetic and historical Polish scan benchmarks. No equations, derivations, or parameter fits are described that reduce reported error rates to self-referential inputs or prior self-citations; the performance numbers (18.46% vs. 43.91%, 63.97% vs. 80.16%) are straightforward held-out comparisons without any reduction by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions: that synthetic data can be made distributionally close enough to real scans for zero-shot success, and that kern normalization to a unique form is information-preserving and musically valid. No free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Synthetic data generation can produce training examples whose visual statistics match real sheet music scans closely enough for zero-shot transfer.
Invoked to justify training exclusively on synthetic data while claiming performance on real historical scans.
domain assumption The kern encoding admits a unique normal form that resolves the one-to-many mapping without loss of musically relevant information.
Central to the normalization step that is presented as solving the non-uniqueness problem.

pith-pipeline@v0.9.0 · 5549 in / 1523 out tokens · 84483 ms · 2026-05-12T04:57:22.128697+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

[1]

G. Bradski. The OpenCV Library.Dr . Dobb’s Journal of Software Tools, 2000

work page 2000
[2]

Toselli, and Enrique Vidal

Jorge Calvo-Zaragoza, Alejandro H. Toselli, and Enrique Vidal. Handwritten music recognition for mensural notation with convolutional recurrent neural networks.Pattern Recognit. Lett., 128:115–121, 2019. doi: 10.1016/J.PATREC.2019.08.021. URLhttps://doi.org/10.1016/j.patrec.2019.08.021

work page doi:10.1016/j.patrec.2019.08.021 2019
[3]

DAN: a segmentation-free document attention network for handwritten document recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):8227–8243, 2023

Denis Coquenet, Clément Chatelain, and Thierry Paquet. DAN: a segmentation-free document attention network for handwritten document recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):8227–8243, 2023. DBLP-verified title and venue

work page 2023
[4]

Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen

Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. XGrammar: Flexible and efficient structured generation engine for large language models.CoRR, abs/2411.15100,

work page arXiv
[6]

An empirical evaluation of end-to-end polyphonic optical music recognition.arXiv preprint arXiv:2108.01769, 2021

Sachinda Edirisooriya, Hao-Wen Dong, Julian McAuley, and Taylor Berg-Kirkpatrick. An empirical evaluation of end-to-end polyphonic optical music recognition.arXiv preprint arXiv:2108.01769, 2021

work page arXiv 2021
[7]

OpenScore String Quartet

Mark R. H. Gotham, Maureen Redbond, Bruno Bower, and Peter Jonas. The “OpenScore String Quartet” Corpus. InProceedings of the 10th International Conference on Digital Libraries for Musicology, pages 49–57, Milan Italy, November 2023. ACM. ISBN 9798400708336. doi: 10.1145/3625135.3625155. URL https://dl.acm.org/doi/10.1145/3625135.3625155

work page doi:10.1145/3625135.3625155 2023
[8]

The OpenScore Lieder Corpus

Mark Robert Haigh Gotham and Peter Jonas. The OpenScore Lieder Corpus. In Stefan Münnich and David Rizo, editors,Music Encoding Conference Proceedings 2021, pages 131–136. Humanities Commons, 2022. doi: 10.17613/1my2-dm23

work page doi:10.17613/1my2-dm23 2021
[9]

Augraphy: A data augmentation library for document images

Alexander Groleau, Kok Wei Chee, Stefan Larson, Samay Maini, and Jonathan Boarman. Augraphy: A data augmentation library for document images. InProceedings of the 17th International Conference on Document Analysis and Recognition (ICDAR), 2023. URL https://arxiv.org/pdf/2208.14558.pdf

work page arXiv 2023
[10]

Humdrum and kern: selective feature encoding

David Huron. Humdrum and kern: selective feature encoding. In Eleanor Selfridge-Field, editor,Beyond MIDI: The Handbook of Musical Codes, pages 375–401. MIT Press, Cambridge, MA, USA, 1997. ISBN 0262193949

work page 1997
[11]

Further steps towards a standard testbed for optical music recognition

Jan Hajiˇc Jr., Jiˇrí Novotný, Pavel Pecina, and Jaroslav Pokorný. Further steps towards a standard testbed for optical music recognition. InProceedings of the 17th International Society for Music Information Retrieval Conference, pages 157–163, 2016. doi: 10.5281/ZENODO.1418161. URL https://doi.org/ 10.5281/zenodo.1418161

work page doi:10.5281/zenodo.1418161 2016
[12]

In: ICASSP 2023 - 2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Yixuan Li, Huaping Liu, Qiang Jin, Miaomiao Cai, and Peng Li. Tromr:transformer-based polyphonic optical music recognition. InIEEE International Conference on Acoustics, Speech and Signal Process- ing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE, 2023. doi: 10.1109/ ICASSP49357.2023.10096055. URLhttps://doi.org/10.1109/ICASSP49357....

work page doi:10.1109/icassp49357.2023.10096055 2023
[13]

In: CVPR

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 11966–11976. IEEE, 2022. doi: 10.1109/CVPR52688. 2022.01167. URLhttps://doi.org/10.1109/CVPR52688.2022.01167

work page doi:10.1109/cvpr52688 2022
[14]

Masked image pretraining on language assisted representation

Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, and Julian McAuley. PDMX: A large-scale public domain musicxml dataset for symbolic music processing. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. doi: 10.1109/ ICASSP49660.2025.10890217

work page arXiv 2025
[15]

Ossorio-Castillo, jqnoc/numsem: numsem console

Juan C. Martinez-Sevilla, Joan Cerveto-Serrano, Noelia N. Luna-Barahona, Greg Chapman, Craig Sapp, David Rizo, and Jorge Calvo-Zaragoza. Sheet music benchmark: Standardized optical music recognition evaluation. InProceedings of the 26th International Society for Music Information Retrieval Conference, ISMIR 2025, Daejeon, South Korea, September 21-25, 202...

work page doi:10.5281/zenodo 2025
[16]

Llama 3.2 11B Vision

Meta Llama. Llama 3.2 11B Vision. https://huggingface.co/meta-llama/Llama-3. 2-11B-Vision, 2024. Hugging Face model card, accessed 2026-05-06

work page 2024
[17]

MuseTrainer Library

MuseTrainer Contributors. MuseTrainer Library. https://github.com/musetrainer/library, 2024. 10

work page 2024
[18]

Towards a universal music symbol classifier

Alexander Pacha and Horst Eidenberger. Towards a universal music symbol classifier. In2017 14th IAPR International conference on document analysis and recognition (ICDAR), volume 2, pages 35–36. IEEE, 2017

work page 2017
[19]

polish-scores, 2025

PRAIG. polish-scores, 2025. URL https://huggingface.co/datasets/PRAIG/polish-scores. Hugging Face dataset, accessed 2026-03-14

work page 2025
[20]

Augraphy: an augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes

The Augraphy Project. Augraphy: an augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes. URLhttps://github.com/sparkfish/augraphy

work page
[21]

International music score library project (IMSLP), 2026

Project Petrucci LLC. International music score library project (IMSLP), 2026. URL https://imslp. org/. Accessed: 2026-05-06

work page 2026
[22]

verovio: A library and toolkit for engraving mei music notation into svg, 2026

Laurent Pugin and Verovio Contributors. verovio: A library and toolkit for engraving mei music notation into svg, 2026. URLhttps://pypi.org/project/verovio/

work page 2026
[23]

Verovio: A library for engraving mei music notation into svg

Laurent Pugin, Rodolfo Zitellini, and Perry Roland. Verovio: A library for engraving mei music notation into svg. InProceedings of the 15th International Society for Music Information Retrieval Conference, page 107–112, 2014. URL http://www.terasoft.com.tw/conf/ismir2014/proceedings/T020_ 221_Paper.pdf. URL listed by the Verovio reference book; unavailabl...

work page 2014
[24]

Ana Rebelo, Ichiro Fujinaga, Filipe Paszkiewicz, André R. S. Marçal, Carlos Guedes, and Jaime S. Cardoso. Optical music recognition: state-of-the-art and open issues.Int. J. Multim. Inf. Retr ., 1(3):173–190, 2012. doi: 10.1007/S13735-012-0004-6. URLhttps://doi.org/10.1007/s13735-012-0004-6

work page doi:10.1007/s13735-012-0004-6 2012
[25]

In: Proc

Antonio Ríos-Vila, Jorge Calvo-Zaragoza, and Thierry Paquet. Sheet music transformer: End-to-end optical music recognition beyond monophonic transcription. InDocument Analysis and Recognition - ICDAR 2024 - 18th International Conference, Athens, Greece, August 30 - September 4, 2024, Proceedings, Part VI, volume 14809 ofLecture Notes in Computer Science, ...

work page doi:10.1007/978-3-031-70552-6_2 2024
[26]

End-to-end full-page optical music recognition for pianoform sheet music.Int

Antonio Ríos-Vila, Jorge Calvo-Zaragoza, David Rizo, and Thierry Paquet. End-to-end full-page optical music recognition for pianoform sheet music.Int. J. Comput. Vis., 134(2):49, 2026. doi: 10.1007/ S11263-025-02654-6. URLhttps://doi.org/10.1007/s11263-025-02654-6

work page doi:10.1007/s11263-025-02654-6 2026
[27]

S. S. Singh and Sergey Karayev. Full page handwriting recognition via image to sequence extraction. In Document Analysis and Recognition - ICDAR 2021 - 16th International Conference, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part III, volume 12823 ofLecture Notes in Computer Science, pages 55–69. Springer, 2021. DBLP-verified title and venue

work page 2021
[28]

On the integration of language models into sequence to sequence architectures for handwritten music recognition

Pau Torras, Arnau Baró, Lei Kang, and Alicia Fornés. On the integration of language models into sequence to sequence architectures for handwritten music recognition. In Jin Ha Lee, Alexander Lerch, Zhiyao Duan, Juhan Nam, Preeti Rao, Peter van Kranenburg, and Ajay Srinivasamurthy, editors,Proceedings of the 22nd International Society for Music Information...

work page 2021
[29]

The ABC Music Standard 2.1, 2011

Chris Walshaw. The ABC Music Standard 2.1, 2011. URL https://michaeleskin.com/abctools/ abc_standard_v2.1.pdf

work page 2011
[30]

Con- vnext v2: Co-designing and scaling convnets with masked autoencoders.arXiv preprint arXiv:2301.00808, 2023

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders, 2023. URL https: //arxiv.org/abs/2301.00808

work page arXiv 2023
[31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling, 2022. URLhttps://arxiv.org/abs/2111.09886

work page arXiv 2022
[32]

Generating symbolic music from natural language prompts using an llm-enhanced dataset,

Weihan Xu, Julian McAuley, Taylor Berg-Kirkpatrick, Shlomo Dubnov, and Hao-Wen Dong. Gener- ating symbolic music from natural language prompts using an llm-enhanced dataset.arXiv preprint arXiv:2410.02084, 2024

work page arXiv 2024
[33]

Toward a more complete omr solution.arXiv preprint arXiv:2409.00316, 2024

Guang Yang, Muru Zhang, Lin Qiu, Yanming Wan, and Noah A Smith. Toward a more complete omr solution.arXiv preprint arXiv:2409.00316, 2024

work page arXiv 2024
[34]

Guang Yang, Victoria Ebert, Nazif Tamer, Luiza Pozzobon, and Noah A. Smith. LEGATO: large-scale end-to-end generalizable approach to typeset OMR.CoRR, abs/2506.19065, 2025. doi: 10.48550/ARXIV . 2506.19065. URLhttps://doi.org/10.48550/arXiv.2506.19065

work page internal anchor Pith review doi:10.48550/arxiv 2025
[35]

Simple fast algorithms for the editing distance between trees and related problems.SIAM Journal on Computing, 18(6):1245–1262, 1989

Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems.SIAM Journal on Computing, 18(6):1245–1262, 1989. doi: 10.1137/0218082. URL https://doi.org/10.1137/0218082. 11 A Synthetic Rendering Settings Training images are rendered with Verovio 6.0.1. For each example, we sample the rendering option...

work page doi:10.1137/0218082 1989

[1] [1]

G. Bradski. The OpenCV Library.Dr . Dobb’s Journal of Software Tools, 2000

work page 2000

[2] [2]

Toselli, and Enrique Vidal

Jorge Calvo-Zaragoza, Alejandro H. Toselli, and Enrique Vidal. Handwritten music recognition for mensural notation with convolutional recurrent neural networks.Pattern Recognit. Lett., 128:115–121, 2019. doi: 10.1016/J.PATREC.2019.08.021. URLhttps://doi.org/10.1016/j.patrec.2019.08.021

work page doi:10.1016/j.patrec.2019.08.021 2019

[3] [3]

DAN: a segmentation-free document attention network for handwritten document recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):8227–8243, 2023

Denis Coquenet, Clément Chatelain, and Thierry Paquet. DAN: a segmentation-free document attention network for handwritten document recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):8227–8243, 2023. DBLP-verified title and venue

work page 2023

[4] [4]

Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen

Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. XGrammar: Flexible and efficient structured generation engine for large language models.CoRR, abs/2411.15100,

work page arXiv

[5] [6]

An empirical evaluation of end-to-end polyphonic optical music recognition.arXiv preprint arXiv:2108.01769, 2021

Sachinda Edirisooriya, Hao-Wen Dong, Julian McAuley, and Taylor Berg-Kirkpatrick. An empirical evaluation of end-to-end polyphonic optical music recognition.arXiv preprint arXiv:2108.01769, 2021

work page arXiv 2021

[6] [7]

OpenScore String Quartet

Mark R. H. Gotham, Maureen Redbond, Bruno Bower, and Peter Jonas. The “OpenScore String Quartet” Corpus. InProceedings of the 10th International Conference on Digital Libraries for Musicology, pages 49–57, Milan Italy, November 2023. ACM. ISBN 9798400708336. doi: 10.1145/3625135.3625155. URL https://dl.acm.org/doi/10.1145/3625135.3625155

work page doi:10.1145/3625135.3625155 2023

[7] [8]

The OpenScore Lieder Corpus

Mark Robert Haigh Gotham and Peter Jonas. The OpenScore Lieder Corpus. In Stefan Münnich and David Rizo, editors,Music Encoding Conference Proceedings 2021, pages 131–136. Humanities Commons, 2022. doi: 10.17613/1my2-dm23

work page doi:10.17613/1my2-dm23 2021

[8] [9]

Augraphy: A data augmentation library for document images

Alexander Groleau, Kok Wei Chee, Stefan Larson, Samay Maini, and Jonathan Boarman. Augraphy: A data augmentation library for document images. InProceedings of the 17th International Conference on Document Analysis and Recognition (ICDAR), 2023. URL https://arxiv.org/pdf/2208.14558.pdf

work page arXiv 2023

[9] [10]

Humdrum and kern: selective feature encoding

David Huron. Humdrum and kern: selective feature encoding. In Eleanor Selfridge-Field, editor,Beyond MIDI: The Handbook of Musical Codes, pages 375–401. MIT Press, Cambridge, MA, USA, 1997. ISBN 0262193949

work page 1997

[10] [11]

Further steps towards a standard testbed for optical music recognition

Jan Hajiˇc Jr., Jiˇrí Novotný, Pavel Pecina, and Jaroslav Pokorný. Further steps towards a standard testbed for optical music recognition. InProceedings of the 17th International Society for Music Information Retrieval Conference, pages 157–163, 2016. doi: 10.5281/ZENODO.1418161. URL https://doi.org/ 10.5281/zenodo.1418161

work page doi:10.5281/zenodo.1418161 2016

[11] [12]

In: ICASSP 2023 - 2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Yixuan Li, Huaping Liu, Qiang Jin, Miaomiao Cai, and Peng Li. Tromr:transformer-based polyphonic optical music recognition. InIEEE International Conference on Acoustics, Speech and Signal Process- ing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE, 2023. doi: 10.1109/ ICASSP49357.2023.10096055. URLhttps://doi.org/10.1109/ICASSP49357....

work page doi:10.1109/icassp49357.2023.10096055 2023

[12] [13]

In: CVPR

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 11966–11976. IEEE, 2022. doi: 10.1109/CVPR52688. 2022.01167. URLhttps://doi.org/10.1109/CVPR52688.2022.01167

work page doi:10.1109/cvpr52688 2022

[13] [14]

Masked image pretraining on language assisted representation

Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, and Julian McAuley. PDMX: A large-scale public domain musicxml dataset for symbolic music processing. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. doi: 10.1109/ ICASSP49660.2025.10890217

work page arXiv 2025

[14] [15]

Ossorio-Castillo, jqnoc/numsem: numsem console

Juan C. Martinez-Sevilla, Joan Cerveto-Serrano, Noelia N. Luna-Barahona, Greg Chapman, Craig Sapp, David Rizo, and Jorge Calvo-Zaragoza. Sheet music benchmark: Standardized optical music recognition evaluation. InProceedings of the 26th International Society for Music Information Retrieval Conference, ISMIR 2025, Daejeon, South Korea, September 21-25, 202...

work page doi:10.5281/zenodo 2025

[15] [16]

Llama 3.2 11B Vision

Meta Llama. Llama 3.2 11B Vision. https://huggingface.co/meta-llama/Llama-3. 2-11B-Vision, 2024. Hugging Face model card, accessed 2026-05-06

work page 2024

[16] [17]

MuseTrainer Library

MuseTrainer Contributors. MuseTrainer Library. https://github.com/musetrainer/library, 2024. 10

work page 2024

[17] [18]

Towards a universal music symbol classifier

Alexander Pacha and Horst Eidenberger. Towards a universal music symbol classifier. In2017 14th IAPR International conference on document analysis and recognition (ICDAR), volume 2, pages 35–36. IEEE, 2017

work page 2017

[18] [19]

polish-scores, 2025

PRAIG. polish-scores, 2025. URL https://huggingface.co/datasets/PRAIG/polish-scores. Hugging Face dataset, accessed 2026-03-14

work page 2025

[19] [20]

Augraphy: an augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes

The Augraphy Project. Augraphy: an augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes. URLhttps://github.com/sparkfish/augraphy

work page

[20] [21]

International music score library project (IMSLP), 2026

Project Petrucci LLC. International music score library project (IMSLP), 2026. URL https://imslp. org/. Accessed: 2026-05-06

work page 2026

[21] [22]

verovio: A library and toolkit for engraving mei music notation into svg, 2026

Laurent Pugin and Verovio Contributors. verovio: A library and toolkit for engraving mei music notation into svg, 2026. URLhttps://pypi.org/project/verovio/

work page 2026

[22] [23]

Verovio: A library for engraving mei music notation into svg

Laurent Pugin, Rodolfo Zitellini, and Perry Roland. Verovio: A library for engraving mei music notation into svg. InProceedings of the 15th International Society for Music Information Retrieval Conference, page 107–112, 2014. URL http://www.terasoft.com.tw/conf/ismir2014/proceedings/T020_ 221_Paper.pdf. URL listed by the Verovio reference book; unavailabl...

work page 2014

[23] [24]

Ana Rebelo, Ichiro Fujinaga, Filipe Paszkiewicz, André R. S. Marçal, Carlos Guedes, and Jaime S. Cardoso. Optical music recognition: state-of-the-art and open issues.Int. J. Multim. Inf. Retr ., 1(3):173–190, 2012. doi: 10.1007/S13735-012-0004-6. URLhttps://doi.org/10.1007/s13735-012-0004-6

work page doi:10.1007/s13735-012-0004-6 2012

[24] [25]

In: Proc

Antonio Ríos-Vila, Jorge Calvo-Zaragoza, and Thierry Paquet. Sheet music transformer: End-to-end optical music recognition beyond monophonic transcription. InDocument Analysis and Recognition - ICDAR 2024 - 18th International Conference, Athens, Greece, August 30 - September 4, 2024, Proceedings, Part VI, volume 14809 ofLecture Notes in Computer Science, ...

work page doi:10.1007/978-3-031-70552-6_2 2024

[25] [26]

End-to-end full-page optical music recognition for pianoform sheet music.Int

Antonio Ríos-Vila, Jorge Calvo-Zaragoza, David Rizo, and Thierry Paquet. End-to-end full-page optical music recognition for pianoform sheet music.Int. J. Comput. Vis., 134(2):49, 2026. doi: 10.1007/ S11263-025-02654-6. URLhttps://doi.org/10.1007/s11263-025-02654-6

work page doi:10.1007/s11263-025-02654-6 2026

[26] [27]

S. S. Singh and Sergey Karayev. Full page handwriting recognition via image to sequence extraction. In Document Analysis and Recognition - ICDAR 2021 - 16th International Conference, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part III, volume 12823 ofLecture Notes in Computer Science, pages 55–69. Springer, 2021. DBLP-verified title and venue

work page 2021

[27] [28]

On the integration of language models into sequence to sequence architectures for handwritten music recognition

Pau Torras, Arnau Baró, Lei Kang, and Alicia Fornés. On the integration of language models into sequence to sequence architectures for handwritten music recognition. In Jin Ha Lee, Alexander Lerch, Zhiyao Duan, Juhan Nam, Preeti Rao, Peter van Kranenburg, and Ajay Srinivasamurthy, editors,Proceedings of the 22nd International Society for Music Information...

work page 2021

[28] [29]

The ABC Music Standard 2.1, 2011

Chris Walshaw. The ABC Music Standard 2.1, 2011. URL https://michaeleskin.com/abctools/ abc_standard_v2.1.pdf

work page 2011

[29] [30]

Con- vnext v2: Co-designing and scaling convnets with masked autoencoders.arXiv preprint arXiv:2301.00808, 2023

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders, 2023. URL https: //arxiv.org/abs/2301.00808

work page arXiv 2023

[30] [31]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling, 2022. URLhttps://arxiv.org/abs/2111.09886

work page arXiv 2022

[31] [32]

Generating symbolic music from natural language prompts using an llm-enhanced dataset,

Weihan Xu, Julian McAuley, Taylor Berg-Kirkpatrick, Shlomo Dubnov, and Hao-Wen Dong. Gener- ating symbolic music from natural language prompts using an llm-enhanced dataset.arXiv preprint arXiv:2410.02084, 2024

work page arXiv 2024

[32] [33]

Toward a more complete omr solution.arXiv preprint arXiv:2409.00316, 2024

Guang Yang, Muru Zhang, Lin Qiu, Yanming Wan, and Noah A Smith. Toward a more complete omr solution.arXiv preprint arXiv:2409.00316, 2024

work page arXiv 2024

[33] [34]

Guang Yang, Victoria Ebert, Nazif Tamer, Luiza Pozzobon, and Noah A. Smith. LEGATO: large-scale end-to-end generalizable approach to typeset OMR.CoRR, abs/2506.19065, 2025. doi: 10.48550/ARXIV . 2506.19065. URLhttps://doi.org/10.48550/arXiv.2506.19065

work page internal anchor Pith review doi:10.48550/arxiv 2025

[34] [35]

Simple fast algorithms for the editing distance between trees and related problems.SIAM Journal on Computing, 18(6):1245–1262, 1989

Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems.SIAM Journal on Computing, 18(6):1245–1262, 1989. doi: 10.1137/0218082. URL https://doi.org/10.1137/0218082. 11 A Synthetic Rendering Settings Training images are rendered with Verovio 6.0.1. For each example, we sample the rendering option...

work page doi:10.1137/0218082 1989