Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training
Pith reviewed 2026-05-12 04:57 UTC · model grok-4.3
The pith
Transcoda uses synthetic data generation, kern normalization, and grammar decoding to train a compact model that outperforms larger baselines on both synthetic and real music scores in zero-shot settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an end-to-end zero-shot OMR system built on advanced synthetic data generation, normalization of the kern encoding to a unique normal form, and grammar-based decoding produces a compact 59M-parameter model trainable in six hours on a single GPU that achieves 18.46 percent OMR-NED on synthetic scores (versus 43.91 percent for the next best baseline) and 63.97 percent OMR-NED on historical Polish scans (down from 80.16 percent for the prior best).
What carries the argument
The central mechanism is the data-centric synthetic training pipeline that renders diverse scores, paired with kern normalization to eliminate one-to-many encodings and grammar-based decoding to enforce syntactic validity.
If this is right
- A model an order of magnitude smaller can surpass much larger models when training data and output constraints are carefully engineered.
- Training an effective OMR system requires only hours on a single consumer GPU rather than large-scale distributed resources.
- Zero-shot performance on historical documents improves substantially without any fine-tuning on real annotated examples.
- Removing encoding ambiguity during training reduces decoder uncertainty and improves overall transcription reliability.
Where Pith is reading between the lines
- The same combination of synthetic rendering and canonical output forms could apply to other visual-to-text tasks that suffer from one-to-many mappings, such as handwritten text or chemical structure recognition.
- Further gains may come from iteratively refining the synthetic generator based on failure modes observed on real scans.
- This data-first approach suggests that for many recognition problems, investing in data realism and consistency can substitute for increases in model size.
Load-bearing premise
The synthetic images match the distribution of real scans closely enough for zero-shot transfer, and the kern normalization step preserves all musically relevant information without introducing new systematic errors on actual data.
What would settle it
A test set of diverse real-world music scans from multiple historical periods and styles on which Transcoda shows higher error rates than the strongest baselines would falsify the zero-shot transfer claim.
Figures
read the original abstract
Optical Music Recognition (OMR), the task of transcribing sheet music into a structured textual representation, is currently bottlenecked by a lack of large-scale, annotated datasets of real scans. This forces models to rely on either few-shot transfer or synthetic training pipelines that remain overly simplistic. A secondary challenge is encoding non-uniqueness: in the popular Humdrum **kern format for transcribing music, multiple different text encodings can render into the same visual sheet music. This one-to-many mapping creates a harder learning task and introduces high uncertainty during decoding. We propose Transcoda, an OMR system built on (i) an advanced synthetic data generation pipeline, (ii) a normalization of the **kern encoding to enforce a unique normal form and (iii) grammar-based decoding to ensure the syntactic correctness of the output. This approach allows us to train a compact 59M-parameter model in just 6 hours on a single GPU that outperforms billion-parameter baselines. Transcoda achieves the best score among state of the art baselines on a newly curated benchmark of synthetically rendered scores at 18.46% OMR-NED (compared to 43.91% for the next-best system, Legato) and reduces the error rate on historical Polish scans to 63.97% OMR-NED (down from 80.16% for SMT++).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Transcoda, an end-to-end OMR system trained exclusively on synthetically generated data. It combines an advanced synthetic data pipeline, normalization of **kern to a unique normal form to resolve encoding non-uniqueness, and grammar-based decoding for syntactic validity. A compact 59M-parameter model is trained in 6 hours on one GPU and is claimed to outperform larger baselines, achieving 18.46% OMR-NED on a new synthetic benchmark (vs. 43.91% for Legato) and 63.97% OMR-NED on historical Polish scans in zero-shot setting (vs. 80.16% for SMT++).
Significance. If the zero-shot transfer holds, the work would be significant for OMR by demonstrating that careful data-centric synthetic generation can overcome the annotated-data bottleneck and enable compact models to surpass billion-parameter systems. The normalization and grammar constraints directly address a known source of decoding uncertainty in **kern. Credit for the reported training efficiency and parameter count; these are concrete strengths if the empirical claims are supported.
major comments (3)
- [Abstract / Results] Abstract and experimental results: The headline zero-shot result on historical Polish scans (63.97% OMR-NED) rests on the unverified assumption that the synthetic training distribution matches real-scan degradations (ink density, texture, engraving variation). No quantitative domain-similarity metrics (FID, style statistics, or error-pattern analysis) are reported, which is load-bearing for interpreting the improvement over SMT++ as robust transfer rather than test-set artifact.
- [Experimental evaluation] Evaluation protocol: No details are given on how OMR-NED is computed, how baselines (Legato, SMT++) were re-implemented or trained, whether synthetic test data were checked for leakage from the training pipeline, or whether error bars / multiple seeds were used. These omissions directly affect confidence in the reported 18.46% and 63.97% figures.
- [Method (normalization step)] Kern normalization: The claim that collapsing **kern to a unique normal form preserves all musically relevant information is central to both training and zero-shot decoding, yet no analysis or counter-examples are provided showing that real ambiguous encodings are not systematically altered.
minor comments (2)
- [Abstract] The abstract refers to 'a newly curated benchmark' without stating whether the data or generation code will be released.
- [Introduction / Method] OMR-NED should be defined on first use in the main text even if standard in the community.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications and committing to specific revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and experimental results: The headline zero-shot result on historical Polish scans (63.97% OMR-NED) rests on the unverified assumption that the synthetic training distribution matches real-scan degradations (ink density, texture, engraving variation). No quantitative domain-similarity metrics (FID, style statistics, or error-pattern analysis) are reported, which is load-bearing for interpreting the improvement over SMT++ as robust transfer rather than test-set artifact.
Authors: We agree that explicit domain-similarity analysis would increase confidence in the zero-shot transfer. The synthetic pipeline incorporates controlled variations in ink density, texture, and engraving to approximate historical degradations, and the 16-point absolute improvement over SMT++ on real Polish scans provides supporting evidence. In the revised version we will add a new subsection with qualitative side-by-side comparisons of synthetic and real degradations together with an error-pattern breakdown (e.g., common substitution types) to make the domain alignment explicit. revision: yes
-
Referee: [Experimental evaluation] Evaluation protocol: No details are given on how OMR-NED is computed, how baselines (Legato, SMT++) were re-implemented or trained, whether synthetic test data were checked for leakage from the training pipeline, or whether error bars / multiple seeds were used. These omissions directly affect confidence in the reported 18.46% and 63.97% figures.
Authors: We will expand the experimental section to document the exact OMR-NED definition (token-level Levenshtein distance normalized by reference length, following established OMR practice), the re-implementation procedure for each baseline (official repositories with identical synthetic training data), the independent generation of the synthetic test split to preclude leakage, and results averaged over three random seeds with standard deviations. revision: yes
-
Referee: [Method (normalization step)] Kern normalization: The claim that collapsing **kern to a unique normal form preserves all musically relevant information is central to both training and zero-shot decoding, yet no analysis or counter-examples are provided showing that real ambiguous encodings are not systematically altered.
Authors: We will augment the normalization section with concrete counter-examples of ambiguous **kern strings (different duration and articulation encodings that render identically) and demonstrate via round-trip rendering that the chosen normal form is information-preserving for musically valid content. This analysis will be supported by references to Humdrum semantics and will clarify that the mapping is applied only after confirming syntactic validity. revision: yes
Circularity Check
No circularity: empirical benchmarks on held-out data
full rationale
The paper presents an end-to-end empirical pipeline (synthetic data generation, kern normalization to unique form, grammar-based decoding) trained on synthetic images and evaluated via direct OMR-NED metrics on separate synthetic and historical Polish scan benchmarks. No equations, derivations, or parameter fits are described that reduce reported error rates to self-referential inputs or prior self-citations; the performance numbers (18.46% vs. 43.91%, 63.97% vs. 80.16%) are straightforward held-out comparisons without any reduction by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Synthetic data generation can produce training examples whose visual statistics match real sheet music scans closely enough for zero-shot transfer.
- domain assumption The kern encoding admits a unique normal form that resolves the one-to-many mapping without loss of musically relevant information.
Reference graph
Works this paper leans on
-
[1]
G. Bradski. The OpenCV Library.Dr . Dobb’s Journal of Software Tools, 2000
work page 2000
-
[2]
Jorge Calvo-Zaragoza, Alejandro H. Toselli, and Enrique Vidal. Handwritten music recognition for mensural notation with convolutional recurrent neural networks.Pattern Recognit. Lett., 128:115–121, 2019. doi: 10.1016/J.PATREC.2019.08.021. URLhttps://doi.org/10.1016/j.patrec.2019.08.021
-
[3]
Denis Coquenet, Clément Chatelain, and Thierry Paquet. DAN: a segmentation-free document attention network for handwritten document recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):8227–8243, 2023. DBLP-verified title and venue
work page 2023
-
[4]
Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen
Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. XGrammar: Flexible and efficient structured generation engine for large language models.CoRR, abs/2411.15100,
-
[6]
Sachinda Edirisooriya, Hao-Wen Dong, Julian McAuley, and Taylor Berg-Kirkpatrick. An empirical evaluation of end-to-end polyphonic optical music recognition.arXiv preprint arXiv:2108.01769, 2021
-
[7]
Mark R. H. Gotham, Maureen Redbond, Bruno Bower, and Peter Jonas. The “OpenScore String Quartet” Corpus. InProceedings of the 10th International Conference on Digital Libraries for Musicology, pages 49–57, Milan Italy, November 2023. ACM. ISBN 9798400708336. doi: 10.1145/3625135.3625155. URL https://dl.acm.org/doi/10.1145/3625135.3625155
-
[8]
Mark Robert Haigh Gotham and Peter Jonas. The OpenScore Lieder Corpus. In Stefan Münnich and David Rizo, editors,Music Encoding Conference Proceedings 2021, pages 131–136. Humanities Commons, 2022. doi: 10.17613/1my2-dm23
-
[9]
Augraphy: A data augmentation library for document images
Alexander Groleau, Kok Wei Chee, Stefan Larson, Samay Maini, and Jonathan Boarman. Augraphy: A data augmentation library for document images. InProceedings of the 17th International Conference on Document Analysis and Recognition (ICDAR), 2023. URL https://arxiv.org/pdf/2208.14558.pdf
-
[10]
Humdrum and kern: selective feature encoding
David Huron. Humdrum and kern: selective feature encoding. In Eleanor Selfridge-Field, editor,Beyond MIDI: The Handbook of Musical Codes, pages 375–401. MIT Press, Cambridge, MA, USA, 1997. ISBN 0262193949
work page 1997
-
[11]
Further steps towards a standard testbed for optical music recognition
Jan Hajiˇc Jr., Jiˇrí Novotný, Pavel Pecina, and Jaroslav Pokorný. Further steps towards a standard testbed for optical music recognition. InProceedings of the 17th International Society for Music Information Retrieval Conference, pages 157–163, 2016. doi: 10.5281/ZENODO.1418161. URL https://doi.org/ 10.5281/zenodo.1418161
-
[12]
Yixuan Li, Huaping Liu, Qiang Jin, Miaomiao Cai, and Peng Li. Tromr:transformer-based polyphonic optical music recognition. InIEEE International Conference on Acoustics, Speech and Signal Process- ing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE, 2023. doi: 10.1109/ ICASSP49357.2023.10096055. URLhttps://doi.org/10.1109/ICASSP49357....
-
[13]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 11966–11976. IEEE, 2022. doi: 10.1109/CVPR52688. 2022.01167. URLhttps://doi.org/10.1109/CVPR52688.2022.01167
-
[14]
Masked image pretraining on language assisted representation
Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, and Julian McAuley. PDMX: A large-scale public domain musicxml dataset for symbolic music processing. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. doi: 10.1109/ ICASSP49660.2025.10890217
-
[15]
Ossorio-Castillo, jqnoc/numsem: numsem console
Juan C. Martinez-Sevilla, Joan Cerveto-Serrano, Noelia N. Luna-Barahona, Greg Chapman, Craig Sapp, David Rizo, and Jorge Calvo-Zaragoza. Sheet music benchmark: Standardized optical music recognition evaluation. InProceedings of the 26th International Society for Music Information Retrieval Conference, ISMIR 2025, Daejeon, South Korea, September 21-25, 202...
-
[16]
Meta Llama. Llama 3.2 11B Vision. https://huggingface.co/meta-llama/Llama-3. 2-11B-Vision, 2024. Hugging Face model card, accessed 2026-05-06
work page 2024
-
[17]
MuseTrainer Contributors. MuseTrainer Library. https://github.com/musetrainer/library, 2024. 10
work page 2024
-
[18]
Towards a universal music symbol classifier
Alexander Pacha and Horst Eidenberger. Towards a universal music symbol classifier. In2017 14th IAPR International conference on document analysis and recognition (ICDAR), volume 2, pages 35–36. IEEE, 2017
work page 2017
-
[19]
PRAIG. polish-scores, 2025. URL https://huggingface.co/datasets/PRAIG/polish-scores. Hugging Face dataset, accessed 2026-03-14
work page 2025
-
[20]
The Augraphy Project. Augraphy: an augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes. URLhttps://github.com/sparkfish/augraphy
-
[21]
International music score library project (IMSLP), 2026
Project Petrucci LLC. International music score library project (IMSLP), 2026. URL https://imslp. org/. Accessed: 2026-05-06
work page 2026
-
[22]
verovio: A library and toolkit for engraving mei music notation into svg, 2026
Laurent Pugin and Verovio Contributors. verovio: A library and toolkit for engraving mei music notation into svg, 2026. URLhttps://pypi.org/project/verovio/
work page 2026
-
[23]
Verovio: A library for engraving mei music notation into svg
Laurent Pugin, Rodolfo Zitellini, and Perry Roland. Verovio: A library for engraving mei music notation into svg. InProceedings of the 15th International Society for Music Information Retrieval Conference, page 107–112, 2014. URL http://www.terasoft.com.tw/conf/ismir2014/proceedings/T020_ 221_Paper.pdf. URL listed by the Verovio reference book; unavailabl...
work page 2014
-
[24]
Ana Rebelo, Ichiro Fujinaga, Filipe Paszkiewicz, André R. S. Marçal, Carlos Guedes, and Jaime S. Cardoso. Optical music recognition: state-of-the-art and open issues.Int. J. Multim. Inf. Retr ., 1(3):173–190, 2012. doi: 10.1007/S13735-012-0004-6. URLhttps://doi.org/10.1007/s13735-012-0004-6
-
[25]
Antonio Ríos-Vila, Jorge Calvo-Zaragoza, and Thierry Paquet. Sheet music transformer: End-to-end optical music recognition beyond monophonic transcription. InDocument Analysis and Recognition - ICDAR 2024 - 18th International Conference, Athens, Greece, August 30 - September 4, 2024, Proceedings, Part VI, volume 14809 ofLecture Notes in Computer Science, ...
-
[26]
End-to-end full-page optical music recognition for pianoform sheet music.Int
Antonio Ríos-Vila, Jorge Calvo-Zaragoza, David Rizo, and Thierry Paquet. End-to-end full-page optical music recognition for pianoform sheet music.Int. J. Comput. Vis., 134(2):49, 2026. doi: 10.1007/ S11263-025-02654-6. URLhttps://doi.org/10.1007/s11263-025-02654-6
-
[27]
S. S. Singh and Sergey Karayev. Full page handwriting recognition via image to sequence extraction. In Document Analysis and Recognition - ICDAR 2021 - 16th International Conference, Lausanne, Switzerland, September 5-10, 2021, Proceedings, Part III, volume 12823 ofLecture Notes in Computer Science, pages 55–69. Springer, 2021. DBLP-verified title and venue
work page 2021
-
[28]
Pau Torras, Arnau Baró, Lei Kang, and Alicia Fornés. On the integration of language models into sequence to sequence architectures for handwritten music recognition. In Jin Ha Lee, Alexander Lerch, Zhiyao Duan, Juhan Nam, Preeti Rao, Peter van Kranenburg, and Ajay Srinivasamurthy, editors,Proceedings of the 22nd International Society for Music Information...
work page 2021
-
[29]
The ABC Music Standard 2.1, 2011
Chris Walshaw. The ABC Music Standard 2.1, 2011. URL https://michaeleskin.com/abctools/ abc_standard_v2.1.pdf
work page 2011
-
[30]
Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders, 2023. URL https: //arxiv.org/abs/2301.00808
-
[31]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling, 2022. URLhttps://arxiv.org/abs/2111.09886
-
[32]
Generating symbolic music from natural language prompts using an llm-enhanced dataset,
Weihan Xu, Julian McAuley, Taylor Berg-Kirkpatrick, Shlomo Dubnov, and Hao-Wen Dong. Gener- ating symbolic music from natural language prompts using an llm-enhanced dataset.arXiv preprint arXiv:2410.02084, 2024
-
[33]
Toward a more complete omr solution.arXiv preprint arXiv:2409.00316, 2024
Guang Yang, Muru Zhang, Lin Qiu, Yanming Wan, and Noah A Smith. Toward a more complete omr solution.arXiv preprint arXiv:2409.00316, 2024
-
[34]
Guang Yang, Victoria Ebert, Nazif Tamer, Luiza Pozzobon, and Noah A. Smith. LEGATO: large-scale end-to-end generalizable approach to typeset OMR.CoRR, abs/2506.19065, 2025. doi: 10.48550/ARXIV . 2506.19065. URLhttps://doi.org/10.48550/arXiv.2506.19065
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[35]
Kaizhong Zhang and Dennis Shasha. Simple fast algorithms for the editing distance between trees and related problems.SIAM Journal on Computing, 18(6):1245–1262, 1989. doi: 10.1137/0218082. URL https://doi.org/10.1137/0218082. 11 A Synthetic Rendering Settings Training images are rendered with Verovio 6.0.1. For each example, we sample the rendering option...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.