A High-Accuracy Optical Music Recognition Method Based on Bottleneck Residual Convolutions

and Weicheng Fu; Huhu Xue; Junwen Ma; Xingyuan Zhao

arxiv: 2604.16446 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.LG· cs.SD· eess.AS

A High-Accuracy Optical Music Recognition Method Based on Bottleneck Residual Convolutions

Junwen Ma , Huhu Xue , Xingyuan Zhao , and Weicheng Fu This is my paper

Pith reviewed 2026-05-10 18:46 UTC · model grok-4.3

classification 💻 cs.CV cs.LGcs.SDeess.AS

keywords optical music recognitionresidual convolutional networksbidirectional GRUconnectionist temporal classificationend-to-end sequence recognitionmusic score transcription

0 comments

The pith

Combining residual bottleneck convolutions with BiGRU enables end-to-end optical music recognition with sub-1% symbol error rates on standard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops an end-to-end system for turning images of music scores into symbolic notation. It uses a CNN built from residual bottleneck blocks and multi-scale dilated convolutions to pull out both local symbol shapes and broader staff structures. These features go into a bidirectional GRU that learns the order of notes and symbols. Training relies on CTC loss, which removes the need for manual alignment of symbols to image positions. The approach reaches very low error rates on two public datasets of printed music, suggesting it could speed up digitizing large archives of scores.

Core claim

The proposed framework extracts features using a ResNet-v2-style network with residual bottleneck blocks and multi-scale dilated convolutions, then models temporal dependencies with BiGRU, trained end-to-end via CTC loss. This yields sequence error rates of 7.52% and 8.11% on Camera-PrIMuS and PrIMuS, respectively, along with symbol error rates below 0.5% and note accuracies above 99%.

What carries the argument

Residual bottleneck convolution blocks combined with multi-scale dilated convolutions in a CNN front-end feeding a BiGRU sequence model, trained with CTC loss.

If this is right

The model can transcribe music scores without requiring explicit alignment annotations during training.
It maintains high computational efficiency with average training time of 1.74 seconds per epoch.
Fine-grained error analysis shows effectiveness across pitch, type, and note recognition.
The same architecture works on both Camera-PrIMuS and PrIMuS datasets with comparable performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the feature encoding generalizes, this could extend to handwritten scores or more complex polyphonic music without major redesign.
The efficiency suggests it could be applied to real-time or large-scale batch processing of historical music collections.
Performance on noisy or degraded scans remains untested and could be a next step.

Load-bearing premise

The residual bottleneck convolutions plus BiGRU will encode musical symbol features well enough to generalize to unseen score images without overfitting to the training distributions.

What would settle it

Running the model on a new dataset of music scores with different fonts, layouts, or image qualities and observing whether the symbol error rate remains below 1%.

Figures

Figures reproduced from arXiv: 2604.16446 by and Weicheng Fu, Huhu Xue, Junwen Ma, Xingyuan Zhao.

**Figure 1.** Figure 1: Architecture of the proposed end-to-end Optical Music Recognition model based on bottleneck residual convolution and BiGRU. The model consists [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison between the original music score image and its augmented [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Optical Music Recognition (OMR) aims to convert printed or handwritten music score images into editable symbolic representations. This paper presents an end-to-end OMR framework that combines residual bottleneck convolutions with bidirectional gated recurrent unit (BiGRU)-based sequence modeling. A convolutional neural network with ResNet-v2-style residual bottleneck blocks and multi-scale dilated convolutions is used to extract features that encode both fine-grained symbol details and global staff-line structures. The extracted feature sequences are then fed into a BiGRU network to model temporal dependencies among musical symbols. The model is trained using the Connectionist Temporal Classification loss, enabling end-to-end prediction without explicit alignment annotations. Experimental results on the Camera-PrIMuS and PrIMuS datasets demonstrate the effectiveness of the proposed framework. On Camera-PrIMuS, the proposed method achieves a sequence error rate (SeER) of $7.52\%$ and a symbol error rate (SyER) of $0.45\%$, with pitch, type, and note accuracies of $99.33\%$, $99.60\%$, and $99.28\%$, respectively. The average training time is 1.74~s per epoch, demonstrating high computational efficiency while maintaining strong recognition performance. On PrIMuS, the method achieves a SeER of $8.11\%$ and a SyER of $0.49\%$, with pitch, type, and note accuracies of $99.27\%$, $99.58\%$, and $99.21\%$, respectively. A fine-grained error analysis further confirms the effectiveness of the proposed model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Standard architecture for OMR delivers strong numbers on public datasets but without the controls needed to confirm its advantages.

read the letter

The main takeaway is that this paper applies a familiar deep learning architecture to optical music recognition without introducing novel components. It stacks ResNet-v2 style bottleneck residual blocks with multi-scale dilated convolutions for extracting features from music score images, then uses a BiGRU to handle the sequence of symbols and CTC loss for end-to-end training. The reported results on Camera-PrIMuS show a sequence error rate of 7.52% and symbol error rate of 0.45%, along with high accuracies above 99% for pitch, type, and notes. Similar performance appears on the PrIMuS dataset.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an end-to-end Optical Music Recognition (OMR) framework that employs a CNN built from ResNet-v2-style residual bottleneck blocks augmented with multi-scale dilated convolutions to extract features encoding both local symbol details and global staff structures, followed by a BiGRU for temporal sequence modeling. The model is trained with CTC loss to avoid explicit alignment annotations. It reports concrete performance on the Camera-PrIMuS dataset (SeER 7.52%, SyER 0.45%, pitch/type/note accuracies 99.33%/99.60%/99.28%) and PrIMuS dataset (SeER 8.11%, SyER 0.49%, accuracies 99.27%/99.58%/99.21%), plus average training time of 1.74 s/epoch and a fine-grained error analysis.

Significance. If substantiated by comparisons, the reported error rates and computational efficiency could indicate a practical advance in OMR by showing that bottleneck residuals plus dilated convolutions and BiGRU can deliver high symbol-level accuracy without alignment supervision. The end-to-end CTC training and fine-grained error breakdown are positive elements that align with standard practices in sequence recognition. However, the lack of controls currently prevents determining whether these numbers reflect architectural superiority or other factors.

major comments (3)

[Experimental Results] Experimental Results section: The headline metrics (SeER 7.52% and SyER 0.45% on Camera-PrIMuS; pitch/type/note accuracies >99%) are presented without any quantitative baseline comparisons to prior OMR systems (e.g., CRNN, other ResNet variants, or attention models) on identical dataset splits. This omission directly undermines the central claim that the bottleneck residual convolutions and BiGRU supply superior feature encoding.
[Experimental Results] Experimental Results section: No ablation experiments are reported that isolate the contribution of the multi-scale dilated convolutions, the residual bottleneck blocks, or the BiGRU component. Without these controls, the attribution of the low error rates and high accuracies to the proposed architecture remains unverified.
[Model and Training] Model and Training description: Hyperparameter settings, learning-rate schedule, data splits, and any statistical significance tests for the reported metrics are absent. These details are load-bearing for reproducing and assessing the claimed performance and efficiency (1.74 s/epoch).

minor comments (2)

[Abstract] The abstract introduces SeER and SyER without a brief definition or reference to their standard formulation in OMR literature.
[Figures and Tables] Figure captions and table headers could more explicitly link visual results to the quantitative error rates and accuracy breakdowns.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We agree that strengthening the experimental section with baselines, ablations, and reproducibility details will better support our claims, and we will revise the manuscript accordingly.

read point-by-point responses

Referee: [Experimental Results] Experimental Results section: The headline metrics (SeER 7.52% and SyER 0.45% on Camera-PrIMuS; pitch/type/note accuracies >99%) are presented without any quantitative baseline comparisons to prior OMR systems (e.g., CRNN, other ResNet variants, or attention models) on identical dataset splits. This omission directly undermines the central claim that the bottleneck residual convolutions and BiGRU supply superior feature encoding.

Authors: We agree that direct quantitative comparisons on identical splits are essential to substantiate the advantages of our architecture. In the revised manuscript, we will add comparisons to prior OMR systems including CRNN, other ResNet variants, and attention-based models using the same Camera-PrIMuS and PrIMuS dataset splits, allowing clear evaluation of the contributions from the residual bottleneck blocks and BiGRU. revision: yes
Referee: [Experimental Results] Experimental Results section: No ablation experiments are reported that isolate the contribution of the multi-scale dilated convolutions, the residual bottleneck blocks, or the BiGRU component. Without these controls, the attribution of the low error rates and high accuracies to the proposed architecture remains unverified.

Authors: We acknowledge that ablation studies are required to isolate component contributions. We will include ablation experiments in the revision that systematically evaluate variants with and without the multi-scale dilated convolutions, residual bottleneck blocks, and BiGRU, reporting the resulting changes in SeER, SyER, and accuracies on both datasets. revision: yes
Referee: [Model and Training] Model and Training description: Hyperparameter settings, learning-rate schedule, data splits, and any statistical significance tests for the reported metrics are absent. These details are load-bearing for reproducing and assessing the claimed performance and efficiency (1.74 s/epoch).

Authors: We will expand the Model and Training section in the revision to provide full hyperparameter settings, the learning-rate schedule, exact data splits, and statistical significance analysis (including means and standard deviations from multiple runs with different random seeds) to ensure full reproducibility and proper assessment of the reported metrics and efficiency. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model evaluation on external datasets

full rationale

The paper describes a CNN-BiGRU architecture (ResNet-v2 bottleneck blocks + multi-scale dilated convolutions + BiGRU + CTC loss) and reports direct performance metrics (SeER, SyER, pitch/type/note accuracies) on the public Camera-PrIMuS and PrIMuS datasets. No derivation chain exists; there are no equations, no fitted parameters renamed as predictions, no self-citations invoked as uniqueness theorems, and no ansatz or renaming of known results. The reported numbers are standard test-set outputs of a trained model and do not reduce to the inputs by construction. This is a normal non-circular empirical ML paper.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard deep learning assumptions about feature extraction and sequence modeling; no new entities are introduced and the performance numbers depend on empirical tuning rather than derivation.

free parameters (2)

Convolution kernel sizes, dilation rates, and channel dimensions
Chosen to capture multi-scale staff and symbol features but specific values not provided in abstract.
BiGRU hidden size, number of layers, and learning rate schedule
Hyperparameters for sequence modeling tuned for the task but not detailed.

axioms (2)

domain assumption CTC loss enables end-to-end training of sequence models without explicit alignment annotations
Standard assumption in handwriting and speech recognition tasks applied here to music symbols.
standard math Residual bottleneck blocks improve gradient flow and feature extraction in deep CNNs for structured images
Based on established ResNet-v2 literature.

pith-pipeline@v0.9.0 · 5616 in / 1514 out tokens · 49702 ms · 2026-05-10T18:46:23.690033+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

[1]

Taruskin, Music from the Earliest Notations to the Sixteenth Century: The Oxford History of Western Music, Oxford University Press, New York, 2006

R. Taruskin, Music from the Earliest Notations to the Sixteenth Century: The Oxford History of Western Music, Oxford University Press, New York, 2006

work page 2006
[2]

Downie, Music information retrieval, Annu

J. Downie, Music information retrieval, Annu. Rev. Inform. Sci. 37 (2003) 295–340. doi:10.1002/aris.1440370108

work page doi:10.1002/aris.1440370108 2003
[3]

M. M. Terras, Digital curiosities: resource creation via amateur digiti- zation, Lit. Linguist. Comput. 25 (2010) 425–438

work page 2010
[4]

Bainbridge, T

D. Bainbridge, T. Bell, The challenge of optical music recognition, Comput. Humanit. 35 (2) (2001) 95–121

work page 2001
[5]

D. Byrd, J. G. Simonsen, Towards a standard testbed for optical music recognition: Definitions, metrics, and page images, J. New Music Res. 44 (3) (2015) 169–195

work page 2015
[6]

Rebelo, I

A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. S. Marcal, C. Guedes, J. S. Cardoso, Optical music recognition: state-of-the-art and open issues, Int. J. Multimed. Inf. Retr. 1 (3) (2012) 173–190

work page 2012
[7]

Pinto, A

T. Pinto, A. Rebelo, G. Giraldi, J. S. Cardoso, Music score binarization based on domain knowledge, in: Proceedings of the 5th Iberian Confer- ence on Pattern Recognition and Image Analysis, IbPRIA’11, Springer- Verlag, Berlin, Heidelberg, 2011, pp. 700–708

work page 2011
[8]

Bosch Campos, J

V . Bosch Campos, J. Calvo-Zaragoza, A. H. Toselli, E. Vidal Ruiz, Sheet music statistical layout analysis, in: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2016, pp. 313–318

work page 2016
[9]

Vigliensoni, G

G. Vigliensoni, G. Burlet, I. Fujinaga, Optical measure recognition in common music notation, in: Proceedings of the 14th International Soci- ety for Music Information Retrieval Conference (ISMIR), International Society for Music Information Retrieval, Curitiba, Brazil, 2013, pp. 207– 10 212

work page 2013
[10]

Pacha, H

A. Pacha, H. Eidenberger, Towards a universal music symbol classifier, in: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), V ol. 02, 2017, pp. 35–36

work page 2017
[11]

Calvo-Zaragoza, J

J. Calvo-Zaragoza, J. J. Valero-Mas, A. Pertusa, End-to-end optical music recognition using neural networks, in: International Society for Music Information Retrieval Conference, 2017

work page 2017
[12]

Calvo-Zaragoza, D

J. Calvo-Zaragoza, D. Rizo, End-to-end neural optical music recognition of monophonic scores, Appl. Sci. 8 (4)

work page
[13]

Bar ´o, C

A. Bar ´o, C. Badal, A. Forn ˆes, Handwritten historical music recognition by sequence-to-sequence with attention mechanism, in: 2020 17th Inter- national Conference on Frontiers in Handwriting Recognition (ICFHR), 2020, pp. 205–210

work page 2020
[14]

R ´ıos-Vila, J

A. R ´ıos-Vila, J. M. I˜nesta, J. Calvo-Zaragoza, On the use of transformers for end-to-end optical music recognition, in: A. J. Pinho, P. Georgieva, L. F. Teixeira, J. A. S ´anchez (Eds.), Pattern Recognition and Image Analysis, Springer International Publishing, Cham, 2022, pp. 470–481

work page 2022
[15]

Rios-Vila, E

A. Rios-Vila, E. Fuentes-Martinez, F. J. Castellanos, An implicit layout- aware transformer for full-page end-to-end optical music recognition, Int. J. Multimed. Inf. Retr. 14 (4) (2025) 34

work page 2025
[16]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

work page 2016
[17]

K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Computer Vision – ECCV 2016, Springer International Publishing, Cham, 2016, pp. 630–645

work page 2016
[18]

Graves, J

A. Graves, J. Schmidhuber, Offline handwriting recognition with mul- tidimensional recurrent neural networks, in: Advances in Neural Infor- mation Processing Systems (NeurIPS), 2009, pp. 545–552

work page 2009
[19]

Schuster, K

M. Schuster, K. Paliwal, Bidirectional recurrent neural networks, IEEE Trans. Signal Process. 45 (11) (1997) 2673–2681

work page 1997
[20]

Graves, S

A. Graves, S. Fern ´andez, F. Gomez, J. Schmidhuber, Connectionist tem- poral classification: labelling unsegmented sequence data with recurrent neural networks, in: Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, Association for Computing Machinery, New York, NY , USA, 2006, pp. 369–376

work page 2006
[21]

J. C. Martinez-Sevilla, J. Cerveto-Serrano, N. Luna, G. Chapman, C. Sapp, D. Rizo, J. Calvo-Zaragoza, Sheet music benchmark: Standard- ized optical music recognition evaluation (2025). arXiv:2506.10488

work page arXiv 2025
[22]

Calvo-Zaragoza, J

J. Calvo-Zaragoza, J. C. Martinez-Sevilla, C. Penarrubia, A. Rios- Vila, Optical music recognition: Recent advances, current challenges, and future directions, in: M. Coustaty, A. Forn ´es (Eds.), Document Analysis and Recognition – ICDAR 2023 Workshops, Springer Nature Switzerland, Cham, 2023, pp. 94–104

work page 2023
[23]

F. J. Castellanos, A. J. Gallego, I. Fujinaga, Deep learning for optical music recognition: A review, PreprintE-Prints posted on TechRxiv are preliminary reports that are not peer reviewed

work page
[24]

X. Mao, Y . Tian, T. Jin, B. Di, Enhancing music audio signal recognition through cnn-bilstm fusion with de-noising autoencoder for improved performance, Neurocomputing 625 (2025) 129607

work page 2025
[25]

An empirical evaluation of end-to-end polyphonic optical music recognition.arXiv preprint arXiv:2108.01769, 2021

S. Edirisooriya, H.-W. Dong, J. McAuley, T. Berg-Kirkpatrick, An empirical evaluation of end-to-end polyphonic optical music recognition (2021). arXiv:2108.01769

work page arXiv 2021
[26]

van der Wel, K

E. van der Wel, K. Ullrich, Optical music recognition with convolutional sequence-to-sequence models (2017). arXiv:1707.04877

work page arXiv 2017
[27]

Calvo-Zaragoza, J

J. Calvo-Zaragoza, J. H. Jr., A. Pacha, Understanding optical music recognition, ACM Comput. Surv. 53 (4) (2020) 1–35

work page 2020
[28]

F. Luo, Y . Dai, J. Fuentes, W. Ding, X. Zhang, M-detr: Multi-scale detr for optical music recognition, Expert Syst. Appl. 249 (2024) 123664

work page 2024
[29]

Y . Yu, S. Luo, S. Liu, H. Qiao, Y . Liu, L. Feng, Deep attention based music genre classification, Neurocomputing 372 (2020) 84–91

work page 2020
[30]

Alfaro-Contreras, A

M. Alfaro-Contreras, A. R ´ıos-Vila, J. J. Valero-Mas, J. M. I ˜nesta, J. Calvo-Zaragoza, Decoupling music notation to improve end-to-end optical music recognition, Pattern Recogn. Lett. 158 (2022) 157–163

work page 2022
[31]

A. Liu, L. Zhang, Y . Mei, B. Han, Z. Cai, Z. Zhu, J. Xiao, Residual recurrent crnn for end-to-end optical music recognition on monophonic scores, in: Proceedings of the 2021 Workshop on Multi-Modal Pre- Training for Multimedia Understanding, MMPT ’21, Association for Computing Machinery, New York, NY , USA, 2021, pp. 23–27

work page 2021
[32]

Y . Liu, R. Wu, Y . Wu, L. Luo, W. Xu, A stave-aware optical music recognition on monophonic scores for camera-based scenarios, Appl. Sci. 13 (16)

work page
[33]

P. Yu, H. Chen, Deep multilevel cascade residual recurrent framework (mcrr) for sheet music recognition, IEEE Access 12 (2024) 6941–6960

work page 2024
[34]

arXiv preprint arXiv:2402.07596 (2024) 7

A. R ´ıos-Vila, J. Calvo-Zaragoza, T. Paquet, Sheet music transformer: End-to-end optical music recognition beyond monophonic transcription (2024). arXiv:2402.07596

work page arXiv 2024
[35]

Sheet mu- sic transformer++: End-to-end full-page optical music recognition for pianoform sheet music,

A. R ´ıos-Vila, J. Calvo-Zaragoza, D. Rizo, T. Paquet, End-to-end full- page optical music recognition for pianoform sheet music (2025). arXiv:2405.12105

work page arXiv 2025
[36]

Rosell ´o, E

A. Rosell ´o, E. Fuentes-Mart´ınez, M. Alfaro-Contreras, D. Rizo, J. Calvo- Zaragoza, Source-free domain adaptation for optical music recognition, in: E. H. Barney Smith, M. Liwicki, L. Peng (Eds.), Document Analysis and Recognition - ICDAR 2024, Springer Nature Switzerland, Cham, 2024, pp. 3–19

work page 2024
[37]

Shatri, D

E. Shatri, D. Raymond, G. Fazekas, Low-data classification of his- torical music manuscripts: A few-shot learning approach (2024). arXiv:2411.16408

work page arXiv 2024
[38]

Penarrubia, J

C. Penarrubia, J. J. Valero-Mas, J. Calvo-Zaragoza, Contrastive self- supervised learning for optical music recognition, in: G. Sfikas, G. Retsi- nas (Eds.), Document Analysis Systems, Springer Nature Switzerland, Cham, 2024, pp. 312–326

work page 2024
[39]

Rios-Vila, M

A. Rios-Vila, M. Alfaro-Contreras, J. J. Valero-Mas, J. Calvo-Zaragoza, Few-shot music symbol classification via self-supervised learning and nearest neighbor, in: J.-J. Rousseau, B. Kapralos (Eds.), Pattern Recogni- tion, Computer Vision, and Image Processing. ICPR 2022 International Workshops and Challenges, Springer Nature Switzerland, Cham, 2023, pp. 93–107

work page 2022
[40]

Calvo-Zaragoza, D

J. Calvo-Zaragoza, D. Rizo, Camera-primus: Neural end-to-end optical music recognition on realistic monophonic scores, in: International Society for Music Information Retrieval Conference, 2018

work page 2018
[41]

K.-Y . Choi, B. Co ¨uasnon, Y . Ricquebourg, R. Zanibbi, Bootstrapping samples of accidentals in dense piano scores for cnn-based detection, Proc. ICDAR 2017 02 (2017) 19–20

work page 2017
[42]

Zhang, Z

Y . Zhang, Z. Huang, Y . Zhang, K. Ren, A detector for page-level handwritten music object recognition based on deep learning, Neural Comput. Appl. 35 (2023) 9773–9787

work page 2023

[1] [1]

Taruskin, Music from the Earliest Notations to the Sixteenth Century: The Oxford History of Western Music, Oxford University Press, New York, 2006

R. Taruskin, Music from the Earliest Notations to the Sixteenth Century: The Oxford History of Western Music, Oxford University Press, New York, 2006

work page 2006

[2] [2]

Downie, Music information retrieval, Annu

J. Downie, Music information retrieval, Annu. Rev. Inform. Sci. 37 (2003) 295–340. doi:10.1002/aris.1440370108

work page doi:10.1002/aris.1440370108 2003

[3] [3]

M. M. Terras, Digital curiosities: resource creation via amateur digiti- zation, Lit. Linguist. Comput. 25 (2010) 425–438

work page 2010

[4] [4]

Bainbridge, T

D. Bainbridge, T. Bell, The challenge of optical music recognition, Comput. Humanit. 35 (2) (2001) 95–121

work page 2001

[5] [5]

D. Byrd, J. G. Simonsen, Towards a standard testbed for optical music recognition: Definitions, metrics, and page images, J. New Music Res. 44 (3) (2015) 169–195

work page 2015

[6] [6]

Rebelo, I

A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. S. Marcal, C. Guedes, J. S. Cardoso, Optical music recognition: state-of-the-art and open issues, Int. J. Multimed. Inf. Retr. 1 (3) (2012) 173–190

work page 2012

[7] [7]

Pinto, A

T. Pinto, A. Rebelo, G. Giraldi, J. S. Cardoso, Music score binarization based on domain knowledge, in: Proceedings of the 5th Iberian Confer- ence on Pattern Recognition and Image Analysis, IbPRIA’11, Springer- Verlag, Berlin, Heidelberg, 2011, pp. 700–708

work page 2011

[8] [8]

Bosch Campos, J

V . Bosch Campos, J. Calvo-Zaragoza, A. H. Toselli, E. Vidal Ruiz, Sheet music statistical layout analysis, in: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2016, pp. 313–318

work page 2016

[9] [9]

Vigliensoni, G

G. Vigliensoni, G. Burlet, I. Fujinaga, Optical measure recognition in common music notation, in: Proceedings of the 14th International Soci- ety for Music Information Retrieval Conference (ISMIR), International Society for Music Information Retrieval, Curitiba, Brazil, 2013, pp. 207– 10 212

work page 2013

[10] [10]

Pacha, H

A. Pacha, H. Eidenberger, Towards a universal music symbol classifier, in: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), V ol. 02, 2017, pp. 35–36

work page 2017

[11] [11]

Calvo-Zaragoza, J

J. Calvo-Zaragoza, J. J. Valero-Mas, A. Pertusa, End-to-end optical music recognition using neural networks, in: International Society for Music Information Retrieval Conference, 2017

work page 2017

[12] [12]

Calvo-Zaragoza, D

J. Calvo-Zaragoza, D. Rizo, End-to-end neural optical music recognition of monophonic scores, Appl. Sci. 8 (4)

work page

[13] [13]

Bar ´o, C

A. Bar ´o, C. Badal, A. Forn ˆes, Handwritten historical music recognition by sequence-to-sequence with attention mechanism, in: 2020 17th Inter- national Conference on Frontiers in Handwriting Recognition (ICFHR), 2020, pp. 205–210

work page 2020

[14] [14]

R ´ıos-Vila, J

A. R ´ıos-Vila, J. M. I˜nesta, J. Calvo-Zaragoza, On the use of transformers for end-to-end optical music recognition, in: A. J. Pinho, P. Georgieva, L. F. Teixeira, J. A. S ´anchez (Eds.), Pattern Recognition and Image Analysis, Springer International Publishing, Cham, 2022, pp. 470–481

work page 2022

[15] [15]

Rios-Vila, E

A. Rios-Vila, E. Fuentes-Martinez, F. J. Castellanos, An implicit layout- aware transformer for full-page end-to-end optical music recognition, Int. J. Multimed. Inf. Retr. 14 (4) (2025) 34

work page 2025

[16] [16]

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

work page 2016

[17] [17]

K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Computer Vision – ECCV 2016, Springer International Publishing, Cham, 2016, pp. 630–645

work page 2016

[18] [18]

Graves, J

A. Graves, J. Schmidhuber, Offline handwriting recognition with mul- tidimensional recurrent neural networks, in: Advances in Neural Infor- mation Processing Systems (NeurIPS), 2009, pp. 545–552

work page 2009

[19] [19]

Schuster, K

M. Schuster, K. Paliwal, Bidirectional recurrent neural networks, IEEE Trans. Signal Process. 45 (11) (1997) 2673–2681

work page 1997

[20] [20]

Graves, S

A. Graves, S. Fern ´andez, F. Gomez, J. Schmidhuber, Connectionist tem- poral classification: labelling unsegmented sequence data with recurrent neural networks, in: Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, Association for Computing Machinery, New York, NY , USA, 2006, pp. 369–376

work page 2006

[21] [21]

J. C. Martinez-Sevilla, J. Cerveto-Serrano, N. Luna, G. Chapman, C. Sapp, D. Rizo, J. Calvo-Zaragoza, Sheet music benchmark: Standard- ized optical music recognition evaluation (2025). arXiv:2506.10488

work page arXiv 2025

[22] [22]

Calvo-Zaragoza, J

J. Calvo-Zaragoza, J. C. Martinez-Sevilla, C. Penarrubia, A. Rios- Vila, Optical music recognition: Recent advances, current challenges, and future directions, in: M. Coustaty, A. Forn ´es (Eds.), Document Analysis and Recognition – ICDAR 2023 Workshops, Springer Nature Switzerland, Cham, 2023, pp. 94–104

work page 2023

[23] [23]

F. J. Castellanos, A. J. Gallego, I. Fujinaga, Deep learning for optical music recognition: A review, PreprintE-Prints posted on TechRxiv are preliminary reports that are not peer reviewed

work page

[24] [24]

X. Mao, Y . Tian, T. Jin, B. Di, Enhancing music audio signal recognition through cnn-bilstm fusion with de-noising autoencoder for improved performance, Neurocomputing 625 (2025) 129607

work page 2025

[25] [25]

An empirical evaluation of end-to-end polyphonic optical music recognition.arXiv preprint arXiv:2108.01769, 2021

S. Edirisooriya, H.-W. Dong, J. McAuley, T. Berg-Kirkpatrick, An empirical evaluation of end-to-end polyphonic optical music recognition (2021). arXiv:2108.01769

work page arXiv 2021

[26] [26]

van der Wel, K

E. van der Wel, K. Ullrich, Optical music recognition with convolutional sequence-to-sequence models (2017). arXiv:1707.04877

work page arXiv 2017

[27] [27]

Calvo-Zaragoza, J

J. Calvo-Zaragoza, J. H. Jr., A. Pacha, Understanding optical music recognition, ACM Comput. Surv. 53 (4) (2020) 1–35

work page 2020

[28] [28]

F. Luo, Y . Dai, J. Fuentes, W. Ding, X. Zhang, M-detr: Multi-scale detr for optical music recognition, Expert Syst. Appl. 249 (2024) 123664

work page 2024

[29] [29]

Y . Yu, S. Luo, S. Liu, H. Qiao, Y . Liu, L. Feng, Deep attention based music genre classification, Neurocomputing 372 (2020) 84–91

work page 2020

[30] [30]

Alfaro-Contreras, A

M. Alfaro-Contreras, A. R ´ıos-Vila, J. J. Valero-Mas, J. M. I ˜nesta, J. Calvo-Zaragoza, Decoupling music notation to improve end-to-end optical music recognition, Pattern Recogn. Lett. 158 (2022) 157–163

work page 2022

[31] [31]

A. Liu, L. Zhang, Y . Mei, B. Han, Z. Cai, Z. Zhu, J. Xiao, Residual recurrent crnn for end-to-end optical music recognition on monophonic scores, in: Proceedings of the 2021 Workshop on Multi-Modal Pre- Training for Multimedia Understanding, MMPT ’21, Association for Computing Machinery, New York, NY , USA, 2021, pp. 23–27

work page 2021

[32] [32]

Y . Liu, R. Wu, Y . Wu, L. Luo, W. Xu, A stave-aware optical music recognition on monophonic scores for camera-based scenarios, Appl. Sci. 13 (16)

work page

[33] [33]

P. Yu, H. Chen, Deep multilevel cascade residual recurrent framework (mcrr) for sheet music recognition, IEEE Access 12 (2024) 6941–6960

work page 2024

[34] [34]

arXiv preprint arXiv:2402.07596 (2024) 7

A. R ´ıos-Vila, J. Calvo-Zaragoza, T. Paquet, Sheet music transformer: End-to-end optical music recognition beyond monophonic transcription (2024). arXiv:2402.07596

work page arXiv 2024

[35] [35]

Sheet mu- sic transformer++: End-to-end full-page optical music recognition for pianoform sheet music,

A. R ´ıos-Vila, J. Calvo-Zaragoza, D. Rizo, T. Paquet, End-to-end full- page optical music recognition for pianoform sheet music (2025). arXiv:2405.12105

work page arXiv 2025

[36] [36]

Rosell ´o, E

A. Rosell ´o, E. Fuentes-Mart´ınez, M. Alfaro-Contreras, D. Rizo, J. Calvo- Zaragoza, Source-free domain adaptation for optical music recognition, in: E. H. Barney Smith, M. Liwicki, L. Peng (Eds.), Document Analysis and Recognition - ICDAR 2024, Springer Nature Switzerland, Cham, 2024, pp. 3–19

work page 2024

[37] [37]

Shatri, D

E. Shatri, D. Raymond, G. Fazekas, Low-data classification of his- torical music manuscripts: A few-shot learning approach (2024). arXiv:2411.16408

work page arXiv 2024

[38] [38]

Penarrubia, J

C. Penarrubia, J. J. Valero-Mas, J. Calvo-Zaragoza, Contrastive self- supervised learning for optical music recognition, in: G. Sfikas, G. Retsi- nas (Eds.), Document Analysis Systems, Springer Nature Switzerland, Cham, 2024, pp. 312–326

work page 2024

[39] [39]

Rios-Vila, M

A. Rios-Vila, M. Alfaro-Contreras, J. J. Valero-Mas, J. Calvo-Zaragoza, Few-shot music symbol classification via self-supervised learning and nearest neighbor, in: J.-J. Rousseau, B. Kapralos (Eds.), Pattern Recogni- tion, Computer Vision, and Image Processing. ICPR 2022 International Workshops and Challenges, Springer Nature Switzerland, Cham, 2023, pp. 93–107

work page 2022

[40] [40]

Calvo-Zaragoza, D

J. Calvo-Zaragoza, D. Rizo, Camera-primus: Neural end-to-end optical music recognition on realistic monophonic scores, in: International Society for Music Information Retrieval Conference, 2018

work page 2018

[41] [41]

K.-Y . Choi, B. Co ¨uasnon, Y . Ricquebourg, R. Zanibbi, Bootstrapping samples of accidentals in dense piano scores for cnn-based detection, Proc. ICDAR 2017 02 (2017) 19–20

work page 2017

[42] [42]

Zhang, Z

Y . Zhang, Z. Huang, Y . Zhang, K. Ren, A detector for page-level handwritten music object recognition based on deep learning, Neural Comput. Appl. 35 (2023) 9773–9787

work page 2023