A High-Accuracy Optical Music Recognition Method Based on Bottleneck Residual Convolutions
Pith reviewed 2026-05-10 18:46 UTC · model grok-4.3
The pith
Combining residual bottleneck convolutions with BiGRU enables end-to-end optical music recognition with sub-1% symbol error rates on standard benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The proposed framework extracts features using a ResNet-v2-style network with residual bottleneck blocks and multi-scale dilated convolutions, then models temporal dependencies with BiGRU, trained end-to-end via CTC loss. This yields sequence error rates of 7.52% and 8.11% on Camera-PrIMuS and PrIMuS, respectively, along with symbol error rates below 0.5% and note accuracies above 99%.
What carries the argument
Residual bottleneck convolution blocks combined with multi-scale dilated convolutions in a CNN front-end feeding a BiGRU sequence model, trained with CTC loss.
If this is right
- The model can transcribe music scores without requiring explicit alignment annotations during training.
- It maintains high computational efficiency with average training time of 1.74 seconds per epoch.
- Fine-grained error analysis shows effectiveness across pitch, type, and note recognition.
- The same architecture works on both Camera-PrIMuS and PrIMuS datasets with comparable performance.
Where Pith is reading between the lines
- If the feature encoding generalizes, this could extend to handwritten scores or more complex polyphonic music without major redesign.
- The efficiency suggests it could be applied to real-time or large-scale batch processing of historical music collections.
- Performance on noisy or degraded scans remains untested and could be a next step.
Load-bearing premise
The residual bottleneck convolutions plus BiGRU will encode musical symbol features well enough to generalize to unseen score images without overfitting to the training distributions.
What would settle it
Running the model on a new dataset of music scores with different fonts, layouts, or image qualities and observing whether the symbol error rate remains below 1%.
Figures
read the original abstract
Optical Music Recognition (OMR) aims to convert printed or handwritten music score images into editable symbolic representations. This paper presents an end-to-end OMR framework that combines residual bottleneck convolutions with bidirectional gated recurrent unit (BiGRU)-based sequence modeling. A convolutional neural network with ResNet-v2-style residual bottleneck blocks and multi-scale dilated convolutions is used to extract features that encode both fine-grained symbol details and global staff-line structures. The extracted feature sequences are then fed into a BiGRU network to model temporal dependencies among musical symbols. The model is trained using the Connectionist Temporal Classification loss, enabling end-to-end prediction without explicit alignment annotations. Experimental results on the Camera-PrIMuS and PrIMuS datasets demonstrate the effectiveness of the proposed framework. On Camera-PrIMuS, the proposed method achieves a sequence error rate (SeER) of $7.52\%$ and a symbol error rate (SyER) of $0.45\%$, with pitch, type, and note accuracies of $99.33\%$, $99.60\%$, and $99.28\%$, respectively. The average training time is 1.74~s per epoch, demonstrating high computational efficiency while maintaining strong recognition performance. On PrIMuS, the method achieves a SeER of $8.11\%$ and a SyER of $0.49\%$, with pitch, type, and note accuracies of $99.27\%$, $99.58\%$, and $99.21\%$, respectively. A fine-grained error analysis further confirms the effectiveness of the proposed model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes an end-to-end Optical Music Recognition (OMR) framework that employs a CNN built from ResNet-v2-style residual bottleneck blocks augmented with multi-scale dilated convolutions to extract features encoding both local symbol details and global staff structures, followed by a BiGRU for temporal sequence modeling. The model is trained with CTC loss to avoid explicit alignment annotations. It reports concrete performance on the Camera-PrIMuS dataset (SeER 7.52%, SyER 0.45%, pitch/type/note accuracies 99.33%/99.60%/99.28%) and PrIMuS dataset (SeER 8.11%, SyER 0.49%, accuracies 99.27%/99.58%/99.21%), plus average training time of 1.74 s/epoch and a fine-grained error analysis.
Significance. If substantiated by comparisons, the reported error rates and computational efficiency could indicate a practical advance in OMR by showing that bottleneck residuals plus dilated convolutions and BiGRU can deliver high symbol-level accuracy without alignment supervision. The end-to-end CTC training and fine-grained error breakdown are positive elements that align with standard practices in sequence recognition. However, the lack of controls currently prevents determining whether these numbers reflect architectural superiority or other factors.
major comments (3)
- [Experimental Results] Experimental Results section: The headline metrics (SeER 7.52% and SyER 0.45% on Camera-PrIMuS; pitch/type/note accuracies >99%) are presented without any quantitative baseline comparisons to prior OMR systems (e.g., CRNN, other ResNet variants, or attention models) on identical dataset splits. This omission directly undermines the central claim that the bottleneck residual convolutions and BiGRU supply superior feature encoding.
- [Experimental Results] Experimental Results section: No ablation experiments are reported that isolate the contribution of the multi-scale dilated convolutions, the residual bottleneck blocks, or the BiGRU component. Without these controls, the attribution of the low error rates and high accuracies to the proposed architecture remains unverified.
- [Model and Training] Model and Training description: Hyperparameter settings, learning-rate schedule, data splits, and any statistical significance tests for the reported metrics are absent. These details are load-bearing for reproducing and assessing the claimed performance and efficiency (1.74 s/epoch).
minor comments (2)
- [Abstract] The abstract introduces SeER and SyER without a brief definition or reference to their standard formulation in OMR literature.
- [Figures and Tables] Figure captions and table headers could more explicitly link visual results to the quantitative error rates and accuracy breakdowns.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We agree that strengthening the experimental section with baselines, ablations, and reproducibility details will better support our claims, and we will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Experimental Results] Experimental Results section: The headline metrics (SeER 7.52% and SyER 0.45% on Camera-PrIMuS; pitch/type/note accuracies >99%) are presented without any quantitative baseline comparisons to prior OMR systems (e.g., CRNN, other ResNet variants, or attention models) on identical dataset splits. This omission directly undermines the central claim that the bottleneck residual convolutions and BiGRU supply superior feature encoding.
Authors: We agree that direct quantitative comparisons on identical splits are essential to substantiate the advantages of our architecture. In the revised manuscript, we will add comparisons to prior OMR systems including CRNN, other ResNet variants, and attention-based models using the same Camera-PrIMuS and PrIMuS dataset splits, allowing clear evaluation of the contributions from the residual bottleneck blocks and BiGRU. revision: yes
-
Referee: [Experimental Results] Experimental Results section: No ablation experiments are reported that isolate the contribution of the multi-scale dilated convolutions, the residual bottleneck blocks, or the BiGRU component. Without these controls, the attribution of the low error rates and high accuracies to the proposed architecture remains unverified.
Authors: We acknowledge that ablation studies are required to isolate component contributions. We will include ablation experiments in the revision that systematically evaluate variants with and without the multi-scale dilated convolutions, residual bottleneck blocks, and BiGRU, reporting the resulting changes in SeER, SyER, and accuracies on both datasets. revision: yes
-
Referee: [Model and Training] Model and Training description: Hyperparameter settings, learning-rate schedule, data splits, and any statistical significance tests for the reported metrics are absent. These details are load-bearing for reproducing and assessing the claimed performance and efficiency (1.74 s/epoch).
Authors: We will expand the Model and Training section in the revision to provide full hyperparameter settings, the learning-rate schedule, exact data splits, and statistical significance analysis (including means and standard deviations from multiple runs with different random seeds) to ensure full reproducibility and proper assessment of the reported metrics and efficiency. revision: yes
Circularity Check
No circularity: purely empirical model evaluation on external datasets
full rationale
The paper describes a CNN-BiGRU architecture (ResNet-v2 bottleneck blocks + multi-scale dilated convolutions + BiGRU + CTC loss) and reports direct performance metrics (SeER, SyER, pitch/type/note accuracies) on the public Camera-PrIMuS and PrIMuS datasets. No derivation chain exists; there are no equations, no fitted parameters renamed as predictions, no self-citations invoked as uniqueness theorems, and no ansatz or renaming of known results. The reported numbers are standard test-set outputs of a trained model and do not reduce to the inputs by construction. This is a normal non-circular empirical ML paper.
Axiom & Free-Parameter Ledger
free parameters (2)
- Convolution kernel sizes, dilation rates, and channel dimensions
- BiGRU hidden size, number of layers, and learning rate schedule
axioms (2)
- domain assumption CTC loss enables end-to-end training of sequence models without explicit alignment annotations
- standard math Residual bottleneck blocks improve gradient flow and feature extraction in deep CNNs for structured images
Reference graph
Works this paper leans on
-
[1]
R. Taruskin, Music from the Earliest Notations to the Sixteenth Century: The Oxford History of Western Music, Oxford University Press, New York, 2006
work page 2006
-
[2]
Downie, Music information retrieval, Annu
J. Downie, Music information retrieval, Annu. Rev. Inform. Sci. 37 (2003) 295–340. doi:10.1002/aris.1440370108
-
[3]
M. M. Terras, Digital curiosities: resource creation via amateur digiti- zation, Lit. Linguist. Comput. 25 (2010) 425–438
work page 2010
-
[4]
D. Bainbridge, T. Bell, The challenge of optical music recognition, Comput. Humanit. 35 (2) (2001) 95–121
work page 2001
-
[5]
D. Byrd, J. G. Simonsen, Towards a standard testbed for optical music recognition: Definitions, metrics, and page images, J. New Music Res. 44 (3) (2015) 169–195
work page 2015
- [6]
- [7]
-
[8]
V . Bosch Campos, J. Calvo-Zaragoza, A. H. Toselli, E. Vidal Ruiz, Sheet music statistical layout analysis, in: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2016, pp. 313–318
work page 2016
-
[9]
G. Vigliensoni, G. Burlet, I. Fujinaga, Optical measure recognition in common music notation, in: Proceedings of the 14th International Soci- ety for Music Information Retrieval Conference (ISMIR), International Society for Music Information Retrieval, Curitiba, Brazil, 2013, pp. 207– 10 212
work page 2013
- [10]
-
[11]
J. Calvo-Zaragoza, J. J. Valero-Mas, A. Pertusa, End-to-end optical music recognition using neural networks, in: International Society for Music Information Retrieval Conference, 2017
work page 2017
-
[12]
J. Calvo-Zaragoza, D. Rizo, End-to-end neural optical music recognition of monophonic scores, Appl. Sci. 8 (4)
- [13]
-
[14]
A. R ´ıos-Vila, J. M. I˜nesta, J. Calvo-Zaragoza, On the use of transformers for end-to-end optical music recognition, in: A. J. Pinho, P. Georgieva, L. F. Teixeira, J. A. S ´anchez (Eds.), Pattern Recognition and Image Analysis, Springer International Publishing, Cham, 2022, pp. 470–481
work page 2022
-
[15]
A. Rios-Vila, E. Fuentes-Martinez, F. J. Castellanos, An implicit layout- aware transformer for full-page end-to-end optical music recognition, Int. J. Multimed. Inf. Retr. 14 (4) (2025) 34
work page 2025
-
[16]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778
work page 2016
-
[17]
K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Computer Vision – ECCV 2016, Springer International Publishing, Cham, 2016, pp. 630–645
work page 2016
- [18]
-
[19]
M. Schuster, K. Paliwal, Bidirectional recurrent neural networks, IEEE Trans. Signal Process. 45 (11) (1997) 2673–2681
work page 1997
-
[20]
A. Graves, S. Fern ´andez, F. Gomez, J. Schmidhuber, Connectionist tem- poral classification: labelling unsegmented sequence data with recurrent neural networks, in: Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, Association for Computing Machinery, New York, NY , USA, 2006, pp. 369–376
work page 2006
- [21]
-
[22]
J. Calvo-Zaragoza, J. C. Martinez-Sevilla, C. Penarrubia, A. Rios- Vila, Optical music recognition: Recent advances, current challenges, and future directions, in: M. Coustaty, A. Forn ´es (Eds.), Document Analysis and Recognition – ICDAR 2023 Workshops, Springer Nature Switzerland, Cham, 2023, pp. 94–104
work page 2023
-
[23]
F. J. Castellanos, A. J. Gallego, I. Fujinaga, Deep learning for optical music recognition: A review, PreprintE-Prints posted on TechRxiv are preliminary reports that are not peer reviewed
-
[24]
X. Mao, Y . Tian, T. Jin, B. Di, Enhancing music audio signal recognition through cnn-bilstm fusion with de-noising autoencoder for improved performance, Neurocomputing 625 (2025) 129607
work page 2025
-
[25]
S. Edirisooriya, H.-W. Dong, J. McAuley, T. Berg-Kirkpatrick, An empirical evaluation of end-to-end polyphonic optical music recognition (2021). arXiv:2108.01769
-
[26]
E. van der Wel, K. Ullrich, Optical music recognition with convolutional sequence-to-sequence models (2017). arXiv:1707.04877
-
[27]
J. Calvo-Zaragoza, J. H. Jr., A. Pacha, Understanding optical music recognition, ACM Comput. Surv. 53 (4) (2020) 1–35
work page 2020
-
[28]
F. Luo, Y . Dai, J. Fuentes, W. Ding, X. Zhang, M-detr: Multi-scale detr for optical music recognition, Expert Syst. Appl. 249 (2024) 123664
work page 2024
-
[29]
Y . Yu, S. Luo, S. Liu, H. Qiao, Y . Liu, L. Feng, Deep attention based music genre classification, Neurocomputing 372 (2020) 84–91
work page 2020
-
[30]
M. Alfaro-Contreras, A. R ´ıos-Vila, J. J. Valero-Mas, J. M. I ˜nesta, J. Calvo-Zaragoza, Decoupling music notation to improve end-to-end optical music recognition, Pattern Recogn. Lett. 158 (2022) 157–163
work page 2022
-
[31]
A. Liu, L. Zhang, Y . Mei, B. Han, Z. Cai, Z. Zhu, J. Xiao, Residual recurrent crnn for end-to-end optical music recognition on monophonic scores, in: Proceedings of the 2021 Workshop on Multi-Modal Pre- Training for Multimedia Understanding, MMPT ’21, Association for Computing Machinery, New York, NY , USA, 2021, pp. 23–27
work page 2021
-
[32]
Y . Liu, R. Wu, Y . Wu, L. Luo, W. Xu, A stave-aware optical music recognition on monophonic scores for camera-based scenarios, Appl. Sci. 13 (16)
-
[33]
P. Yu, H. Chen, Deep multilevel cascade residual recurrent framework (mcrr) for sheet music recognition, IEEE Access 12 (2024) 6941–6960
work page 2024
-
[34]
arXiv preprint arXiv:2402.07596 (2024) 7
A. R ´ıos-Vila, J. Calvo-Zaragoza, T. Paquet, Sheet music transformer: End-to-end optical music recognition beyond monophonic transcription (2024). arXiv:2402.07596
-
[35]
A. R ´ıos-Vila, J. Calvo-Zaragoza, D. Rizo, T. Paquet, End-to-end full- page optical music recognition for pianoform sheet music (2025). arXiv:2405.12105
-
[36]
A. Rosell ´o, E. Fuentes-Mart´ınez, M. Alfaro-Contreras, D. Rizo, J. Calvo- Zaragoza, Source-free domain adaptation for optical music recognition, in: E. H. Barney Smith, M. Liwicki, L. Peng (Eds.), Document Analysis and Recognition - ICDAR 2024, Springer Nature Switzerland, Cham, 2024, pp. 3–19
work page 2024
- [37]
-
[38]
C. Penarrubia, J. J. Valero-Mas, J. Calvo-Zaragoza, Contrastive self- supervised learning for optical music recognition, in: G. Sfikas, G. Retsi- nas (Eds.), Document Analysis Systems, Springer Nature Switzerland, Cham, 2024, pp. 312–326
work page 2024
-
[39]
A. Rios-Vila, M. Alfaro-Contreras, J. J. Valero-Mas, J. Calvo-Zaragoza, Few-shot music symbol classification via self-supervised learning and nearest neighbor, in: J.-J. Rousseau, B. Kapralos (Eds.), Pattern Recogni- tion, Computer Vision, and Image Processing. ICPR 2022 International Workshops and Challenges, Springer Nature Switzerland, Cham, 2023, pp. 93–107
work page 2022
-
[40]
J. Calvo-Zaragoza, D. Rizo, Camera-primus: Neural end-to-end optical music recognition on realistic monophonic scores, in: International Society for Music Information Retrieval Conference, 2018
work page 2018
-
[41]
K.-Y . Choi, B. Co ¨uasnon, Y . Ricquebourg, R. Zanibbi, Bootstrapping samples of accidentals in dense piano scores for cnn-based detection, Proc. ICDAR 2017 02 (2017) 19–20
work page 2017
- [42]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.