Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations
Pith reviewed 2026-05-08 09:28 UTC · model grok-4.3
The pith
Transformer models can quantize rhythms in MIDI performances into readable musical notation when given beat annotations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that an adapted T5 transformer, trained on beat-synchronized score and performance MIDI together with data augmentations such as transposition, note deletion, and time jitter, performs effective rhythm quantization. This produces high accuracy on onset detection and note value assignment while generalizing across time signatures, including those absent from the training data, and yields output that forms readable musical scores. Fine-tuning on instrument-specific collections further improves capture of characteristic rhythmic and melodic patterns.
What carries the argument
An adapted T5 transformer that processes beat-aligned performance and score MIDI sequences, supported by a custom MIDI tokenizer and a beat-based pre-quantization step that unifies timing information.
If this is right
- The model produces musical scores that musicians can read directly from performance MIDI input.
- Accuracy holds across different time signatures even when some are absent from training data.
- Fine-tuning on instrument-specific collections improves modeling of typical rhythmic patterns for that instrument.
- The framework supports objective evaluation through score-level metrics that compare quantized output against reference notation.
Where Pith is reading between the lines
- Pairing this quantizer with a separate beat-tracking stage could support fully automatic pipelines that need no manual annotations.
- The same alignment and tokenization approach might transfer to other symbolic music tasks that require correcting timing deviations.
- Training with paired score information suggests a route for guiding future models toward notation that respects musical conventions rather than raw timing.
Load-bearing premise
Accurate beat annotations are supplied in advance to align performance timing with score data.
What would settle it
Measuring substantially lower accuracy when the same model receives automatically estimated beats instead of ground-truth annotations would show that the claimed performance depends on perfect prior beat information.
Figures
read the original abstract
Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this work, we introduce a novel deep learning approach for quantizing MIDI performances using a priori beat information. Our method leverages the transformer architecture to effectively process synchronized score and performance data for training a quantization model. Key components of our approach include dataset preparation, a beat-based pre-quantization method to align performance and score times within a unified framework, and a MIDI tokenizer tailored for this task. We adapt a transformer model based on the T5 architecture to meet the specific requirements of rhythm quantization. The model is evaluated using a set of score-level metrics designed for objective assessment of quantization performance. Through systematic evaluation, we optimize both data representation and model architecture. Additionally, we apply performance and score augmentations, such as transposition, note deletion, and performance-side time jitter, to enhance the model's robustness. Finally, a qualitative analysis compares our model's quantization performance against state-of-the-art probabilistic and deep-learning models on various example pieces. Our model achieves an onset F1-score of 97.3% and a note value accuracy of 83.3% on the ASAP dataset. It generalizes well across time signatures, including those not seen during training, and produces readable score output. Fine-tuning on instrument-specific datasets further improves performance by capturing characteristic rhythmic and melodic patterns. This work contributes a robust and flexible framework for beat-based MIDI quantization using transformer models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a transformer-based method for rhythm quantization of performance MIDI using a priori beat annotations. It covers dataset preparation from ASAP, a beat-based pre-quantization step to align performance onsets to score positions, a custom MIDI tokenizer, and adaptation of the T5 architecture. The approach includes augmentations (transposition, note deletion, time jitter) and fine-tuning on instrument-specific data. Systematic evaluation yields 97.3% onset F1-score and 83.3% note value accuracy on ASAP, with claims of generalization to unseen time signatures, readable score output, and qualitative superiority over probabilistic and deep-learning baselines.
Significance. If the results hold under realistic beat conditions, this would advance automatic music transcription by filling a gap in beat-based rhythm quantization with a flexible transformer framework. The synchronized data processing, targeted augmentations, and fine-tuning demonstrate practical strengths, and the generalization across time signatures plus readable outputs could improve downstream score generation in AMT pipelines.
major comments (2)
- [Evaluation] Evaluation section: The headline metrics (onset F1-score of 97.3% and note value accuracy of 83.3%) are obtained only after the beat-based pre-quantization alignment that relies on error-free ground-truth beat annotations from ASAP. No ablation or stress test with perturbed beats (timing jitter, missed beats) is reported, which is load-bearing for assessing the transformer's independent contribution and for validating the generalization claim to unseen time signatures.
- [Evaluation] Evaluation section: Quantitative details on baselines (specific implementations, error analysis, data splits, and controls for dataset leakage) are not supplied alongside the reported metrics, weakening the comparison to state-of-the-art probabilistic and deep-learning models.
minor comments (2)
- [Abstract] Abstract and Section 3 (dataset preparation): The MIDI tokenizer description would benefit from an explicit example of token sequences for synchronized score-performance pairs to aid reproducibility.
- [Conclusion] The paper could add a short limitations paragraph noting the dependence on accurate beat annotations, as this assumption is highlighted in the method but not foregrounded for readers.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work. We agree that expanding the evaluation with robustness tests and baseline details will strengthen the manuscript and will revise accordingly.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The headline metrics (onset F1-score of 97.3% and note value accuracy of 83.3%) are obtained only after the beat-based pre-quantization alignment that relies on error-free ground-truth beat annotations from ASAP. No ablation or stress test with perturbed beats (timing jitter, missed beats) is reported, which is load-bearing for assessing the transformer's independent contribution and for validating the generalization claim to unseen time signatures.
Authors: We acknowledge that the reported metrics use ground-truth ASAP beat annotations, as the method is explicitly designed for a priori beat information (see title and Section 3). The beat-based pre-quantization is an integral pipeline step that provides a coarse alignment, after which the transformer refines note values. To directly address the concern, we will add an ablation study in the revised Evaluation section introducing controlled perturbations to the input beats (Gaussian timing jitter of 50-200ms and random missed beats at 5-20% rates) and report the resulting onset F1 and note-value accuracy. This will quantify the transformer's independent contribution beyond perfect beats and extend the generalization analysis to unseen time signatures under realistic conditions. revision: yes
-
Referee: [Evaluation] Evaluation section: Quantitative details on baselines (specific implementations, error analysis, data splits, and controls for dataset leakage) are not supplied alongside the reported metrics, weakening the comparison to state-of-the-art probabilistic and deep-learning models.
Authors: We agree that additional quantitative details are needed for transparent comparison. In the revised manuscript we will expand the Evaluation section to include: (i) exact implementations and hyperparameters of the probabilistic (e.g., HMM-based) and deep-learning baselines, (ii) per-category error analysis (onset, offset, duration errors), (iii) explicit confirmation of identical train/test splits across all models with no leakage (ASAP pieces were partitioned by piece ID), and (iv) a reproducibility note with code release. These additions will allow readers to assess the comparisons rigorously while preserving the qualitative examples already present. revision: yes
Circularity Check
No circularity in derivation or evaluation chain
full rationale
The paper describes an empirical deep-learning pipeline: a T5-based transformer is trained on synchronized performance/score MIDI data from the external ASAP dataset after a beat-based alignment preprocessing step. Reported metrics (onset F1 97.3 %, note-value accuracy 83.3 %) are obtained by direct evaluation on held-out test data, not by algebraic reduction to fitted parameters or by self-citation. No equations, uniqueness theorems, or ansatzes are presented that loop back to the inputs; the method is a standard supervised sequence-to-sequence model whose performance claims rest on external data rather than internal construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- model hyperparameters and weights
axioms (1)
- domain assumption Provided beat annotations are accurate and sufficient for alignment
Reference graph
Works this paper leans on
-
[1]
Automatic Music Tran- scription: An Overview,
E. Benetos, S. Dixon, Z. Duan, and S. Ewert, “Automatic Music Tran- scription: An Overview,”IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 20–30, 2018
work page 2018
-
[2]
Enhanced beat tracking with context-aware neural networks,
S. B ¨ock and M. Schedl, “Enhanced beat tracking with context-aware neural networks,” inProceedings of the 14th International Conference on Digital Audio Effects (DAFx), 2011
work page 2011
-
[3]
Joint beat and downbeat tracking with recurrent neural networks,
S. B ¨ock, F. Krebs, and G. Widmer, “Joint beat and downbeat tracking with recurrent neural networks,” inProceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), 2016
work page 2016
-
[4]
Temporal convolutional networks for musical audio beat tracking,
M. E. P. Davies and S. B ¨ock, “Temporal convolutional networks for musical audio beat tracking,” in27th European Signal Processing Conference (EUSIPCO), 2019
work page 2019
-
[5]
Beat transformer: Demixed beat and downbeat tracking with dilated self-attention,
J. Zhao, G. Xia, and Y . Wang, “Beat transformer: Demixed beat and downbeat tracking with dilated self-attention,” inProceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022
work page 2022
-
[6]
Beat this! accurate beat tracking without dbn postprocessing,
F. Foscarin, J. Schl ¨uter, and G. Widmer, “Beat this! accurate beat tracking without dbn postprocessing,” inProceedings of the 25th Inter- national Society for Music Information Retrieval Conference (ISMIR), 2024
work page 2024
-
[7]
Beat and downbeat tracking in perfor- mance midi using an end-to-end transformer architecture,
S. Murgul and M. Heizmann, “Beat and downbeat tracking in perfor- mance midi using an end-to-end transformer architecture,” inProceed- ings of the 22nd Sound and Music Computing Conference (SMC), 2025
work page 2025
-
[8]
From midi to traditional musical notation,
E. Cambouropoulos, “From midi to traditional musical notation,” in Proceedings of the AAAI Workshop on Artificial Intelligence and Music: Towards Formal Models for Composition, Performance and Analysis, 2000
work page 2000
-
[9]
Rhythm quantization for transcription,
A. T. Cemgil, P. Desain, and B. Kappen, “Rhythm quantization for transcription,”Computer Music Journal, vol. 24, no. 2, pp. 60–76, 2000
work page 2000
-
[10]
Hidden markov model for automatic transcription of midi signals,
H. Takeda, N. Saito, T. Otsuki, M. Nakai, H. Shimodaira, and S. Sagayama, “Hidden markov model for automatic transcription of midi signals,” inIEEE Workshop on Multimedia Signal Processing, 2002
work page 2002
-
[11]
A learning-based quantization: Unsupervised estimation of the model parameters,
M. Hamanaka, M. Goto, H. Asoh, and N. Otsu, “A learning-based quantization: Unsupervised estimation of the model parameters,” in Proceedings of the International Computer Music Conference (ICMC), 2003
work page 2003
- [12]
-
[13]
Transcribing human piano per- formances into music notation
A. Cogliati, D. Temperley, and Z. Duan, “Transcribing human piano per- formances into music notation.” inProceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), 2016
work page 2016
-
[14]
A parse-based framework for coupled rhythm quantization and score structuring,
F. Foscarin, F. Jacquemard, P. Rigaux, and M. Sakai, “A parse-based framework for coupled rhythm quantization and score structuring,” in Mathematics and Computation in Music, 2019
work page 2019
-
[15]
Non-local musical statistics as guides for audio-to-score piano transcription,
K. Shibata, E. Nakamura, and K. Yoshii, “Non-local musical statistics as guides for audio-to-score piano transcription,”Information Sciences, vol. 566, pp. 262–280, 2021
work page 2021
-
[16]
Performance midi- to-score conversion by neural beat tracking,
L. Liu, Q. Kong, G. Morfi, E. Benetoset al., “Performance midi- to-score conversion by neural beat tracking,” inProceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022
work page 2022
-
[17]
Note-level automatic guitar transcrip- tion using attention mechanism,
S. Kim, T. Hayashi, and T. Toda, “Note-level automatic guitar transcrip- tion using attention mechanism,” inProceedings of the 30th European Signal Processing Conference (EUSIPCO), 2022
work page 2022
-
[18]
End-to-end piano performance-midi to score conversion with transformers,
T. Beyer and A. Dai, “End-to-end piano performance-midi to score conversion with transformers,” inProceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR), 2024
work page 2024
-
[19]
(2025) Musicxml for exchanging digital sheet music
MakeMusic, Inc. (2025) Musicxml for exchanging digital sheet music. https://www.musicxml.com/. (accessed Feb. 19, 2025)
work page 2025
-
[20]
Sequence-to-sequence piano transcription with transformers,
C. Hawthorne, I. Simon, R. Swavely, E. Manilow, and J. Engel, “Sequence-to-sequence piano transcription with transformers,” inPro- ceedings of the 22nd International Society for Music Information Re- trieval Conference (ISMIR), 2021
work page 2021
-
[21]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,”Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020
work page 2020
-
[22]
Adafactor: Adaptive learning rates with sublinear memory cost,
N. Shazeer and M. Stern, “Adafactor: Adaptive learning rates with sublinear memory cost,” inProceedings of the 35th International Con- ference on Machine Learning (ICML), 2018
work page 2018
-
[23]
A-maps: Augmented maps dataset with rhythm and key annotations,
A. Ycart and E. Benetos, “A-maps: Augmented maps dataset with rhythm and key annotations,” inProceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), 2018
work page 2018
-
[24]
Asap: a dataset of aligned scores and performances for piano transcription,
F. Foscarin, A. Mcleod, P. Rigaux, F. Jacquemard, and M. Sakai, “Asap: a dataset of aligned scores and performances for piano transcription,” inProceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR), 2020
work page 2020
-
[25]
D. Edwards, X. Riley, and S. Dixon, “The Francois Leduc Dataset,” Apr
-
[26]
Available: https://doi.org/10.5281/zenodo.10984521
[Online]. Available: https://doi.org/10.5281/zenodo.10984521
-
[27]
High resolution guitar transcription via domain adaptation,
X. Riley, D. Edwards, and S. Dixon, “High resolution guitar transcription via domain adaptation,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1051–1055
work page 2024
-
[28]
E. Nakamura, E. Benetos, K. Yoshii, and S. Dixon, “Towards complete polyphonic music transcription: Integrating multi-pitch detection and rhythm quantization,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018
work page 2018
-
[29]
Rhythm transcription of polyphonic piano music based on merged-output hmm for multiple voices,
E. Nakamura, K. Yoshii, and S. Sagayama, “Rhythm transcription of polyphonic piano music based on merged-output hmm for multiple voices,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017
work page 2017
-
[30]
MuseScore: Free music composition and notation soft- ware,
MuseScore, “MuseScore: Free music composition and notation soft- ware,” https://musescore.org, 2025, (accessed Feb. 20, 2025)
work page 2025
-
[31]
MakeMusic, Inc., “Finale version 27,” https://finalemusic.com/, 1988, (accessed Jul. 17, 2024)
work page 1988
-
[32]
Acpas: a dataset of aligned classical piano audio and scores for audio-to-score transcription,
L. Liu, V . Morfi, E. Benetoset al., “Acpas: a dataset of aligned classical piano audio and scores for audio-to-score transcription,” inExtended Abstracts for the Late-Breaking Demo Session of the 22nd International Society for Music Information Retrieval Conference (ISMIR), 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.