pith. sign in

arxiv: 2604.22290 · v1 · submitted 2026-04-24 · 💻 cs.SD · cs.MM· eess.AS

Transformer-Based Rhythm Quantization of Performance MIDI Using Beat Annotations

Pith reviewed 2026-05-08 09:28 UTC · model grok-4.3

classification 💻 cs.SD cs.MMeess.AS
keywords MIDI quantizationrhythm transcriptiontransformer modelbeat annotationsautomatic music transcriptionmusical notationperformance to scorescore generation
0
0 comments X

The pith

Transformer models can quantize rhythms in MIDI performances into readable musical notation when given beat annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a deep learning system to convert raw MIDI performances into properly timed musical scores. It adapts a transformer to work with paired performance and score data that has first been aligned using known beat positions. The method includes dataset preparation steps, a beat-based alignment technique, and a dedicated MIDI tokenizer. A sympathetic reader would care because turning live or improvised playing into accurate sheet music remains a core challenge in automatic transcription, and better quantization directly improves the readability of the final output.

Core claim

The authors establish that an adapted T5 transformer, trained on beat-synchronized score and performance MIDI together with data augmentations such as transposition, note deletion, and time jitter, performs effective rhythm quantization. This produces high accuracy on onset detection and note value assignment while generalizing across time signatures, including those absent from the training data, and yields output that forms readable musical scores. Fine-tuning on instrument-specific collections further improves capture of characteristic rhythmic and melodic patterns.

What carries the argument

An adapted T5 transformer that processes beat-aligned performance and score MIDI sequences, supported by a custom MIDI tokenizer and a beat-based pre-quantization step that unifies timing information.

If this is right

  • The model produces musical scores that musicians can read directly from performance MIDI input.
  • Accuracy holds across different time signatures even when some are absent from training data.
  • Fine-tuning on instrument-specific collections improves modeling of typical rhythmic patterns for that instrument.
  • The framework supports objective evaluation through score-level metrics that compare quantized output against reference notation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pairing this quantizer with a separate beat-tracking stage could support fully automatic pipelines that need no manual annotations.
  • The same alignment and tokenization approach might transfer to other symbolic music tasks that require correcting timing deviations.
  • Training with paired score information suggests a route for guiding future models toward notation that respects musical conventions rather than raw timing.

Load-bearing premise

Accurate beat annotations are supplied in advance to align performance timing with score data.

What would settle it

Measuring substantially lower accuracy when the same model receives automatically estimated beats instead of ground-truth annotations would show that the claimed performance depends on perfect prior beat information.

Figures

Figures reproduced from arXiv: 2604.22290 by Maximilian Wachter, Michael Heizmann, Sebastian Murgul.

Figure 1
Figure 1. Figure 1: Overview of the data preprocessing steps, from raw MIDI and view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of Onset F1-score and Note Value Accuracy in percent for view at source ↗
read the original abstract

Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this work, we introduce a novel deep learning approach for quantizing MIDI performances using a priori beat information. Our method leverages the transformer architecture to effectively process synchronized score and performance data for training a quantization model. Key components of our approach include dataset preparation, a beat-based pre-quantization method to align performance and score times within a unified framework, and a MIDI tokenizer tailored for this task. We adapt a transformer model based on the T5 architecture to meet the specific requirements of rhythm quantization. The model is evaluated using a set of score-level metrics designed for objective assessment of quantization performance. Through systematic evaluation, we optimize both data representation and model architecture. Additionally, we apply performance and score augmentations, such as transposition, note deletion, and performance-side time jitter, to enhance the model's robustness. Finally, a qualitative analysis compares our model's quantization performance against state-of-the-art probabilistic and deep-learning models on various example pieces. Our model achieves an onset F1-score of 97.3% and a note value accuracy of 83.3% on the ASAP dataset. It generalizes well across time signatures, including those not seen during training, and produces readable score output. Fine-tuning on instrument-specific datasets further improves performance by capturing characteristic rhythmic and melodic patterns. This work contributes a robust and flexible framework for beat-based MIDI quantization using transformer models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a transformer-based method for rhythm quantization of performance MIDI using a priori beat annotations. It covers dataset preparation from ASAP, a beat-based pre-quantization step to align performance onsets to score positions, a custom MIDI tokenizer, and adaptation of the T5 architecture. The approach includes augmentations (transposition, note deletion, time jitter) and fine-tuning on instrument-specific data. Systematic evaluation yields 97.3% onset F1-score and 83.3% note value accuracy on ASAP, with claims of generalization to unseen time signatures, readable score output, and qualitative superiority over probabilistic and deep-learning baselines.

Significance. If the results hold under realistic beat conditions, this would advance automatic music transcription by filling a gap in beat-based rhythm quantization with a flexible transformer framework. The synchronized data processing, targeted augmentations, and fine-tuning demonstrate practical strengths, and the generalization across time signatures plus readable outputs could improve downstream score generation in AMT pipelines.

major comments (2)
  1. [Evaluation] Evaluation section: The headline metrics (onset F1-score of 97.3% and note value accuracy of 83.3%) are obtained only after the beat-based pre-quantization alignment that relies on error-free ground-truth beat annotations from ASAP. No ablation or stress test with perturbed beats (timing jitter, missed beats) is reported, which is load-bearing for assessing the transformer's independent contribution and for validating the generalization claim to unseen time signatures.
  2. [Evaluation] Evaluation section: Quantitative details on baselines (specific implementations, error analysis, data splits, and controls for dataset leakage) are not supplied alongside the reported metrics, weakening the comparison to state-of-the-art probabilistic and deep-learning models.
minor comments (2)
  1. [Abstract] Abstract and Section 3 (dataset preparation): The MIDI tokenizer description would benefit from an explicit example of token sequences for synchronized score-performance pairs to aid reproducibility.
  2. [Conclusion] The paper could add a short limitations paragraph noting the dependence on accurate beat annotations, as this assumption is highlighted in the method but not foregrounded for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We agree that expanding the evaluation with robustness tests and baseline details will strengthen the manuscript and will revise accordingly.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The headline metrics (onset F1-score of 97.3% and note value accuracy of 83.3%) are obtained only after the beat-based pre-quantization alignment that relies on error-free ground-truth beat annotations from ASAP. No ablation or stress test with perturbed beats (timing jitter, missed beats) is reported, which is load-bearing for assessing the transformer's independent contribution and for validating the generalization claim to unseen time signatures.

    Authors: We acknowledge that the reported metrics use ground-truth ASAP beat annotations, as the method is explicitly designed for a priori beat information (see title and Section 3). The beat-based pre-quantization is an integral pipeline step that provides a coarse alignment, after which the transformer refines note values. To directly address the concern, we will add an ablation study in the revised Evaluation section introducing controlled perturbations to the input beats (Gaussian timing jitter of 50-200ms and random missed beats at 5-20% rates) and report the resulting onset F1 and note-value accuracy. This will quantify the transformer's independent contribution beyond perfect beats and extend the generalization analysis to unseen time signatures under realistic conditions. revision: yes

  2. Referee: [Evaluation] Evaluation section: Quantitative details on baselines (specific implementations, error analysis, data splits, and controls for dataset leakage) are not supplied alongside the reported metrics, weakening the comparison to state-of-the-art probabilistic and deep-learning models.

    Authors: We agree that additional quantitative details are needed for transparent comparison. In the revised manuscript we will expand the Evaluation section to include: (i) exact implementations and hyperparameters of the probabilistic (e.g., HMM-based) and deep-learning baselines, (ii) per-category error analysis (onset, offset, duration errors), (iii) explicit confirmation of identical train/test splits across all models with no leakage (ASAP pieces were partitioned by piece ID), and (iv) a reproducibility note with code release. These additions will allow readers to assess the comparisons rigorously while preserving the qualitative examples already present. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper describes an empirical deep-learning pipeline: a T5-based transformer is trained on synchronized performance/score MIDI data from the external ASAP dataset after a beat-based alignment preprocessing step. Reported metrics (onset F1 97.3 %, note-value accuracy 83.3 %) are obtained by direct evaluation on held-out test data, not by algebraic reduction to fitted parameters or by self-citation. No equations, uniqueness theorems, or ansatzes are presented that loop back to the inputs; the method is a standard supervised sequence-to-sequence model whose performance claims rest on external data rather than internal construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that beat annotations provide reliable alignment and that the transformer can learn rhythmic mappings from the prepared paired data; model weights and training hyperparameters are fitted parameters.

free parameters (1)
  • model hyperparameters and weights
    Standard deep learning training fits millions of parameters to the dataset; exact values not reported in abstract.
axioms (1)
  • domain assumption Provided beat annotations are accurate and sufficient for alignment
    The entire pipeline depends on a priori beat information to create synchronized training pairs.

pith-pipeline@v0.9.0 · 5594 in / 1264 out tokens · 31586 ms · 2026-05-08T09:28:30.497549+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    Automatic Music Tran- scription: An Overview,

    E. Benetos, S. Dixon, Z. Duan, and S. Ewert, “Automatic Music Tran- scription: An Overview,”IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 20–30, 2018

  2. [2]

    Enhanced beat tracking with context-aware neural networks,

    S. B ¨ock and M. Schedl, “Enhanced beat tracking with context-aware neural networks,” inProceedings of the 14th International Conference on Digital Audio Effects (DAFx), 2011

  3. [3]

    Joint beat and downbeat tracking with recurrent neural networks,

    S. B ¨ock, F. Krebs, and G. Widmer, “Joint beat and downbeat tracking with recurrent neural networks,” inProceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), 2016

  4. [4]

    Temporal convolutional networks for musical audio beat tracking,

    M. E. P. Davies and S. B ¨ock, “Temporal convolutional networks for musical audio beat tracking,” in27th European Signal Processing Conference (EUSIPCO), 2019

  5. [5]

    Beat transformer: Demixed beat and downbeat tracking with dilated self-attention,

    J. Zhao, G. Xia, and Y . Wang, “Beat transformer: Demixed beat and downbeat tracking with dilated self-attention,” inProceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022

  6. [6]

    Beat this! accurate beat tracking without dbn postprocessing,

    F. Foscarin, J. Schl ¨uter, and G. Widmer, “Beat this! accurate beat tracking without dbn postprocessing,” inProceedings of the 25th Inter- national Society for Music Information Retrieval Conference (ISMIR), 2024

  7. [7]

    Beat and downbeat tracking in perfor- mance midi using an end-to-end transformer architecture,

    S. Murgul and M. Heizmann, “Beat and downbeat tracking in perfor- mance midi using an end-to-end transformer architecture,” inProceed- ings of the 22nd Sound and Music Computing Conference (SMC), 2025

  8. [8]

    From midi to traditional musical notation,

    E. Cambouropoulos, “From midi to traditional musical notation,” in Proceedings of the AAAI Workshop on Artificial Intelligence and Music: Towards Formal Models for Composition, Performance and Analysis, 2000

  9. [9]

    Rhythm quantization for transcription,

    A. T. Cemgil, P. Desain, and B. Kappen, “Rhythm quantization for transcription,”Computer Music Journal, vol. 24, no. 2, pp. 60–76, 2000

  10. [10]

    Hidden markov model for automatic transcription of midi signals,

    H. Takeda, N. Saito, T. Otsuki, M. Nakai, H. Shimodaira, and S. Sagayama, “Hidden markov model for automatic transcription of midi signals,” inIEEE Workshop on Multimedia Signal Processing, 2002

  11. [11]

    A learning-based quantization: Unsupervised estimation of the model parameters,

    M. Hamanaka, M. Goto, H. Asoh, and N. Otsu, “A learning-based quantization: Unsupervised estimation of the model parameters,” in Proceedings of the International Computer Music Conference (ICMC), 2003

  12. [12]

    Temperley,Music and probability

    D. Temperley,Music and probability. Mit Press, 2007

  13. [13]

    Transcribing human piano per- formances into music notation

    A. Cogliati, D. Temperley, and Z. Duan, “Transcribing human piano per- formances into music notation.” inProceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), 2016

  14. [14]

    A parse-based framework for coupled rhythm quantization and score structuring,

    F. Foscarin, F. Jacquemard, P. Rigaux, and M. Sakai, “A parse-based framework for coupled rhythm quantization and score structuring,” in Mathematics and Computation in Music, 2019

  15. [15]

    Non-local musical statistics as guides for audio-to-score piano transcription,

    K. Shibata, E. Nakamura, and K. Yoshii, “Non-local musical statistics as guides for audio-to-score piano transcription,”Information Sciences, vol. 566, pp. 262–280, 2021

  16. [16]

    Performance midi- to-score conversion by neural beat tracking,

    L. Liu, Q. Kong, G. Morfi, E. Benetoset al., “Performance midi- to-score conversion by neural beat tracking,” inProceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022

  17. [17]

    Note-level automatic guitar transcrip- tion using attention mechanism,

    S. Kim, T. Hayashi, and T. Toda, “Note-level automatic guitar transcrip- tion using attention mechanism,” inProceedings of the 30th European Signal Processing Conference (EUSIPCO), 2022

  18. [18]

    End-to-end piano performance-midi to score conversion with transformers,

    T. Beyer and A. Dai, “End-to-end piano performance-midi to score conversion with transformers,” inProceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR), 2024

  19. [19]

    (2025) Musicxml for exchanging digital sheet music

    MakeMusic, Inc. (2025) Musicxml for exchanging digital sheet music. https://www.musicxml.com/. (accessed Feb. 19, 2025)

  20. [20]

    Sequence-to-sequence piano transcription with transformers,

    C. Hawthorne, I. Simon, R. Swavely, E. Manilow, and J. Engel, “Sequence-to-sequence piano transcription with transformers,” inPro- ceedings of the 22nd International Society for Music Information Re- trieval Conference (ISMIR), 2021

  21. [21]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,”Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020

  22. [22]

    Adafactor: Adaptive learning rates with sublinear memory cost,

    N. Shazeer and M. Stern, “Adafactor: Adaptive learning rates with sublinear memory cost,” inProceedings of the 35th International Con- ference on Machine Learning (ICML), 2018

  23. [23]

    A-maps: Augmented maps dataset with rhythm and key annotations,

    A. Ycart and E. Benetos, “A-maps: Augmented maps dataset with rhythm and key annotations,” inProceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), 2018

  24. [24]

    Asap: a dataset of aligned scores and performances for piano transcription,

    F. Foscarin, A. Mcleod, P. Rigaux, F. Jacquemard, and M. Sakai, “Asap: a dataset of aligned scores and performances for piano transcription,” inProceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR), 2020

  25. [25]

    The Francois Leduc Dataset,

    D. Edwards, X. Riley, and S. Dixon, “The Francois Leduc Dataset,” Apr

  26. [26]

    Available: https://doi.org/10.5281/zenodo.10984521

    [Online]. Available: https://doi.org/10.5281/zenodo.10984521

  27. [27]

    High resolution guitar transcription via domain adaptation,

    X. Riley, D. Edwards, and S. Dixon, “High resolution guitar transcription via domain adaptation,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1051–1055

  28. [28]

    Towards complete polyphonic music transcription: Integrating multi-pitch detection and rhythm quantization,

    E. Nakamura, E. Benetos, K. Yoshii, and S. Dixon, “Towards complete polyphonic music transcription: Integrating multi-pitch detection and rhythm quantization,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018

  29. [29]

    Rhythm transcription of polyphonic piano music based on merged-output hmm for multiple voices,

    E. Nakamura, K. Yoshii, and S. Sagayama, “Rhythm transcription of polyphonic piano music based on merged-output hmm for multiple voices,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017

  30. [30]

    MuseScore: Free music composition and notation soft- ware,

    MuseScore, “MuseScore: Free music composition and notation soft- ware,” https://musescore.org, 2025, (accessed Feb. 20, 2025)

  31. [31]

    Finale version 27,

    MakeMusic, Inc., “Finale version 27,” https://finalemusic.com/, 1988, (accessed Jul. 17, 2024)

  32. [32]

    Acpas: a dataset of aligned classical piano audio and scores for audio-to-score transcription,

    L. Liu, V . Morfi, E. Benetoset al., “Acpas: a dataset of aligned classical piano audio and scores for audio-to-score transcription,” inExtended Abstracts for the Late-Breaking Demo Session of the 22nd International Society for Music Information Retrieval Conference (ISMIR), 2021