A Distribution Matching Approach to Neural Piano Transcription with Optimal Transport
Pith reviewed 2026-05-19 22:42 UTC · model grok-4.3
The pith
Automatic piano transcription improves when framed as optimal transport between note distributions rather than frame-by-frame classification.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that formalizing automatic piano transcription as an optimal transport problem, where the model minimizes the cost of transporting a predicted distribution of note events to the ground-truth distribution over time and frequency, leads to perceptually relevant optimization that accommodates temporal misalignment. This is implemented in a convolutional recurrent neural network with a harmonics-aware attention mechanism, and experiments on the MAESTRO dataset confirm improved onset detection.
What carries the argument
Optimal transport loss for matching predicted and ground-truth note event distributions across time and frequency, supported by a harmonics-aware attention mechanism in a CRNN.
If this is right
- State-of-the-art performance in onset detection is achieved on the MAESTRO dataset.
- The optimal transport loss applies successfully to existing models for versatility.
- Optimization focuses on perceptually relevant aspects by handling temporal misalignments.
- Overall transcription quality improves due to the distribution matching approach.
Where Pith is reading between the lines
- This approach could be tested on other polyphonic music transcription tasks beyond piano.
- Real-time applications might benefit if the transport computation can be made efficient.
- Combining the OT loss with different network architectures could reveal its general usefulness.
Load-bearing premise
That using optimal transport to match note distributions will yield transcriptions that are superior in perceptual quality compared to standard classification methods.
What would settle it
Running the same CRNN model with a standard binary cross-entropy loss instead of the optimal transport loss and observing no improvement or even worse onset detection F1 scores on the MAESTRO test set.
read the original abstract
This paper describes a novel paradigm that formalizes automatic piano transcription (APT) as an optimal transport (OT) problem, not as a frame-level multi-label binary classification problem. Our method learns to minimize the cost of transporting a predicted distribution of note events to the ground-truth distribution over time and frequency. The OT loss can thus accommodate temporal misalignment, leading to perceptually relevant optimization. We also propose a convolutional recurrent neural network (CRNN) with a harmonics-aware attention mechanism to capture the spectro-temporal dependencies inherent in music.Our experiments using the MAESTRO dataset showed that our method attained a state-of-the-art performance in onset detection. We confirmed the versatility of the OT loss in application to existing models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes reframing automatic piano transcription (APT) as an optimal transport (OT) problem that minimizes the cost of transporting a predicted distribution of note events to the ground-truth distribution over time and frequency, rather than using frame-level multi-label binary classification. It introduces a CRNN architecture incorporating a harmonics-aware attention mechanism and reports state-of-the-art onset detection performance on the MAESTRO dataset while claiming that the OT loss is versatile and can be applied to existing models.
Significance. If the central claims hold, the work could meaningfully advance music information retrieval by shifting supervision from rigid frame-level classification to distribution matching that tolerates musically plausible temporal misalignments. The harmonics-aware attention component may also offer a useful inductive bias for modeling spectro-temporal structure in polyphonic piano signals. The approach is conceptually clean and could generalize beyond piano transcription.
major comments (3)
- [Abstract] Abstract: The claim that the method 'attained a state-of-the-art performance in onset detection' on MAESTRO is unsupported by any quantitative metrics, baseline comparisons, ablation results, or implementation details. Without these, it is impossible to determine whether gains are attributable to the OT loss, the harmonics-aware CRNN, or other training choices.
- [Abstract] Abstract / Experiments: The assertion that the OT loss 'can thus accommodate temporal misalignment, leading to perceptually relevant optimization' is load-bearing for the central contribution, yet the reported evaluation uses only frame-level onset F1 and supplies no perceptual metrics, offset/velocity accuracy, or comparisons against other misalignment-robust objectives (e.g., CTC-style or soft-alignment losses).
- [Abstract] Abstract: The versatility claim ('We confirmed the versatility of the OT loss in application to existing models') is stated without identifying the models, reporting the resulting metrics, or showing ablations that isolate the OT loss from architectural changes.
minor comments (1)
- [Abstract] The abstract would benefit from a concise statement of the precise OT formulation (cost function, marginals, and solver) and the exact definition of the note-event distribution.
Simulated Author's Rebuttal
We thank the referee for their valuable comments on our manuscript. We have carefully considered each point and made revisions to strengthen the presentation of our results and claims. Below, we provide point-by-point responses to the major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the method 'attained a state-of-the-art performance in onset detection' on MAESTRO is unsupported by any quantitative metrics, baseline comparisons, ablation results, or implementation details. Without these, it is impossible to determine whether gains are attributable to the OT loss, the harmonics-aware CRNN, or other training choices.
Authors: We agree that the abstract would benefit from more specific support for this claim. The full manuscript includes quantitative results in Section 4, with comparisons to baselines such as the standard CRNN and other SOTA methods, along with ablations. Implementation details are in the appendix. In the revised version, we have updated the abstract to include the achieved onset F1 score and a brief mention of the key comparisons, while directing readers to the experiments section for full details. This helps attribute the gains appropriately. revision: yes
-
Referee: [Abstract] Abstract / Experiments: The assertion that the OT loss 'can thus accommodate temporal misalignment, leading to perceptually relevant optimization' is load-bearing for the central contribution, yet the reported evaluation uses only frame-level onset F1 and supplies no perceptual metrics, offset/velocity accuracy, or comparisons against other misalignment-robust objectives (e.g., CTC-style or soft-alignment losses).
Authors: The OT loss is indeed central, as it allows for distribution matching that tolerates small temporal shifts common in musical performances. While we report the standard frame-level onset F1 as per the MAESTRO benchmark conventions, we recognize the value of additional metrics. In the revision, we have added a paragraph in the experiments section discussing the perceptual relevance and included comparisons to alternative losses like CTC where applicable. We note that offset and velocity are secondary in this work focused on onset detection, but we can expand if required. revision: partial
-
Referee: [Abstract] Abstract: The versatility claim ('We confirmed the versatility of the OT loss in application to existing models') is stated without identifying the models, reporting the resulting metrics, or showing ablations that isolate the OT loss from architectural changes.
Authors: We appreciate this observation. The full paper details the application to existing models in Section 4.4, including specific models like the Onsets and Frames baseline and a harmonics-aware variant, with reported metrics showing improvements when using the OT loss. We have added an ablation study isolating the loss function. The revised abstract now briefly references these findings. revision: yes
Circularity Check
No significant circularity; derivation introduces independent OT loss
full rationale
The paper's central claim formalizes APT as an optimal transport problem between predicted and ground-truth note distributions, with a new CRNN architecture incorporating harmonics-aware attention. No equations or steps in the provided abstract reduce the OT loss or claimed perceptual improvements to self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations from the same authors. The approach is presented as a novel paradigm distinct from frame-level classification, with experiments on MAESTRO and versatility claims that do not collapse to prior inputs by construction. This is a standard case of an independent methodological contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Optimal transport distance between predicted and ground-truth note distributions provides a perceptually relevant optimization target for piano transcription.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize APT as distribution matching task with optimal transport (OT). ... C′(ui,vj)=min(|ti−tj|,τ0) iff fi=fj, τ1 otherwise. ... LOT(M,μ)=d′C(M,μ)+λLmass
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION In automatic piano transcription (APT) that aims to estimate a piano-roll representation (MIDI data) from a music record- ing [1], deep learning models play a central role. The standard paradigm is to convert a music spectrogram over time frames and frequency bins into a piano-roll matrix over time frames and semitone-level pitches through fr...
-
[2]
RELA TED WORK Early APT models [7] perform frame-wise classification with CNNs to judge the presence of notes within each frame. On- sets&Frames model [2] predicted note onsets and frame-wise activations separately and then combining them afterwards. A regression-based approach was proposed for predicting high- resolution onset and offset times [3]. More ...
-
[3]
PROPOSED METHOD This section introduces the problem formulation of APT with OT and the proposed model architecture. 3.1. Problem formulation LetX∈R T×F be the time-frequency representation of a targe music recording, whereTis the number of frames and Fis the number of frequency bins. The ground-truth is a set ofNnotesG= (s i, ei, pi)N i=1, wheres i, ei, p...
-
[4]
spectrogram first passes through a stack of three 2D con- volutional layers. Each layer uses a7×7kernel and strides of (1,2), (1,2), and (2,1), progressively downsampling the tem- poral and frequency dimensions by factors of 2 and 4. The channel count increases from 1 to 64, 128, and finally 256. Harmonics-aware attention block: It consists of nine stacke...
-
[5]
EV ALUA TION This section reports comparative evaluation conducted for val- idating the effectiveness of the OT loss in APT. 4.1. Experimental Conditions The MAESTRO [26] dataset was used for evaluation. It con- tains over 200 hours of piano recordings with aligned MIDI data. We used the official train/validation/test split. Each recording was resampled a...
-
[6]
CONCLUSION This paper introduced a novel paradigm that formalizes APT as an OT problem, moving beyond the temporal rigidity lim- itations of traditional frame-level binary classification prob- lem. We proposed a new CRNN model to capture the spectro- temporal dependencies of music and trained it in a perceptu- ally reasonable manner to accommodate tempora...
-
[7]
Automatic music transcription: An overview,
Emmanouil Benetos, Simon Dixon et al., “Automatic music transcription: An overview,”IEEE Signal Processing Maga- zine, vol. 36, no. 1, pp. 20–30, 2018
work page 2018
-
[8]
Onsets and frames: Dual-objective piano transcription,
Curtis Hawthorne, Erich Elsen et al., “Onsets and frames: Dual-objective piano transcription,” inProceedings of the In- ternational Society for Music Information Retrieval Confer- ence (ISMIR), 2018, pp. 50–57
work page 2018
-
[9]
High-resolution piano tran- scription with pedals by regressing onset and offset times,
Qiuqiang Kong, Bochen Li et al., “High-resolution piano tran- scription with pedals by regressing onset and offset times,” IEEE ACM Transactions on Audio Speech and Language Pro- cessing (TASLP), vol. 29, pp. 3707–3717, 2021
work page 2021
-
[10]
HPPNet: Modeling the har- monic structure and pitch invariance in piano transcription,
Weixing Wei, Peilin Li et al., “HPPNet: Modeling the har- monic structure and pitch invariance in piano transcription,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2022, pp. 709–716
work page 2022
-
[11]
Automatic piano tran- scription with hierarchical frequency-time transformer,
Keisuke Toyama, Taketo Akama et al., “Automatic piano tran- scription with hierarchical frequency-time transformer,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2023, pp. 215–222
work page 2023
-
[12]
C ´edric Villani et al.,Optimal transport: old and new, vol. 338, Springer, 2008
work page 2008
-
[13]
An end-to- end neural network for polyphonic piano music transcription,
Siddharth Sigtia, Emmanouil Benetos et al., “An end-to- end neural network for polyphonic piano music transcription,” IEEE ACM Transactions on Audio Speech and Language Pro- cessing (TASLP), vol. 24, no. 5, pp. 927–939, 2016
work page 2016
-
[14]
Ashish Vaswani, Noam Shazeer et al., “Attention is all you need,” inAnnual Conference on Neural Information Process- ing Systems (NIPS), 2017, pp. 5998–6008
work page 2017
-
[15]
Exploring transformer’s po- tential on automatic piano transcription,
Longshen Ou, Ziyi Guo et al., “Exploring transformer’s po- tential on automatic piano transcription,” inIEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 776–780
work page 2022
-
[16]
Sequence-to-sequence piano transcription with transformers,
Curtis Hawthorne, Ian Simon et al., “Sequence-to-sequence piano transcription with transformers,” inProceedings of the International Society for Music Information Retrieval Confer- ence (ISMIR), 2021, pp. 246–253
work page 2021
-
[17]
MT3: Multi-task multitrack music transcription,
Josh Gardner, Ian Simon et al., “MT3: Multi-task multitrack music transcription,” inProceedings of International Confer- ence on Learning Representations (ICLR), 2022
work page 2022
-
[18]
Weixing Wei, Jiahao Zhao et al., “Streaming piano transcrip- tion based on consistent onset and offset decoding with sus- tain pedal detection,” inProceedings of the 25th International Society for Music Information Retrieval Conference, (ISMIR), 2024, pp. 906–913
work page 2024
-
[19]
Piano transcription by hierarchical language modeling with pretrained roll-based en- coders,
Dichucheng Li, Yongyi Zang et al., “Piano transcription by hierarchical language modeling with pretrained roll-based en- coders,” in2025 IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP). 2025, pp. 1–5, IEEE
work page 2025
-
[20]
Piano transcription with har- monic attention,
Ruimin Wu, Xianke Wang et al., “Piano transcription with har- monic attention,” inIEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2024, pp. 1256– 1260
work page 2024
-
[21]
Harmonic-aware frequency and time attention for automatic piano transcription,
Qi Wang, Mingkuan Liu et al., “Harmonic-aware frequency and time attention for automatic piano transcription,”IEEE ACM Transactions on Audio Speech and Language Processing (TASLP), vol. 32, pp. 3492–3506, 2024
work page 2024
-
[22]
Wasserstein gen- erative adversarial networks,
Martin Arjovsky, Soumith Chintala et al., “Wasserstein gen- erative adversarial networks,” inInternational Conference on Machine Learning (ICML). PMLR, 2017, pp. 214–223
work page 2017
-
[23]
Sinkhorn distances: Lightspeed computation of optimal transport,
Marco Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,”Advances in neural information process- ing systems (NIPS), vol. 26, 2013
work page 2013
-
[24]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong et al., “Flow straight and fast: Learning to generate and transfer data with rectified flow,” arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Gromov- wasserstein alignment of word embedding spaces,
David Alvarez-Melis and Tommi Jaakkola, “Gromov- wasserstein alignment of word embedding spaces,” inProceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018, pp. 1881–1890
work page 2018
-
[26]
V ocabulary learning via opti- mal transport for neural machine translation,
Jingjing Xu, Hao Zhou et al., “V ocabulary learning via opti- mal transport for neural machine translation,” inProceedings of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021, pp. 7361–7373
work page 2021
-
[27]
Transportation distances in music notation retrieval,
F Wiering, R Typke et al., “Transportation distances in music notation retrieval,”Computing in Musicology, vol. 13, pp. 113– 128, 2004
work page 2004
-
[28]
Optimal spectral trans- portation with application to music transcription,
R ´emi Flamary, C´edric F´evotte et al., “Optimal spectral trans- portation with application to music transcription,” inAdvances in Neural Information Processing Systems (NIPS), 2016, pp. 703–711
work page 2016
-
[29]
Unbalanced opti- mal transport through non-negative penalized linear regres- sion,
Laetitia Chapel, R ´emi Flamary et al., “Unbalanced opti- mal transport through non-negative penalized linear regres- sion,”Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 23270–23282, 2021
work page 2021
-
[30]
Unbalanced optimal transport, from theory to numerics,
Thibault S ´ejourn´e, Gabriel Peyr ´e et al., “Unbalanced optimal transport, from theory to numerics,”Handbook of Numerical Analysis, vol. 24, pp. 407–471, 2023
work page 2023
-
[31]
Calculation of a constant q spectral trans- form,
Judith C Brown, “Calculation of a constant q spectral trans- form,”The Journal of the Acoustical Society of America, (JASA), vol. 89, no. 1, pp. 425–434, 1991
work page 1991
-
[32]
Enabling factorized piano music modeling and generation with the MAESTRO dataset,
Curtis Hawthorne, Andriy Stasyuk et al., “Enabling factorized piano music modeling and generation with the MAESTRO dataset,” inProceedings of International Conference on Learn- ing Representations (ICLR), 2019, pp. 1–6
work page 2019
-
[33]
Adam: A method for stochastic optimization,
Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” inProceedings of International Con- ference on Learning Representations (ICLR), 2015, pp. 1–9
work page 2015
-
[34]
Mir eval: A transparent im- plementation of common mir metrics,
Colin Raffel, Brian McFee et al., “Mir eval: A transparent im- plementation of common mir metrics,” inProceedings of the International Society for Music Information Retrieval Confer- ence (ISMIR), 2014, pp. 1–6
work page 2014
-
[35]
Scoring time intervals using non- hierarchical transformer for automatic piano transcription,
Yujia Yan and Zhiyao Duan, “Scoring time intervals using non- hierarchical transformer for automatic piano transcription,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2024, pp. 973–980
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.