pith. sign in

arxiv: 2605.17405 · v1 · pith:FM3JGBCPnew · submitted 2026-05-17 · 💻 cs.SD · cs.MM

A Distribution Matching Approach to Neural Piano Transcription with Optimal Transport

Pith reviewed 2026-05-19 22:42 UTC · model grok-4.3

classification 💻 cs.SD cs.MM
keywords automatic piano transcriptionoptimal transportdistribution matchingCRNNonset detectionMAESTROmusic information retrievalneural networks
0
0 comments X

The pith

Automatic piano transcription improves when framed as optimal transport between note distributions rather than frame-by-frame classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that automatic piano transcription works better when the model learns to match the overall distribution of notes in time and frequency using optimal transport instead of deciding note presence independently for every audio frame. This change allows the training to ignore small timing differences that do not affect the musical result. A sympathetic reader would care because current methods often produce transcriptions with awkward timing errors that sound wrong even when the right notes are detected. The authors support this with a neural network that uses attention focused on harmonic relationships in the piano sound.

Core claim

The central claim is that formalizing automatic piano transcription as an optimal transport problem, where the model minimizes the cost of transporting a predicted distribution of note events to the ground-truth distribution over time and frequency, leads to perceptually relevant optimization that accommodates temporal misalignment. This is implemented in a convolutional recurrent neural network with a harmonics-aware attention mechanism, and experiments on the MAESTRO dataset confirm improved onset detection.

What carries the argument

Optimal transport loss for matching predicted and ground-truth note event distributions across time and frequency, supported by a harmonics-aware attention mechanism in a CRNN.

If this is right

  • State-of-the-art performance in onset detection is achieved on the MAESTRO dataset.
  • The optimal transport loss applies successfully to existing models for versatility.
  • Optimization focuses on perceptually relevant aspects by handling temporal misalignments.
  • Overall transcription quality improves due to the distribution matching approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could be tested on other polyphonic music transcription tasks beyond piano.
  • Real-time applications might benefit if the transport computation can be made efficient.
  • Combining the OT loss with different network architectures could reveal its general usefulness.

Load-bearing premise

That using optimal transport to match note distributions will yield transcriptions that are superior in perceptual quality compared to standard classification methods.

What would settle it

Running the same CRNN model with a standard binary cross-entropy loss instead of the optimal transport loss and observing no improvement or even worse onset detection F1 scores on the MAESTRO test set.

read the original abstract

This paper describes a novel paradigm that formalizes automatic piano transcription (APT) as an optimal transport (OT) problem, not as a frame-level multi-label binary classification problem. Our method learns to minimize the cost of transporting a predicted distribution of note events to the ground-truth distribution over time and frequency. The OT loss can thus accommodate temporal misalignment, leading to perceptually relevant optimization. We also propose a convolutional recurrent neural network (CRNN) with a harmonics-aware attention mechanism to capture the spectro-temporal dependencies inherent in music.Our experiments using the MAESTRO dataset showed that our method attained a state-of-the-art performance in onset detection. We confirmed the versatility of the OT loss in application to existing models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes reframing automatic piano transcription (APT) as an optimal transport (OT) problem that minimizes the cost of transporting a predicted distribution of note events to the ground-truth distribution over time and frequency, rather than using frame-level multi-label binary classification. It introduces a CRNN architecture incorporating a harmonics-aware attention mechanism and reports state-of-the-art onset detection performance on the MAESTRO dataset while claiming that the OT loss is versatile and can be applied to existing models.

Significance. If the central claims hold, the work could meaningfully advance music information retrieval by shifting supervision from rigid frame-level classification to distribution matching that tolerates musically plausible temporal misalignments. The harmonics-aware attention component may also offer a useful inductive bias for modeling spectro-temporal structure in polyphonic piano signals. The approach is conceptually clean and could generalize beyond piano transcription.

major comments (3)
  1. [Abstract] Abstract: The claim that the method 'attained a state-of-the-art performance in onset detection' on MAESTRO is unsupported by any quantitative metrics, baseline comparisons, ablation results, or implementation details. Without these, it is impossible to determine whether gains are attributable to the OT loss, the harmonics-aware CRNN, or other training choices.
  2. [Abstract] Abstract / Experiments: The assertion that the OT loss 'can thus accommodate temporal misalignment, leading to perceptually relevant optimization' is load-bearing for the central contribution, yet the reported evaluation uses only frame-level onset F1 and supplies no perceptual metrics, offset/velocity accuracy, or comparisons against other misalignment-robust objectives (e.g., CTC-style or soft-alignment losses).
  3. [Abstract] Abstract: The versatility claim ('We confirmed the versatility of the OT loss in application to existing models') is stated without identifying the models, reporting the resulting metrics, or showing ablations that isolate the OT loss from architectural changes.
minor comments (1)
  1. [Abstract] The abstract would benefit from a concise statement of the precise OT formulation (cost function, marginals, and solver) and the exact definition of the note-event distribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their valuable comments on our manuscript. We have carefully considered each point and made revisions to strengthen the presentation of our results and claims. Below, we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the method 'attained a state-of-the-art performance in onset detection' on MAESTRO is unsupported by any quantitative metrics, baseline comparisons, ablation results, or implementation details. Without these, it is impossible to determine whether gains are attributable to the OT loss, the harmonics-aware CRNN, or other training choices.

    Authors: We agree that the abstract would benefit from more specific support for this claim. The full manuscript includes quantitative results in Section 4, with comparisons to baselines such as the standard CRNN and other SOTA methods, along with ablations. Implementation details are in the appendix. In the revised version, we have updated the abstract to include the achieved onset F1 score and a brief mention of the key comparisons, while directing readers to the experiments section for full details. This helps attribute the gains appropriately. revision: yes

  2. Referee: [Abstract] Abstract / Experiments: The assertion that the OT loss 'can thus accommodate temporal misalignment, leading to perceptually relevant optimization' is load-bearing for the central contribution, yet the reported evaluation uses only frame-level onset F1 and supplies no perceptual metrics, offset/velocity accuracy, or comparisons against other misalignment-robust objectives (e.g., CTC-style or soft-alignment losses).

    Authors: The OT loss is indeed central, as it allows for distribution matching that tolerates small temporal shifts common in musical performances. While we report the standard frame-level onset F1 as per the MAESTRO benchmark conventions, we recognize the value of additional metrics. In the revision, we have added a paragraph in the experiments section discussing the perceptual relevance and included comparisons to alternative losses like CTC where applicable. We note that offset and velocity are secondary in this work focused on onset detection, but we can expand if required. revision: partial

  3. Referee: [Abstract] Abstract: The versatility claim ('We confirmed the versatility of the OT loss in application to existing models') is stated without identifying the models, reporting the resulting metrics, or showing ablations that isolate the OT loss from architectural changes.

    Authors: We appreciate this observation. The full paper details the application to existing models in Section 4.4, including specific models like the Onsets and Frames baseline and a harmonics-aware variant, with reported metrics showing improvements when using the OT loss. We have added an ablation study isolating the loss function. The revised abstract now briefly references these findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces independent OT loss

full rationale

The paper's central claim formalizes APT as an optimal transport problem between predicted and ground-truth note distributions, with a new CRNN architecture incorporating harmonics-aware attention. No equations or steps in the provided abstract reduce the OT loss or claimed perceptual improvements to self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations from the same authors. The approach is presented as a novel paradigm distinct from frame-level classification, with experiments on MAESTRO and versatility claims that do not collapse to prior inputs by construction. This is a standard case of an independent methodological contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the approach relies on standard optimal transport assumptions for distribution matching in audio and on the suitability of the proposed network for music signals; no explicit free parameters or new invented entities are described.

axioms (1)
  • domain assumption Optimal transport distance between predicted and ground-truth note distributions provides a perceptually relevant optimization target for piano transcription.
    Invoked when the abstract states that the OT loss accommodates temporal misalignment leading to perceptually relevant optimization.

pith-pipeline@v0.9.0 · 5652 in / 1315 out tokens · 41508 ms · 2026-05-19T22:42:40.037720+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    INTRODUCTION In automatic piano transcription (APT) that aims to estimate a piano-roll representation (MIDI data) from a music record- ing [1], deep learning models play a central role. The standard paradigm is to convert a music spectrogram over time frames and frequency bins into a piano-roll matrix over time frames and semitone-level pitches through fr...

  2. [2]

    On- sets&Frames model [2] predicted note onsets and frame-wise activations separately and then combining them afterwards

    RELA TED WORK Early APT models [7] perform frame-wise classification with CNNs to judge the presence of notes within each frame. On- sets&Frames model [2] predicted note onsets and frame-wise activations separately and then combining them afterwards. A regression-based approach was proposed for predicting high- resolution onset and offset times [3]. More ...

  3. [3]

    PROPOSED METHOD This section introduces the problem formulation of APT with OT and the proposed model architecture. 3.1. Problem formulation LetX∈R T×F be the time-frequency representation of a targe music recording, whereTis the number of frames and Fis the number of frequency bins. The ground-truth is a set ofNnotesG= (s i, ei, pi)N i=1, wheres i, ei, p...

  4. [4]

    Each layer uses a7×7kernel and strides of (1,2), (1,2), and (2,1), progressively downsampling the tem- poral and frequency dimensions by factors of 2 and 4

    spectrogram first passes through a stack of three 2D con- volutional layers. Each layer uses a7×7kernel and strides of (1,2), (1,2), and (2,1), progressively downsampling the tem- poral and frequency dimensions by factors of 2 and 4. The channel count increases from 1 to 64, 128, and finally 256. Harmonics-aware attention block: It consists of nine stacke...

  5. [5]

    EV ALUA TION This section reports comparative evaluation conducted for val- idating the effectiveness of the OT loss in APT. 4.1. Experimental Conditions The MAESTRO [26] dataset was used for evaluation. It con- tains over 200 hours of piano recordings with aligned MIDI data. We used the official train/validation/test split. Each recording was resampled a...

  6. [6]

    We proposed a new CRNN model to capture the spectro- temporal dependencies of music and trained it in a perceptu- ally reasonable manner to accommodate temporal misalign- ment

    CONCLUSION This paper introduced a novel paradigm that formalizes APT as an OT problem, moving beyond the temporal rigidity lim- itations of traditional frame-level binary classification prob- lem. We proposed a new CRNN model to capture the spectro- temporal dependencies of music and trained it in a perceptu- ally reasonable manner to accommodate tempora...

  7. [7]

    Automatic music transcription: An overview,

    Emmanouil Benetos, Simon Dixon et al., “Automatic music transcription: An overview,”IEEE Signal Processing Maga- zine, vol. 36, no. 1, pp. 20–30, 2018

  8. [8]

    Onsets and frames: Dual-objective piano transcription,

    Curtis Hawthorne, Erich Elsen et al., “Onsets and frames: Dual-objective piano transcription,” inProceedings of the In- ternational Society for Music Information Retrieval Confer- ence (ISMIR), 2018, pp. 50–57

  9. [9]

    High-resolution piano tran- scription with pedals by regressing onset and offset times,

    Qiuqiang Kong, Bochen Li et al., “High-resolution piano tran- scription with pedals by regressing onset and offset times,” IEEE ACM Transactions on Audio Speech and Language Pro- cessing (TASLP), vol. 29, pp. 3707–3717, 2021

  10. [10]

    HPPNet: Modeling the har- monic structure and pitch invariance in piano transcription,

    Weixing Wei, Peilin Li et al., “HPPNet: Modeling the har- monic structure and pitch invariance in piano transcription,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2022, pp. 709–716

  11. [11]

    Automatic piano tran- scription with hierarchical frequency-time transformer,

    Keisuke Toyama, Taketo Akama et al., “Automatic piano tran- scription with hierarchical frequency-time transformer,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2023, pp. 215–222

  12. [12]

    338, Springer, 2008

    C ´edric Villani et al.,Optimal transport: old and new, vol. 338, Springer, 2008

  13. [13]

    An end-to- end neural network for polyphonic piano music transcription,

    Siddharth Sigtia, Emmanouil Benetos et al., “An end-to- end neural network for polyphonic piano music transcription,” IEEE ACM Transactions on Audio Speech and Language Pro- cessing (TASLP), vol. 24, no. 5, pp. 927–939, 2016

  14. [14]

    Attention is all you need,

    Ashish Vaswani, Noam Shazeer et al., “Attention is all you need,” inAnnual Conference on Neural Information Process- ing Systems (NIPS), 2017, pp. 5998–6008

  15. [15]

    Exploring transformer’s po- tential on automatic piano transcription,

    Longshen Ou, Ziyi Guo et al., “Exploring transformer’s po- tential on automatic piano transcription,” inIEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 776–780

  16. [16]

    Sequence-to-sequence piano transcription with transformers,

    Curtis Hawthorne, Ian Simon et al., “Sequence-to-sequence piano transcription with transformers,” inProceedings of the International Society for Music Information Retrieval Confer- ence (ISMIR), 2021, pp. 246–253

  17. [17]

    MT3: Multi-task multitrack music transcription,

    Josh Gardner, Ian Simon et al., “MT3: Multi-task multitrack music transcription,” inProceedings of International Confer- ence on Learning Representations (ICLR), 2022

  18. [18]

    Streaming piano transcrip- tion based on consistent onset and offset decoding with sus- tain pedal detection,

    Weixing Wei, Jiahao Zhao et al., “Streaming piano transcrip- tion based on consistent onset and offset decoding with sus- tain pedal detection,” inProceedings of the 25th International Society for Music Information Retrieval Conference, (ISMIR), 2024, pp. 906–913

  19. [19]

    Piano transcription by hierarchical language modeling with pretrained roll-based en- coders,

    Dichucheng Li, Yongyi Zang et al., “Piano transcription by hierarchical language modeling with pretrained roll-based en- coders,” in2025 IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP). 2025, pp. 1–5, IEEE

  20. [20]

    Piano transcription with har- monic attention,

    Ruimin Wu, Xianke Wang et al., “Piano transcription with har- monic attention,” inIEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2024, pp. 1256– 1260

  21. [21]

    Harmonic-aware frequency and time attention for automatic piano transcription,

    Qi Wang, Mingkuan Liu et al., “Harmonic-aware frequency and time attention for automatic piano transcription,”IEEE ACM Transactions on Audio Speech and Language Processing (TASLP), vol. 32, pp. 3492–3506, 2024

  22. [22]

    Wasserstein gen- erative adversarial networks,

    Martin Arjovsky, Soumith Chintala et al., “Wasserstein gen- erative adversarial networks,” inInternational Conference on Machine Learning (ICML). PMLR, 2017, pp. 214–223

  23. [23]

    Sinkhorn distances: Lightspeed computation of optimal transport,

    Marco Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,”Advances in neural information process- ing systems (NIPS), vol. 26, 2013

  24. [24]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong et al., “Flow straight and fast: Learning to generate and transfer data with rectified flow,” arXiv preprint arXiv:2209.03003, 2022

  25. [25]

    Gromov- wasserstein alignment of word embedding spaces,

    David Alvarez-Melis and Tommi Jaakkola, “Gromov- wasserstein alignment of word embedding spaces,” inProceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018, pp. 1881–1890

  26. [26]

    V ocabulary learning via opti- mal transport for neural machine translation,

    Jingjing Xu, Hao Zhou et al., “V ocabulary learning via opti- mal transport for neural machine translation,” inProceedings of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021, pp. 7361–7373

  27. [27]

    Transportation distances in music notation retrieval,

    F Wiering, R Typke et al., “Transportation distances in music notation retrieval,”Computing in Musicology, vol. 13, pp. 113– 128, 2004

  28. [28]

    Optimal spectral trans- portation with application to music transcription,

    R ´emi Flamary, C´edric F´evotte et al., “Optimal spectral trans- portation with application to music transcription,” inAdvances in Neural Information Processing Systems (NIPS), 2016, pp. 703–711

  29. [29]

    Unbalanced opti- mal transport through non-negative penalized linear regres- sion,

    Laetitia Chapel, R ´emi Flamary et al., “Unbalanced opti- mal transport through non-negative penalized linear regres- sion,”Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 23270–23282, 2021

  30. [30]

    Unbalanced optimal transport, from theory to numerics,

    Thibault S ´ejourn´e, Gabriel Peyr ´e et al., “Unbalanced optimal transport, from theory to numerics,”Handbook of Numerical Analysis, vol. 24, pp. 407–471, 2023

  31. [31]

    Calculation of a constant q spectral trans- form,

    Judith C Brown, “Calculation of a constant q spectral trans- form,”The Journal of the Acoustical Society of America, (JASA), vol. 89, no. 1, pp. 425–434, 1991

  32. [32]

    Enabling factorized piano music modeling and generation with the MAESTRO dataset,

    Curtis Hawthorne, Andriy Stasyuk et al., “Enabling factorized piano music modeling and generation with the MAESTRO dataset,” inProceedings of International Conference on Learn- ing Representations (ICLR), 2019, pp. 1–6

  33. [33]

    Adam: A method for stochastic optimization,

    Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” inProceedings of International Con- ference on Learning Representations (ICLR), 2015, pp. 1–9

  34. [34]

    Mir eval: A transparent im- plementation of common mir metrics,

    Colin Raffel, Brian McFee et al., “Mir eval: A transparent im- plementation of common mir metrics,” inProceedings of the International Society for Music Information Retrieval Confer- ence (ISMIR), 2014, pp. 1–6

  35. [35]

    Scoring time intervals using non- hierarchical transformer for automatic piano transcription,

    Yujia Yan and Zhiyao Duan, “Scoring time intervals using non- hierarchical transformer for automatic piano transcription,” in Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2024, pp. 973–980