Adversarial Learning for Improved Onsets and Frames Music Transcription
Pith reviewed 2026-05-25 19:30 UTC · model grok-4.3
The pith
Adversarial training on time-frequency maps improves both frame-level and note-level accuracy over the Onsets and Frames baseline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce an adversarial training scheme that operates directly on the time-frequency representations and makes the output distribution closer to the ground-truth. Through adversarial learning, we achieve a consistent improvement in both frame-level and note-level metrics over Onsets and Frames, a state-of-the-art music transcription model. Our results show that adversarial learning can significantly reduce the error rate while increasing the confidence of the model estimations.
What carries the argument
An adversarial discriminator trained directly on the model's time-frequency output maps to enforce inter-label dependencies that element-wise losses cannot capture.
If this is right
- The combined loss produces lower error rates on standard transcription benchmarks than the baseline element-wise loss alone.
- Model predictions exhibit higher confidence scores because the discriminator penalizes unrealistic label configurations.
- The same adversarial scheme can be attached to any existing multi-label transcription architecture without changing its core network.
- Post-processing steps that currently correct independent-label errors may become less necessary once dependencies are enforced during training.
Where Pith is reading between the lines
- Similar adversarial terms could be added to other structured audio labeling tasks where element-wise losses currently ignore temporal or harmonic dependencies.
- The approach suggests that distribution-matching objectives may be more effective than independent per-bin losses for any MIR problem that outputs piano-roll style representations.
- If the discriminator learns to detect common transcription artifacts, the method might serve as an implicit regularizer that reduces the need for hand-crafted post-filters.
Load-bearing premise
Training a discriminator on the time-frequency output maps will capture and enforce inter-label dependencies without introducing new artifacts or mode collapse that degrade transcription quality.
What would settle it
Running the adversarial model on the same test sets used for Onsets and Frames and observing no improvement or a drop in both frame-level and note-level F1 scores would falsify the central claim.
read the original abstract
Automatic music transcription is considered to be one of the hardest problems in music information retrieval, yet recent deep learning approaches have achieved substantial improvements on transcription performance. These approaches commonly employ supervised learning models that predict various time-frequency representations, by minimizing element-wise losses such as the cross entropy function. However, applying the loss in this manner assumes conditional independence of each label given the input, and thus cannot accurately express inter-label dependencies. To address this issue, we introduce an adversarial training scheme that operates directly on the time-frequency representations and makes the output distribution closer to the ground-truth. Through adversarial learning, we achieve a consistent improvement in both frame-level and note-level metrics over Onsets and Frames, a state-of-the-art music transcription model. Our results show that adversarial learning can significantly reduce the error rate while increasing the confidence of the model estimations. Our approach is generic and applicable to any transcription model based on multi-label predictions, which are very common in music signal analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an adversarial training scheme that operates directly on time-frequency output maps of a music transcription model (specifically Onsets and Frames) to enforce that generated outputs match the distribution of ground-truth maps. This is motivated by the limitation of element-wise losses (e.g., cross-entropy) assuming conditional independence of labels. The central claim is that this yields consistent improvements in both frame-level and note-level metrics, reduces error rates, and increases model confidence; the method is presented as generic for any multi-label transcription model.
Significance. If the reported gains hold and are shown to arise from the discriminator capturing musically relevant inter-label dependencies (rather than incidental regularization or low-level statistics), the approach would provide a practical, architecture-agnostic way to address a known limitation of supervised multi-label models in MIR. The generic framing is a strength, but the lack of supporting numerical evidence, ablations, or mechanism analysis in the manuscript limits assessment of whether this is a substantive advance.
major comments (2)
- [Abstract] Abstract: the claim of 'consistent improvement in both frame-level and note-level metrics' and 'significantly reduce the error rate' is asserted without any numerical results, tables, ablation studies, training curves, or statistical significance tests. This is load-bearing for the central empirical claim and prevents verification of the result.
- [Method] Method and motivation sections: the paper states that the adversarial objective addresses inter-label dependencies that element-wise loss cannot capture, but provides no analysis (e.g., discriminator feature visualization, controlled ablations removing the adversarial term, or comparison of learned statistics) to show that the discriminator models harmonic/rhythmic structure rather than low-level spectral patterns or artifacts. This directly bears on whether the mechanism matches the stated motivation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate the corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'consistent improvement in both frame-level and note-level metrics' and 'significantly reduce the error rate' is asserted without any numerical results, tables, ablation studies, training curves, or statistical significance tests. This is load-bearing for the central empirical claim and prevents verification of the result.
Authors: We agree that the abstract would be strengthened by including specific numerical results. In the revised version we have updated the abstract to report the observed frame-level and note-level accuracy gains together with the error-rate reductions on the evaluation sets. revision: yes
-
Referee: [Method] Method and motivation sections: the paper states that the adversarial objective addresses inter-label dependencies that element-wise loss cannot capture, but provides no analysis (e.g., discriminator feature visualization, controlled ablations removing the adversarial term, or comparison of learned statistics) to show that the discriminator models harmonic/rhythmic structure rather than low-level spectral patterns or artifacts. This directly bears on whether the mechanism matches the stated motivation.
Authors: The referee is correct that direct evidence linking performance gains to the modeling of inter-label dependencies would better substantiate the motivation. We have added controlled ablations that isolate the adversarial term and visualizations of the discriminator features in the revised manuscript to address this point. revision: yes
Circularity Check
No circularity: empirical adversarial training compared to external baseline
full rationale
The paper describes an empirical method that augments a supervised transcription model with an adversarial discriminator operating on time-frequency output maps. The central claim is a consistent metric improvement over the external Onsets and Frames baseline (Hawthorne et al.). No equations, derivations, or first-principles results are presented that reduce to quantities defined by the authors themselves. There are no self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations of uniqueness theorems, or ansatzes smuggled via prior work. The work is framed as an experimental comparison against an independent external model on standard datasets, satisfying the condition for a self-contained result against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Adversarial Learning for Improved Onsets and Frames Music Transcription
INTRODUCTION Automatic music transcription (AMT) concerns automated methods for converting acoustic music signals into some form of musical notation [4]. AMT is a multifaceted prob- lem and comprises a number of subtasks, including multi- pitch estimation (MPE), note tracking, instrument recogni- tion, rhythm analysis, score typesetting, etc. MPE predicts...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[2]
BACKGROUND 2.1 Automatic Transcription of Polyphonic Music Automatic transcription models for polyphonic music can be classified into frame- or note-level approaches. Frame- level transcription is synonymous with multi-pitch estima- tion (MPE) and operates on tiny temporal slices of au- dio, or frames, to predict all pitch values present in each frame. Not...
-
[3]
METHOD We describe a general method for improving an NN-based transcription model G that performs prediction of a two- dimensional target Y from an input audio representation X. Say the original model G is trained by minimiz- ing the lossLtask(G(X), Y) between the predicted target ˆY = G(X) and the ground-truth Y. The main idea of our method is to adapt p...
-
[4]
We also aim to evaluate the choices of the GAN loss and the mixup strengthα
EXPERIMENTAL SETUP To verify the effectiveness of our approach, we compare Onsets and Frames [17], a state-of-the-art piano transcrip- tion model, with variants of the same model that are trained with the adversarial loss. We also aim to evaluate the choices of the GAN loss and the mixup strengthα. 4.1 Model Architecture We use the extended Onsets and Fra...
-
[5]
RESULTS 5.1 Comparison with the Baseline Metrics Table 2 and 3 summarize the transcription performance, clearly showing a consistent improvement in the condi- tional GAN models over the Onsets and Frames baseline. Table 2 shows that both non-saturating GAN and least- squares GAN achieve the highest frame and note F1 scores when the mixup strength α = 0.3 ...
-
[6]
To achieve this, a discriminator network is trained competitively with the transcription model, i.e
CONCLUSIONS We have presented an adversarial training method that can consistently outperform the baseline Onsets and Frames model, using the standard frame-level and note-level tran- scription metrics and visualizations that show how the im- proved model predicts more confident output. To achieve this, a discriminator network is trained competitively with...
-
[7]
Unsuper- vised analysis of polyphonic music by sparse coding
Samer A Abdallah and Mark D Plumbley. Unsuper- vised analysis of polyphonic music by sparse coding. IEEE Transactions on Neural Networks , 17(1):179– 196, 2006
work page 2006
-
[8]
Emmanouil Benetos and Simon Dixon. Multiple- instrument polyphonic music transcription using a tem- porally constrained shift-invariant model. The Journal of the Acoustical Society of America , 133(3):1727– 1741, 2013
work page 2013
-
[9]
Automatic music transcription: An overview
Emmanouil Benetos, Simon Dixon, Zhiyao Duan, and Sebastian Ewert. Automatic music transcription: An overview. IEEE Signal Processing Magazine , 36(1):20–30, 2019
work page 2019
-
[10]
Auto- matic music transcription: challenges and future di- rections
Emmanouil Benetos, Simon Dixon, Dimitrios Gian- noulis, Holger Kirchhoff, and Anssi Klapuri. Auto- matic music transcription: challenges and future di- rections. Journal of Intelligent Information Systems , 41(3):407–434, 2013
work page 2013
-
[11]
Deep salience representations for f0 estimation in polyphonic music
Rachel M Bittner, Brian McFee, Justin Salamon, Peter Li, and Juan Pablo Bello. Deep salience representations for f0 estimation in polyphonic music. In Proceedings of the International Society for Music Information Re- trieval (ISMIR) Conference, pages 63–70, 2017
work page 2017
-
[12]
Polyphonic pi- ano note transcription with recurrent neural networks
Sebastian Böck and Markus Schedl. Polyphonic pi- ano note transcription with recurrent neural networks. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages 121–124, 2012
work page 2012
-
[13]
Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to poly- phonic music generation and transcription. In Pro- ceedings of the International Conference on Machine Learning (ICML), 2012
work page 2012
-
[14]
Generative adversarial networks: An overview
Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1):53–65, 2018
work page 2018
-
[15]
Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi- Hsuan Yang. Musegan: Multi-track sequential gener- ative adversarial networks for symbolic music genera- tion and accompaniment. In Thirty-Second AAAI Con- ference on Artificial Intelligence, 2018
work page 2018
-
[16]
Generating im- ages with perceptual similarity metrics based on deep networks
Alexey Dosovitskiy and Thomas Brox. Generating im- ages with perceptual similarity metrics based on deep networks. In Advances in Neural Information Process- ing Systems, pages 658–666, 2016
work page 2016
-
[17]
GANSynth: Adversarial Neural Audio Synthesis
Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. GANSynth: Adversarial neural audio synthesis. arXiv preprint arXiv:1902.08710, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[18]
Piano transcrip- tion in the studio using an extensible alternating direc- tions framework
Sebastian Ewert and Mark Sandler. Piano transcrip- tion in the studio using an extensible alternating direc- tions framework. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(11):1983–1997, 2016
work page 1983
-
[19]
Algorithms for non- negative matrix factorization with the β-divergence
Cédric Févotte and Jérôme Idier. Algorithms for non- negative matrix factorization with the β-divergence. Neural computation, 23(9):2421–2456, 2011
work page 2011
-
[20]
Benoit Fuentes, Roland Badeau, and Gaël Richard. Harmonic adaptive latent component analysis of au- dio and application to music transcription.IEEE Trans- actions on Audio, Speech, and Language Processing , 21(9):1854–1866, 2013
work page 2013
-
[21]
NIPS 2016 Tutorial: Generative Adversarial Networks
Ian Goodfellow. NIPS 2016 tutorial: Generative ad- versarial networks. arXiv preprint arXiv:1701.00160 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[22]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversar- ial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014
work page 2014
-
[23]
Onsets and frames: Dual- objective piano transcription
Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck. Onsets and frames: Dual- objective piano transcription. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, pages 50–57, 2018
work page 2018
-
[24]
En- abling factorized piano music modeling and genera- tion with the MAESTRO dataset
Curtis Hawthorne, Andrew Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Diele- man, Erich Elsen, Jesse Engel, and Douglas Eck. En- abling factorized piano music modeling and genera- tion with the MAESTRO dataset. InProceedings of the International Conference on Learning Representations (ICLR), 2019
work page 2019
-
[25]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017
work page 2017
-
[26]
A fast learning algorithm for deep belief nets
Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006
work page 2006
-
[27]
Image-to-image translation with con- ditional adversarial networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with con- ditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1125–1134, 2017
work page 2017
-
[28]
A Style-Based Generator Architecture for Generative Adversarial Networks
Tero Karras, Samuli Laine, and Timo Aila. A style- based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[29]
On the potential of simple framewise approaches to piano transcription
Rainer Kelz, Matthias Dorfer, Filip Korzeniowski, Se- bastian Böck, Andreas Arzt, and Gerhard Widmer. On the potential of simple framewise approaches to piano transcription. In Proceedings of the International So- ciety for Music Information Retrieval (ISMIR) Confer- ence, pages 475–481, 2016
work page 2016
-
[30]
Neural music synthesis for flexi- ble timbre control
Jong Wook Kim, Rachel Bittner, Aparna Kumar, and Juan Pablo Bello. Neural music synthesis for flexi- ble timbre control. In Proceedings of the IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019
work page 2019
-
[31]
CREPE: A convolutional represen- tation for pitch estimation
Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A convolutional represen- tation for pitch estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 161–165, 2018
work page 2018
-
[32]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the In- ternational Conference on Learning Representations, (ICLR), 2015
work page 2015
-
[33]
The neural autore- gressive distribution estimator
Hugo Larochelle and Iain Murray. The neural autore- gressive distribution estimator. In Proceedings of the Fourteenth International Conference on Artificial In- telligence and Statistics, pages 29–37, 2011
work page 2011
-
[34]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015
work page 2015
-
[35]
Algorithms for non-negative matrix factorization
Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization. InAdvances in Neu- ral Information Processing Systems , pages 556–562, 2001
work page 2001
-
[36]
Fully convolutional networks for semantic segmenta- tion
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta- tion. In Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR) , pages 3431–3440, 2015
work page 2015
-
[37]
Least squares generative adversarial networks
Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision , pages 2794–2802, 2017
work page 2017
-
[38]
pYIN: A funda- mental frequency estimator using probabilistic thresh- old distributions
Matthias Mauch and Simon Dixon. pYIN: A funda- mental frequency estimator using probabilistic thresh- old distributions. In Proceedings of the IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 659–663. IEEE, 2014
work page 2014
-
[39]
Conditional Generative Adversarial Nets
Mehdi Mirza and Simon Osindero. Conditional gener- ative adversarial nets. arXiv preprint arXiv:1411.1784, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[40]
Juhan Nam, Jiquan Ngiam, Honglak Lee, and Mal- colm Slaney. A classification-based polyphonic piano transcription approach using learned feature represen- tations. In Proceedings of the 12th International Soci- ety for Music Information Retrieval (ISMIR) Confer- ence, pages 175–180, 2011
work page 2011
-
[41]
An end-to-end machine learning system for harmonic analysis of music
Yizhao Ni, Matt McVicar, Raul Santos-Rodriguez, and Tijl De Bie. An end-to-end machine learning system for harmonic analysis of music. IEEE Transactions on Audio, Speech, and Language Processing, 20(6):1771– 1783, 2012
work page 2012
-
[42]
A dis- criminative model for polyphonic piano transcription
Graham E Poliner and Daniel PW Ellis. A dis- criminative model for polyphonic piano transcription. EURASIP Journal on Advances in Signal Processing , 2007(1):048317, 2006
work page 2007
-
[43]
mir_eval: A transparent implemen- tation of common MIR metrics
Colin Raffel, Brian McFee, Eric J Humphrey, Justin Salamon, Oriol Nieto, Dawen Liang, Daniel PW Ellis, and C Colin Raffel. mir_eval: A transparent implemen- tation of common MIR metrics. In Proceedings of the International Society for Music Information Retrieval (ISMIR) Conference, 2014
work page 2014
-
[44]
Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 4779–4783. IEEE, 2018
work page 2018
-
[45]
An end-to-end neural network for polyphonic piano music transcription
Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon. An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing , 24(5):927– 939, 2016
work page 2016
-
[46]
Non-negative matrix factorization for polyphonic music transcrip- tion
Paris Smaragdis and Judith C Brown. Non-negative matrix factorization for polyphonic music transcrip- tion. In 2003 IEEE Workshop on Applications of Sig- nal Processing to Audio and Acoustics, pages 177–180, 2003
work page 2003
-
[47]
Dropout: a simple way to prevent neural networks from over- fitting
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from over- fitting. The Journal of Machine Learning Research , 15(1):1929–1958, 2014
work page 1929
-
[48]
Condi- tional image generation with PixelCNN decoders
Aaron Van den Oord, Nal Kalchbrenner, Lasse Es- peholt, Oriol Vinyals, Alex Graves, et al. Condi- tional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems , pages 4790–4798, 2016
work page 2016
-
[49]
Pixel recurrent neural networks
Aäron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In Pro- ceedings of the International Conference on Machine Learning (ICML), pages 1747–1756, 2016
work page 2016
-
[50]
Adaptive harmonic spectral decomposition for mul- tiple pitch estimation
Emmanuel Vincent, Nancy Bertin, and Roland Badeau. Adaptive harmonic spectral decomposition for mul- tiple pitch estimation. IEEE Transactions on Audio, Speech, and Language Processing , 18(3):528–537, 2010
work page 2010
-
[51]
Midinet: A convolutional generative adversarial net- work for symbolic-domain music generation
Li-Chia Yang, Szu-Yu Chou, and Yi-Hsuan Yang. Midinet: A convolutional generative adversarial net- work for symbolic-domain music generation. In Pro- ceedings of the International Society for Music Infor- mation Retrieval (ISMIR) Conference, pages 324–331, 2017
work page 2017
-
[52]
Recurrent Neural Network Regularization
Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[53]
Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In Proceedings of the International Con- ference on Learning Representations (ICLR), 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.