pith. sign in

arxiv: 1906.09165 · v1 · pith:QSRV33IMnew · submitted 2019-06-21 · 💻 cs.SD · eess.AS

Deep Polyphonic ADSR Piano Note Transcription

Pith reviewed 2026-05-25 18:15 UTC · model grok-4.3

classification 💻 cs.SD eess.AS
keywords piano transcriptionpolyphonic musicADSR envelopeHidden Markov Modeldeep learningnote segmentationMAPS datasetlate fusion
0
0 comments X

The pith

A compact network with late fusion to an ADSR-derived HMM produces state-of-the-art piano note segments on the MAPS dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper combines a small neural network that processes piano audio frames with a handcrafted Hidden Markov Model whose transitions encode attack, decay, sustain and release stages. Network outputs are fused across time using this model to form note hypotheses, which are then accepted or rejected by a simple threshold. The resulting system reaches state-of-the-art accuracy on the MAPS dataset and shows a large gain over earlier methods specifically when the task requires correct onsets and offsets together. The architecture remains compact and trains directly with gradient descent because the temporal structure is supplied by the fixed HMM rather than learned from data.

Core claim

Late fusion of per-frame network predictions with an HMM whose transition probabilities are chosen from an ADSR envelope model, followed by a final binary decision, yields accurate polyphonic note segmentations that outperform prior approaches by a large margin on the MAPS dataset when complete note regions from onset to offset are evaluated.

What carries the argument

Late-fusion stage that combines network outputs over time with a handcrafted HMM whose transition probabilities are set from an ADSR envelope model.

If this is right

  • Note offsets are predicted more reliably because the HMM explicitly encodes release behaviour.
  • Small networks become competitive when supplied with an explicit temporal prior instead of learning dynamics from data.
  • Polyphonic overlaps are handled by the fusion stage rather than by the network alone.
  • A final threshold can reject weak hypotheses without additional learned components.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ADSR prior could be tested on other percussive or sustained instruments that share similar amplitude envelopes.
  • Replacing the handcrafted transitions with parameters learned from data would reveal how much the fixed ADSR model contributes.
  • The low parameter count suggests the method could run in real time on modest hardware once the HMM is integrated into an online decoder.

Load-bearing premise

The fixed transition probabilities taken from the ADSR model supply a temporal prior that is accurate enough for real piano performances without any learned sequence dynamics.

What would settle it

An experiment on the MAPS test set in which another transcription system records a higher F1 score for complete note regions would falsify the claimed superiority.

read the original abstract

We investigate a late-fusion approach to piano transcription, combined with a strong temporal prior in the form of a handcrafted Hidden Markov Model (HMM). The network architecture under consideration is compact in terms of its number of parameters and easy to train with gradient descent. The network outputs are fused over time in the final stage to obtain note segmentations, with an HMM whose transition probabilities are chosen based on a model of attack, decay, sustain, release (ADSR) envelopes, commonly used for sound synthesis. The note segments are then subject to a final binary decision rule to reject too weak note segment hypotheses. We obtain state-of-the-art results on the MAPS dataset, and are able to outperform other approaches by a large margin, when predicting complete note regions from onsets to offsets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a late-fusion piano transcription pipeline in which a compact neural network produces frame-level activations that are then decoded into note segments via a handcrafted HMM whose transition matrix is derived from a generic ADSR envelope model; a final binary threshold rejects weak hypotheses. It claims state-of-the-art performance on the MAPS dataset for full note-region transcription (onsets to offsets), outperforming prior methods by a large margin.

Significance. If the empirical claims hold after proper validation, the work would illustrate that a small, easily trained network plus a domain-derived temporal prior can achieve strong segmentation without learned sequence modeling, offering a lightweight alternative to end-to-end recurrent or attention-based transcription systems.

major comments (2)
  1. [Experiments] Experiments section: the central SOTA and 'large margin' claims are stated without any reported metrics, baselines, dataset splits, error bars, or ablation tables; this absence makes it impossible to verify the contribution of the HMM or to assess whether the network outputs plus simple thresholding already account for most of the reported quality.
  2. [Method] Method (HMM construction): transition probabilities are handcrafted from a fixed ADSR model rather than estimated from MAPS training data or compared against uniform, learned, or data-driven alternatives; because the performance gain is explicitly attributed to this temporal prior, the lack of an ablation directly undermines the load-bearing assumption that the chosen probabilities generalize to real performances with pedaling and expressive timing.
minor comments (2)
  1. [Method] Notation for the final binary decision rule is introduced without an explicit equation or threshold value, making the complete pipeline hard to reproduce from the text alone.
  2. [Abstract] The abstract asserts quantitative superiority but supplies none of the supporting numbers; moving at least the headline F1 or precision-recall figures into the abstract would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the experimental claims require fuller quantitative support and will revise the manuscript to include the requested metrics, baselines, splits, and ablations.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the central SOTA and 'large margin' claims are stated without any reported metrics, baselines, dataset splits, error bars, or ablation tables; this absence makes it impossible to verify the contribution of the HMM or to assess whether the network outputs plus simple thresholding already account for most of the reported quality.

    Authors: We acknowledge the omission of explicit quantitative results in the experiments section of the current manuscript. In the revision we will add a results table reporting note-level and frame-level F1 scores on the standard MAPS train/test splits, direct numerical comparisons against published baselines, standard deviations across runs where applicable, and an ablation isolating the HMM decoder versus the network outputs with simple thresholding. This will make the SOTA claims and the HMM contribution verifiable. revision: yes

  2. Referee: [Method] Method (HMM construction): transition probabilities are handcrafted from a fixed ADSR model rather than estimated from MAPS training data or compared against uniform, learned, or data-driven alternatives; because the performance gain is explicitly attributed to this temporal prior, the lack of an ablation directly undermines the load-bearing assumption that the chosen probabilities generalize to real performances with pedaling and expressive timing.

    Authors: The ADSR-derived transitions are deliberately handcrafted to encode a domain-specific acoustic prior rather than learned from the same data used to train the network. We agree an ablation is needed to quantify its benefit. The revised manuscript will include comparisons of the fixed ADSR matrix against both uniform transition probabilities and transition probabilities estimated directly from MAPS training data, together with a brief analysis of performance on pedal-active excerpts to address generalization to expressive timing. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results rest on external benchmarks

full rationale

The paper's central claim is an empirical SOTA result on the MAPS dataset obtained by late fusion of network outputs with a handcrafted HMM whose transition probabilities are taken from the standard ADSR envelope model used in sound synthesis. No equations, parameters, or predictions are shown to reduce to fitted quantities or self-definitions by construction; the HMM prior is fixed and external rather than estimated from the evaluation data or the network itself. The performance comparison is externally falsifiable on a public benchmark and does not rely on self-citation chains or uniqueness theorems imported from the authors' prior work. This is the normal case of a self-contained empirical pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on the domain assumption that ADSR envelopes supply useful transition probabilities for piano notes and that late fusion plus a final binary decision suffices to produce accurate segments.

axioms (1)
  • domain assumption ADSR envelope model supplies appropriate transition probabilities for a piano note HMM.
    Explicitly stated as the basis for choosing HMM transitions in the abstract.

pith-pipeline@v0.9.0 · 5660 in / 1105 out tokens · 22368 ms · 2026-05-25T18:15:01.937252+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 3 internal anchors

  1. [1]

    For each note sounding in the recording, we would like to obtain a tuple (s,e,n,v ), denoting start, end, MIDI note number and op- tionally volume

    INTRODUCTION Polyphonic transcription is the task of extracting a symbolic score from an audio recording, regardless of how many in- struments or notes are playing concurrently. For each note sounding in the recording, we would like to obtain a tuple (s,e,n,v ), denoting start, end, MIDI note number and op- tionally volume. We tackle a somewhat easier sub...

  2. [2]

    RELA TION TO PREVIOUS WORK It could be shown that modelling different note phases in time with different neural network outputs can be advantageous [2, 4, 5, 8]. The piano transcription approach in [4] uses two separate, bi-directional long-short term recurrent neural net- works (BLSTMs) to train a pitched onset detector together with a framewise pitch de...

  3. [3]

    Gaussian Dropout

    MODELS When predicting multiple targets simultaneously with neu- ral networks, one can consider two ends of a spectrum. One could either branch out immediately after the input layer, and thus have a separate network for each target, or one could branch out immediately before the output layers and have a shared network for all targets. We opt to use a mode...

  4. [4]

    Configuration II

    EXPERIMENTS We use the MAPS dataset [24] to train and select models. The dataset contains 210 recordings of classical piano music, ren- dered using 7 samplebank-based synthesizers. Additionally, there are two sets of recordings of a reproducing Disklavier piano: 30 recordings from a microphone in close proximity to the piano, and 30 recordings from a micr...

  5. [5]

    CONCLUSION AND FUTURE WORK We have shown that simple, small convolutional neural net- works with multiple outputs for different temporal phases of a note, together with sequential probabilistic models can achieve state-of-the-art results on a widely used piano tran- scription dataset. Some potential improvements for the future include: a global model for ...

  6. [6]

    BA- SIS, Basisprogramm

    ACKNOWLEDGMENTS This work is supported by the European Research Council via ERC Grant Agreement 670035, project CON ESPRESSIONE and the Austrian Promotion Agency (FFG) under the “BA- SIS, Basisprogramm” umbrella program. The Tesla K40 used for this research was donated by the NVIDIA Corporation

  7. [7]

    Polyphonic Piano Note Transcription with Recurrent Neural Networks,

    Sebastian B ¨ock and Markus Schedl, “Polyphonic Piano Note Transcription with Recurrent Neural Networks,” inIEEE Inter- national Conference on Acoustics, Speech and Signal Process- ing, ICASSP, Kyoto, Japan, March 25-30, 2012, pp. 121–124

  8. [8]

    Polyphonic Pitch Tracking with Deep Layered Learning

    Anders Elowsson, “Polyphonic Pitch Tracking with Deep Lay- ered Learning,” CoRR, vol. abs/1804.02918, 2018

  9. [9]

    Polyphonic Pitch Detection with Convolutional Recurrent Neural Networks,

    Carl Thom ´e and Sven Ahlb ¨ack, “Polyphonic Pitch Detection with Convolutional Recurrent Neural Networks,” in MIREX 2017 abstracts, 2017

  10. [10]

    Onsets and Frames: Dual-Objective Piano Transcrip- tion,

    Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck, “Onsets and Frames: Dual-Objective Piano Transcrip- tion,” in Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018. 4 Published in IEEE Internati...

  11. [11]

    A Parallel Fusion Approach to Piano Music Transcription based on Convolutional Neural Network,

    Fu’ze Cong, Shuchang Liu, Li Guo, and Geraint A. Wiggins, “A Parallel Fusion Approach to Piano Music Transcription based on Convolutional Neural Network,” in IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing, ICASSP, Calgary, AL, Canada, April 15-20 , 2018

  12. [12]

    An End-to-End Neural Network for Polyphonic Piano Music Tran- scription,

    Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon, “An End-to-End Neural Network for Polyphonic Piano Music Tran- scription,” IEEE/ACM Trans. Audio, Speech & Language Pro- cessing, vol. 24, no. 5, pp. 927–939, 2016

  13. [13]

    On the Poten- tial of Simple Framewise Approaches to Piano Transcription,

    Rainer Kelz, Matthias Dorfer, Filip Korzeniowski, Sebastian B¨ock, Andreas Arzt, and Gerhard Widmer, “On the Poten- tial of Simple Framewise Approaches to Piano Transcription,” in Proceedings of the 17th International Society for Music In- formation Retrieval Conference, ISMIR 2016, New York City, United States, August 7-11, 2016 , 2016, pp. 475–481

  14. [14]

    A Dozen Tricks with Multitask Learning,

    Rich Caruana, “A Dozen Tricks with Multitask Learning,” in Neural Networks: Tricks of the Trade - Second Edition , pp. 163–189. 2012

  15. [15]

    Auto- matic Transcription of Piano Music based on HMM Tracking of jointly-estimated Pitches,

    Valentin Emiya, Roland Badeau, and Bertrand David, “Auto- matic Transcription of Piano Music based on HMM Tracking of jointly-estimated Pitches,” in 2008 16th European Signal Processing Conference, EUSIPCO 2008, Lausanne, Switzer- land, August 25-29, 2008, pp. 1–5

  16. [16]

    Mul- tipitch Estimation of Piano Music by Exemplar-Based Sparse Representation,

    Cheng-Te Lee, Yi-Hsuan Yang, and Homer H. Chen, “Mul- tipitch Estimation of Piano Music by Exemplar-Based Sparse Representation,” IEEE Trans. Multimedia, vol. 14, no. 3-1, pp. 608–618, 2012

  17. [17]

    Music Transcription with ISA and HMM,

    Emmanuel Vincent and Xavier Rodet, “Music Transcription with ISA and HMM,” in Independent Component Analysis and Blind Signal Separation, Fifth International Conference, ICA Granada, Spain, September 22-24, Proceedings, 2004, pp. 1197–1204

  18. [18]

    Improving Note Segmentation in Auto- matic Piano Music Transcription Systems with a Two-State Pitch-Wise HMM Method,

    Dorian Cazau, Yuancheng Wang, Olivier Adam, Qiao Wang, and Gr ´egory Nuel, “Improving Note Segmentation in Auto- matic Piano Music Transcription Systems with a Two-State Pitch-Wise HMM Method,” in Proceedings of the 18th Inter- national Society for Music Information Retrieval Conference, ISMIR 2017, Suzhou, China, October 23-27, 2017 , 2017, pp. 523–530

  19. [19]

    Facto- rial Scaled Hidden Markov Model for Polyphonic Audio Rep- resentation and Source Separation,

    Alexey Ozerov, C ´edric F´evotte, and Maurice Charbit, “Facto- rial Scaled Hidden Markov Model for Polyphonic Audio Rep- resentation and Source Separation,” in IEEE Workshop on Ap- plications of Signal Processing to Audio and Acoustics, WAS- PAA ’09, New Paltz, NY, USA, October 18-21 , 2009, pp. 121– 124

  20. [20]

    Explicit Duration Hidden Markov Models for Multiple-Instrument Polyphonic Music Transcription,

    Emmanouil Benetos and Tillman Weyde, “Explicit Duration Hidden Markov Models for Multiple-Instrument Polyphonic Music Transcription,” in Proceedings of the 14th International Society for Music Information Retrieval Conference, ISMIR 2013, Curitiba, Brazil, November 4-8, 2013 , 2013, pp. 269– 274

  21. [21]

    An Efficient Temporally-Constrained Probabilistic Model for Multiple- Instrument Music Transcription,

    Emmanouil Benetos and Tillman Weyde, “An Efficient Temporally-Constrained Probabilistic Model for Multiple- Instrument Music Transcription,” in Proceedings of the 16th International Society for Music Information Retrieval Confer- ence, ISMIR 2015, M ´alaga, Spain, October 26-30, 2015, 2015, pp. 701–707

  22. [22]

    Polyphonic music transcrip- tion using note event modeling,

    M. P. Ryynanen and A. Klapuri, “Polyphonic music transcrip- tion using note event modeling,” in IEEE Workshop on Appli- cations of Signal Processing to Audio and Acoustics, Oct 2005, pp. 319–322

  23. [23]

    A discriminative model for polyphonic piano transcription,

    Graham E. Poliner and Daniel P. W. Ellis, “A discriminative model for polyphonic piano transcription,” EURASIP J. Adv. Sig. Proc., vol. 2007

  24. [24]

    Bilevel Sparse Mod- els for Polyphonic Music Transcription,

    Tal Ben Yakar, Roee Litman, Pablo Sprechmann, Alexan- der M. Bronstein, and Guillermo Sapiro, “Bilevel Sparse Mod- els for Polyphonic Music Transcription,” in Proceedings of the 14th International Society for Music Information Retrieval Conference, ISMIR, Curitiba, Brazil, November 4-8 , 2013, pp. 65–70

  25. [25]

    Non- negative Hidden Markov Modeling of Audio with Application to Source Separation,

    Gautham J. Mysore, Paris Smaragdis, and Bhiksha Raj, “Non- negative Hidden Markov Modeling of Audio with Application to Source Separation,” in Latent V ariable Analysis and Signal Separation - 9th International Conference, LVA/ICA 2010, St. Malo, France, September 27-30, 2010. Proceedings , 2010, pp. 140–148

  26. [26]

    Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

    Djork-Arn ´e Clevert, Thomas Unterthiner, and Sepp Hochreiter, “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs),” CoRR, vol. abs/1511.07289, 2015

  27. [27]

    Dropout: A simple Way to Prevent Neural Networks from Overfitting,

    Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: A simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014

  28. [28]

    madmom: a new Python Au- dio and Music Signal Processing Library,

    Sebastian B ¨ock, Filip Korzeniowski, Jan Schl ¨uter, Florian Krebs, and Gerhard Widmer, “madmom: a new Python Au- dio and Music Signal Processing Library,” in Proceedings of the 24th ACM International Conference on Multimedia , Ams- terdam, The Netherlands, 10 2016, pp. 1174–1178

  29. [29]

    Experimenting with Musically Motivated Convolutional Neural Networks,

    Jordi Pons, Thomas Lidy, and Xavier Serra, “Experimenting with Musically Motivated Convolutional Neural Networks,” in 14th International Workshop on Content-Based Multimedia In- dexing (CBMI). IEEE, 2016, pp. 1–6

  30. [30]

    Mul- tipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle,

    Valentin Emiya, Roland Badeau, and Bertrand David, “Mul- tipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle,” IEEE Trans. Audio, Speech & Language Processing, vol. 18, no. 6, pp. 1643–1654, 2010

  31. [31]

    MIR EV AL: A Transparent Implementation of Common MIR Metrics,

    Colin Raffel, Brian McFee, Eric J. Humphrey, Justin Sala- mon, Oriol Nieto, Dawen Liang, and Daniel P. W. Ellis, “MIR EV AL: A Transparent Implementation of Common MIR Metrics,” in Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR 2014, Taipei, Taiwan, October 27-31, 2014, 2014, pp. 367–372. 5