pith. sign in

arxiv: 2606.24912 · v1 · pith:W57XBJSHnew · submitted 2026-06-19 · 💻 cs.SD · cs.AI· eess.AS

Velocity Prediction in Automatic Guitar Transcription

Pith reviewed 2026-06-26 12:59 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS
keywords automatic guitar transcriptionvelocity predictionsynthetic datatransfer learningpolyphonic transcriptionnote intensityvirtual instruments
0
0 comments X

The pith

Pretraining on synthetic guitar data lets a transcription model predict note velocity on real audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to add velocity prediction to automatic guitar transcription despite the lack of intensity labels in real recordings. Virtual instruments generate synthetic audio with explicit velocity labels for pretraining. The resulting weights transfer to a second model trained on unlabeled real guitar audio, preserving the velocity head while adapting to actual instrument sounds. This transferred model beats a non-pretrained baseline at velocity estimation when both are tested on synthetic data and yields modest gains in note detection on some real test sets. The combined system reaches transcription accuracy on par with existing guitar-specific models while outputting velocity values.

Core claim

A model first trained on synthetic guitar data that includes velocity labels can have those weights transferred to a new network trained on real unlabeled guitar audio; the transferred model then predicts velocity more accurately than a baseline without pretraining when evaluated on synthetic test data, delivers small improvements to note transcription on some real test sets, and maintains overall performance comparable to the state of the art.

What carries the argument

Weight transfer of a velocity prediction head pretrained on synthetic virtual-instrument data to a transcription network trained on real guitar audio.

If this is right

  • Velocity prediction becomes possible in guitar transcription without requiring velocity-labeled real recordings.
  • The transferred model outperforms a non-pretrained baseline at velocity estimation on synthetic evaluation data.
  • Note transcription accuracy receives a small boost on certain real test sets when pretrained velocity weights are used.
  • Overall transcription performance stays comparable to existing state-of-the-art guitar transcription systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pretrain-and-transfer pattern could be applied to other instruments that lack velocity annotations.
  • If the acoustic mismatch between virtual and real instruments is reduced, velocity prediction accuracy on real audio may increase.
  • Downstream tools such as performance feedback systems could directly use the velocity output for expressive analysis.

Load-bearing premise

Statistical patterns of velocity learned from virtual instrument sounds remain useful when the model processes real guitar recordings.

What would settle it

Both the transferred model and the baseline are evaluated on held-out synthetic guitar audio with known velocities; if the transferred model shows no lower velocity prediction error, the claim fails.

Figures

Figures reproduced from arXiv: 2606.24912 by Emmanouil Benetos, Jackson Loth, Simon Dixon, Xavier Riley.

Figure 1
Figure 1. Figure 1: Visualization of the training methodology. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Automatic Music Transcription (AMT) models have achieved a high level of success in polyphonic transcription of various instruments. Velocity, typically a measure of note intensity, is less commonly predicted in these models due to the absence of velocity labels in available datasets and lack of a proper definition for instruments other than piano. We present a methodology and model for velocity prediction in Automatic Guitar Transcription (AGT) which uses virtual instruments to generate synthetic training data with velocity labels. We first pretrain a model on this synthetic data. These weights are then transferred to a different model and trained on real guitar audio, allowing the model to retain the working velocity prediction while also achieving high performance and generalisability from the real training data. The velocity prediction is shown to outperform a baseline model which does not use the pretrained velocity weights, when evaluated on synthetic data. In addition, using the pretrained velocity weights offers a small improvement in note transcription, though the magnitude of this improvement is limited and not always significant depending on the testing data. Overall the model achieves results comparable to the state of the art in guitar transcription, while also successfully predicting velocity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes a method for velocity prediction in automatic guitar transcription by generating synthetic data with velocity labels from virtual instruments, pretraining a model on this data, transferring the weights to a model trained on real unlabeled guitar audio, and claiming that the resulting model outperforms a non-pretrained baseline on synthetic velocity prediction while yielding a small (sometimes non-significant) gain in note transcription and results comparable to SOTA guitar transcription.

Significance. If the transfer of velocity prediction from synthetic to real guitar audio holds under direct testing, the work would address a notable gap in AMT for non-piano instruments by enabling velocity-aware models without requiring real velocity labels; the use of synthetic pretraining and transfer learning is a practical strength.

major comments (3)
  1. [Abstract] Abstract and evaluation description: outperformance on velocity is reported solely versus a non-pretrained baseline on synthetic test data, but no quantitative metrics, error bars, statistical tests, or definition of how velocity is computed or scored for guitar are provided.
  2. [Evaluation / Results] The central transfer claim (pretrained velocity weights remain functional after fine-tuning on real recordings) is supported only by indirect note-transcription metrics showing small and sometimes non-significant gains; no direct evaluation or proxy (loudness correlation, human ratings, or cross-instrument consistency) of velocity quality on real guitar audio is described.
  3. [Methodology / Discussion] The weakest assumption—that statistical properties of synthetic velocity labels transfer meaningfully to real guitar—is stated but not tested, leaving generalization from virtual-instrument data unverified.
minor comments (1)
  1. The manuscript would benefit from explicit statements of the velocity definition used for guitar, details on the precise model architectures, and any ablation studies isolating the contribution of the velocity pretraining.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our evaluation and assumptions. We address each major comment below, proposing revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation description: outperformance on velocity is reported solely versus a non-pretrained baseline on synthetic test data, but no quantitative metrics, error bars, statistical tests, or definition of how velocity is computed or scored for guitar are provided.

    Authors: We agree with this observation. The current abstract summarizes the results qualitatively. In the revised version, we will expand the abstract and evaluation description to include quantitative metrics (e.g., velocity prediction error), error bars from multiple runs, results of statistical tests, and a clear definition of how velocity is computed and scored for guitar, including the mapping from MIDI velocity values to synthesized audio intensity. revision: yes

  2. Referee: [Evaluation / Results] The central transfer claim (pretrained velocity weights remain functional after fine-tuning on real recordings) is supported only by indirect note-transcription metrics showing small and sometimes non-significant gains; no direct evaluation or proxy (loudness correlation, human ratings, or cross-instrument consistency) of velocity quality on real guitar audio is described.

    Authors: We acknowledge that our evaluation of the transferred velocity prediction relies on indirect evidence from note transcription improvements. Direct evaluation is challenging due to the absence of velocity annotations in real guitar datasets. We will revise the paper to explicitly state this limitation and discuss potential future proxies such as loudness correlations, while noting that the observed note transcription gains provide supporting evidence for retention of velocity information. revision: partial

  3. Referee: [Methodology / Discussion] The weakest assumption—that statistical properties of synthetic velocity labels transfer meaningfully to real guitar—is stated but not tested, leaving generalization from virtual-instrument data unverified.

    Authors: The assumption regarding the transfer of velocity statistics from synthetic to real data is indeed central and not directly tested in the current experiments. We will enhance the discussion section to better articulate the rationale behind this assumption with references to analogous transfer learning results in audio domains, and we will clearly label it as a limitation requiring further validation. revision: yes

standing simulated objections not resolved
  • Direct quantitative evaluation of velocity prediction quality on real guitar audio, as no ground-truth velocity labels exist for real recordings.

Circularity Check

0 steps flagged

No circularity: standard transfer learning with independent empirical evaluation

full rationale

The paper describes pretraining a velocity prediction model on synthetic guitar data generated by virtual instruments, then transferring those weights to a separate model fine-tuned on real unlabeled guitar recordings. Performance is assessed via direct comparison to a non-pretrained baseline on held-out synthetic data and via note transcription metrics on real data. No equations, fitted parameters, or self-citations are presented that reduce the claimed improvements to inputs by construction; the central claims rest on standard supervised pretraining plus transfer, which remain falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that virtual-instrument velocity labels are sufficiently realistic to transfer to real guitar audio; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Velocity labels generated by virtual guitar instruments capture transferable acoustic properties of real guitar playing intensity.
    Invoked to justify pretraining on synthetic data before real-data fine-tuning.

pith-pipeline@v0.9.1-grok · 5728 in / 1236 out tokens · 24459 ms · 2026-06-26T12:59:03.712620+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Automatic music transcription: An overview,

    E. Benetos, S. Dixon, Z. Duan, and S. Ewert, “Automatic music transcription: An overview,”IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 20–30, 2018

  2. [2]

    Guitarset: A dataset for guitar transcription,

    Q. Xi, R. M. Bittner, J. Pauwels, X. Ye, and J. P. Bello, “Guitarset: A dataset for guitar transcription,”Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2018

  3. [3]

    Automatic tabla- ture transcription of electric guitar recordings by estimation of score- and instrument-related parameters,

    C. Kehling, J. Abeßer, C. Dittmar, and G. Schuller, “Automatic tabla- ture transcription of electric guitar recordings by estimation of score- and instrument-related parameters,” inProceedings of the International Conference on Digital Audio Effects (DAFx), 2014, pp. 219–226

  4. [4]

    Towards automatic transcription of polyphonic electric guitar music: A new dataset and a multi-loss transformer model,

    Y .-H. Chen, W.-Y . Hsiao, T.-K. Hsieh, J.-S. R. Jang, and Y .-H. Yang, “Towards automatic transcription of polyphonic electric guitar music: A new dataset and a multi-loss transformer model,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 786–790

  5. [5]

    Gaps: A large and diverse classical guitar dataset and benchmark transcription model,

    J. Riley, Z. Guo, D. Edwards, S. Dixonet al., “Gaps: A large and diverse classical guitar dataset and benchmark transcription model,” in25th International Society for Music Information Retrieval (ISMIR) Conference, 2024

  6. [6]

    Guitar-techs: An electric guitar dataset covering techniques, musical excerpts, chords and scales using a diverse array of hardware,

    H. Pedroza, W. Abreu, R. M. Corey, and I. R. Roman, “Guitar-techs: An electric guitar dataset covering techniques, musical excerpts, chords and scales using a diverse array of hardware,”IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

  7. [7]

    Goat: A large dataset of paired guitar audio recordings and tablatures,

    J. Loth, P. Sarmento, S. Sarkar, Z. Guo, M. Barthet, and M. Sandler, “Goat: A large dataset of paired guitar audio recordings and tablatures,” in26th International Society for Music Information Retrieval (ISMIR) Conference, 2025

  8. [8]

    Unaligned supervision for automatic music transcription in the wild,

    B. Maman and A. H. Bermano, “Unaligned supervision for automatic music transcription in the wild,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 14 918–14 934

  9. [9]

    High resolution guitar transcrip- tion via domain adaptation,

    X. Riley, D. Edwards, and S. Dixon, “High resolution guitar transcrip- tion via domain adaptation,” inInternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1051–1055

  10. [10]

    Count the notes: Histogram-based supervision for automatic music transcription,

    J. Yaffe, B. Maman, M. M ¨uller, and A. H. Bermano, “Count the notes: Histogram-based supervision for automatic music transcription,” Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2025

  11. [11]

    The interpretation of midi velocity,

    R. B. Dannenberg, “The interpretation of midi velocity,” inICMC, 2006

  12. [12]

    Onsets and Frames: Dual-Objective Piano Transcription

    C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, J. Engel, S. Oore, and D. Eck, “Onsets and frames: Dual-objective piano transcription,”arXiv preprint arXiv:1710.11153, 2017

  13. [13]

    High-resolution piano transcription with pedals by regressing onset and offset times,

    Q. Kong, B. Li, X. Song, Y . Wan, and Y . Wang, “High-resolution piano transcription with pedals by regressing onset and offset times,”IEEE ACM Transactions on Audio, Speech and Language Processing, vol. 29, pp. 3707–3717, 2021

  14. [14]

    Enabling factorized piano music modeling and generation with the MAESTRO dataset,

    C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck, “Enabling factorized piano music modeling and generation with the MAESTRO dataset,” in International Conference on Learning Representations (ICLR), 2019

  15. [15]

    A data- driven analysis of robust automatic piano transcription,

    D. Edwards, S. Dixon, E. Benetos, A. Maezawa, and Y . Kusaka, “A data- driven analysis of robust automatic piano transcription,”IEEE Signal Processing Letters, vol. 31, pp. 681–685, 2024

  16. [16]

    Saarland music data (smd),

    M. M ¨uller, V . Konz, W. Bogler, and V . Arifi-M ¨uller, “Saarland music data (smd),” inInternational Society for Music Information Retrieval (ISMIR) Conference Late Breaking Session, 2011

  17. [17]

    Estimating note intensities in music record- ings,

    S. Ewert and M. M ¨uller, “Estimating note intensities in music record- ings,” in2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011, pp. 385–388

  18. [18]

    Score-informed MIDI velocity estimation for piano performance by film conditioning,

    H. Kim, M. Miron, and X. Serra, “Score-informed MIDI velocity estimation for piano performance by film conditioning,” inSound and Music Computing Conference, 2023

  19. [19]

    Diffvel: Note-level midi velocity estimation for piano performance by a double conditioned diffusion model,

    H. Kim and X. Serra, “Diffvel: Note-level midi velocity estimation for piano performance by a double conditioned diffusion model,” inIn- ternational Symposium on Computer Music Multidisciplinary Research. Springer, 2023, pp. 349–361

  20. [20]

    A method for midi velocity estimation for piano performance by a u-net with attention and film,

    ——, “A method for midi velocity estimation for piano performance by a u-net with attention and film,” inInternational Society for Music Information Retrieval (ISMIR) Conference, 2024

  21. [21]

    A timbre-based approach to estimate key velocity from polyphonic piano recordings

    D. Jeong, T. Kwon, and J. Nam, “A timbre-based approach to estimate key velocity from polyphonic piano recordings.” in19th International Society for Music Information Retrieval (ISMIR) Conference, 2018, pp. 120–127

  22. [22]

    Note-intensity estimation of piano recordings using coarsely aligned MIDI score,

    ——, “Note-intensity estimation of piano recordings using coarsely aligned MIDI score,”Journal of the Audio Engineering Society, vol. 68, no. 1/2, pp. 34–47, 2020

  23. [23]

    Learn from virtual guitar: A comparative analysis of automatic guitar transcription using synthetic and real audio,

    Y . Kusaka and A. Maezawa, “Learn from virtual guitar: A comparative analysis of automatic guitar transcription using synthetic and real audio,” in2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2025, pp. 1–5

  24. [24]

    Fine-grained midi expression tran- scription from wind and string instrument audio via sim2real transfer learning,

    Y . Xie, Z. Guo, and M. Barthet, “Fine-grained midi expression tran- scription from wind and string instrument audio via sim2real transfer learning,” inThe 17th International Symposium on Computer Music Multidisciplinary Research (CMMR), 2025

  25. [25]

    Synthtab: Leveraging synthesized data for guitar tablature transcription,

    Y . Zang, Y . Zhong, F. Cwitkowitz, and Z. Duan, “Synthtab: Leveraging synthesized data for guitar tablature transcription,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1286–1290

  26. [26]

    Pedalboard,

    P. Sobot, “Pedalboard,” Jul. 2021. [Online]. Available: https://doi.org/ 10.5281/zenodo.7817838

  27. [27]

    mir eval: A transparent implementation of common mir metrics,

    C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis, “mir eval: A transparent implementation of common mir metrics,” inProceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, 2014, pp. 367–372