Velocity Prediction in Automatic Guitar Transcription
Pith reviewed 2026-06-26 12:59 UTC · model grok-4.3
The pith
Pretraining on synthetic guitar data lets a transcription model predict note velocity on real audio.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A model first trained on synthetic guitar data that includes velocity labels can have those weights transferred to a new network trained on real unlabeled guitar audio; the transferred model then predicts velocity more accurately than a baseline without pretraining when evaluated on synthetic test data, delivers small improvements to note transcription on some real test sets, and maintains overall performance comparable to the state of the art.
What carries the argument
Weight transfer of a velocity prediction head pretrained on synthetic virtual-instrument data to a transcription network trained on real guitar audio.
If this is right
- Velocity prediction becomes possible in guitar transcription without requiring velocity-labeled real recordings.
- The transferred model outperforms a non-pretrained baseline at velocity estimation on synthetic evaluation data.
- Note transcription accuracy receives a small boost on certain real test sets when pretrained velocity weights are used.
- Overall transcription performance stays comparable to existing state-of-the-art guitar transcription systems.
Where Pith is reading between the lines
- The same pretrain-and-transfer pattern could be applied to other instruments that lack velocity annotations.
- If the acoustic mismatch between virtual and real instruments is reduced, velocity prediction accuracy on real audio may increase.
- Downstream tools such as performance feedback systems could directly use the velocity output for expressive analysis.
Load-bearing premise
Statistical patterns of velocity learned from virtual instrument sounds remain useful when the model processes real guitar recordings.
What would settle it
Both the transferred model and the baseline are evaluated on held-out synthetic guitar audio with known velocities; if the transferred model shows no lower velocity prediction error, the claim fails.
Figures
read the original abstract
Automatic Music Transcription (AMT) models have achieved a high level of success in polyphonic transcription of various instruments. Velocity, typically a measure of note intensity, is less commonly predicted in these models due to the absence of velocity labels in available datasets and lack of a proper definition for instruments other than piano. We present a methodology and model for velocity prediction in Automatic Guitar Transcription (AGT) which uses virtual instruments to generate synthetic training data with velocity labels. We first pretrain a model on this synthetic data. These weights are then transferred to a different model and trained on real guitar audio, allowing the model to retain the working velocity prediction while also achieving high performance and generalisability from the real training data. The velocity prediction is shown to outperform a baseline model which does not use the pretrained velocity weights, when evaluated on synthetic data. In addition, using the pretrained velocity weights offers a small improvement in note transcription, though the magnitude of this improvement is limited and not always significant depending on the testing data. Overall the model achieves results comparable to the state of the art in guitar transcription, while also successfully predicting velocity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a method for velocity prediction in automatic guitar transcription by generating synthetic data with velocity labels from virtual instruments, pretraining a model on this data, transferring the weights to a model trained on real unlabeled guitar audio, and claiming that the resulting model outperforms a non-pretrained baseline on synthetic velocity prediction while yielding a small (sometimes non-significant) gain in note transcription and results comparable to SOTA guitar transcription.
Significance. If the transfer of velocity prediction from synthetic to real guitar audio holds under direct testing, the work would address a notable gap in AMT for non-piano instruments by enabling velocity-aware models without requiring real velocity labels; the use of synthetic pretraining and transfer learning is a practical strength.
major comments (3)
- [Abstract] Abstract and evaluation description: outperformance on velocity is reported solely versus a non-pretrained baseline on synthetic test data, but no quantitative metrics, error bars, statistical tests, or definition of how velocity is computed or scored for guitar are provided.
- [Evaluation / Results] The central transfer claim (pretrained velocity weights remain functional after fine-tuning on real recordings) is supported only by indirect note-transcription metrics showing small and sometimes non-significant gains; no direct evaluation or proxy (loudness correlation, human ratings, or cross-instrument consistency) of velocity quality on real guitar audio is described.
- [Methodology / Discussion] The weakest assumption—that statistical properties of synthetic velocity labels transfer meaningfully to real guitar—is stated but not tested, leaving generalization from virtual-instrument data unverified.
minor comments (1)
- The manuscript would benefit from explicit statements of the velocity definition used for guitar, details on the precise model architectures, and any ablation studies isolating the contribution of the velocity pretraining.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of our evaluation and assumptions. We address each major comment below, proposing revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation description: outperformance on velocity is reported solely versus a non-pretrained baseline on synthetic test data, but no quantitative metrics, error bars, statistical tests, or definition of how velocity is computed or scored for guitar are provided.
Authors: We agree with this observation. The current abstract summarizes the results qualitatively. In the revised version, we will expand the abstract and evaluation description to include quantitative metrics (e.g., velocity prediction error), error bars from multiple runs, results of statistical tests, and a clear definition of how velocity is computed and scored for guitar, including the mapping from MIDI velocity values to synthesized audio intensity. revision: yes
-
Referee: [Evaluation / Results] The central transfer claim (pretrained velocity weights remain functional after fine-tuning on real recordings) is supported only by indirect note-transcription metrics showing small and sometimes non-significant gains; no direct evaluation or proxy (loudness correlation, human ratings, or cross-instrument consistency) of velocity quality on real guitar audio is described.
Authors: We acknowledge that our evaluation of the transferred velocity prediction relies on indirect evidence from note transcription improvements. Direct evaluation is challenging due to the absence of velocity annotations in real guitar datasets. We will revise the paper to explicitly state this limitation and discuss potential future proxies such as loudness correlations, while noting that the observed note transcription gains provide supporting evidence for retention of velocity information. revision: partial
-
Referee: [Methodology / Discussion] The weakest assumption—that statistical properties of synthetic velocity labels transfer meaningfully to real guitar—is stated but not tested, leaving generalization from virtual-instrument data unverified.
Authors: The assumption regarding the transfer of velocity statistics from synthetic to real data is indeed central and not directly tested in the current experiments. We will enhance the discussion section to better articulate the rationale behind this assumption with references to analogous transfer learning results in audio domains, and we will clearly label it as a limitation requiring further validation. revision: yes
- Direct quantitative evaluation of velocity prediction quality on real guitar audio, as no ground-truth velocity labels exist for real recordings.
Circularity Check
No circularity: standard transfer learning with independent empirical evaluation
full rationale
The paper describes pretraining a velocity prediction model on synthetic guitar data generated by virtual instruments, then transferring those weights to a separate model fine-tuned on real unlabeled guitar recordings. Performance is assessed via direct comparison to a non-pretrained baseline on held-out synthetic data and via note transcription metrics on real data. No equations, fitted parameters, or self-citations are presented that reduce the claimed improvements to inputs by construction; the central claims rest on standard supervised pretraining plus transfer, which remain falsifiable against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Velocity labels generated by virtual guitar instruments capture transferable acoustic properties of real guitar playing intensity.
Reference graph
Works this paper leans on
-
[1]
Automatic music transcription: An overview,
E. Benetos, S. Dixon, Z. Duan, and S. Ewert, “Automatic music transcription: An overview,”IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 20–30, 2018
2018
-
[2]
Guitarset: A dataset for guitar transcription,
Q. Xi, R. M. Bittner, J. Pauwels, X. Ye, and J. P. Bello, “Guitarset: A dataset for guitar transcription,”Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2018
2018
-
[3]
Automatic tabla- ture transcription of electric guitar recordings by estimation of score- and instrument-related parameters,
C. Kehling, J. Abeßer, C. Dittmar, and G. Schuller, “Automatic tabla- ture transcription of electric guitar recordings by estimation of score- and instrument-related parameters,” inProceedings of the International Conference on Digital Audio Effects (DAFx), 2014, pp. 219–226
2014
-
[4]
Towards automatic transcription of polyphonic electric guitar music: A new dataset and a multi-loss transformer model,
Y .-H. Chen, W.-Y . Hsiao, T.-K. Hsieh, J.-S. R. Jang, and Y .-H. Yang, “Towards automatic transcription of polyphonic electric guitar music: A new dataset and a multi-loss transformer model,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 786–790
2022
-
[5]
Gaps: A large and diverse classical guitar dataset and benchmark transcription model,
J. Riley, Z. Guo, D. Edwards, S. Dixonet al., “Gaps: A large and diverse classical guitar dataset and benchmark transcription model,” in25th International Society for Music Information Retrieval (ISMIR) Conference, 2024
2024
-
[6]
Guitar-techs: An electric guitar dataset covering techniques, musical excerpts, chords and scales using a diverse array of hardware,
H. Pedroza, W. Abreu, R. M. Corey, and I. R. Roman, “Guitar-techs: An electric guitar dataset covering techniques, musical excerpts, chords and scales using a diverse array of hardware,”IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025
2025
-
[7]
Goat: A large dataset of paired guitar audio recordings and tablatures,
J. Loth, P. Sarmento, S. Sarkar, Z. Guo, M. Barthet, and M. Sandler, “Goat: A large dataset of paired guitar audio recordings and tablatures,” in26th International Society for Music Information Retrieval (ISMIR) Conference, 2025
2025
-
[8]
Unaligned supervision for automatic music transcription in the wild,
B. Maman and A. H. Bermano, “Unaligned supervision for automatic music transcription in the wild,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 14 918–14 934
2022
-
[9]
High resolution guitar transcrip- tion via domain adaptation,
X. Riley, D. Edwards, and S. Dixon, “High resolution guitar transcrip- tion via domain adaptation,” inInternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 1051–1055
2024
-
[10]
Count the notes: Histogram-based supervision for automatic music transcription,
J. Yaffe, B. Maman, M. M ¨uller, and A. H. Bermano, “Count the notes: Histogram-based supervision for automatic music transcription,” Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2025
2025
-
[11]
The interpretation of midi velocity,
R. B. Dannenberg, “The interpretation of midi velocity,” inICMC, 2006
2006
-
[12]
Onsets and Frames: Dual-Objective Piano Transcription
C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, J. Engel, S. Oore, and D. Eck, “Onsets and frames: Dual-objective piano transcription,”arXiv preprint arXiv:1710.11153, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
High-resolution piano transcription with pedals by regressing onset and offset times,
Q. Kong, B. Li, X. Song, Y . Wan, and Y . Wang, “High-resolution piano transcription with pedals by regressing onset and offset times,”IEEE ACM Transactions on Audio, Speech and Language Processing, vol. 29, pp. 3707–3717, 2021
2021
-
[14]
Enabling factorized piano music modeling and generation with the MAESTRO dataset,
C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck, “Enabling factorized piano music modeling and generation with the MAESTRO dataset,” in International Conference on Learning Representations (ICLR), 2019
2019
-
[15]
A data- driven analysis of robust automatic piano transcription,
D. Edwards, S. Dixon, E. Benetos, A. Maezawa, and Y . Kusaka, “A data- driven analysis of robust automatic piano transcription,”IEEE Signal Processing Letters, vol. 31, pp. 681–685, 2024
2024
-
[16]
Saarland music data (smd),
M. M ¨uller, V . Konz, W. Bogler, and V . Arifi-M ¨uller, “Saarland music data (smd),” inInternational Society for Music Information Retrieval (ISMIR) Conference Late Breaking Session, 2011
2011
-
[17]
Estimating note intensities in music record- ings,
S. Ewert and M. M ¨uller, “Estimating note intensities in music record- ings,” in2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011, pp. 385–388
2011
-
[18]
Score-informed MIDI velocity estimation for piano performance by film conditioning,
H. Kim, M. Miron, and X. Serra, “Score-informed MIDI velocity estimation for piano performance by film conditioning,” inSound and Music Computing Conference, 2023
2023
-
[19]
Diffvel: Note-level midi velocity estimation for piano performance by a double conditioned diffusion model,
H. Kim and X. Serra, “Diffvel: Note-level midi velocity estimation for piano performance by a double conditioned diffusion model,” inIn- ternational Symposium on Computer Music Multidisciplinary Research. Springer, 2023, pp. 349–361
2023
-
[20]
A method for midi velocity estimation for piano performance by a u-net with attention and film,
——, “A method for midi velocity estimation for piano performance by a u-net with attention and film,” inInternational Society for Music Information Retrieval (ISMIR) Conference, 2024
2024
-
[21]
A timbre-based approach to estimate key velocity from polyphonic piano recordings
D. Jeong, T. Kwon, and J. Nam, “A timbre-based approach to estimate key velocity from polyphonic piano recordings.” in19th International Society for Music Information Retrieval (ISMIR) Conference, 2018, pp. 120–127
2018
-
[22]
Note-intensity estimation of piano recordings using coarsely aligned MIDI score,
——, “Note-intensity estimation of piano recordings using coarsely aligned MIDI score,”Journal of the Audio Engineering Society, vol. 68, no. 1/2, pp. 34–47, 2020
2020
-
[23]
Learn from virtual guitar: A comparative analysis of automatic guitar transcription using synthetic and real audio,
Y . Kusaka and A. Maezawa, “Learn from virtual guitar: A comparative analysis of automatic guitar transcription using synthetic and real audio,” in2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2025, pp. 1–5
2025
-
[24]
Fine-grained midi expression tran- scription from wind and string instrument audio via sim2real transfer learning,
Y . Xie, Z. Guo, and M. Barthet, “Fine-grained midi expression tran- scription from wind and string instrument audio via sim2real transfer learning,” inThe 17th International Symposium on Computer Music Multidisciplinary Research (CMMR), 2025
2025
-
[25]
Synthtab: Leveraging synthesized data for guitar tablature transcription,
Y . Zang, Y . Zhong, F. Cwitkowitz, and Z. Duan, “Synthtab: Leveraging synthesized data for guitar tablature transcription,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1286–1290
2024
-
[26]
P. Sobot, “Pedalboard,” Jul. 2021. [Online]. Available: https://doi.org/ 10.5281/zenodo.7817838
-
[27]
mir eval: A transparent implementation of common mir metrics,
C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis, “mir eval: A transparent implementation of common mir metrics,” inProceedings of the 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, 2014, pp. 367–372
2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.