pith. sign in

arxiv: 2605.24291 · v1 · pith:UIEDEZ6Jnew · submitted 2026-05-22 · 💻 cs.SD · cs.CL· cs.MM

Rubato: Transcribing Piano Music with Timestamps

Pith reviewed 2026-06-30 14:07 UTC · model grok-4.3

classification 💻 cs.SD cs.CLcs.MM
keywords piano transcriptionaudio to sheet musictimestamped notationpolyphonic musicsequence-to-sequence modelmusic information retrievalrubato performance
0
0 comments X

The pith

Rubato transcribes piano audio into timestamped sheet music more accurately than cascade methods by using a new polyphonic representation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Rubato, a prompt-conditioned encoder-decoder model that converts musical audio recordings into human-readable sheet music with timestamps. It relies on a new textual representation called InterMo for polyphonic music, designed to work with sequence-to-sequence training. This approach yields higher notational accuracy than existing cascade-based systems. Even when cascades receive ground-truth MIDI input instead of audio, Rubato still performs better, indicating that the main limitation in current methods is representational rather than acoustic. The multi-task training also allows it to compete on simpler tasks like beat detection.

Core claim

Rubato produces timestamped piano sheet music from audio with higher notational accuracy than the best existing approaches based on cascades. Even when the cascade is given ground-truth MIDI instead of audio, Rubato performs better, suggesting that the ceiling of existing approaches is primarily representational, not acoustic.

What carries the argument

InterMo, a new textual representation for polyphonic music compatible with sequence-to-sequence training, used in a prompt-conditioned encoder-decoder model trained on multiple related tasks.

Load-bearing premise

The InterMo representation enables better performance than cascades by being compatible with sequence-to-sequence training even when input is perfect MIDI.

What would settle it

A new cascade system that achieves equal or higher notational accuracy than Rubato when given ground-truth MIDI would show that the representational limit is not the main factor.

Figures

Figures reproduced from arXiv: 2605.24291 by Guang Yang, Nazif Can Tamer, Noah A. Smith, Victoria Ebert.

Figure 1
Figure 1. Figure 1: InterMo representation. Left, top to bottom: A two-bar piano excerpt, decomposed into metric intervals (gray/black) and pitched moments (orange = upper staff, teal = lower staff), serialized as a 1D text sequence. Right: English narration of InterMo, segmented into interval-piece pretokenization borders which follow a canonical order within score moments (§3.2). Uppercase = onset, lowercase = offset; # = s… view at source ↗
Figure 2
Figure 2. Figure 2: Top: Rubato model is trained to transcribe the same musical passage into different InterMo [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Downstream impact of the upstream offset convention. Only the upstream AMT changes [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: MIDI note-detection F1 on MAESTRO test (177 files). TkunPed is Transkun-V2 as [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Raw sustain-pedal velocity distribution over [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Score-transcription systems on ATEPP (All represented with InterMo). (a) OMR-NED on [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Isolated effect of tokenizer on n-gram shingling, holding the upstream transcription model [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

We consider the conversion of musical recordings into human-readable sheet music annotated with timestamps. Such output lets a listener clearly visualize rubato (temporally expressive playing), a learner diagnose ensemble precision and timing choices against the written music, and a musicology scholar compare performance styles across recordings of the same work. We introduce (1) a prompt-conditioned encoder-decoder model, named Rubato, trained to output (2) a new textual representation for polyphonic music, named InterMo, which we designed for compatibility with sequence-to-sequence training. Our experiments demonstrate that Rubato produces timestamped piano sheet music from audio with higher notational accuracy than the best existing approaches, which are based on cascades. We find that even if the cascade is given ground-truth MIDI instead of audio, Rubato performs better, suggesting that the ceiling of existing approaches is primarily representational, not acoustic. Further, because Rubato is trained on several related tasks (with prompts), it competes with or outperforms the best single-task systems on related but simpler tasks like MIDI note grounding and beat/downbeat detection. A demo is available at https://nctamer.github.io/rubato-transcription .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Rubato, a prompt-conditioned encoder-decoder model trained to output a new textual representation InterMo for polyphonic piano music. It claims superior notational accuracy for timestamped sheet music from audio compared to cascade methods, and that it outperforms even cascades given ground-truth MIDI input, implying the performance ceiling of existing approaches is representational rather than acoustic. The model also competes on related tasks such as MIDI note grounding and beat/downbeat detection due to multi-task prompt training.

Significance. If the central empirical claims hold after addressing baseline details, the work would provide evidence that sequence-to-sequence models with custom representations can surpass cascaded pipelines in music transcription, with practical value for visualizing rubato and performance analysis. The multi-task prompt conditioning enabling competitive performance on simpler subtasks is a clear strength.

major comments (1)
  1. [Experiments (GT-MIDI comparison)] GT-MIDI cascade comparison (experiments section): the manuscript must specify the exact MIDI-to-timestamped-sheet pipeline used for the ground-truth MIDI baseline, including whether it is a trained model, off-the-shelf converter, or rule-based system, and whether it is subject to the same output format constraints as InterMo. Without this, the claim that Rubato's advantage demonstrates a representational (rather than acoustic) ceiling cannot be isolated from possible baseline weakness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments. We address the major point on the GT-MIDI cascade comparison below and will revise the manuscript accordingly to improve experimental clarity.

read point-by-point responses
  1. Referee: [Experiments (GT-MIDI comparison)] GT-MIDI cascade comparison (experiments section): the manuscript must specify the exact MIDI-to-timestamped-sheet pipeline used for the ground-truth MIDI baseline, including whether it is a trained model, off-the-shelf converter, or rule-based system, and whether it is subject to the same output format constraints as InterMo. Without this, the claim that Rubato's advantage demonstrates a representational (rather than acoustic) ceiling cannot be isolated from possible baseline weakness.

    Authors: We agree that additional detail on the GT-MIDI cascade is required. The current manuscript does not fully specify the MIDI-to-timestamped-sheet conversion pipeline used for this baseline. In the revised version we will explicitly describe the pipeline (including its type and output-format constraints) so that the comparison isolates representational differences as intended. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external comparisons

full rationale

The paper introduces Rubato as a prompt-conditioned encoder-decoder trained on the new InterMo representation and validates its claims through direct empirical comparisons to cascade baselines, including a ground-truth MIDI input condition. These evaluations rely on held-out test performance metrics rather than any self-referential fitting where a parameter is tuned on data and then presented as a prediction of a closely related quantity. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force the central representational claim. The derivation chain consists of standard supervised sequence-to-sequence training followed by independent benchmarking, making the results self-contained against external methods.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract only; no details available on free parameters, axioms, or dependencies. The central claim rests on the effectiveness of the invented InterMo representation and multi-task prompt training.

invented entities (1)
  • InterMo no independent evidence
    purpose: New textual representation for polyphonic music designed for compatibility with sequence-to-sequence training
    Introduced in this work to support the encoder-decoder model

pith-pipeline@v0.9.1-grok · 5742 in / 1116 out tokens · 31245 ms · 2026-06-30T14:07:58.597192+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 36 canonical work pages · 5 internal anchors

  1. [1]

    Beat this! Accurate beat tracking without DBN postprocessing

    Francesco Foscarin, Jan Schlüter, and Gerhard Widmer. Beat this! Accurate beat tracking without DBN postprocessing. InProceedings of the 25th International Society for Music Information Retrieval Conference, 2024.doi:10.48550/arXiv.2407.21658

  2. [2]

    Scoring time intervals using non-hierarchical transformer for automatic piano transcription

    Yujia Yan and Zhiyao Duan. Scoring time intervals using non-hierarchical transformer for automatic piano transcription. InProc. International Society for Music Information Retrieval Conference (ISMIR), 2024.doi:10.5281/zenodo.14877493

  3. [3]

    Musically aware automatic piano transcription using synthetic pretraining

    Louis Bradshaw, Simon Colton, Alexander Spangher, and Stella Biderman. Musically aware automatic piano transcription using synthetic pretraining. Technical report, EleutherAI, 2024

  4. [4]

    High- resolution piano transcription with pedals by regressing onset and offset times.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3707–3717, 2021

    Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, and Yuxuan Wang. High- resolution piano transcription with pedals by regressing onset and offset times.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3707–3717, 2021. doi:10.1109/TASLP.2021.3121991

  5. [5]

    Gardner, I

    Josh Gardner, Ian Simon, Ethan Manilow, Curtis Hawthorne, and Jesse Engel. MT3: Multi- task multitrack music transcription. InInternational Conference on Learning Representations (ICLR), 2022.doi:10.48550/arXiv.2111.03017

  6. [6]

    Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

    Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse H. Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset. InInternational Conference on Learning Representations (ICLR), 2019.doi:10.48550/arXiv.1810.12247

  7. [7]

    Performance MIDI- to-score conversion by neural beat tracking

    Lele Liu, Qiuqiang Kong, Veronica Morfi, and Emmanouil Benetos. Performance MIDI- to-score conversion by neural beat tracking. InProceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), Bengaluru, India, 2022. doi:10.5281/zenodo.7316682

  8. [8]

    End-to-end piano performance-MIDI to score conversion with transformers

    Tim Beyer and Angela Dai. End-to-end piano performance-MIDI to score conversion with transformers. InProceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 319–326, 2024.doi:10.48550/arXiv.2410.00210

  9. [9]

    Bridging piano transcription and rendering via disentangled score content and style

    Wei Zeng, Junchuan Zhao, and Ye Wang. Bridging piano transcription and rendering via disentangled score content and style. InInternational Conference on Learning Representations, 2026

  10. [10]

    Silvan David Peter, Carlos Eduardo Cancino-Chacón, Francesco Foscarin, Andrew Philip McLeod, Florian Henkel, Emmanouil Karystinaios, and Gerhard Widmer. Automatic note-level score-to-performance alignments in the ASAP dataset.Transactions of the International Society for Music Information Retrieval, 6(1):27–42, 2023.doi:10.5334/tismir.149

  11. [11]

    Robust Speech Recognition via Large-Scale Weak Supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR, 2023. doi:10.48550/arXiv.2212.04356

  12. [12]

    Word level timestamp generation for automatic speech recognition and translation

    Ke Hu, Krishna Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, and Boris Ginsburg. Word level timestamp generation for automatic speech recognition and translation. InProc. Interspeech 2025, pages 2565–2569, 2025.doi:10.21437/Interspeech.2025-869. 10

  13. [13]

    Order Matters: Sequence to sequence for sets

    Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to se- quence for sets. InInternational Conference on Learning Representations (ICLR), 2016. doi:10.48550/arXiv.1511.06391

  14. [14]

    Symphony generation with permutation invariant language model

    Jiafeng Liu, Yuanliang Dong, Zehua Cheng, Xinran Zhang, Xiaobing Li, Feng Yu, and Maosong Sun. Symphony generation with permutation invariant language model. InPro- ceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2022.doi:10.48550/arXiv.2205.05448

  15. [15]

    Jongmin Jung, Dongmin Kim, Sihun Lee, Seola Cho, Hyungjoon Soh, Irmak Bukey, Chris Donahue, and Dasaem Jeong. U-MusT: A unified framework for cross-modal translation of score images, symbolic music, and performance audio.IEEE Transactions on Audio, Speech and Language Processing, 34:1876–1891, 2026.doi:10.1109/TASLPRO.2025.3648794

  16. [16]

    LSTM networks can perform dynamic counting

    Mirac Suzgun, Yonatan Belinkov, Stuart Shieber, and Sebastian Gehrmann. LSTM networks can perform dynamic counting. In Jason Eisner, Matthias Gallé, Jeffrey Heinz, Ariadna Quattoni, and Guillaume Rabusseau, editors,Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges, pages 44–54, Florence, August 2019. Association for Compu...

  17. [17]

    Counting like transformers: Compiling temporal counting logic into softmax transformers

    Andy Yang and David Chiang. Counting like transformers: Compiling temporal counting logic into softmax transformers. InProceedings of the Conference on Language Modeling (COLM), 2024.doi:10.48550/arXiv.2404.04393

  18. [18]

    Between circuits and Chomsky: Pre-pretraining on formal languages imparts linguistic biases

    Michael Y Hu, Jackson Petty, Chuan Shi, William Merrill, and Tal Linzen. Between circuits and Chomsky: Pre-pretraining on formal languages imparts linguistic biases. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9691–9709, Vienna, Austria, 2025. doi:10.18653/v1/2025.acl-long.478

  19. [19]

    Compound word trans- former: Learning to compose full-song music over dynamic directed hypergraphs

    Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh, and Yi-Hsuan Yang. Compound word trans- former: Learning to compose full-song music over dynamic directed hypergraphs. InPro- ceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 178–186, 2021. doi:10.1609/aaai.v35i1.16091

  20. [20]

    Mu- sicBERT: Symbolic music understanding with large-scale pre-training

    Mingliang Zeng, Xu Tan, Rui Wang, Zeqian Ju, Tao Qin, and Tie-Yan Liu. Mu- sicBERT: Symbolic music understanding with large-scale pre-training. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 791–800, 2021. doi:10.18653/v1/2021.findings-acl.70

  21. [21]

    Humdrum and Kern: Selective feature encoding

    David Huron. Humdrum and Kern: Selective feature encoding. In Eleanor Selfridge- Field, editor,Beyond MIDI: The Handbook of Musical Codes. MIT Press, 1997. doi:10.5555/275928.275976

  22. [22]

    The Music Encoding Initiative as a document-encoding framework

    Andrew Hankinson, Perry Roland, and Ichiro Fujinaga. The Music Encoding Initiative as a document-encoding framework. InProceedings of the 12th International Society for Music Information Retrieval Conference, 2011.doi:10.5281/zenodo.1417609

  23. [23]

    Verovio: A library for engraving MEI music notation into SVG

    Laurent Pugin, Rodolfo Zitellini, and Perry Roland. Verovio: A library for engraving MEI music notation into SVG. InProceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2014.doi:10.5281/zenodo.1417589

  24. [24]

    Puvvada, He Huang, Kunal Dhawan, Nithin Palaparthi, Ke Hu, Zhehuai Chen, Jagadeesh Balam, and Boris Ginsburg

    Krishna C. Puvvada, He Huang, Kunal Dhawan, Nithin Palaparthi, Ke Hu, Zhehuai Chen, Jagadeesh Balam, and Boris Ginsburg. Less is more: Accurate speech recognition & translation without web-scale data. InProc. Interspeech 2024, pages 1798–1802, 2024. doi:10.21437/Interspeech.2024-1058

  25. [25]

    PDMX: A large-scale public domain MusicXML dataset for symbolic music processing

    Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, and Julian McAuley. PDMX: A large-scale public domain MusicXML dataset for symbolic music processing. InICASSP 2025 – IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2025.doi:10.1109/ICASSP49660.2025.10890217. 11

  26. [26]

    ASAP: A dataset of aligned scores and performances for piano transcription

    Francesco Foscarin, Andrew McLeod, Philippe Rigaux, Florent Jacquemard, and Masahiko Sakai. ASAP: A dataset of aligned scores and performances for piano transcription. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2020.doi:10.5281/zenodo.4245490

  27. [27]

    This time with feeling: Learning expressive musical performance.Neural Computing and Applications, 32(4): 955–967, 2020.doi:10.1007/s00521-018-3758-9

    Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, and Karen Simonyan. This time with feeling: Learning expressive musical performance.Neural Computing and Applications, 32(4): 955–967, 2020.doi:10.1007/s00521-018-3758-9

  28. [28]

    Byte pair encoding is suboptimal for language model pretraining

    Kaj Bostrom and Greg Durrett. Byte pair encoding is suboptimal for language model pretraining. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624, 2020.doi:10.18653/v1/2020.findings-emnlp.414

  29. [29]

    Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

    Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates.arXiv preprint arXiv:1804.10959, 2018.doi:10.18653/v1/P18-1007

  30. [30]

    DawDreamer: Bridging the gap between digital audio workstations and Python interfaces

    David Braun et al. DawDreamer: Bridging the gap between digital audio workstations and Python interfaces. InProceedings of the International Society for Music Information Retrieval Conference (Late-Breaking Demo), 2021

  31. [31]

    VirtuosoNet: A hierarchical RNN-based system for modeling expressive piano performance

    Dasaem Jeong, Taegyun Kwon, Yoojin Kim, Kyogu Lee, and Juhan Nam. VirtuosoNet: A hierarchical RNN-based system for modeling expressive piano performance. InISMIR, pages 908–915, 2019.doi:10.5281/zenodo.3527962

  32. [32]

    Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

    Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611, 2026.doi:10.48550/arXiv.2601.10611

  33. [33]

    ATEPP: A dataset of automatically transcribed expressive piano perfor- mance

    Huan Zhang, Jingjing Tang, Syed Rifat Mahmud Rafee, George Fazekas, Simon Dixon, and Geraint A Wiggins. ATEPP: A dataset of automatically transcribed expressive piano perfor- mance. InProceedings of the 23rd International Society for Music Information Retrieval Con- ference (ISMIR), pages 446–453, Bengaluru, India, 2022.doi:10.5281/zenodo.7342764

  34. [34]

    Sheet music benchmark: Standardized optical music recogni- tion evaluation.arXiv preprint arXiv:2506.10488, 2025

    Juan C Martinez-Sevilla, Joan Cerveto-Serrano, Noelia Luna, Greg Chapman, Craig Sapp, David Rizo, and Jorge Calvo-Zaragoza. Sheet music benchmark: Standardized optical music recogni- tion evaluation.arXiv preprint arXiv:2506.10488, 2025. doi:10.48550/arXiv.2506.10488

  35. [35]

    End-to-end real-world polyphonic piano audio-to-score tran- scription with hierarchical decoding

    Wei Zeng, Xian He, and Ye Wang. End-to-end real-world polyphonic piano audio-to-score tran- scription with hierarchical decoding. InProceedings of the Thirty-Third International Joint Con- ference on Artificial Intelligence, pages 7788–7795, 2024. doi:10.24963/ijcai.2024/862

  36. [36]

    On the resemblance and containment of documents

    Andrei Z Broder. On the resemblance and containment of documents. InPro- ceedings of the Compression and Complexity of Sequences, pages 21–29. IEEE, 1997. doi:10.1109/SEQUEN.1997.666900

  37. [37]

    The TREC spoken document retrieval track: A success story

    John S Garofolo, Cedric GP Auzanne, and Ellen M V oorhees. The TREC spoken document retrieval track: A success story. InRIAO, volume 6, pages 1–20, 2000

  38. [38]

    Oguz Araz, Dmitry Bogdanov, and Yuki Mitsufuji

    Joan Serrà, R. Oguz Araz, Dmitry Bogdanov, and Yuki Mitsufuji. Supervised con- trastive learning from weakly-labeled audio segments for musical version matching. InProceedings of the International Conference on Machine Learning (ICML), 2025. doi:10.48550/arXiv.2502.16936

  39. [39]

    CoverHunter: Cover song identification with refined attention and alignments

    Feng Liu, Deyi Tuo, Yinan Xu, and Xintong Han. CoverHunter: Cover song identification with refined attention and alignments. InIEEE International Conference on Multimedia and Expo (ICME), 2023.doi:10.1109/ICME55011.2023.00189

  40. [40]

    Distinctive neurophysiological correlates of sound onset and offset perception in humans.The Journal of Physiology, 2025

    Fatima Ali, Gabriela Bury, Adèle Simon, and Jennifer F Linden. Distinctive neurophysiological correlates of sound onset and offset perception in humans.The Journal of Physiology, 2025

  41. [41]

    Example input audio. Produce only ABC notation

    Samuel A Mehr. Core systems of music perception.Trends in Cognitive Sciences, 2025. doi:10.1016/j.tics.2025.05.013. 12 A Gemini Evaluation We evaluate Gemini 3.1 Pro to examine how a frontier audio-language model, without task-specific training, approaches piano score transcription when prompted to produce conventional notation formats. A.1 Output Format ...