Rubato: Transcribing Piano Music with Timestamps

Guang Yang; Nazif Can Tamer; Noah A. Smith; Victoria Ebert

arxiv: 2605.24291 · v1 · pith:UIEDEZ6Jnew · submitted 2026-05-22 · 💻 cs.SD · cs.CL· cs.MM

Rubato: Transcribing Piano Music with Timestamps

Nazif Can Tamer , Victoria Ebert , Guang Yang , Noah A. Smith This is my paper

Pith reviewed 2026-06-30 14:07 UTC · model grok-4.3

classification 💻 cs.SD cs.CLcs.MM

keywords piano transcriptionaudio to sheet musictimestamped notationpolyphonic musicsequence-to-sequence modelmusic information retrievalrubato performance

0 comments

The pith

Rubato transcribes piano audio into timestamped sheet music more accurately than cascade methods by using a new polyphonic representation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Rubato, a prompt-conditioned encoder-decoder model that converts musical audio recordings into human-readable sheet music with timestamps. It relies on a new textual representation called InterMo for polyphonic music, designed to work with sequence-to-sequence training. This approach yields higher notational accuracy than existing cascade-based systems. Even when cascades receive ground-truth MIDI input instead of audio, Rubato still performs better, indicating that the main limitation in current methods is representational rather than acoustic. The multi-task training also allows it to compete on simpler tasks like beat detection.

Core claim

Rubato produces timestamped piano sheet music from audio with higher notational accuracy than the best existing approaches based on cascades. Even when the cascade is given ground-truth MIDI instead of audio, Rubato performs better, suggesting that the ceiling of existing approaches is primarily representational, not acoustic.

What carries the argument

InterMo, a new textual representation for polyphonic music compatible with sequence-to-sequence training, used in a prompt-conditioned encoder-decoder model trained on multiple related tasks.

Load-bearing premise

The InterMo representation enables better performance than cascades by being compatible with sequence-to-sequence training even when input is perfect MIDI.

What would settle it

A new cascade system that achieves equal or higher notational accuracy than Rubato when given ground-truth MIDI would show that the representational limit is not the main factor.

Figures

Figures reproduced from arXiv: 2605.24291 by Guang Yang, Nazif Can Tamer, Noah A. Smith, Victoria Ebert.

**Figure 1.** Figure 1: InterMo representation. Left, top to bottom: A two-bar piano excerpt, decomposed into metric intervals (gray/black) and pitched moments (orange = upper staff, teal = lower staff), serialized as a 1D text sequence. Right: English narration of InterMo, segmented into interval-piece pretokenization borders which follow a canonical order within score moments (§3.2). Uppercase = onset, lowercase = offset; # = s… view at source ↗

**Figure 2.** Figure 2: Top: Rubato model is trained to transcribe the same musical passage into different InterMo [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Downstream impact of the upstream offset convention. Only the upstream AMT changes [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: MIDI note-detection F1 on MAESTRO test (177 files). TkunPed is Transkun-V2 as [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Raw sustain-pedal velocity distribution over [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Score-transcription systems on ATEPP (All represented with InterMo). (a) OMR-NED on [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Isolated effect of tokenizer on n-gram shingling, holding the upstream transcription model [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

We consider the conversion of musical recordings into human-readable sheet music annotated with timestamps. Such output lets a listener clearly visualize rubato (temporally expressive playing), a learner diagnose ensemble precision and timing choices against the written music, and a musicology scholar compare performance styles across recordings of the same work. We introduce (1) a prompt-conditioned encoder-decoder model, named Rubato, trained to output (2) a new textual representation for polyphonic music, named InterMo, which we designed for compatibility with sequence-to-sequence training. Our experiments demonstrate that Rubato produces timestamped piano sheet music from audio with higher notational accuracy than the best existing approaches, which are based on cascades. We find that even if the cascade is given ground-truth MIDI instead of audio, Rubato performs better, suggesting that the ceiling of existing approaches is primarily representational, not acoustic. Further, because Rubato is trained on several related tasks (with prompts), it competes with or outperforms the best single-task systems on related but simpler tasks like MIDI note grounding and beat/downbeat detection. A demo is available at https://nctamer.github.io/rubato-transcription .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rubato's InterMo representation and prompt-conditioned model claim to beat cascades on timestamped piano transcription even with GT MIDI input, but the baseline's MIDI-to-notation step is underspecified so the representational-ceiling conclusion is not yet isolated.

read the letter

The paper's main contribution is a prompt-conditioned encoder-decoder that outputs a new textual format called InterMo for polyphonic piano music with timestamps. This lets the system produce human-readable sheet music directly from audio while also handling related tasks like MIDI note grounding and beat detection through prompting.

InterMo is designed from the start to fit sequence-to-sequence training, which is the concrete novelty here. The multi-task training appears to give the model an edge on simpler subtasks without hurting the main transcription goal. The experiments compare against cascade baselines and report better notational accuracy, including the case where the cascade receives ground-truth MIDI.

The soft spot is exactly the one flagged in the stress test. The claim that existing approaches are limited by representation rather than acoustics depends on Rubato outperforming a GT-MIDI cascade. The abstract gives no description of how that cascade turns MIDI into timestamped notation, what training it received, or whether it faced the same output constraints as InterMo. If the MIDI-to-notation piece is an off-the-shelf or lightly tuned converter, the gap does not cleanly demonstrate InterMo superiority.

This paper is for MIR researchers who work on transcription pipelines and performance analysis tools. A reader who needs a working system for rubato visualization or ensemble feedback would find the direct output format practical. The work is coherent on its own terms and shows honest engagement with the cascade literature, so it deserves a serious referee to check the experimental details and baseline strength.

I would send it to peer review.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Rubato, a prompt-conditioned encoder-decoder model trained to output a new textual representation InterMo for polyphonic piano music. It claims superior notational accuracy for timestamped sheet music from audio compared to cascade methods, and that it outperforms even cascades given ground-truth MIDI input, implying the performance ceiling of existing approaches is representational rather than acoustic. The model also competes on related tasks such as MIDI note grounding and beat/downbeat detection due to multi-task prompt training.

Significance. If the central empirical claims hold after addressing baseline details, the work would provide evidence that sequence-to-sequence models with custom representations can surpass cascaded pipelines in music transcription, with practical value for visualizing rubato and performance analysis. The multi-task prompt conditioning enabling competitive performance on simpler subtasks is a clear strength.

major comments (1)

[Experiments (GT-MIDI comparison)] GT-MIDI cascade comparison (experiments section): the manuscript must specify the exact MIDI-to-timestamped-sheet pipeline used for the ground-truth MIDI baseline, including whether it is a trained model, off-the-shelf converter, or rule-based system, and whether it is subject to the same output format constraints as InterMo. Without this, the claim that Rubato's advantage demonstrates a representational (rather than acoustic) ceiling cannot be isolated from possible baseline weakness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments. We address the major point on the GT-MIDI cascade comparison below and will revise the manuscript accordingly to improve experimental clarity.

read point-by-point responses

Referee: [Experiments (GT-MIDI comparison)] GT-MIDI cascade comparison (experiments section): the manuscript must specify the exact MIDI-to-timestamped-sheet pipeline used for the ground-truth MIDI baseline, including whether it is a trained model, off-the-shelf converter, or rule-based system, and whether it is subject to the same output format constraints as InterMo. Without this, the claim that Rubato's advantage demonstrates a representational (rather than acoustic) ceiling cannot be isolated from possible baseline weakness.

Authors: We agree that additional detail on the GT-MIDI cascade is required. The current manuscript does not fully specify the MIDI-to-timestamped-sheet conversion pipeline used for this baseline. In the revised version we will explicitly describe the pipeline (including its type and output-format constraints) so that the comparison isolates representational differences as intended. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external comparisons

full rationale

The paper introduces Rubato as a prompt-conditioned encoder-decoder trained on the new InterMo representation and validates its claims through direct empirical comparisons to cascade baselines, including a ground-truth MIDI input condition. These evaluations rely on held-out test performance metrics rather than any self-referential fitting where a parameter is tuned on data and then presented as a prediction of a closely related quantity. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force the central representational claim. The derivation chain consists of standard supervised sequence-to-sequence training followed by independent benchmarking, making the results self-contained against external methods.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract only; no details available on free parameters, axioms, or dependencies. The central claim rests on the effectiveness of the invented InterMo representation and multi-task prompt training.

invented entities (1)

InterMo no independent evidence
purpose: New textual representation for polyphonic music designed for compatibility with sequence-to-sequence training
Introduced in this work to support the encoder-decoder model

pith-pipeline@v0.9.1-grok · 5742 in / 1116 out tokens · 31245 ms · 2026-06-30T14:07:58.597192+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 36 canonical work pages · 5 internal anchors

[1]

Beat this! Accurate beat tracking without DBN postprocessing

Francesco Foscarin, Jan Schlüter, and Gerhard Widmer. Beat this! Accurate beat tracking without DBN postprocessing. InProceedings of the 25th International Society for Music Information Retrieval Conference, 2024.doi:10.48550/arXiv.2407.21658

work page doi:10.48550/arxiv.2407.21658 2024
[2]

Scoring time intervals using non-hierarchical transformer for automatic piano transcription

Yujia Yan and Zhiyao Duan. Scoring time intervals using non-hierarchical transformer for automatic piano transcription. InProc. International Society for Music Information Retrieval Conference (ISMIR), 2024.doi:10.5281/zenodo.14877493

work page doi:10.5281/zenodo.14877493 2024
[3]

Musically aware automatic piano transcription using synthetic pretraining

Louis Bradshaw, Simon Colton, Alexander Spangher, and Stella Biderman. Musically aware automatic piano transcription using synthetic pretraining. Technical report, EleutherAI, 2024

2024
[4]

High- resolution piano transcription with pedals by regressing onset and offset times.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3707–3717, 2021

Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, and Yuxuan Wang. High- resolution piano transcription with pedals by regressing onset and offset times.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3707–3717, 2021. doi:10.1109/TASLP.2021.3121991

work page doi:10.1109/taslp.2021.3121991 2021
[5]

Gardner, I

Josh Gardner, Ian Simon, Ethan Manilow, Curtis Hawthorne, and Jesse Engel. MT3: Multi- task multitrack music transcription. InInternational Conference on Learning Representations (ICLR), 2022.doi:10.48550/arXiv.2111.03017

work page doi:10.48550/arxiv.2111.03017 2022
[6]

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse H. Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset. InInternational Conference on Learning Representations (ICLR), 2019.doi:10.48550/arXiv.1810.12247

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.12247 2019
[7]

Performance MIDI- to-score conversion by neural beat tracking

Lele Liu, Qiuqiang Kong, Veronica Morfi, and Emmanouil Benetos. Performance MIDI- to-score conversion by neural beat tracking. InProceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), Bengaluru, India, 2022. doi:10.5281/zenodo.7316682

work page doi:10.5281/zenodo.7316682 2022
[8]

End-to-end piano performance-MIDI to score conversion with transformers

Tim Beyer and Angela Dai. End-to-end piano performance-MIDI to score conversion with transformers. InProceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 319–326, 2024.doi:10.48550/arXiv.2410.00210

work page doi:10.48550/arxiv.2410.00210 2024
[9]

Bridging piano transcription and rendering via disentangled score content and style

Wei Zeng, Junchuan Zhao, and Ye Wang. Bridging piano transcription and rendering via disentangled score content and style. InInternational Conference on Learning Representations, 2026

2026
[10]

Silvan David Peter, Carlos Eduardo Cancino-Chacón, Francesco Foscarin, Andrew Philip McLeod, Florian Henkel, Emmanouil Karystinaios, and Gerhard Widmer. Automatic note-level score-to-performance alignments in the ASAP dataset.Transactions of the International Society for Music Information Retrieval, 6(1):27–42, 2023.doi:10.5334/tismir.149

work page doi:10.5334/tismir.149 2023
[11]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR, 2023. doi:10.48550/arXiv.2212.04356

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.04356 2023
[12]

Word level timestamp generation for automatic speech recognition and translation

Ke Hu, Krishna Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, and Boris Ginsburg. Word level timestamp generation for automatic speech recognition and translation. InProc. Interspeech 2025, pages 2565–2569, 2025.doi:10.21437/Interspeech.2025-869. 10

work page doi:10.21437/interspeech.2025-869 2025
[13]

Order Matters: Sequence to sequence for sets

Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to se- quence for sets. InInternational Conference on Learning Representations (ICLR), 2016. doi:10.48550/arXiv.1511.06391

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1511.06391 2016
[14]

Symphony generation with permutation invariant language model

Jiafeng Liu, Yuanliang Dong, Zehua Cheng, Xinran Zhang, Xiaobing Li, Feng Yu, and Maosong Sun. Symphony generation with permutation invariant language model. InPro- ceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2022.doi:10.48550/arXiv.2205.05448

work page doi:10.48550/arxiv.2205.05448 2022
[15]

Jongmin Jung, Dongmin Kim, Sihun Lee, Seola Cho, Hyungjoon Soh, Irmak Bukey, Chris Donahue, and Dasaem Jeong. U-MusT: A unified framework for cross-modal translation of score images, symbolic music, and performance audio.IEEE Transactions on Audio, Speech and Language Processing, 34:1876–1891, 2026.doi:10.1109/TASLPRO.2025.3648794

work page doi:10.1109/taslpro.2025.3648794 2026
[16]

LSTM networks can perform dynamic counting

Mirac Suzgun, Yonatan Belinkov, Stuart Shieber, and Sebastian Gehrmann. LSTM networks can perform dynamic counting. In Jason Eisner, Matthias Gallé, Jeffrey Heinz, Ariadna Quattoni, and Guillaume Rabusseau, editors,Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges, pages 44–54, Florence, August 2019. Association for Compu...

work page doi:10.18653/v1/w19-3905 2019
[17]

Counting like transformers: Compiling temporal counting logic into softmax transformers

Andy Yang and David Chiang. Counting like transformers: Compiling temporal counting logic into softmax transformers. InProceedings of the Conference on Language Modeling (COLM), 2024.doi:10.48550/arXiv.2404.04393

work page doi:10.48550/arxiv.2404.04393 2024
[18]

Between circuits and Chomsky: Pre-pretraining on formal languages imparts linguistic biases

Michael Y Hu, Jackson Petty, Chuan Shi, William Merrill, and Tal Linzen. Between circuits and Chomsky: Pre-pretraining on formal languages imparts linguistic biases. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9691–9709, Vienna, Austria, 2025. doi:10.18653/v1/2025.acl-long.478

work page doi:10.18653/v1/2025.acl-long.478 2025
[19]

Compound word trans- former: Learning to compose full-song music over dynamic directed hypergraphs

Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh, and Yi-Hsuan Yang. Compound word trans- former: Learning to compose full-song music over dynamic directed hypergraphs. InPro- ceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 178–186, 2021. doi:10.1609/aaai.v35i1.16091

work page doi:10.1609/aaai.v35i1.16091 2021
[20]

Mu- sicBERT: Symbolic music understanding with large-scale pre-training

Mingliang Zeng, Xu Tan, Rui Wang, Zeqian Ju, Tao Qin, and Tie-Yan Liu. Mu- sicBERT: Symbolic music understanding with large-scale pre-training. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 791–800, 2021. doi:10.18653/v1/2021.findings-acl.70

work page doi:10.18653/v1/2021.findings-acl.70 2021
[21]

Humdrum and Kern: Selective feature encoding

David Huron. Humdrum and Kern: Selective feature encoding. In Eleanor Selfridge- Field, editor,Beyond MIDI: The Handbook of Musical Codes. MIT Press, 1997. doi:10.5555/275928.275976

work page doi:10.5555/275928.275976 1997
[22]

The Music Encoding Initiative as a document-encoding framework

Andrew Hankinson, Perry Roland, and Ichiro Fujinaga. The Music Encoding Initiative as a document-encoding framework. InProceedings of the 12th International Society for Music Information Retrieval Conference, 2011.doi:10.5281/zenodo.1417609

work page doi:10.5281/zenodo.1417609 2011
[23]

Verovio: A library for engraving MEI music notation into SVG

Laurent Pugin, Rodolfo Zitellini, and Perry Roland. Verovio: A library for engraving MEI music notation into SVG. InProceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2014.doi:10.5281/zenodo.1417589

work page doi:10.5281/zenodo.1417589 2014
[24]

Puvvada, He Huang, Kunal Dhawan, Nithin Palaparthi, Ke Hu, Zhehuai Chen, Jagadeesh Balam, and Boris Ginsburg

Krishna C. Puvvada, He Huang, Kunal Dhawan, Nithin Palaparthi, Ke Hu, Zhehuai Chen, Jagadeesh Balam, and Boris Ginsburg. Less is more: Accurate speech recognition & translation without web-scale data. InProc. Interspeech 2024, pages 1798–1802, 2024. doi:10.21437/Interspeech.2024-1058

work page doi:10.21437/interspeech.2024-1058 2024
[25]

PDMX: A large-scale public domain MusicXML dataset for symbolic music processing

Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, and Julian McAuley. PDMX: A large-scale public domain MusicXML dataset for symbolic music processing. InICASSP 2025 – IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2025.doi:10.1109/ICASSP49660.2025.10890217. 11

work page doi:10.1109/icassp49660.2025.10890217 2025
[26]

ASAP: A dataset of aligned scores and performances for piano transcription

Francesco Foscarin, Andrew McLeod, Philippe Rigaux, Florent Jacquemard, and Masahiko Sakai. ASAP: A dataset of aligned scores and performances for piano transcription. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2020.doi:10.5281/zenodo.4245490

work page doi:10.5281/zenodo.4245490 2020
[27]

This time with feeling: Learning expressive musical performance.Neural Computing and Applications, 32(4): 955–967, 2020.doi:10.1007/s00521-018-3758-9

Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, and Karen Simonyan. This time with feeling: Learning expressive musical performance.Neural Computing and Applications, 32(4): 955–967, 2020.doi:10.1007/s00521-018-3758-9

work page doi:10.1007/s00521-018-3758-9 2020
[28]

Byte pair encoding is suboptimal for language model pretraining

Kaj Bostrom and Greg Durrett. Byte pair encoding is suboptimal for language model pretraining. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624, 2020.doi:10.18653/v1/2020.findings-emnlp.414

work page doi:10.18653/v1/2020.findings-emnlp.414 2020
[29]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates.arXiv preprint arXiv:1804.10959, 2018.doi:10.18653/v1/P18-1007

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p18-1007 2018
[30]

DawDreamer: Bridging the gap between digital audio workstations and Python interfaces

David Braun et al. DawDreamer: Bridging the gap between digital audio workstations and Python interfaces. InProceedings of the International Society for Music Information Retrieval Conference (Late-Breaking Demo), 2021

2021
[31]

VirtuosoNet: A hierarchical RNN-based system for modeling expressive piano performance

Dasaem Jeong, Taegyun Kwon, Yoojin Kim, Kyogu Lee, and Juhan Nam. VirtuosoNet: A hierarchical RNN-based system for modeling expressive piano performance. InISMIR, pages 908–915, 2019.doi:10.5281/zenodo.3527962

work page doi:10.5281/zenodo.3527962 2019
[32]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611, 2026.doi:10.48550/arXiv.2601.10611

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.10611 2026
[33]

ATEPP: A dataset of automatically transcribed expressive piano perfor- mance

Huan Zhang, Jingjing Tang, Syed Rifat Mahmud Rafee, George Fazekas, Simon Dixon, and Geraint A Wiggins. ATEPP: A dataset of automatically transcribed expressive piano perfor- mance. InProceedings of the 23rd International Society for Music Information Retrieval Con- ference (ISMIR), pages 446–453, Bengaluru, India, 2022.doi:10.5281/zenodo.7342764

work page doi:10.5281/zenodo.7342764 2022
[34]

Sheet music benchmark: Standardized optical music recogni- tion evaluation.arXiv preprint arXiv:2506.10488, 2025

Juan C Martinez-Sevilla, Joan Cerveto-Serrano, Noelia Luna, Greg Chapman, Craig Sapp, David Rizo, and Jorge Calvo-Zaragoza. Sheet music benchmark: Standardized optical music recogni- tion evaluation.arXiv preprint arXiv:2506.10488, 2025. doi:10.48550/arXiv.2506.10488

work page doi:10.48550/arxiv.2506.10488 2025
[35]

End-to-end real-world polyphonic piano audio-to-score tran- scription with hierarchical decoding

Wei Zeng, Xian He, and Ye Wang. End-to-end real-world polyphonic piano audio-to-score tran- scription with hierarchical decoding. InProceedings of the Thirty-Third International Joint Con- ference on Artificial Intelligence, pages 7788–7795, 2024. doi:10.24963/ijcai.2024/862

work page doi:10.24963/ijcai.2024/862 2024
[36]

On the resemblance and containment of documents

Andrei Z Broder. On the resemblance and containment of documents. InPro- ceedings of the Compression and Complexity of Sequences, pages 21–29. IEEE, 1997. doi:10.1109/SEQUEN.1997.666900

work page doi:10.1109/sequen.1997.666900 1997
[37]

The TREC spoken document retrieval track: A success story

John S Garofolo, Cedric GP Auzanne, and Ellen M V oorhees. The TREC spoken document retrieval track: A success story. InRIAO, volume 6, pages 1–20, 2000

2000
[38]

Oguz Araz, Dmitry Bogdanov, and Yuki Mitsufuji

Joan Serrà, R. Oguz Araz, Dmitry Bogdanov, and Yuki Mitsufuji. Supervised con- trastive learning from weakly-labeled audio segments for musical version matching. InProceedings of the International Conference on Machine Learning (ICML), 2025. doi:10.48550/arXiv.2502.16936

work page doi:10.48550/arxiv.2502.16936 2025
[39]

CoverHunter: Cover song identification with refined attention and alignments

Feng Liu, Deyi Tuo, Yinan Xu, and Xintong Han. CoverHunter: Cover song identification with refined attention and alignments. InIEEE International Conference on Multimedia and Expo (ICME), 2023.doi:10.1109/ICME55011.2023.00189

work page doi:10.1109/icme55011.2023.00189 2023
[40]

Distinctive neurophysiological correlates of sound onset and offset perception in humans.The Journal of Physiology, 2025

Fatima Ali, Gabriela Bury, Adèle Simon, and Jennifer F Linden. Distinctive neurophysiological correlates of sound onset and offset perception in humans.The Journal of Physiology, 2025

2025
[41]

Example input audio. Produce only ABC notation

Samuel A Mehr. Core systems of music perception.Trends in Cognitive Sciences, 2025. doi:10.1016/j.tics.2025.05.013. 12 A Gemini Evaluation We evaluate Gemini 3.1 Pro to examine how a frontier audio-language model, without task-specific training, approaches piano score transcription when prompted to produce conventional notation formats. A.1 Output Format ...

work page doi:10.1016/j.tics.2025.05.013 2025

[1] [1]

Beat this! Accurate beat tracking without DBN postprocessing

Francesco Foscarin, Jan Schlüter, and Gerhard Widmer. Beat this! Accurate beat tracking without DBN postprocessing. InProceedings of the 25th International Society for Music Information Retrieval Conference, 2024.doi:10.48550/arXiv.2407.21658

work page doi:10.48550/arxiv.2407.21658 2024

[2] [2]

Scoring time intervals using non-hierarchical transformer for automatic piano transcription

Yujia Yan and Zhiyao Duan. Scoring time intervals using non-hierarchical transformer for automatic piano transcription. InProc. International Society for Music Information Retrieval Conference (ISMIR), 2024.doi:10.5281/zenodo.14877493

work page doi:10.5281/zenodo.14877493 2024

[3] [3]

Musically aware automatic piano transcription using synthetic pretraining

Louis Bradshaw, Simon Colton, Alexander Spangher, and Stella Biderman. Musically aware automatic piano transcription using synthetic pretraining. Technical report, EleutherAI, 2024

2024

[4] [4]

High- resolution piano transcription with pedals by regressing onset and offset times.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3707–3717, 2021

Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, and Yuxuan Wang. High- resolution piano transcription with pedals by regressing onset and offset times.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3707–3717, 2021. doi:10.1109/TASLP.2021.3121991

work page doi:10.1109/taslp.2021.3121991 2021

[5] [5]

Gardner, I

Josh Gardner, Ian Simon, Ethan Manilow, Curtis Hawthorne, and Jesse Engel. MT3: Multi- task multitrack music transcription. InInternational Conference on Learning Representations (ICLR), 2022.doi:10.48550/arXiv.2111.03017

work page doi:10.48550/arxiv.2111.03017 2022

[6] [6]

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse H. Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset. InInternational Conference on Learning Representations (ICLR), 2019.doi:10.48550/arXiv.1810.12247

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.12247 2019

[7] [7]

Performance MIDI- to-score conversion by neural beat tracking

Lele Liu, Qiuqiang Kong, Veronica Morfi, and Emmanouil Benetos. Performance MIDI- to-score conversion by neural beat tracking. InProceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), Bengaluru, India, 2022. doi:10.5281/zenodo.7316682

work page doi:10.5281/zenodo.7316682 2022

[8] [8]

End-to-end piano performance-MIDI to score conversion with transformers

Tim Beyer and Angela Dai. End-to-end piano performance-MIDI to score conversion with transformers. InProceedings of the International Society for Music Information Retrieval Conference (ISMIR), pages 319–326, 2024.doi:10.48550/arXiv.2410.00210

work page doi:10.48550/arxiv.2410.00210 2024

[9] [9]

Bridging piano transcription and rendering via disentangled score content and style

Wei Zeng, Junchuan Zhao, and Ye Wang. Bridging piano transcription and rendering via disentangled score content and style. InInternational Conference on Learning Representations, 2026

2026

[10] [10]

Silvan David Peter, Carlos Eduardo Cancino-Chacón, Francesco Foscarin, Andrew Philip McLeod, Florian Henkel, Emmanouil Karystinaios, and Gerhard Widmer. Automatic note-level score-to-performance alignments in the ASAP dataset.Transactions of the International Society for Music Information Retrieval, 6(1):27–42, 2023.doi:10.5334/tismir.149

work page doi:10.5334/tismir.149 2023

[11] [11]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR, 2023. doi:10.48550/arXiv.2212.04356

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.04356 2023

[12] [12]

Word level timestamp generation for automatic speech recognition and translation

Ke Hu, Krishna Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, and Boris Ginsburg. Word level timestamp generation for automatic speech recognition and translation. InProc. Interspeech 2025, pages 2565–2569, 2025.doi:10.21437/Interspeech.2025-869. 10

work page doi:10.21437/interspeech.2025-869 2025

[13] [13]

Order Matters: Sequence to sequence for sets

Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. Order matters: Sequence to se- quence for sets. InInternational Conference on Learning Representations (ICLR), 2016. doi:10.48550/arXiv.1511.06391

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1511.06391 2016

[14] [14]

Symphony generation with permutation invariant language model

Jiafeng Liu, Yuanliang Dong, Zehua Cheng, Xinran Zhang, Xiaobing Li, Feng Yu, and Maosong Sun. Symphony generation with permutation invariant language model. InPro- ceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2022.doi:10.48550/arXiv.2205.05448

work page doi:10.48550/arxiv.2205.05448 2022

[15] [15]

Jongmin Jung, Dongmin Kim, Sihun Lee, Seola Cho, Hyungjoon Soh, Irmak Bukey, Chris Donahue, and Dasaem Jeong. U-MusT: A unified framework for cross-modal translation of score images, symbolic music, and performance audio.IEEE Transactions on Audio, Speech and Language Processing, 34:1876–1891, 2026.doi:10.1109/TASLPRO.2025.3648794

work page doi:10.1109/taslpro.2025.3648794 2026

[16] [16]

LSTM networks can perform dynamic counting

Mirac Suzgun, Yonatan Belinkov, Stuart Shieber, and Sebastian Gehrmann. LSTM networks can perform dynamic counting. In Jason Eisner, Matthias Gallé, Jeffrey Heinz, Ariadna Quattoni, and Guillaume Rabusseau, editors,Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges, pages 44–54, Florence, August 2019. Association for Compu...

work page doi:10.18653/v1/w19-3905 2019

[17] [17]

Counting like transformers: Compiling temporal counting logic into softmax transformers

Andy Yang and David Chiang. Counting like transformers: Compiling temporal counting logic into softmax transformers. InProceedings of the Conference on Language Modeling (COLM), 2024.doi:10.48550/arXiv.2404.04393

work page doi:10.48550/arxiv.2404.04393 2024

[18] [18]

Between circuits and Chomsky: Pre-pretraining on formal languages imparts linguistic biases

Michael Y Hu, Jackson Petty, Chuan Shi, William Merrill, and Tal Linzen. Between circuits and Chomsky: Pre-pretraining on formal languages imparts linguistic biases. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9691–9709, Vienna, Austria, 2025. doi:10.18653/v1/2025.acl-long.478

work page doi:10.18653/v1/2025.acl-long.478 2025

[19] [19]

Compound word trans- former: Learning to compose full-song music over dynamic directed hypergraphs

Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh, and Yi-Hsuan Yang. Compound word trans- former: Learning to compose full-song music over dynamic directed hypergraphs. InPro- ceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 178–186, 2021. doi:10.1609/aaai.v35i1.16091

work page doi:10.1609/aaai.v35i1.16091 2021

[20] [20]

Mu- sicBERT: Symbolic music understanding with large-scale pre-training

Mingliang Zeng, Xu Tan, Rui Wang, Zeqian Ju, Tao Qin, and Tie-Yan Liu. Mu- sicBERT: Symbolic music understanding with large-scale pre-training. InFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 791–800, 2021. doi:10.18653/v1/2021.findings-acl.70

work page doi:10.18653/v1/2021.findings-acl.70 2021

[21] [21]

Humdrum and Kern: Selective feature encoding

David Huron. Humdrum and Kern: Selective feature encoding. In Eleanor Selfridge- Field, editor,Beyond MIDI: The Handbook of Musical Codes. MIT Press, 1997. doi:10.5555/275928.275976

work page doi:10.5555/275928.275976 1997

[22] [22]

The Music Encoding Initiative as a document-encoding framework

Andrew Hankinson, Perry Roland, and Ichiro Fujinaga. The Music Encoding Initiative as a document-encoding framework. InProceedings of the 12th International Society for Music Information Retrieval Conference, 2011.doi:10.5281/zenodo.1417609

work page doi:10.5281/zenodo.1417609 2011

[23] [23]

Verovio: A library for engraving MEI music notation into SVG

Laurent Pugin, Rodolfo Zitellini, and Perry Roland. Verovio: A library for engraving MEI music notation into SVG. InProceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2014.doi:10.5281/zenodo.1417589

work page doi:10.5281/zenodo.1417589 2014

[24] [24]

Puvvada, He Huang, Kunal Dhawan, Nithin Palaparthi, Ke Hu, Zhehuai Chen, Jagadeesh Balam, and Boris Ginsburg

Krishna C. Puvvada, He Huang, Kunal Dhawan, Nithin Palaparthi, Ke Hu, Zhehuai Chen, Jagadeesh Balam, and Boris Ginsburg. Less is more: Accurate speech recognition & translation without web-scale data. InProc. Interspeech 2024, pages 1798–1802, 2024. doi:10.21437/Interspeech.2024-1058

work page doi:10.21437/interspeech.2024-1058 2024

[25] [25]

PDMX: A large-scale public domain MusicXML dataset for symbolic music processing

Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, and Julian McAuley. PDMX: A large-scale public domain MusicXML dataset for symbolic music processing. InICASSP 2025 – IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2025.doi:10.1109/ICASSP49660.2025.10890217. 11

work page doi:10.1109/icassp49660.2025.10890217 2025

[26] [26]

ASAP: A dataset of aligned scores and performances for piano transcription

Francesco Foscarin, Andrew McLeod, Philippe Rigaux, Florent Jacquemard, and Masahiko Sakai. ASAP: A dataset of aligned scores and performances for piano transcription. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2020.doi:10.5281/zenodo.4245490

work page doi:10.5281/zenodo.4245490 2020

[27] [27]

This time with feeling: Learning expressive musical performance.Neural Computing and Applications, 32(4): 955–967, 2020.doi:10.1007/s00521-018-3758-9

Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, and Karen Simonyan. This time with feeling: Learning expressive musical performance.Neural Computing and Applications, 32(4): 955–967, 2020.doi:10.1007/s00521-018-3758-9

work page doi:10.1007/s00521-018-3758-9 2020

[28] [28]

Byte pair encoding is suboptimal for language model pretraining

Kaj Bostrom and Greg Durrett. Byte pair encoding is suboptimal for language model pretraining. InFindings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624, 2020.doi:10.18653/v1/2020.findings-emnlp.414

work page doi:10.18653/v1/2020.findings-emnlp.414 2020

[29] [29]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates.arXiv preprint arXiv:1804.10959, 2018.doi:10.18653/v1/P18-1007

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/p18-1007 2018

[30] [30]

DawDreamer: Bridging the gap between digital audio workstations and Python interfaces

David Braun et al. DawDreamer: Bridging the gap between digital audio workstations and Python interfaces. InProceedings of the International Society for Music Information Retrieval Conference (Late-Breaking Demo), 2021

2021

[31] [31]

VirtuosoNet: A hierarchical RNN-based system for modeling expressive piano performance

Dasaem Jeong, Taegyun Kwon, Yoojin Kim, Kyogu Lee, and Juhan Nam. VirtuosoNet: A hierarchical RNN-based system for modeling expressive piano performance. InISMIR, pages 908–915, 2019.doi:10.5281/zenodo.3527962

work page doi:10.5281/zenodo.3527962 2019

[32] [32]

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611, 2026.doi:10.48550/arXiv.2601.10611

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.10611 2026

[33] [33]

ATEPP: A dataset of automatically transcribed expressive piano perfor- mance

Huan Zhang, Jingjing Tang, Syed Rifat Mahmud Rafee, George Fazekas, Simon Dixon, and Geraint A Wiggins. ATEPP: A dataset of automatically transcribed expressive piano perfor- mance. InProceedings of the 23rd International Society for Music Information Retrieval Con- ference (ISMIR), pages 446–453, Bengaluru, India, 2022.doi:10.5281/zenodo.7342764

work page doi:10.5281/zenodo.7342764 2022

[34] [34]

Sheet music benchmark: Standardized optical music recogni- tion evaluation.arXiv preprint arXiv:2506.10488, 2025

Juan C Martinez-Sevilla, Joan Cerveto-Serrano, Noelia Luna, Greg Chapman, Craig Sapp, David Rizo, and Jorge Calvo-Zaragoza. Sheet music benchmark: Standardized optical music recogni- tion evaluation.arXiv preprint arXiv:2506.10488, 2025. doi:10.48550/arXiv.2506.10488

work page doi:10.48550/arxiv.2506.10488 2025

[35] [35]

End-to-end real-world polyphonic piano audio-to-score tran- scription with hierarchical decoding

Wei Zeng, Xian He, and Ye Wang. End-to-end real-world polyphonic piano audio-to-score tran- scription with hierarchical decoding. InProceedings of the Thirty-Third International Joint Con- ference on Artificial Intelligence, pages 7788–7795, 2024. doi:10.24963/ijcai.2024/862

work page doi:10.24963/ijcai.2024/862 2024

[36] [36]

On the resemblance and containment of documents

Andrei Z Broder. On the resemblance and containment of documents. InPro- ceedings of the Compression and Complexity of Sequences, pages 21–29. IEEE, 1997. doi:10.1109/SEQUEN.1997.666900

work page doi:10.1109/sequen.1997.666900 1997

[37] [37]

The TREC spoken document retrieval track: A success story

John S Garofolo, Cedric GP Auzanne, and Ellen M V oorhees. The TREC spoken document retrieval track: A success story. InRIAO, volume 6, pages 1–20, 2000

2000

[38] [38]

Oguz Araz, Dmitry Bogdanov, and Yuki Mitsufuji

Joan Serrà, R. Oguz Araz, Dmitry Bogdanov, and Yuki Mitsufuji. Supervised con- trastive learning from weakly-labeled audio segments for musical version matching. InProceedings of the International Conference on Machine Learning (ICML), 2025. doi:10.48550/arXiv.2502.16936

work page doi:10.48550/arxiv.2502.16936 2025

[39] [39]

CoverHunter: Cover song identification with refined attention and alignments

Feng Liu, Deyi Tuo, Yinan Xu, and Xintong Han. CoverHunter: Cover song identification with refined attention and alignments. InIEEE International Conference on Multimedia and Expo (ICME), 2023.doi:10.1109/ICME55011.2023.00189

work page doi:10.1109/icme55011.2023.00189 2023

[40] [40]

Distinctive neurophysiological correlates of sound onset and offset perception in humans.The Journal of Physiology, 2025

Fatima Ali, Gabriela Bury, Adèle Simon, and Jennifer F Linden. Distinctive neurophysiological correlates of sound onset and offset perception in humans.The Journal of Physiology, 2025

2025

[41] [41]

Example input audio. Produce only ABC notation

Samuel A Mehr. Core systems of music perception.Trends in Cognitive Sciences, 2025. doi:10.1016/j.tics.2025.05.013. 12 A Gemini Evaluation We evaluate Gemini 3.1 Pro to examine how a frontier audio-language model, without task-specific training, approaches piano score transcription when prompted to produce conventional notation formats. A.1 Output Format ...

work page doi:10.1016/j.tics.2025.05.013 2025