Music Transcription with (Almost) No Supervision

Chao Wan; Daniel C. Lin; John Thickstun; Justin Lovelace; Kilian Q. Weinberger; Saebyeol Shin; Zhenzhen Liu

REVIEW 2 major objections 2 minor 32 references

A cycle-consistent translation framework anchored by minimal paired data unlocks large gains from abundant unpaired audio and scores for music transcription.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-30 14:43 UTC pith:2BMRBNRC

load-bearing objection Cycle-consistent translation with minimal paired data gives practical gains on transcription from unpaired audio, but the many-to-one asymmetry leaves room for degenerate mappings. the 2 major comments →

arxiv 2605.24193 v1 pith:2BMRBNRC submitted 2026-05-22 cs.SD cs.LG

Music Transcription with (Almost) No Supervision

Saebyeol Shin , Chao Wan , Zhenzhen Liu , Justin Lovelace , Daniel C. Lin , Kilian Q. Weinberger , John Thickstun This is my paper

classification cs.SD cs.LG

keywords music transcriptionunpaired datacycle-consistent translationsemi-supervised learningaudio-to-scorelimited supervisionsymbolic music

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that music transcription models normally require large amounts of paired audio and score data, which is costly to collect. By using a cycle-consistent translation system with only a small amount of paired data as an anchor, the model can exploit vast unpaired audio recordings and symbolic scores. This produces surprisingly large accuracy gains especially when paired supervision is scarce, with unpaired audio contributing more than unpaired scores, and even allowing better transcription of a new instrument by adding its unlabeled audio alone. A sympathetic reader would care because paired data is limited by cost, alignment challenges, and copyright while unpaired data exists in large quantities online.

Core claim

We adopt a cycle-consistent translation framework in which a small amount of paired data acts as a minimal anchor, unlocking the full potential of the unpaired pool. We find that: unpaired data yields surprisingly large gains, especially under limited supervision; unpaired audio contributes more than unpaired scores; incorporating unlabeled audio from a new instrument during training improves transcription for that instrument without any paired supervision. Together, these results suggest that scaling unpaired data offers a practical path toward high-quality transcription for instruments where labeled data remains scarce.

What carries the argument

cycle-consistent translation framework that anchors audio-to-score mappings with minimal paired data to exploit unpaired examples from both domains

Load-bearing premise

The cycle-consistent translation can learn accurate audio-to-score mappings from unpaired data distributions without the cycle loss creating systematic errors or collapsed modes that produce incorrect transcriptions.

What would settle it

If adding unpaired data to the cycle-consistent model produces no improvement or a decrease in transcription accuracy on a held-out test set compared to training on paired data alone, the central claim would be falsified.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Transcription accuracy improves substantially when unpaired data is added, with the largest relative gains occurring under limited paired supervision.
Unpaired audio contributes more to performance gains than unpaired symbolic scores.
Including unlabeled audio from a previously unseen instrument during training raises transcription quality for that instrument despite zero paired examples for it.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same minimal-anchor cycle-consistent approach could extend to other scarce-pairing domains such as speech recognition or image-to-text.
Very large-scale unpaired music collections might enable transcription models that cover many instruments and genres without further paired labeling.
This method could make high-quality transcription practical for rare instruments or historical recordings that lack any aligned paired data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Cycle-consistent translation with minimal paired data gives practical gains on transcription from unpaired audio, but the many-to-one asymmetry leaves room for degenerate mappings.

read the letter

The main thing to know is that the paper takes cycle-consistent translation, anchors it with a small paired set, and reports that unpaired audio drives most of the improvement while also enabling some zero-shot transcription on new instruments.

What is new is the concrete comparison showing unpaired audio helps more than unpaired scores, plus the cross-instrument result. The work does a clean job of stating the data-scarcity problem in MIR and showing that the unpaired pool can be put to use without needing full supervision.

The soft spot is the domain asymmetry the stress test flags. Audio-to-score is many-to-one, so a model can satisfy the cycle loss by mapping to average or modal patterns rather than recovering exact timing and notes. The abstract gives directional claims but no numbers, dataset sizes, or controls, so it is not possible to tell whether the experiments actually rule out collapse. If the full paper has ablations that test for this, the results strengthen; if not, the central claim rests on an unverified assumption.

This is for people working on music transcription or semi-supervised audio methods who need to stretch limited labels. A reader who wants a practical recipe for data-scarce instruments will get something usable from the approach. It is worth sending to peer review so the experimental details and the mapping quality can be checked directly.

Referee Report

2 major / 2 minor

Summary. The paper proposes a cycle-consistent translation framework for music transcription that anchors learning from large unpaired audio and score corpora using only a small paired dataset. It claims that unpaired data yields large gains (especially under limited supervision), that unpaired audio contributes more than unpaired scores, and that incorporating unlabeled audio from a new instrument improves transcription for that instrument with no paired supervision for it.

Significance. If the results hold under rigorous validation, the work would be significant for reducing dependence on scarce paired audio-score data in music transcription, enabling scaling via abundant unpaired resources. The reported differential value of audio versus scores and the zero-shot instrument improvement would be notable contributions if supported by appropriate controls and metrics.

major comments (2)

[Cycle-consistent translation framework] The central claim requires that cycle consistency on unpaired data, combined with a small paired anchor, produces accurate audio-to-score mappings rather than degenerate solutions. Given the domain asymmetry (audio-to-score is many-to-one while score-to-audio is one-to-many), cycle consistency could be satisfied by mappings that ignore timing details or collapse to modal patterns. The manuscript must include targeted analysis (e.g., in the method or results sections) showing that the learned transcription outputs recover specific note timings and avoid this degeneracy when paired data is minimal.
[Experiments and results] The experiments must demonstrate that the reported gains are not artifacts of the cycle loss or evaluation protocol. Specific controls are needed to isolate the contribution of unpaired audio versus unpaired scores and to confirm the new-instrument improvement occurs without any paired supervision for that instrument.

minor comments (2)

[Abstract] The abstract states directional findings without quantitative metrics, dataset sizes, or baseline comparisons; the full manuscript should ensure these are clearly reported in the results to allow verification of the claimed gains.
[Method] Clarify the exact architecture details, loss weighting between paired anchor and cycle terms, and any regularization used to prevent mode collapse in the translation models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments highlighting the need for explicit validation against degeneracy and stronger experimental controls. We address each point below, clarifying existing results and committing to targeted additions in the revision.

read point-by-point responses

Referee: [Cycle-consistent translation framework] The central claim requires that cycle consistency on unpaired data, combined with a small paired anchor, produces accurate audio-to-score mappings rather than degenerate solutions. Given the domain asymmetry (audio-to-score is many-to-one while score-to-audio is one-to-many), cycle consistency could be satisfied by mappings that ignore timing details or collapse to modal patterns. The manuscript must include targeted analysis (e.g., in the method or results sections) showing that the learned transcription outputs recover specific note timings and avoid this degeneracy when paired data is minimal.

Authors: We agree this is a substantive concern given the asymmetry. The manuscript reports that models trained with the cycle-consistent framework plus minimal paired data achieve note-level F1 scores competitive with fully supervised baselines on held-out test sets; such performance is incompatible with timing-ignoring or modal collapse. We will add a dedicated subsection with (i) quantitative timing-error histograms comparing outputs with/without the paired anchor and (ii) qualitative alignment visualizations on example excerpts, confirming recovery of specific onsets and offsets even at the lowest paired-data regimes. revision: yes
Referee: [Experiments and results] The experiments must demonstrate that the reported gains are not artifacts of the cycle loss or evaluation protocol. Specific controls are needed to isolate the contribution of unpaired audio versus unpaired scores and to confirm the new-instrument improvement occurs without any paired supervision for that instrument.

Authors: The current experiments already contain ablations that isolate unpaired audio versus unpaired scores (Table 3 and associated text), showing larger gains from audio. The new-instrument protocol explicitly trains with zero paired examples for the target instrument, using only its unpaired audio plus the anchor paired data from other instruments; evaluation uses a held-out paired test set for that instrument. We will expand the experimental section with an explicit zero-paired-supervision statement, an additional control removing the cycle loss entirely, and a table confirming the exact paired-sample count (zero) for the new instrument. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims are empirical outcomes

full rationale

The paper adopts a cycle-consistent translation framework (standard in unpaired translation) and reports measured gains from adding unpaired audio/scores to a small paired anchor. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the provided text. The central results (unpaired audio contributes more, new-instrument gains without paired data) are presented as experimental findings, not quantities forced by construction or prior author theorems. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that cycle consistency supplies a usable training signal across audio and symbolic music domains. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Cycle consistency provides a sufficient signal to learn cross-domain mappings from unpaired data
The framework adopted in the paper relies on this to unlock unpaired data.

pith-pipeline@v0.9.1-grok · 5667 in / 1241 out tokens · 56671 ms · 2026-06-30T14:43:29.138695+00:00 · methodology

0 comments

read the original abstract

Competitive music transcription models require large amounts of paired audio-score data, which is scarce due to collection costs, alignment difficulty, and copyright restrictions. Meanwhile, vast quantities of unpaired audio recordings and symbolic scores are freely available but have gone unused. We adopt a cycle-consistent translation framework in which a small amount of paired data acts as a minimal anchor, unlocking the full potential of the unpaired pool. We find that: unpaired data yields surprisingly large gains, especially under limited supervision; unpaired audio contributes more than unpaired scores; incorporating unlabeled audio from a new instrument during training improves transcription for that instrument without any paired supervision. Together, these results suggest that scaling unpaired data offers a practical path toward high-quality transcription for instruments where labeled data remains scarce.

Figures

Figures reproduced from arXiv: 2605.24193 by Chao Wan, Daniel C. Lin, John Thickstun, Justin Lovelace, Kilian Q. Weinberger, Saebyeol Shin, Zhenzhen Liu.

**Figure 2.** Figure 2: Overview of the proposed semi-supervised latent-space transcription framework. Two [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Pitch-shift failure modes from two independent unpaired-only training runs, where each [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Paired + unpaired training is more robust [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Validation Frame F1 during training under 1.6h of paired supervision, comparing the effect [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 5 internal anchors

[1]

Berg-Kirkpatrick, J

T. Berg-Kirkpatrick, J. Andreas, and D. Klein. Unsupervised transcription of piano music. Advances in neural information processing systems, 27, 2014

work page 2014
[2]

Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription

N. Boulanger-Lewandowski, Y . Bengio, and P. Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[3]

Brunner, Y

G. Brunner, Y . Wang, R. Wattenhofer, and S. Zhao. Symbolic music genre transfer with cyclegan. In2018 ieee 30th international conference on tools with artificial intelligence (ictai), pages 786–793. IEEE, 2018

work page 2018
[4]

Chang, E

S. Chang, E. Benetos, H. Kirchhoff, and S. Dixon. Yourmt3+: Multi-instrument music tran- scription with enhanced transformer architectures and cross-dataset stem augmentation. In2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2024

work page 2024
[5]

K. W. Cheuk, D. Herremans, and L. Su. ReconV AT: A semi-supervised automatic music transcription framework for low-resource real-world data.arXiv preprint arXiv:2107.04954, 2021

work page arXiv 2021
[6]

Choi and K

K. Choi and K. Cho. Deep unsupervised drum transcription.arXiv preprint arXiv:1906.03697, 2019

work page arXiv 1906
[7]

Edwards, S

D. Edwards, S. Dixon, E. Benetos, A. Maezawa, and Y . Kusaka. A data-driven analysis of robust automatic piano transcription.IEEE Signal Processing Letters, 31:681–685, 2024

work page 2024
[8]

MT3: Multi-Task Multitrack Music Transcription

J. Gardner, I. Simon, E. Manilow, C. Hawthorne, and J. Engel. Mt3: Multi-task multitrack music transcription.arXiv preprint arXiv:2111.03017, 2021

work page Pith review arXiv 2021
[9]

Onsets and Frames: Dual-Objective Piano Transcription

C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, J. Engel, S. Oore, and D. Eck. Onsets and Frames: Dual-objective piano transcription.arXiv preprint arXiv:1710.11153, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset.arXiv preprint arXiv:1810.12247, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Hawthorne, I

C. Hawthorne, I. Simon, R. Swavely, E. Manilow, and J. Engel. Sequence-to-sequence piano transcription with Transformers.arXiv preprint arXiv:2107.09142, 2021

work page arXiv 2021
[12]

Huang, Q

S. Huang, Q. Li, C. Anil, X. Bao, S. Oore, and R. B. Grosse. Timbretron: A wavenet (cyclegan (cqt (audio))) pipeline for musical timbre transfer.arXiv preprint arXiv:1811.09620, 2018

work page arXiv 2018
[13]

Kaneko and H

T. Kaneko and H. Kameoka. Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks. In2018 26th European signal processing conference (EUSIPCO), pages 2100–2104. IEEE, 2018

work page 2018
[14]

Katharopoulos, A

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

work page 2020
[15]

R. Kelz, M. Dorfer, F. Korzeniowski, S. Böck, A. Arzt, and G. Widmer. On the potential of simple framewise approaches to piano transcription.arXiv preprint arXiv:1612.05153, 2016. 10

work page internal anchor Pith review Pith/arXiv arXiv 2016
[16]

R. Kelz, S. Böck, and G. Widmer. Deep polyphonic ADSR piano note transcription. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 246–250. IEEE, 2019

work page 2019
[17]

T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. InInternational conference on machine learning, pages 1857–1865. Pmlr, 2017

work page 2017
[18]

Q. Kong, B. Li, X. Song, Y . Wan, and Y . Wang. High-resolution piano transcription with pedals by regressing onsets and offsets times.arXiv preprint arXiv:2010.01815, 2020

work page arXiv 2010
[19]

Kumar, R

K. Kumar, R. Kumar, T. De Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. De Brebisson, Y . Bengio, and A. C. Courville. Melgan: Generative adversarial networks for conditional waveform synthesis.Advances in neural information processing systems, 32, 2019

work page 2019
[20]

Liu and C

L. Liu and C. Weiß. Unsupervised domain adaptation for music transcription: Exploiting cross-version consistency. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025
[21]

M.-Y . Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks.Ad- vances in neural information processing systems, 30, 2017

work page 2017
[22]

Maman and A

B. Maman and A. H. Bermano. Unaligned supervision for automatic music transcription in the wild. InInternational Conference on Machine Learning, pages 14918–14934. PMLR, 2022

work page 2022
[23]

X. Mao, Q. Li, H. Xie, R. Y . Lau, Z. Wang, and S. Paul Smolley. Least squares generative adversarial networks. InProceedings of the IEEE international conference on computer vision, pages 2794–2802, 2017

work page 2017
[24]

G. E. Poliner and D. P. Ellis. A discriminative model for polyphonic piano transcription. EURASIP Journal on Advances in Signal Processing, 2007:1–9, 2006

work page 2007
[25]

K. Qian, Y . Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson. Autovc: Zero-shot voice style transfer with only autoencoder loss. InInternational Conference on Machine Learning, pages 5210–5219. PMLR, 2019

work page 2019
[26]

Riley, D

X. Riley, D. Edwards, and S. Dixon. High resolution guitar transcription via domain adapta- tion. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1051–1055. IEEE, 2024

work page 2024
[27]

Sato and T

G. Sato and T. Akama. Annotation-free automatic music transcription with scalable synthetic data and adversarial domain confusion. In2024 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2024

work page 2024
[28]

Learning Features of Music from Scratch

J. Thickstun, Z. Harchaoui, and S. Kakade. Learning features of music from scratch.arXiv preprint arXiv:1611.09827, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[29]

Wang, M.-Y

T.-C. Wang, M.-Y . Liu, J.-Y . Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018

work page 2018
[30]

Q. Xi, R. M. Bittner, J. Pauwels, X. Ye, and J. P. Bello. GuitarSet: A dataset for guitar transcription. InISMIR, pages 453–460, 2018

work page 2018
[31]

Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image translation. InProceedings of the IEEE international conference on computer vision, pages 2849–2857, 2017

work page 2017
[32]

J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. InProceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017. 11 A Dataset Details Input representation.Audio is represented as a log-CQT magnitude spectrogram xC ∈R T×B computed at the native sam...

work page 2017

[1] [1]

Berg-Kirkpatrick, J

T. Berg-Kirkpatrick, J. Andreas, and D. Klein. Unsupervised transcription of piano music. Advances in neural information processing systems, 27, 2014

work page 2014

[2] [2]

Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription

N. Boulanger-Lewandowski, Y . Bengio, and P. Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[3] [3]

Brunner, Y

G. Brunner, Y . Wang, R. Wattenhofer, and S. Zhao. Symbolic music genre transfer with cyclegan. In2018 ieee 30th international conference on tools with artificial intelligence (ictai), pages 786–793. IEEE, 2018

work page 2018

[4] [4]

Chang, E

S. Chang, E. Benetos, H. Kirchhoff, and S. Dixon. Yourmt3+: Multi-instrument music tran- scription with enhanced transformer architectures and cross-dataset stem augmentation. In2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1–6. IEEE, 2024

work page 2024

[5] [5]

K. W. Cheuk, D. Herremans, and L. Su. ReconV AT: A semi-supervised automatic music transcription framework for low-resource real-world data.arXiv preprint arXiv:2107.04954, 2021

work page arXiv 2021

[6] [6]

Choi and K

K. Choi and K. Cho. Deep unsupervised drum transcription.arXiv preprint arXiv:1906.03697, 2019

work page arXiv 1906

[7] [7]

Edwards, S

D. Edwards, S. Dixon, E. Benetos, A. Maezawa, and Y . Kusaka. A data-driven analysis of robust automatic piano transcription.IEEE Signal Processing Letters, 31:681–685, 2024

work page 2024

[8] [8]

MT3: Multi-Task Multitrack Music Transcription

J. Gardner, I. Simon, E. Manilow, C. Hawthorne, and J. Engel. Mt3: Multi-task multitrack music transcription.arXiv preprint arXiv:2111.03017, 2021

work page Pith review arXiv 2021

[9] [9]

Onsets and Frames: Dual-Objective Piano Transcription

C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, J. Engel, S. Oore, and D. Eck. Onsets and Frames: Dual-objective piano transcription.arXiv preprint arXiv:1710.11153, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset.arXiv preprint arXiv:1810.12247, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Hawthorne, I

C. Hawthorne, I. Simon, R. Swavely, E. Manilow, and J. Engel. Sequence-to-sequence piano transcription with Transformers.arXiv preprint arXiv:2107.09142, 2021

work page arXiv 2021

[12] [12]

Huang, Q

S. Huang, Q. Li, C. Anil, X. Bao, S. Oore, and R. B. Grosse. Timbretron: A wavenet (cyclegan (cqt (audio))) pipeline for musical timbre transfer.arXiv preprint arXiv:1811.09620, 2018

work page arXiv 2018

[13] [13]

Kaneko and H

T. Kaneko and H. Kameoka. Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks. In2018 26th European signal processing conference (EUSIPCO), pages 2100–2104. IEEE, 2018

work page 2018

[14] [14]

Katharopoulos, A

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020

work page 2020

[15] [15]

R. Kelz, M. Dorfer, F. Korzeniowski, S. Böck, A. Arzt, and G. Widmer. On the potential of simple framewise approaches to piano transcription.arXiv preprint arXiv:1612.05153, 2016. 10

work page internal anchor Pith review Pith/arXiv arXiv 2016

[16] [16]

R. Kelz, S. Böck, and G. Widmer. Deep polyphonic ADSR piano note transcription. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 246–250. IEEE, 2019

work page 2019

[17] [17]

T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. InInternational conference on machine learning, pages 1857–1865. Pmlr, 2017

work page 2017

[18] [18]

Q. Kong, B. Li, X. Song, Y . Wan, and Y . Wang. High-resolution piano transcription with pedals by regressing onsets and offsets times.arXiv preprint arXiv:2010.01815, 2020

work page arXiv 2010

[19] [19]

Kumar, R

K. Kumar, R. Kumar, T. De Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. De Brebisson, Y . Bengio, and A. C. Courville. Melgan: Generative adversarial networks for conditional waveform synthesis.Advances in neural information processing systems, 32, 2019

work page 2019

[20] [20]

Liu and C

L. Liu and C. Weiß. Unsupervised domain adaptation for music transcription: Exploiting cross-version consistency. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2025

work page 2025

[21] [21]

M.-Y . Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks.Ad- vances in neural information processing systems, 30, 2017

work page 2017

[22] [22]

Maman and A

B. Maman and A. H. Bermano. Unaligned supervision for automatic music transcription in the wild. InInternational Conference on Machine Learning, pages 14918–14934. PMLR, 2022

work page 2022

[23] [23]

X. Mao, Q. Li, H. Xie, R. Y . Lau, Z. Wang, and S. Paul Smolley. Least squares generative adversarial networks. InProceedings of the IEEE international conference on computer vision, pages 2794–2802, 2017

work page 2017

[24] [24]

G. E. Poliner and D. P. Ellis. A discriminative model for polyphonic piano transcription. EURASIP Journal on Advances in Signal Processing, 2007:1–9, 2006

work page 2007

[25] [25]

K. Qian, Y . Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson. Autovc: Zero-shot voice style transfer with only autoencoder loss. InInternational Conference on Machine Learning, pages 5210–5219. PMLR, 2019

work page 2019

[26] [26]

Riley, D

X. Riley, D. Edwards, and S. Dixon. High resolution guitar transcription via domain adapta- tion. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1051–1055. IEEE, 2024

work page 2024

[27] [27]

Sato and T

G. Sato and T. Akama. Annotation-free automatic music transcription with scalable synthetic data and adversarial domain confusion. In2024 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2024

work page 2024

[28] [28]

Learning Features of Music from Scratch

J. Thickstun, Z. Harchaoui, and S. Kakade. Learning features of music from scratch.arXiv preprint arXiv:1611.09827, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[29] [29]

Wang, M.-Y

T.-C. Wang, M.-Y . Liu, J.-Y . Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 8798–8807, 2018

work page 2018

[30] [30]

Q. Xi, R. M. Bittner, J. Pauwels, X. Ye, and J. P. Bello. GuitarSet: A dataset for guitar transcription. InISMIR, pages 453–460, 2018

work page 2018

[31] [31]

Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image translation. InProceedings of the IEEE international conference on computer vision, pages 2849–2857, 2017

work page 2017

[32] [32]

J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. InProceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017. 11 A Dataset Details Input representation.Audio is represented as a log-CQT magnitude spectrogram xC ∈R T×B computed at the native sam...

work page 2017