BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference

Injung Kim; Jinyoung Choi; Kihyun Na; Sungjae Kim

arxiv: 2511.20006 · v3 · submitted 2025-11-25 · 📡 eess.AS · cs.AI· cs.SD

BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference

Sungjae Kim , Kihyun Na , Jinyoung Choi , Injung Kim This is my paper

Pith reviewed 2026-05-17 05:28 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.SD

keywords automatic pitch correctionreference-freemusic language modelsinging voicepitch estimationBERTaudio processingmusical context

0 comments

The pith

BERT-APC corrects singing pitch deviations without references by inferring intended notes from musical context with a language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BERT-APC, a framework that aligns off-key notes in vocal recordings to intended pitches without any reference pitch or musical score. It first extracts the stable pitch from each note's steady region in the audio, then applies a repurposed music language model to predict the full intended pitch sequence using surrounding musical context. A final correction step adjusts errors while leaving intentional expressive bends intact. The system also trains with learnable augmentation that creates realistic detuning examples. Tests show better target note accuracy than recent transcription models on badly detuned inputs and higher listener ratings than Auto-Tune or Melodyne.

Core claim

BERT-APC is the first automatic pitch correction model to use a music language model for reference-free correction with symbolic musical context. The pipeline combines a stationary pitch predictor that estimates the continuous pitch from stable note regions, a context-aware note pitch predictor built on the repurposed language model, and a note-level correction algorithm that fixes errors without removing expressive deviations. A learnable data augmentation strategy simulates realistic detuning patterns during training. On highly detuned samples the model exceeds the second-best transcription baseline by 10.49 percentage points in raw pitch accuracy, and in listening tests it receives a mean

What carries the argument

The context-aware note pitch predictor, a repurposed music language model that infers the intended symbolic pitch sequence from detuned vocal input plus musical context.

If this is right

Pitch correction becomes practical for recordings that lack a clean reference track or score.
Expressive nuances remain because the correction step targets only errors and leaves intentional deviations untouched.
Robustness to real-world detuning increases through the learnable augmentation that generates varied detuning patterns.
Target note prediction accuracy rises substantially on challenging inputs compared with recent singing transcription models.
Listener-perceived quality improves over commercial tools while matching their ability to retain expressiveness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time use in live performance would require latency measurements of the full pipeline including the language model.
Feeding the model additional accompaniment audio could strengthen context inference when vocals are sparse.
The same language-model approach might apply to related tasks such as automatic timing correction or dynamic adjustment.
Validation on non-Western scales or genres outside the training distribution would test how far the context inference generalizes.

Load-bearing premise

A repurposed music language model can reliably infer the intended symbolic pitch sequence from only the detuned vocal input and musical context without any reference pitch or score.

What would settle it

On a test set of highly detuned vocal performances with known ground-truth intended notes, if BERT-APC shows no improvement in raw pitch accuracy over simple pitch estimation or prior transcription models, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2511.20006 by Injung Kim, Jinyoung Choi, Kihyun Na, Sungjae Kim.

**Figure 1.** Figure 1: Model architecture of BERT-APC. The system operates in three stages—note-level feature extraction, contextaware note pitch estimation, and note-level pitch correction. A concise step-by-step overview is provided in the blue box on the right. B. Note-level Feature Extraction BERT-APC leverages a symbolic music language model to estimate the intended note pitches and to correct pitch deviations at the note … view at source ↗

**Figure 2.** Figure 2: Comparison of pitch estimation methods. Blue, orange, and green lines denote vocal, ground-truth, and estimated pitches, with purple ellipses marking large errors. The proposed method (a) successfully identifies perceptual pitch centers while avoiding distortions from transitional regions such as onsets and vibrato, unlike the baselines. In (a), the red bars visualize estimated stationarity weight wt (Eq. … view at source ↗

**Figure 3.** Figure 3: The architecture of the context-aware note pitch pre [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Histograms of note-wise pitch errors for the three subsets: (a) in-tune (10%), (b) moderately detuned (80%), and (c) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of pitch correction results for a highly detuned sample. The green, blue, and orange lines represent [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The distribution of note-wise pitch errors for the moderately detuned dataset under three augmentation strategies. The [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Automatic Pitch Correction (APC) enhances vocal recordings by aligning pitch deviations with intended musical notes. However, existing APC systems either rely on reference pitches, which limits practical applicability, or employ simple pitch estimation algorithms that often fail to preserve expressiveness and naturalness. We propose BERT-APC, a reference-free APC framework that corrects pitch errors while maintaining the expressiveness and naturalness of vocal performances. In BERT-APC, a stationary pitch predictor first estimates the stationary pitch of each note from the detuned singing voice, where stationary pitch is the continuous pitch from the stable region of a note and approximates its perceived pitch. A context-aware note pitch predictor then infers the intended pitch sequence using a repurposed music language model that incorporates musical context. Finally, a note-level correction algorithm fixes pitch errors while preserving intentional deviations for emotional expression. We also introduce a learnable data augmentation strategy that improves robustness by simulating realistic detuning patterns. Compared to two recent singing voice transcription models, BERT-APC demonstrated superior target note pitch prediction, outperforming the second-best model, ROSVOT, by 10.49 percentage points on highly detuned samples in raw pitch accuracy. In the MOS test, BERT-APC achieved the highest quality rating of $4.32 \pm 0.15$, significantly higher than Auto-Tune ($3.22 \pm 0.18$) and Melodyne ($3.08 \pm 0.18$), while maintaining a comparable ability to preserve expressive nuances. To the best of our knowledge, this is the first APC model that leverages a music language model to achieve reference-free pitch correction with symbolic musical context. The corrected audio samples are available at https://joshua-1995.github.io/BERT-APC-Demo/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BERT-APC gives a practical reference-free pitch correction system using a repurposed music LM, with clear empirical gains, but the inference details and evaluation setup need closer inspection.

read the letter

The main thing to know is that this paper builds a reference-free automatic pitch correction pipeline: stationary pitch estimation from the detuned input, followed by a music language model that predicts the intended symbolic note sequence from context, then a correction step that tries to keep expressive deviations. They add learnable augmentation to simulate realistic detuning during training. On the numbers they report, it beats ROSVOT by about 10 points in raw pitch accuracy on highly detuned samples and scores 4.32 MOS versus 3.22 and 3.08 for Auto-Tune and Melodyne while preserving nuance comparably. That combination of stationary estimation plus LM-based context inference looks like the actual novelty here, and the concrete listening-test results plus demo samples give it some grounding for an applied audio paper.

Referee Report

2 major / 2 minor

Summary. The paper proposes BERT-APC, a reference-free automatic pitch correction framework for singing voices. It first estimates stationary pitch per note from detuned input, then uses a repurposed music language model as a context-aware note pitch predictor to infer the intended symbolic pitch sequence, and finally applies a note-level correction algorithm that preserves expressive deviations. The authors report that BERT-APC outperforms recent singing voice transcription models (e.g., ROSVOT by 10.49 pp in raw pitch accuracy on highly detuned samples) and achieves higher MOS scores (4.32 ± 0.15) than Auto-Tune and Melodyne while maintaining expressiveness.

Significance. If the central empirical claims hold after detailed verification, the work would be significant as the first demonstration of a music LM for reference-free APC, potentially enabling more practical and natural vocal correction systems. The learnable data augmentation strategy for simulating detuning patterns is a constructive technical contribution that could be adopted more broadly.

major comments (2)

Methods section on the context-aware note pitch predictor: the manuscript provides no description of input tokenization for the stationary pitch estimates, the precise conditioning mechanism that supplies musical context to the repurposed music LM, or the training objective and target labels (e.g., whether ground-truth symbolic pitches derive from scores or clean recordings). This information is load-bearing for the claim that the model performs genuine reference-free inference rather than regressing to training-set correlations.
Experiments and evaluation: the reported 10.49 pp raw pitch accuracy gain on highly detuned samples and the MOS results (4.32 ± 0.15) are presented without data-split details, test-set size, number of listeners, or ablation studies that isolate the music LM component from the stationary predictor and correction algorithm. These omissions prevent verification that the superiority claims are robust rather than sensitive to post-hoc choices or evaluation biases.

minor comments (2)

Abstract: the phrase 'to the best of our knowledge' for novelty would be strengthened by a concise related-work paragraph that explicitly contrasts BERT-APC with prior uses of language models in music transcription or pitch tasks.
Figure and table captions: ensure all reported error bars and statistical comparisons are accompanied by the exact number of trials or listeners to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. We address each major comment point by point below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses

Referee: Methods section on the context-aware note pitch predictor: the manuscript provides no description of input tokenization for the stationary pitch estimates, the precise conditioning mechanism that supplies musical context to the repurposed music LM, or the training objective and target labels (e.g., whether ground-truth symbolic pitches derive from scores or clean recordings). This information is load-bearing for the claim that the model performs genuine reference-free inference rather than regressing to training-set correlations.

Authors: We agree that these methodological details are essential for substantiating the reference-free claim and distinguishing context-aware inference from potential training-set correlations. In the revised manuscript we will expand the Methods section with a precise description of the input tokenization applied to the stationary pitch estimates, the exact conditioning mechanism that injects musical context into the repurposed music LM, and the training objective together with the provenance of the target labels (derived from clean recordings). These additions will make explicit that no reference pitch is supplied at inference time and that the LM component performs genuine musical-context inference. revision: yes
Referee: Experiments and evaluation: the reported 10.49 pp raw pitch accuracy gain on highly detuned samples and the MOS results (4.32 ± 0.15) are presented without data-split details, test-set size, number of listeners, or ablation studies that isolate the music LM component from the stationary predictor and correction algorithm. These omissions prevent verification that the superiority claims are robust rather than sensitive to post-hoc choices or evaluation biases.

Authors: We acknowledge that the current presentation lacks sufficient experimental transparency. In the revised manuscript we will report the data-split protocol, the exact size of the test set, the number of listeners who participated in the MOS evaluation, and new ablation studies that isolate the contribution of the music LM from the stationary pitch predictor and the note-level correction algorithm. These additions will allow readers to assess the robustness of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML framework with external metrics

full rationale

The paper describes an engineering pipeline (stationary pitch predictor + repurposed music LM for context-aware note pitch prediction + note-level correction + learnable augmentation) and reports empirical results on held-out test sets using standard metrics (raw pitch accuracy, MOS). No derivation, equation, or 'prediction' is presented that reduces by construction to its own fitted inputs or to a self-citation chain. The central claims rest on comparative performance numbers against external baselines, which are falsifiable outside the paper's training procedure.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on the domain assumption that stationary pitch from stable note regions approximates perceived pitch and that a repurposed music LM can infer intended notes from context alone. No new physical entities are postulated. Free parameters include the weights of the fine-tuned music LM and the learnable augmentation parameters.

free parameters (2)

music LM fine-tuning hyperparameters
Weights and training schedule of the repurposed music language model are learned from data.
learnable data augmentation parameters
Parameters that simulate realistic detuning patterns are trained jointly.

axioms (2)

domain assumption Stationary pitch from the stable region of a note approximates its perceived pitch
Invoked in the description of the stationary pitch predictor stage.
domain assumption A music language model can infer intended pitch sequences from detuned vocal context without reference
Central premise of the context-aware note pitch predictor.

pith-pipeline@v0.9.0 · 5639 in / 1539 out tokens · 26010 ms · 2026-05-17T05:28:51.103499+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a context-aware note pitch predictor then infers the intended pitch sequence using a repurposed music language model that incorporates musical context

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

[1]

Deep autotuner: A pitch correcting network for singing performances,

S. Wager, G. Tzanetakis, C.-i. Wang, and M. Kim, “Deep autotuner: A pitch correcting network for singing performances,” inICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 246–250

work page 2020
[2]

Diff-pitcher: Diffusion-based singing voice pitch correction,

J. Hai and M. Elhilali, “Diff-pitcher: Diffusion-based singing voice pitch correction,” in2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2023, pp. 1–5

work page 2023
[3]

Contuner: Singing voice beautifying with pitch and expressiveness condition,

J. Wang, P. Li, X. Zhang, N. Cheng, and J. Xiao, “Contuner: Singing voice beautifying with pitch and expressiveness condition,” in2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 2024, pp. 1–6

work page 2024
[4]

Karatuner: Towards end-to-end natural pitch correction for singing voice in karaoke,

X. Zhuang, H. Yu, W. Zhao, T. Jiang, and P. Hu, “Karatuner: Towards end-to-end natural pitch correction for singing voice in karaoke,” in Interspeech 2022, 2022, pp. 4262–4266

work page 2022
[5]

Singing voice correction using canonical time warping,

Y .-J. Luo, M.-T. Chen, T.-S. Chi, and L. Su, “Singing voice correction using canonical time warping,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 156–160

work page 2018
[6]

Learning the beauty in songs: Neural singing voice beautifier,

J. Liu, C. Li, Y . Ren, Z. Zhu, and Z. Zhao, “Learning the beauty in songs: Neural singing voice beautifier,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 7970–798...

work page 2022
[7]

Auto-tune pro 11,

Antares Audio Technologies, “Auto-tune pro 11,” 2024, version 11. Available at: https://www.antarestech.com/

work page 2024
[8]

Melodyne 5,

Celemony Software GmbH, “Melodyne 5,” 2020, available at: https: //www.celemony.com/en/melodyne

work page 2020
[9]

VOCANO: A note transcription framework for singing voice in polyphonic music,

J.-Y . Hsu and L. Su, “VOCANO: A note transcription framework for singing voice in polyphonic music,” inProceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR), 2021, pp. 293–300. [Online]. Available: https://archives.ismir. net/ismir2021/paper/000035.pdf

work page 2021
[10]

A phoneme-informed neural network model for note-level singing transcription,

S. Yong, L. Su, and J. Nam, “A phoneme-informed neural network model for note-level singing transcription,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023
[11]

Note-level singing melody transcription with transformers,

J. Park, K. Choi, S. Oh, L. Kim, and J. Park, “Note-level singing melody transcription with transformers,”Intelligent Data Analysis, vol. 27, no. 6, pp. 1853–1871, 2023

work page 2023
[12]

Robust singing voice transcription serves synthesis,

R. Li, Y . Zhang, Y . Wang, Z. Hong, R. Huang, and Z. Zhao, “Robust singing voice transcription serves synthesis,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 9751–9766. [Online]. Available: https://aclanthol...

work page 2024
[13]

Note-level singing melody tran- scription for time-aligned musical score generation,

L. Kim, S. Jeon, W. Heo, and J. Park, “Note-level singing melody tran- scription for time-aligned musical score generation,”IEEE Transactions on Audio, Speech and Language Processing, 2025

work page 2025
[14]

STARS: A unified framework for singing transcription, alignment, and refined style annotation,

W. Guo, Y . Zhang, C. Pan, Z. Zhu, R. Li, Z. Chen, W. Xu, F. Wu, and Z. Zhao, “STARS: A unified framework for singing transcription, alignment, and refined style annotation,” inFindings of the Association for Computational Linguistics: ACL 2025. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 15 081–15 093. [Online]. Available: ...

work page 2025
[15]

MusicBERT: Symbolic music understanding with large-scale pre-training,

M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.-Y . Liu, “MusicBERT: Symbolic music understanding with large-scale pre-training,” in Findings of the Association for Computational Linguistics: ACL- IJCNLP 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Online: Association for Computational Linguistics, Aug. 2021, pp. 791–800. [Online]. Available: https:...

work page 2021
[16]

Bert- like pre-training for symbolic piano music classification tasks,

Y .-H. Chou, I.-C. Chen, C.-J. Chang, J. Ching, and Y .-H. Yang, “Bert- like pre-training for symbolic piano music classification tasks,”Journal of Creative Music Systems, vol. 8, no. 1, pp. 1–19, 2024

work page 2024
[17]

Pianobart: Symbolic piano music generation and understanding with large-scale pre-training,

X. Liang, Z. Zhao, W. Zeng, Y . He, F. He, Y . Wang, and C. Gao, “Pianobart: Symbolic piano music generation and understanding with large-scale pre-training,” in2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2024, pp. 1–6

work page 2024
[18]

Midibert-piano: Large-scale pre-training for symbolic music classifi- cation tasks,

Y .-H. Chou, I.-C. Chen, J. Ching, C.-J. Chang, and Y .-H. Yang, “Midibert-piano: Large-scale pre-training for symbolic music classifi- cation tasks,”Journal of Creative Music Systems, vol. 8, no. 1, 2024

work page 2024
[19]

Let network decide what to learn: Symbolic music understand- ing model based on large-scale adversarial pre-training,

Z. Zhao, “Let network decide what to learn: Symbolic music understand- ing model based on large-scale adversarial pre-training,” inProceedings of the 2025 International Conference on Multimedia Retrieval, 2025, pp. 2128–2132

work page 2025
[20]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

work page 2017
[21]

B. C. Moore,An introduction to the psychology of hearing. Brill, 2012

work page 2012
[22]

Madmom: A new python audio and music signal processing library,

S. B ¨ock, F. Korzeniowski, J. Schl ¨uter, F. Krebs, and G. Widmer, “Madmom: A new python audio and music signal processing library,” in Proceedings of the 24th ACM international conference on Multimedia, 2016, pp. 1174–1178

work page 2016
[23]

En- hancing singing performances: Novel method for automatic vocal pitch correction,

R. Shashidhar, D. Aishwarya, B. Shruthi, M. Shashanket al., “En- hancing singing performances: Novel method for automatic vocal pitch correction,” in2023 4th International Conference on Smart Electronics and Communication (ICOSEC). IEEE, 2023, pp. 1095–1102

work page 2023
[24]

Introducing parselmouth: A python interface to praat,

Y . Jadoul, B. Thompson, and B. de Boer, “Introducing parselmouth: A python interface to praat,”Journal of Phonetics, vol. 71, pp. 1–15, 2018. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S0095447017301389

work page 2018
[25]

Ai-hub guide vo- cal dataset,

National Information Society Agency (NIA), “Ai-hub guide vo- cal dataset,” https://www.aihub.or.kr/aihubdata/data/view.do?dataSetSn= 473, 2020, accessed: July 24, 2024

work page 2020
[26]

Ai-hub multi-singer singing voice dataset,

——, “Ai-hub multi-singer singing voice dataset,” https://www.aihub. or.kr/aihubdata/data/view.do?dataSetSn=465, 2020, accessed: July 24, 2024

work page 2020
[27]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,”arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[28]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

The evaluation of vocal pitch accuracy: The case of operatic singing voices,

P. Larrouy-Maestri, D. Magis, and D. Morsomme, “The evaluation of vocal pitch accuracy: The case of operatic singing voices,”Music Perception: An Interdisciplinary Journal, vol. 32, no. 1, pp. 1–10, 2014

work page 2014
[30]

Uncovering phenotypes of poor-pitch singing: The sung performance battery (spb),

M. Berkowska and S. Dalla Bella, “Uncovering phenotypes of poor-pitch singing: The sung performance battery (spb),”Frontiers in Psychology, vol. 4, p. 714, 2013

work page 2013
[31]

Melodic intonation, psychoacoustics, and the violin,

B. H. Repp, “Melodic intonation, psychoacoustics, and the violin,” 1997

work page 1997
[32]

Range of tuning for tones with and without vibrato,

R. M. V . Besouw, J. S. Brereton, and D. M. Howard, “Range of tuning for tones with and without vibrato,”Music Perception, vol. 26, no. 2, pp. 145–155, 2008

work page 2008
[33]

The vocal generosity effect: How bad can your singing be?

S. Hutchins, C. Roquet, and I. Peretz, “The vocal generosity effect: How bad can your singing be?”Music Perception, vol. 30, no. 2, pp. 147–159, 2012

work page 2012
[34]

Production and perception of musical intervals,

A. Vurma and J. Ross, “Production and perception of musical intervals,” Music Perception, vol. 23, no. 4, pp. 331–344, 2006

work page 2006
[35]

Defining poor-pitch singing: A problem of measurement and sensitivity,

S. D. Bella, “Defining poor-pitch singing: A problem of measurement and sensitivity,”Music Perception: An Interdisciplinary Journal, vol. 32, no. 3, pp. 272–282, 2015

work page 2015
[36]

Melody extraction from polyphonic music signals: Approaches, applications, and chal- lenges,

J. Salamon, E. G ´omez, D. P. Ellis, and G. Richard, “Melody extraction from polyphonic music signals: Approaches, applications, and chal- lenges,”IEEE Signal Processing Magazine, vol. 31, no. 2, pp. 118–134, 2014

work page 2014
[37]

A technique for the measurement of attitudes

R. Likert, “A technique for the measurement of attitudes.”Archives of Psychology, 1932. 12 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 VI. BIOGRAPHYSECTION Sungjae Kimreceived the B.S. and M.S. degree in Computer Science and Electrical Engineering (CSEE) at Handong Global University. He is cur- rently a Ph.D student in CSEE at Handong Global...

work page 1932

[1] [1]

Deep autotuner: A pitch correcting network for singing performances,

S. Wager, G. Tzanetakis, C.-i. Wang, and M. Kim, “Deep autotuner: A pitch correcting network for singing performances,” inICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 246–250

work page 2020

[2] [2]

Diff-pitcher: Diffusion-based singing voice pitch correction,

J. Hai and M. Elhilali, “Diff-pitcher: Diffusion-based singing voice pitch correction,” in2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2023, pp. 1–5

work page 2023

[3] [3]

Contuner: Singing voice beautifying with pitch and expressiveness condition,

J. Wang, P. Li, X. Zhang, N. Cheng, and J. Xiao, “Contuner: Singing voice beautifying with pitch and expressiveness condition,” in2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 2024, pp. 1–6

work page 2024

[4] [4]

Karatuner: Towards end-to-end natural pitch correction for singing voice in karaoke,

X. Zhuang, H. Yu, W. Zhao, T. Jiang, and P. Hu, “Karatuner: Towards end-to-end natural pitch correction for singing voice in karaoke,” in Interspeech 2022, 2022, pp. 4262–4266

work page 2022

[5] [5]

Singing voice correction using canonical time warping,

Y .-J. Luo, M.-T. Chen, T.-S. Chi, and L. Su, “Singing voice correction using canonical time warping,” in2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 156–160

work page 2018

[6] [6]

Learning the beauty in songs: Neural singing voice beautifier,

J. Liu, C. Li, Y . Ren, Z. Zhu, and Z. Zhao, “Learning the beauty in songs: Neural singing voice beautifier,” inProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 7970–798...

work page 2022

[7] [7]

Auto-tune pro 11,

Antares Audio Technologies, “Auto-tune pro 11,” 2024, version 11. Available at: https://www.antarestech.com/

work page 2024

[8] [8]

Melodyne 5,

Celemony Software GmbH, “Melodyne 5,” 2020, available at: https: //www.celemony.com/en/melodyne

work page 2020

[9] [9]

VOCANO: A note transcription framework for singing voice in polyphonic music,

J.-Y . Hsu and L. Su, “VOCANO: A note transcription framework for singing voice in polyphonic music,” inProceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR), 2021, pp. 293–300. [Online]. Available: https://archives.ismir. net/ismir2021/paper/000035.pdf

work page 2021

[10] [10]

A phoneme-informed neural network model for note-level singing transcription,

S. Yong, L. Su, and J. Nam, “A phoneme-informed neural network model for note-level singing transcription,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023

[11] [11]

Note-level singing melody transcription with transformers,

J. Park, K. Choi, S. Oh, L. Kim, and J. Park, “Note-level singing melody transcription with transformers,”Intelligent Data Analysis, vol. 27, no. 6, pp. 1853–1871, 2023

work page 2023

[12] [12]

Robust singing voice transcription serves synthesis,

R. Li, Y . Zhang, Y . Wang, Z. Hong, R. Huang, and Z. Zhao, “Robust singing voice transcription serves synthesis,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 9751–9766. [Online]. Available: https://aclanthol...

work page 2024

[13] [13]

Note-level singing melody tran- scription for time-aligned musical score generation,

L. Kim, S. Jeon, W. Heo, and J. Park, “Note-level singing melody tran- scription for time-aligned musical score generation,”IEEE Transactions on Audio, Speech and Language Processing, 2025

work page 2025

[14] [14]

STARS: A unified framework for singing transcription, alignment, and refined style annotation,

W. Guo, Y . Zhang, C. Pan, Z. Zhu, R. Li, Z. Chen, W. Xu, F. Wu, and Z. Zhao, “STARS: A unified framework for singing transcription, alignment, and refined style annotation,” inFindings of the Association for Computational Linguistics: ACL 2025. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 15 081–15 093. [Online]. Available: ...

work page 2025

[15] [15]

MusicBERT: Symbolic music understanding with large-scale pre-training,

M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.-Y . Liu, “MusicBERT: Symbolic music understanding with large-scale pre-training,” in Findings of the Association for Computational Linguistics: ACL- IJCNLP 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Online: Association for Computational Linguistics, Aug. 2021, pp. 791–800. [Online]. Available: https:...

work page 2021

[16] [16]

Bert- like pre-training for symbolic piano music classification tasks,

Y .-H. Chou, I.-C. Chen, C.-J. Chang, J. Ching, and Y .-H. Yang, “Bert- like pre-training for symbolic piano music classification tasks,”Journal of Creative Music Systems, vol. 8, no. 1, pp. 1–19, 2024

work page 2024

[17] [17]

Pianobart: Symbolic piano music generation and understanding with large-scale pre-training,

X. Liang, Z. Zhao, W. Zeng, Y . He, F. He, Y . Wang, and C. Gao, “Pianobart: Symbolic piano music generation and understanding with large-scale pre-training,” in2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2024, pp. 1–6

work page 2024

[18] [18]

Midibert-piano: Large-scale pre-training for symbolic music classifi- cation tasks,

Y .-H. Chou, I.-C. Chen, J. Ching, C.-J. Chang, and Y .-H. Yang, “Midibert-piano: Large-scale pre-training for symbolic music classifi- cation tasks,”Journal of Creative Music Systems, vol. 8, no. 1, 2024

work page 2024

[19] [19]

Let network decide what to learn: Symbolic music understand- ing model based on large-scale adversarial pre-training,

Z. Zhao, “Let network decide what to learn: Symbolic music understand- ing model based on large-scale adversarial pre-training,” inProceedings of the 2025 International Conference on Multimedia Retrieval, 2025, pp. 2128–2132

work page 2025

[20] [20]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

work page 2017

[21] [21]

B. C. Moore,An introduction to the psychology of hearing. Brill, 2012

work page 2012

[22] [22]

Madmom: A new python audio and music signal processing library,

S. B ¨ock, F. Korzeniowski, J. Schl ¨uter, F. Krebs, and G. Widmer, “Madmom: A new python audio and music signal processing library,” in Proceedings of the 24th ACM international conference on Multimedia, 2016, pp. 1174–1178

work page 2016

[23] [23]

En- hancing singing performances: Novel method for automatic vocal pitch correction,

R. Shashidhar, D. Aishwarya, B. Shruthi, M. Shashanket al., “En- hancing singing performances: Novel method for automatic vocal pitch correction,” in2023 4th International Conference on Smart Electronics and Communication (ICOSEC). IEEE, 2023, pp. 1095–1102

work page 2023

[24] [24]

Introducing parselmouth: A python interface to praat,

Y . Jadoul, B. Thompson, and B. de Boer, “Introducing parselmouth: A python interface to praat,”Journal of Phonetics, vol. 71, pp. 1–15, 2018. [Online]. Available: https://www.sciencedirect.com/science/ article/pii/S0095447017301389

work page 2018

[25] [25]

Ai-hub guide vo- cal dataset,

National Information Society Agency (NIA), “Ai-hub guide vo- cal dataset,” https://www.aihub.or.kr/aihubdata/data/view.do?dataSetSn= 473, 2020, accessed: July 24, 2024

work page 2020

[26] [26]

Ai-hub multi-singer singing voice dataset,

——, “Ai-hub multi-singer singing voice dataset,” https://www.aihub. or.kr/aihubdata/data/view.do?dataSetSn=465, 2020, accessed: July 24, 2024

work page 2020

[27] [27]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long- document transformer,”arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[28] [28]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

The evaluation of vocal pitch accuracy: The case of operatic singing voices,

P. Larrouy-Maestri, D. Magis, and D. Morsomme, “The evaluation of vocal pitch accuracy: The case of operatic singing voices,”Music Perception: An Interdisciplinary Journal, vol. 32, no. 1, pp. 1–10, 2014

work page 2014

[30] [30]

Uncovering phenotypes of poor-pitch singing: The sung performance battery (spb),

M. Berkowska and S. Dalla Bella, “Uncovering phenotypes of poor-pitch singing: The sung performance battery (spb),”Frontiers in Psychology, vol. 4, p. 714, 2013

work page 2013

[31] [31]

Melodic intonation, psychoacoustics, and the violin,

B. H. Repp, “Melodic intonation, psychoacoustics, and the violin,” 1997

work page 1997

[32] [32]

Range of tuning for tones with and without vibrato,

R. M. V . Besouw, J. S. Brereton, and D. M. Howard, “Range of tuning for tones with and without vibrato,”Music Perception, vol. 26, no. 2, pp. 145–155, 2008

work page 2008

[33] [33]

The vocal generosity effect: How bad can your singing be?

S. Hutchins, C. Roquet, and I. Peretz, “The vocal generosity effect: How bad can your singing be?”Music Perception, vol. 30, no. 2, pp. 147–159, 2012

work page 2012

[34] [34]

Production and perception of musical intervals,

A. Vurma and J. Ross, “Production and perception of musical intervals,” Music Perception, vol. 23, no. 4, pp. 331–344, 2006

work page 2006

[35] [35]

Defining poor-pitch singing: A problem of measurement and sensitivity,

S. D. Bella, “Defining poor-pitch singing: A problem of measurement and sensitivity,”Music Perception: An Interdisciplinary Journal, vol. 32, no. 3, pp. 272–282, 2015

work page 2015

[36] [36]

Melody extraction from polyphonic music signals: Approaches, applications, and chal- lenges,

J. Salamon, E. G ´omez, D. P. Ellis, and G. Richard, “Melody extraction from polyphonic music signals: Approaches, applications, and chal- lenges,”IEEE Signal Processing Magazine, vol. 31, no. 2, pp. 118–134, 2014

work page 2014

[37] [37]

A technique for the measurement of attitudes

R. Likert, “A technique for the measurement of attitudes.”Archives of Psychology, 1932. 12 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 VI. BIOGRAPHYSECTION Sungjae Kimreceived the B.S. and M.S. degree in Computer Science and Electrical Engineering (CSEE) at Handong Global University. He is cur- rently a Ph.D student in CSEE at Handong Global...

work page 1932