A Bi-directional Transformer for Musical Chord Recognition
Pith reviewed 2026-05-25 02:22 UTC · model grok-4.3
The pith
A bi-directional Transformer recognizes musical chords by focusing self-attention on relevant segments in a single training phase.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The bi-directional Transformer for chord recognition (BTC) applies a self-attention mechanism to focus on certain regions of chords within audio sequences. Its training consists of a single phase and yields competitive performance. Attention map analysis reveals that the model divides segments of chords by means of the adaptive receptive field of attention and captures long-term dependencies by making use of essential information regardless of distance.
What carries the argument
Self-attention mechanism in the bi-directional Transformer, which supplies an adaptive receptive field for segmenting chords and for accessing distant context.
If this is right
- Chord recognition can be performed with single-phase training and no auxiliary model.
- The model divides chord segments adaptively through attention rather than fixed receptive fields.
- Long-term dependencies are utilized irrespective of temporal distance in the audio sequence.
- Attention maps provide direct visualization of how context is selected during recognition.
Where Pith is reading between the lines
- The same attention structure could be tested on related sequence-labeling tasks such as key detection or downbeat tracking.
- If the adaptive field generalizes, it may reduce the need for hand-tuned window sizes in other audio classification problems.
- Performance on very long recordings would offer a direct test of the long-range dependency claim.
Load-bearing premise
The self-attention mechanism can be trained in one phase to focus on relevant chord regions and capture long-term dependencies without the limitations of CNNs or RNNs.
What would settle it
An attention-map visualization on held-out audio where the model produces no clear chord-segment boundaries or systematically ignores distant but harmonically relevant frames would falsify the central claim.
read the original abstract
Chord recognition is an important task since chords are highly abstract and descriptive features of music. For effective chord recognition, it is essential to utilize relevant context in audio sequence. While various machine learning models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been employed for the task, most of them have limitations in capturing long-term dependency or require training of an additional model. In this work, we utilize a self-attention mechanism for chord recognition to focus on certain regions of chords. Training of the proposed bi-directional Transformer for chord recognition (BTC) consists of a single phase while showing competitive performance. Through an attention map analysis, we have visualized how attention was performed. It turns out that the model was able to divide segments of chords by utilizing adaptive receptive field of the attention mechanism. Furthermore, it was observed that the model was able to effectively capture long-term dependencies, making use of essential information regardless of distance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a bi-directional Transformer model (BTC) for musical chord recognition that employs self-attention to focus on relevant regions of audio sequences. It claims that single-phase training yields competitive performance, and attention map analysis shows the model divides chord segments via an adaptive receptive field while capturing long-term dependencies irrespective of distance.
Significance. If supported by quantitative evidence, the work would be significant for introducing Transformer architectures to chord recognition in music information retrieval, addressing limitations of CNNs and RNNs in long-range context without multi-phase training or auxiliary models. The attention visualizations constitute a positive contribution by providing qualitative insight into model behavior.
major comments (2)
- [Abstract] Abstract: the claim of 'competitive performance' is presented without any numerical results, error bars, dataset specifications, or baseline comparisons, which is load-bearing for the central empirical contribution.
- [Attention map analysis] Attention map analysis (as described in the abstract): the assertions that the model divides chord segments and captures long-term dependencies rest exclusively on qualitative observations of attention maps, without supporting quantitative metrics such as attention-distance distributions, ablation studies on sequence length, or comparisons to RNN receptive fields.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below, proposing revisions where they strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'competitive performance' is presented without any numerical results, error bars, dataset specifications, or baseline comparisons, which is load-bearing for the central empirical contribution.
Authors: We agree that the abstract would be strengthened by including key numerical results. Although the full manuscript reports these details in the experimental section (including dataset names, accuracy metrics, and baseline comparisons), the abstract summarizes without them. In the revised version we will add concise performance figures, dataset specifications, and baseline references to the abstract to make the claim self-contained. revision: yes
-
Referee: [Attention map analysis] Attention map analysis (as described in the abstract): the assertions that the model divides chord segments and captures long-term dependencies rest exclusively on qualitative observations of attention maps, without supporting quantitative metrics such as attention-distance distributions, ablation studies on sequence length, or comparisons to RNN receptive fields.
Authors: The attention analysis is presented as qualitative observations from the visualized maps, which illustrate the adaptive receptive field and long-range attention patterns. We acknowledge that quantitative metrics (e.g., attention-distance statistics or length ablations) are absent. The manuscript does not claim quantitative superiority of the attention mechanism, only that the visualizations show the described behavior. We will revise the relevant sections to explicitly label the findings as qualitative and to avoid overstatement, but we do not plan to introduce new quantitative experiments at this stage. revision: partial
Circularity Check
No significant circularity
full rationale
The paper describes an empirical application of a standard bi-directional Transformer architecture to the chord recognition task, followed by training on audio data and qualitative attention-map visualization. No mathematical derivations, equations, or parameter-fitting steps are presented that reduce any claimed result to an input by construction. No self-citations are invoked as load-bearing premises for uniqueness theorems or ansatzes. The reported outcomes (competitive performance and observed attention behavior) are external to any definitional equivalence and rest on experimental results.
Axiom & Free-Parameter Ledger
free parameters (1)
- model hyperparameters
axioms (1)
- domain assumption Self-attention can be trained end-to-end to capture relevant long-range context in audio sequences
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We utilize a self-attention mechanism for chord recognition to focus on certain regions of chords. Training of the proposed bi-directional Transformer for chord recognition (BTC) consists of a single phase while showing competitive performance. Through an attention map analysis, it turns out that the model was able to divide segments of chords by utilizing adaptive receptive field of the attention mechanism. Furthermore, it was observed that the model was able to effectively capture long-term dependencies, making use of essential information regardless of distance.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The structure of BTC is shown in Figure 1. The model consists of bi-directional multi-head self-attentions, position-wise convolutional blocks, a positional encoding, layer normalization, dropout and fully-connected layers.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A BI- DIRECTIONAL TRANSFORMER FOR MUSICAL CHORD RECOG- NITION
INTRODUCTION The goal of chord recognition task is to output a se- quence of time-synchronized chord labels when a raw audio recording of music is given as input. Chords are highly abstract and descriptive features of music that can be used for a variety of musical purposes, including auto- matic lead-sheet creation for musicians, cover song iden- tificati...
work page 2019
-
[2]
capture long-term dependencies
-
[3]
RELA TED WORK 2.1 Automatic Chord Recognition In the past, most automatic chord recognition systems were divided into three parts: feature extraction, pattern match- ing and chord sequence decoding. After applying transfor- mation such as short-time Fourier transform or constant-q transform (CQT) to an input audio signal, features are ex- tracted from the...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[4]
BI-DIRECTIONAL TRANSFORMER FOR CHORD RECOGNITION 3.1 Bi-directional Transformer Making use of appropriate surrounding frames is essen- tial for successful chord recognition [7, 8]. This context- dependent characteristic of the task is the motivation for applying the self-attention mechanism. With some modification to the original Transformer architecture, ...
-
[5]
EXPERIMENTS 4.1 Data and Preprocessing BTC and other baseline models were evaluated on the following datasets. A subset of 221 songs from Isophon- ics 2 : 171 songs by the Beatles, 12 songs by Carole King, 20 songs by Queen and 18 songs by Zweieck; Robbie 2 http://isophonics.net/datasets Williams [12]: 65 songs by Robbie Williams; and a sub- set of 185 so...
work page 2048
-
[6]
CONCLUSION In this paper, we presented bi-directional Transformer for chord recognition (BTC). To the best of our knowledge, this paper was the first attempt to apply Transformer to chord recognition. The self-attention mechanism was ap- propriate for the task that attempts to capture long-term de- pendency by effectively exploring relevant sections. BTC h...
-
[7]
ACKNOWLEDGEMENTS This work was supported by Kakao and Kakao Brain cor- porations
-
[8]
L. J. Ba, R. Kiros, and G. E. Hinton. Layer normaliza- tion. arXiv preprint, arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[9]
D. Bahdanau, K. Cho, and Y . Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representa- tions (ICLR), Conference Track Proc., San Diego, CA, USA, 2015
work page 2015
-
[10]
L. E. Baum and T. Petrie. Statistical inference for prob- abilistic functions of finite state markov chains. The annals of mathematical statistics , 37(6):1554–1563, 1966
work page 1966
-
[11]
J. P. Bello. Chord segmentation and recognition using em-trained hidden markov models. In Proc. of the 8th International Society for Music Information Retrieval Conference (ISMIR), pages 239–244, Vienna, Austria, 2007
work page 2007
-
[12]
N. Boulanger-Lewandowski, Y . Bengio, and P. Vin- cent. Audio chord recognition with recurrent neural networks. In Proc. of the 14th International Society for Music Information Retrieval Conference (ISMIR) , pages 335–340, Curitiba, Brazil, 2013
work page 2013
-
[13]
T. Cho. Improved Techniques for Automatic Chord Recognition from Music Audio Signals . PhD thesis, New York University, 2014
work page 2014
- [14]
- [15]
-
[16]
Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling
J. Chung, Ç. Gülçehre, K. Cho, and Y . Bengio. Empir- ical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint, arXiv:1412.3555, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[17]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transform- ers for language understanding. arXiv preprint , arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[18]
L. Euler. Tentamen novae theoriae musicae ex certis- simis harmoniae principiis dilucide expositae . ex ty- pographia Academiae scientiarum, 1739
-
[19]
B. Di Giorgi, M. Zanoni, A. Sarti, and S. Tubaro. Au- tomatic chord recognition based on the probabilistic modeling of diatonic modal harmony. In Proc. of the 8th International Workshop on Multidimensional Sys- tems, Erlangen, Germany, 2013
work page 2013
-
[20]
C.-Z. Anna Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, C. Hawthorne, A. M. Dai, M. D. Hoffman, and D. Eck. Music transformer: Generating music with long-term structure. arXiv preprint, arXiv:1809.04281, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
E. J. Humphrey and J. P. Bello. Rethinking automatic chord recognition with convolutional neural networks. In 11th International Conference on Machine Learn- ing and Applications(ICMLA) , pages 357–362, Boca Raton, FL, USA, 2012
work page 2012
-
[22]
E. J. Humphrey and J. P. Bello. Four timely insights on automatic chord estimation. In Proc. of the 16th Inter- national Society for Music Information Retrieval Con- ference (ISMIR), pages 673–679, Málaga, Spain, 2015
work page 2015
-
[23]
E. J. Humphrey, T. Cho, and J. P. Bello. Learning a robust tonnetz-space transform for automatic chord recognition. In Proc. of the IEEE International Con- ference on Acoustics, Speech, and Signal Process- ing(ICASSP), pages 453–456, Kyoto, Japan, 2012
work page 2012
-
[24]
D. P. Kingma and J. Ba. Adam: A method for stochas- tic optimization. In 3rd International Conference on Learning Representations (ICLR), Conference Track Proc., San Diego, CA, USA, 2015
work page 2015
-
[25]
F. Korzeniowski, D. R. W. Sears, and G. Widmer. A large-scale study of language models for chord pre- diction. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP), pages 91–95, Calgary, AB, Canada, 2018
work page 2018
-
[26]
F. Korzeniowski and G. Widmer. Feature learning for chord recognition: The deep chroma extractor. InProc. of the 17th International Society for Music Information Retrieval Conference (ISMIR), pages 37–43, New York City, USA, 2016
work page 2016
-
[27]
F. Korzeniowski and G. Widmer. A fully convolutional deep auditory model for musical chord recognition. In 26th IEEE International Workshop on Machine Learn- ing for Signal Processing, (MLSP) , pages 1–6, Vietri sul Mare, Salerno, Italy, 2016
work page 2016
-
[28]
F. Korzeniowski and G. Widmer. Improved chord recognition by combining duration and harmonic lan- guage models. In Proc. of the 19th International So- ciety for Music Information Retrieval Conference (IS- MIR), pages 10–17, Paris, France, 2018
work page 2018
-
[29]
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th International Conference on Machine Learn- ing (ICML 2001), Williams College , pages 282–289, Williamstown, MA, USA, 2001
work page 2001
- [30]
-
[31]
K. Lee. Identifying cover songs from audio using harmonic representation. MIREX 2006 , pages 36–38, 2006
work page 2006
-
[32]
B. McFee and J. P. Bello. Structured training for large- vocabulary chord recognition. In Proc. of the 18th In- ternational Society for Music Information Retrieval Conference (ISMIR) , pages 188–194, Suzhou, China, 2017
work page 2017
-
[33]
J. Pauwels, F. Kaiser, and G. Peeters. Combin- ing harmony-based and novelty-based approaches for structural segmentation. In Proc. of the 14th Interna- tional Society for Music Information Retrieval Confer- ence (ISMIR), pages 601–606, Curitiba, Brazil, 2013
work page 2013
-
[34]
C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis. Mir_eval: A transparent implementation of common mir metrics. In Proc. of the 15th International Society for Music Infor- mation Retrieval Conference (ISMIR), pages 367–372, Taipei, Taiwan, 2014
work page 2014
-
[35]
A. Sheh and D. P. W. Ellis. Chord segmentation and recognition using em-trained hidden markov models. In Proc. of the 4th International Society for Music Information Retrieval Conference (ISMIR) , Baltimore, Maryland, USA, 2003
work page 2003
-
[36]
K. Simonyan and A. Zisserman. Very deep convolu- tional networks for large-scale image recognition. In 3rd International Conference on Learning Representa- tions (ICLR), Conference Track Proc., San Diego, CA, USA, 2015
work page 2015
-
[37]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to pre- vent neural networks from overfitting. Journal of Ma- chine Learning Research, 15:1929–1958, 2014
work page 1929
-
[38]
Boulanger-Lewandowski, and S.Dixon
S.Sigtia, N. Boulanger-Lewandowski, and S.Dixon. Audio chord recognition with a hybrid recurrent neu- ral network. In Proc. of the 16th International Society for Music Information Retrieval Conference (ISMIR) , pages 127–133, Málaga, Spain, 2015
work page 2015
-
[39]
Y . Ueda, Y . Uchiyama, T. Nishimoto, N. Ono, and S. Sagayama. Hmm-based approach for automatic chord detection using refined acoustic features. In Proc. of the IEEE International Conference on Acous- tics, Speech, and Signal Processing(ICASSP) , pages 5518–5521, Dallas, Texas, USA, 2010
work page 2010
-
[40]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural In- formation Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 6000–6010, Long Beach, CA, USA, 2017
work page 2017
- [41]
-
[42]
X. Zhou and A. Lerch. Chord detection using deep learning. In Proc. of the 16th International Society for Music Information Retrieval Conference (ISMIR) , pages 52–58, Málaga, Spain, 2015
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.