pith. sign in

arxiv: 1907.02698 · v1 · pith:DHBJS7S3new · submitted 2019-07-05 · 💻 cs.SD · cs.LG· eess.AS

A Bi-directional Transformer for Musical Chord Recognition

Pith reviewed 2026-05-25 02:22 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS
keywords chord recognitiontransformerself-attentionmusic information retrievalbi-directional modelattention mapslong-term dependenciesaudio sequence
0
0 comments X

The pith

A bi-directional Transformer recognizes musical chords by focusing self-attention on relevant segments in a single training phase.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a bi-directional Transformer model for chord recognition from audio sequences. Prior CNN and RNN approaches either fail to capture long-term dependencies or require separate models for context. The proposed model uses self-attention to focus on chord regions and trains end-to-end in one phase while matching competitive performance levels. Attention map analysis shows the model divides chord segments with an adaptive receptive field and pulls in essential information from any distance in the sequence.

Core claim

The bi-directional Transformer for chord recognition (BTC) applies a self-attention mechanism to focus on certain regions of chords within audio sequences. Its training consists of a single phase and yields competitive performance. Attention map analysis reveals that the model divides segments of chords by means of the adaptive receptive field of attention and captures long-term dependencies by making use of essential information regardless of distance.

What carries the argument

Self-attention mechanism in the bi-directional Transformer, which supplies an adaptive receptive field for segmenting chords and for accessing distant context.

If this is right

  • Chord recognition can be performed with single-phase training and no auxiliary model.
  • The model divides chord segments adaptively through attention rather than fixed receptive fields.
  • Long-term dependencies are utilized irrespective of temporal distance in the audio sequence.
  • Attention maps provide direct visualization of how context is selected during recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attention structure could be tested on related sequence-labeling tasks such as key detection or downbeat tracking.
  • If the adaptive field generalizes, it may reduce the need for hand-tuned window sizes in other audio classification problems.
  • Performance on very long recordings would offer a direct test of the long-range dependency claim.

Load-bearing premise

The self-attention mechanism can be trained in one phase to focus on relevant chord regions and capture long-term dependencies without the limitations of CNNs or RNNs.

What would settle it

An attention-map visualization on held-out audio where the model produces no clear chord-segment boundaries or systematically ignores distant but harmonically relevant frames would falsify the central claim.

read the original abstract

Chord recognition is an important task since chords are highly abstract and descriptive features of music. For effective chord recognition, it is essential to utilize relevant context in audio sequence. While various machine learning models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been employed for the task, most of them have limitations in capturing long-term dependency or require training of an additional model. In this work, we utilize a self-attention mechanism for chord recognition to focus on certain regions of chords. Training of the proposed bi-directional Transformer for chord recognition (BTC) consists of a single phase while showing competitive performance. Through an attention map analysis, we have visualized how attention was performed. It turns out that the model was able to divide segments of chords by utilizing adaptive receptive field of the attention mechanism. Furthermore, it was observed that the model was able to effectively capture long-term dependencies, making use of essential information regardless of distance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes a bi-directional Transformer model (BTC) for musical chord recognition that employs self-attention to focus on relevant regions of audio sequences. It claims that single-phase training yields competitive performance, and attention map analysis shows the model divides chord segments via an adaptive receptive field while capturing long-term dependencies irrespective of distance.

Significance. If supported by quantitative evidence, the work would be significant for introducing Transformer architectures to chord recognition in music information retrieval, addressing limitations of CNNs and RNNs in long-range context without multi-phase training or auxiliary models. The attention visualizations constitute a positive contribution by providing qualitative insight into model behavior.

major comments (2)
  1. [Abstract] Abstract: the claim of 'competitive performance' is presented without any numerical results, error bars, dataset specifications, or baseline comparisons, which is load-bearing for the central empirical contribution.
  2. [Attention map analysis] Attention map analysis (as described in the abstract): the assertions that the model divides chord segments and captures long-term dependencies rest exclusively on qualitative observations of attention maps, without supporting quantitative metrics such as attention-distance distributions, ablation studies on sequence length, or comparisons to RNN receptive fields.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below, proposing revisions where they strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'competitive performance' is presented without any numerical results, error bars, dataset specifications, or baseline comparisons, which is load-bearing for the central empirical contribution.

    Authors: We agree that the abstract would be strengthened by including key numerical results. Although the full manuscript reports these details in the experimental section (including dataset names, accuracy metrics, and baseline comparisons), the abstract summarizes without them. In the revised version we will add concise performance figures, dataset specifications, and baseline references to the abstract to make the claim self-contained. revision: yes

  2. Referee: [Attention map analysis] Attention map analysis (as described in the abstract): the assertions that the model divides chord segments and captures long-term dependencies rest exclusively on qualitative observations of attention maps, without supporting quantitative metrics such as attention-distance distributions, ablation studies on sequence length, or comparisons to RNN receptive fields.

    Authors: The attention analysis is presented as qualitative observations from the visualized maps, which illustrate the adaptive receptive field and long-range attention patterns. We acknowledge that quantitative metrics (e.g., attention-distance statistics or length ablations) are absent. The manuscript does not claim quantitative superiority of the attention mechanism, only that the visualizations show the described behavior. We will revise the relevant sections to explicitly label the findings as qualitative and to avoid overstatement, but we do not plan to introduce new quantitative experiments at this stage. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical application of a standard bi-directional Transformer architecture to the chord recognition task, followed by training on audio data and qualitative attention-map visualization. No mathematical derivations, equations, or parameter-fitting steps are presented that reduce any claimed result to an input by construction. No self-citations are invoked as load-bearing premises for uniqueness theorems or ansatzes. The reported outcomes (competitive performance and observed attention behavior) are external to any definitional equivalence and rest on experimental results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on standard assumptions of deep learning (attention can be optimized via gradient descent on audio data) plus the domain assumption that chord labels in the training set are reliable and representative. No free parameters are explicitly introduced beyond typical model weights; no new entities are postulated.

free parameters (1)
  • model hyperparameters
    Number of layers, attention heads, and learning rate are chosen or tuned on data to achieve the reported performance.
axioms (1)
  • domain assumption Self-attention can be trained end-to-end to capture relevant long-range context in audio sequences
    Invoked in the description of BTC training and attention analysis.

pith-pipeline@v0.9.0 · 5705 in / 1342 out tokens · 21269 ms · 2026-05-25T02:22:03.384189+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We utilize a self-attention mechanism for chord recognition to focus on certain regions of chords. Training of the proposed bi-directional Transformer for chord recognition (BTC) consists of a single phase while showing competitive performance. Through an attention map analysis, it turns out that the model was able to divide segments of chords by utilizing adaptive receptive field of the attention mechanism. Furthermore, it was observed that the model was able to effectively capture long-term dependencies, making use of essential information regardless of distance.

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The structure of BTC is shown in Figure 1. The model consists of bi-directional multi-head self-attentions, position-wise convolutional blocks, a positional encoding, layer normalization, dropout and fully-connected layers.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 5 internal anchors

  1. [1]

    A BI- DIRECTIONAL TRANSFORMER FOR MUSICAL CHORD RECOG- NITION

    INTRODUCTION The goal of chord recognition task is to output a se- quence of time-synchronized chord labels when a raw audio recording of music is given as input. Chords are highly abstract and descriptive features of music that can be used for a variety of musical purposes, including auto- matic lead-sheet creation for musicians, cover song iden- tificati...

  2. [2]

    capture long-term dependencies

  3. [3]

    RELA TED WORK 2.1 Automatic Chord Recognition In the past, most automatic chord recognition systems were divided into three parts: feature extraction, pattern match- ing and chord sequence decoding. After applying transfor- mation such as short-time Fourier transform or constant-q transform (CQT) to an input audio signal, features are ex- tracted from the...

  4. [4]

    This context- dependent characteristic of the task is the motivation for applying the self-attention mechanism

    BI-DIRECTIONAL TRANSFORMER FOR CHORD RECOGNITION 3.1 Bi-directional Transformer Making use of appropriate surrounding frames is essen- tial for successful chord recognition [7, 8]. This context- dependent characteristic of the task is the motivation for applying the self-attention mechanism. With some modification to the original Transformer architecture, ...

  5. [5]

    No chord

    EXPERIMENTS 4.1 Data and Preprocessing BTC and other baseline models were evaluated on the following datasets. A subset of 221 songs from Isophon- ics 2 : 171 songs by the Beatles, 12 songs by Carole King, 20 songs by Queen and 18 songs by Zweieck; Robbie 2 http://isophonics.net/datasets Williams [12]: 65 songs by Robbie Williams; and a sub- set of 185 so...

  6. [6]

    To the best of our knowledge, this paper was the first attempt to apply Transformer to chord recognition

    CONCLUSION In this paper, we presented bi-directional Transformer for chord recognition (BTC). To the best of our knowledge, this paper was the first attempt to apply Transformer to chord recognition. The self-attention mechanism was ap- propriate for the task that attempts to capture long-term de- pendency by effectively exploring relevant sections. BTC h...

  7. [7]

    ACKNOWLEDGEMENTS This work was supported by Kakao and Kakao Brain cor- porations

  8. [8]

    L. J. Ba, R. Kiros, and G. E. Hinton. Layer normaliza- tion. arXiv preprint, arXiv:1607.06450, 2016

  9. [9]

    Bahdanau, K

    D. Bahdanau, K. Cho, and Y . Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representa- tions (ICLR), Conference Track Proc., San Diego, CA, USA, 2015

  10. [10]

    L. E. Baum and T. Petrie. Statistical inference for prob- abilistic functions of finite state markov chains. The annals of mathematical statistics , 37(6):1554–1563, 1966

  11. [11]

    J. P. Bello. Chord segmentation and recognition using em-trained hidden markov models. In Proc. of the 8th International Society for Music Information Retrieval Conference (ISMIR), pages 239–244, Vienna, Austria, 2007

  12. [12]

    Boulanger-Lewandowski, Y

    N. Boulanger-Lewandowski, Y . Bengio, and P. Vin- cent. Audio chord recognition with recurrent neural networks. In Proc. of the 14th International Society for Music Information Retrieval Conference (ISMIR) , pages 335–340, Curitiba, Brazil, 2013

  13. [13]

    T. Cho. Improved Techniques for Automatic Chord Recognition from Music Audio Signals . PhD thesis, New York University, 2014

  14. [14]

    Cho and J

    T. Cho and J. P. Bello. A feature smoothing method for chord recognition using recurrence plots. In Proc. of the 12th International Society for Music Information Retrieval Conference (ISMIR), pages 651–656, Miami, Florida, USA, 2011

  15. [15]

    Cho and J

    T. Cho and J. P. Bello. On the relative importance of individual components of chord recognition systems. IEEE/ACM Trans. Audio, Speech & Language Process- ing, 27(2):477–492, 2014

  16. [16]

    Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

    J. Chung, Ç. Gülçehre, K. Cho, and Y . Bengio. Empir- ical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint, arXiv:1412.3555, 2014

  17. [17]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transform- ers for language understanding. arXiv preprint , arXiv:1810.04805, 2018

  18. [18]

    L. Euler. Tentamen novae theoriae musicae ex certis- simis harmoniae principiis dilucide expositae . ex ty- pographia Academiae scientiarum, 1739

  19. [19]

    Di Giorgi, M

    B. Di Giorgi, M. Zanoni, A. Sarti, and S. Tubaro. Au- tomatic chord recognition based on the probabilistic modeling of diatonic modal harmony. In Proc. of the 8th International Workshop on Multidimensional Sys- tems, Erlangen, Germany, 2013

  20. [20]

    Music Transformer

    C.-Z. Anna Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, C. Hawthorne, A. M. Dai, M. D. Hoffman, and D. Eck. Music transformer: Generating music with long-term structure. arXiv preprint, arXiv:1809.04281, 2018

  21. [21]

    E. J. Humphrey and J. P. Bello. Rethinking automatic chord recognition with convolutional neural networks. In 11th International Conference on Machine Learn- ing and Applications(ICMLA) , pages 357–362, Boca Raton, FL, USA, 2012

  22. [22]

    E. J. Humphrey and J. P. Bello. Four timely insights on automatic chord estimation. In Proc. of the 16th Inter- national Society for Music Information Retrieval Con- ference (ISMIR), pages 673–679, Málaga, Spain, 2015

  23. [23]

    E. J. Humphrey, T. Cho, and J. P. Bello. Learning a robust tonnetz-space transform for automatic chord recognition. In Proc. of the IEEE International Con- ference on Acoustics, Speech, and Signal Process- ing(ICASSP), pages 453–456, Kyoto, Japan, 2012

  24. [24]

    D. P. Kingma and J. Ba. Adam: A method for stochas- tic optimization. In 3rd International Conference on Learning Representations (ICLR), Conference Track Proc., San Diego, CA, USA, 2015

  25. [25]

    Korzeniowski, D

    F. Korzeniowski, D. R. W. Sears, and G. Widmer. A large-scale study of language models for chord pre- diction. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP), pages 91–95, Calgary, AB, Canada, 2018

  26. [26]

    Korzeniowski and G

    F. Korzeniowski and G. Widmer. Feature learning for chord recognition: The deep chroma extractor. InProc. of the 17th International Society for Music Information Retrieval Conference (ISMIR), pages 37–43, New York City, USA, 2016

  27. [27]

    Korzeniowski and G

    F. Korzeniowski and G. Widmer. A fully convolutional deep auditory model for musical chord recognition. In 26th IEEE International Workshop on Machine Learn- ing for Signal Processing, (MLSP) , pages 1–6, Vietri sul Mare, Salerno, Italy, 2016

  28. [28]

    Korzeniowski and G

    F. Korzeniowski and G. Widmer. Improved chord recognition by combining duration and harmonic lan- guage models. In Proc. of the 19th International So- ciety for Music Information Retrieval Conference (IS- MIR), pages 10–17, Paris, France, 2018

  29. [29]

    J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th International Conference on Machine Learn- ing (ICML 2001), Williams College , pages 282–289, Williamstown, MA, USA, 2001

  30. [30]

    LeCun, Y

    Y . LeCun, Y . Bengio, and G. E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015

  31. [31]

    K. Lee. Identifying cover songs from audio using harmonic representation. MIREX 2006 , pages 36–38, 2006

  32. [32]

    McFee and J

    B. McFee and J. P. Bello. Structured training for large- vocabulary chord recognition. In Proc. of the 18th In- ternational Society for Music Information Retrieval Conference (ISMIR) , pages 188–194, Suzhou, China, 2017

  33. [33]

    Pauwels, F

    J. Pauwels, F. Kaiser, and G. Peeters. Combin- ing harmony-based and novelty-based approaches for structural segmentation. In Proc. of the 14th Interna- tional Society for Music Information Retrieval Confer- ence (ISMIR), pages 601–606, Curitiba, Brazil, 2013

  34. [34]

    Raffel, B

    C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis. Mir_eval: A transparent implementation of common mir metrics. In Proc. of the 15th International Society for Music Infor- mation Retrieval Conference (ISMIR), pages 367–372, Taipei, Taiwan, 2014

  35. [35]

    Sheh and D

    A. Sheh and D. P. W. Ellis. Chord segmentation and recognition using em-trained hidden markov models. In Proc. of the 4th International Society for Music Information Retrieval Conference (ISMIR) , Baltimore, Maryland, USA, 2003

  36. [36]

    Simonyan and A

    K. Simonyan and A. Zisserman. Very deep convolu- tional networks for large-scale image recognition. In 3rd International Conference on Learning Representa- tions (ICLR), Conference Track Proc., San Diego, CA, USA, 2015

  37. [37]

    Srivastava, G

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to pre- vent neural networks from overfitting. Journal of Ma- chine Learning Research, 15:1929–1958, 2014

  38. [38]

    Boulanger-Lewandowski, and S.Dixon

    S.Sigtia, N. Boulanger-Lewandowski, and S.Dixon. Audio chord recognition with a hybrid recurrent neu- ral network. In Proc. of the 16th International Society for Music Information Retrieval Conference (ISMIR) , pages 127–133, Málaga, Spain, 2015

  39. [39]

    Y . Ueda, Y . Uchiyama, T. Nishimoto, N. Ono, and S. Sagayama. Hmm-based approach for automatic chord detection using refined acoustic features. In Proc. of the IEEE International Conference on Acous- tics, Speech, and Signal Processing(ICASSP) , pages 5518–5521, Dallas, Texas, USA, 2010

  40. [40]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural In- formation Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 6000–6010, Long Beach, CA, USA, 2017

  41. [41]

    Wu and W

    Y . Wu and W. Li. Automatic audio chord recogni- tion with midi-trained deep feature and BLSTM-CRF sequence decoding model. IEEE/ACM Trans. Audio, Speech & Language Processing, 27(2):355–366, 2019

  42. [42]

    Zhou and A

    X. Zhou and A. Lerch. Chord detection using deep learning. In Proc. of the 16th International Society for Music Information Retrieval Conference (ISMIR) , pages 52–58, Málaga, Spain, 2015