A Bi-directional Transformer for Musical Chord Recognition

Dokyun Kim; Jonggwon Park; Jonghun Park; Kyoyun Choi; Sungwook Jeon

arxiv: 1907.02698 · v1 · pith:DHBJS7S3new · submitted 2019-07-05 · 💻 cs.SD · cs.LG· eess.AS

A Bi-directional Transformer for Musical Chord Recognition

Jonggwon Park , Kyoyun Choi , Sungwook Jeon , Dokyun Kim , Jonghun Park This is my paper

Pith reviewed 2026-05-25 02:22 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS

keywords chord recognitiontransformerself-attentionmusic information retrievalbi-directional modelattention mapslong-term dependenciesaudio sequence

0 comments

The pith

A bi-directional Transformer recognizes musical chords by focusing self-attention on relevant segments in a single training phase.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a bi-directional Transformer model for chord recognition from audio sequences. Prior CNN and RNN approaches either fail to capture long-term dependencies or require separate models for context. The proposed model uses self-attention to focus on chord regions and trains end-to-end in one phase while matching competitive performance levels. Attention map analysis shows the model divides chord segments with an adaptive receptive field and pulls in essential information from any distance in the sequence.

Core claim

The bi-directional Transformer for chord recognition (BTC) applies a self-attention mechanism to focus on certain regions of chords within audio sequences. Its training consists of a single phase and yields competitive performance. Attention map analysis reveals that the model divides segments of chords by means of the adaptive receptive field of attention and captures long-term dependencies by making use of essential information regardless of distance.

What carries the argument

Self-attention mechanism in the bi-directional Transformer, which supplies an adaptive receptive field for segmenting chords and for accessing distant context.

If this is right

Chord recognition can be performed with single-phase training and no auxiliary model.
The model divides chord segments adaptively through attention rather than fixed receptive fields.
Long-term dependencies are utilized irrespective of temporal distance in the audio sequence.
Attention maps provide direct visualization of how context is selected during recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attention structure could be tested on related sequence-labeling tasks such as key detection or downbeat tracking.
If the adaptive field generalizes, it may reduce the need for hand-tuned window sizes in other audio classification problems.
Performance on very long recordings would offer a direct test of the long-range dependency claim.

Load-bearing premise

The self-attention mechanism can be trained in one phase to focus on relevant chord regions and capture long-term dependencies without the limitations of CNNs or RNNs.

What would settle it

An attention-map visualization on held-out audio where the model produces no clear chord-segment boundaries or systematically ignores distant but harmonically relevant frames would falsify the central claim.

read the original abstract

Chord recognition is an important task since chords are highly abstract and descriptive features of music. For effective chord recognition, it is essential to utilize relevant context in audio sequence. While various machine learning models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been employed for the task, most of them have limitations in capturing long-term dependency or require training of an additional model. In this work, we utilize a self-attention mechanism for chord recognition to focus on certain regions of chords. Training of the proposed bi-directional Transformer for chord recognition (BTC) consists of a single phase while showing competitive performance. Through an attention map analysis, we have visualized how attention was performed. It turns out that the model was able to divide segments of chords by utilizing adaptive receptive field of the attention mechanism. Furthermore, it was observed that the model was able to effectively capture long-term dependencies, making use of essential information regardless of distance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Straightforward 2019 application of bidirectional Transformer to chord recognition that adds attention visualization but rests on unshown quantitative claims.

read the letter

The main takeaway is that this paper takes the self-attention mechanism and applies it to musical chord recognition in a single training phase, then inspects the attention maps to argue that the model picks up long-range context without the usual CNN or RNN crutches. That is the concrete contribution on offer. It is a timely move given when the Transformer paper appeared, and the visualization step is a reasonable way to show what the model is doing with chord segments. The abstract makes clear that the goal is to handle variable-length dependencies better than prior sequence models, and the attention analysis is presented as evidence that the receptive field adapts on its own. That part reads as honest empirical work rather than overclaim. The soft spot is that the abstract gives no numbers, no dataset sizes, no baseline scores, and no error bars, so the claim of competitive performance cannot be checked from what is written here. The attention maps are qualitative only, which limits how much weight they can carry for the long-dependency argument. If the full paper supplies those missing comparisons and shows the gains are real rather than marginal, the work becomes more useful. This is the kind of paper that matters to people already working on MIR sequence tasks who want to see how attention behaves on audio features. It is not going to shift core questions in either music theory or attention research. A serious editor should send it to review once the experimental details are confirmed, because the setup is clean enough to be worth referee time even if revisions are needed on the results section.

Referee Report

2 major / 0 minor

Summary. The paper proposes a bi-directional Transformer model (BTC) for musical chord recognition that employs self-attention to focus on relevant regions of audio sequences. It claims that single-phase training yields competitive performance, and attention map analysis shows the model divides chord segments via an adaptive receptive field while capturing long-term dependencies irrespective of distance.

Significance. If supported by quantitative evidence, the work would be significant for introducing Transformer architectures to chord recognition in music information retrieval, addressing limitations of CNNs and RNNs in long-range context without multi-phase training or auxiliary models. The attention visualizations constitute a positive contribution by providing qualitative insight into model behavior.

major comments (2)

[Abstract] Abstract: the claim of 'competitive performance' is presented without any numerical results, error bars, dataset specifications, or baseline comparisons, which is load-bearing for the central empirical contribution.
[Attention map analysis] Attention map analysis (as described in the abstract): the assertions that the model divides chord segments and captures long-term dependencies rest exclusively on qualitative observations of attention maps, without supporting quantitative metrics such as attention-distance distributions, ablation studies on sequence length, or comparisons to RNN receptive fields.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below, proposing revisions where they strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'competitive performance' is presented without any numerical results, error bars, dataset specifications, or baseline comparisons, which is load-bearing for the central empirical contribution.

Authors: We agree that the abstract would be strengthened by including key numerical results. Although the full manuscript reports these details in the experimental section (including dataset names, accuracy metrics, and baseline comparisons), the abstract summarizes without them. In the revised version we will add concise performance figures, dataset specifications, and baseline references to the abstract to make the claim self-contained. revision: yes
Referee: [Attention map analysis] Attention map analysis (as described in the abstract): the assertions that the model divides chord segments and captures long-term dependencies rest exclusively on qualitative observations of attention maps, without supporting quantitative metrics such as attention-distance distributions, ablation studies on sequence length, or comparisons to RNN receptive fields.

Authors: The attention analysis is presented as qualitative observations from the visualized maps, which illustrate the adaptive receptive field and long-range attention patterns. We acknowledge that quantitative metrics (e.g., attention-distance statistics or length ablations) are absent. The manuscript does not claim quantitative superiority of the attention mechanism, only that the visualizations show the described behavior. We will revise the relevant sections to explicitly label the findings as qualitative and to avoid overstatement, but we do not plan to introduce new quantitative experiments at this stage. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical application of a standard bi-directional Transformer architecture to the chord recognition task, followed by training on audio data and qualitative attention-map visualization. No mathematical derivations, equations, or parameter-fitting steps are presented that reduce any claimed result to an input by construction. No self-citations are invoked as load-bearing premises for uniqueness theorems or ansatzes. The reported outcomes (competitive performance and observed attention behavior) are external to any definitional equivalence and rest on experimental results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on standard assumptions of deep learning (attention can be optimized via gradient descent on audio data) plus the domain assumption that chord labels in the training set are reliable and representative. No free parameters are explicitly introduced beyond typical model weights; no new entities are postulated.

free parameters (1)

model hyperparameters
Number of layers, attention heads, and learning rate are chosen or tuned on data to achieve the reported performance.

axioms (1)

domain assumption Self-attention can be trained end-to-end to capture relevant long-range context in audio sequences
Invoked in the description of BTC training and attention analysis.

pith-pipeline@v0.9.0 · 5705 in / 1342 out tokens · 21269 ms · 2026-05-25T02:22:03.384189+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We utilize a self-attention mechanism for chord recognition to focus on certain regions of chords. Training of the proposed bi-directional Transformer for chord recognition (BTC) consists of a single phase while showing competitive performance. Through an attention map analysis, it turns out that the model was able to divide segments of chords by utilizing adaptive receptive field of the attention mechanism. Furthermore, it was observed that the model was able to effectively capture long-term dependencies, making use of essential information regardless of distance.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The structure of BTC is shown in Figure 1. The model consists of bi-directional multi-head self-attentions, position-wise convolutional blocks, a positional encoding, layer normalization, dropout and fully-connected layers.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 5 internal anchors

[1]

A BI- DIRECTIONAL TRANSFORMER FOR MUSICAL CHORD RECOG- NITION

INTRODUCTION The goal of chord recognition task is to output a se- quence of time-synchronized chord labels when a raw audio recording of music is given as input. Chords are highly abstract and descriptive features of music that can be used for a variety of musical purposes, including auto- matic lead-sheet creation for musicians, cover song iden- tiﬁcati...

work page 2019
[2]

capture long-term dependencies

work page
[3]

RELA TED WORK 2.1 Automatic Chord Recognition In the past, most automatic chord recognition systems were divided into three parts: feature extraction, pattern match- ing and chord sequence decoding. After applying transfor- mation such as short-time Fourier transform or constant-q transform (CQT) to an input audio signal, features are ex- tracted from the...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[4]

This context- dependent characteristic of the task is the motivation for applying the self-attention mechanism

BI-DIRECTIONAL TRANSFORMER FOR CHORD RECOGNITION 3.1 Bi-directional Transformer Making use of appropriate surrounding frames is essen- tial for successful chord recognition [7, 8]. This context- dependent characteristic of the task is the motivation for applying the self-attention mechanism. With some modiﬁcation to the original Transformer architecture, ...

work page
[5]

No chord

EXPERIMENTS 4.1 Data and Preprocessing BTC and other baseline models were evaluated on the following datasets. A subset of 221 songs from Isophon- ics 2 : 171 songs by the Beatles, 12 songs by Carole King, 20 songs by Queen and 18 songs by Zweieck; Robbie 2 http://isophonics.net/datasets Williams [12]: 65 songs by Robbie Williams; and a sub- set of 185 so...

work page 2048
[6]

To the best of our knowledge, this paper was the ﬁrst attempt to apply Transformer to chord recognition

CONCLUSION In this paper, we presented bi-directional Transformer for chord recognition (BTC). To the best of our knowledge, this paper was the ﬁrst attempt to apply Transformer to chord recognition. The self-attention mechanism was ap- propriate for the task that attempts to capture long-term de- pendency by effectively exploring relevant sections. BTC h...

work page
[7]

ACKNOWLEDGEMENTS This work was supported by Kakao and Kakao Brain cor- porations

work page
[8]

L. J. Ba, R. Kiros, and G. E. Hinton. Layer normaliza- tion. arXiv preprint, arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[9]

Bahdanau, K

D. Bahdanau, K. Cho, and Y . Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representa- tions (ICLR), Conference Track Proc., San Diego, CA, USA, 2015

work page 2015
[10]

L. E. Baum and T. Petrie. Statistical inference for prob- abilistic functions of ﬁnite state markov chains. The annals of mathematical statistics , 37(6):1554–1563, 1966

work page 1966
[11]

J. P. Bello. Chord segmentation and recognition using em-trained hidden markov models. In Proc. of the 8th International Society for Music Information Retrieval Conference (ISMIR), pages 239–244, Vienna, Austria, 2007

work page 2007
[12]

Boulanger-Lewandowski, Y

N. Boulanger-Lewandowski, Y . Bengio, and P. Vin- cent. Audio chord recognition with recurrent neural networks. In Proc. of the 14th International Society for Music Information Retrieval Conference (ISMIR) , pages 335–340, Curitiba, Brazil, 2013

work page 2013
[13]

T. Cho. Improved Techniques for Automatic Chord Recognition from Music Audio Signals . PhD thesis, New York University, 2014

work page 2014
[14]

Cho and J

T. Cho and J. P. Bello. A feature smoothing method for chord recognition using recurrence plots. In Proc. of the 12th International Society for Music Information Retrieval Conference (ISMIR), pages 651–656, Miami, Florida, USA, 2011

work page 2011
[15]

Cho and J

T. Cho and J. P. Bello. On the relative importance of individual components of chord recognition systems. IEEE/ACM Trans. Audio, Speech & Language Process- ing, 27(2):477–492, 2014

work page 2014
[16]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

J. Chung, Ç. Gülçehre, K. Cho, and Y . Bengio. Empir- ical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint, arXiv:1412.3555, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[17]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transform- ers for language understanding. arXiv preprint , arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

L. Euler. Tentamen novae theoriae musicae ex certis- simis harmoniae principiis dilucide expositae . ex ty- pographia Academiae scientiarum, 1739

work page
[19]

Di Giorgi, M

B. Di Giorgi, M. Zanoni, A. Sarti, and S. Tubaro. Au- tomatic chord recognition based on the probabilistic modeling of diatonic modal harmony. In Proc. of the 8th International Workshop on Multidimensional Sys- tems, Erlangen, Germany, 2013

work page 2013
[20]

Music Transformer

C.-Z. Anna Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, C. Hawthorne, A. M. Dai, M. D. Hoffman, and D. Eck. Music transformer: Generating music with long-term structure. arXiv preprint, arXiv:1809.04281, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

E. J. Humphrey and J. P. Bello. Rethinking automatic chord recognition with convolutional neural networks. In 11th International Conference on Machine Learn- ing and Applications(ICMLA) , pages 357–362, Boca Raton, FL, USA, 2012

work page 2012
[22]

E. J. Humphrey and J. P. Bello. Four timely insights on automatic chord estimation. In Proc. of the 16th Inter- national Society for Music Information Retrieval Con- ference (ISMIR), pages 673–679, Málaga, Spain, 2015

work page 2015
[23]

E. J. Humphrey, T. Cho, and J. P. Bello. Learning a robust tonnetz-space transform for automatic chord recognition. In Proc. of the IEEE International Con- ference on Acoustics, Speech, and Signal Process- ing(ICASSP), pages 453–456, Kyoto, Japan, 2012

work page 2012
[24]

D. P. Kingma and J. Ba. Adam: A method for stochas- tic optimization. In 3rd International Conference on Learning Representations (ICLR), Conference Track Proc., San Diego, CA, USA, 2015

work page 2015
[25]

Korzeniowski, D

F. Korzeniowski, D. R. W. Sears, and G. Widmer. A large-scale study of language models for chord pre- diction. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP), pages 91–95, Calgary, AB, Canada, 2018

work page 2018
[26]

Korzeniowski and G

F. Korzeniowski and G. Widmer. Feature learning for chord recognition: The deep chroma extractor. InProc. of the 17th International Society for Music Information Retrieval Conference (ISMIR), pages 37–43, New York City, USA, 2016

work page 2016
[27]

Korzeniowski and G

F. Korzeniowski and G. Widmer. A fully convolutional deep auditory model for musical chord recognition. In 26th IEEE International Workshop on Machine Learn- ing for Signal Processing, (MLSP) , pages 1–6, Vietri sul Mare, Salerno, Italy, 2016

work page 2016
[28]

Korzeniowski and G

F. Korzeniowski and G. Widmer. Improved chord recognition by combining duration and harmonic lan- guage models. In Proc. of the 19th International So- ciety for Music Information Retrieval Conference (IS- MIR), pages 10–17, Paris, France, 2018

work page 2018
[29]

J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th International Conference on Machine Learn- ing (ICML 2001), Williams College , pages 282–289, Williamstown, MA, USA, 2001

work page 2001
[30]

LeCun, Y

Y . LeCun, Y . Bengio, and G. E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015

work page 2015
[31]

K. Lee. Identifying cover songs from audio using harmonic representation. MIREX 2006 , pages 36–38, 2006

work page 2006
[32]

McFee and J

B. McFee and J. P. Bello. Structured training for large- vocabulary chord recognition. In Proc. of the 18th In- ternational Society for Music Information Retrieval Conference (ISMIR) , pages 188–194, Suzhou, China, 2017

work page 2017
[33]

Pauwels, F

J. Pauwels, F. Kaiser, and G. Peeters. Combin- ing harmony-based and novelty-based approaches for structural segmentation. In Proc. of the 14th Interna- tional Society for Music Information Retrieval Confer- ence (ISMIR), pages 601–606, Curitiba, Brazil, 2013

work page 2013
[34]

Raffel, B

C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis. Mir_eval: A transparent implementation of common mir metrics. In Proc. of the 15th International Society for Music Infor- mation Retrieval Conference (ISMIR), pages 367–372, Taipei, Taiwan, 2014

work page 2014
[35]

Sheh and D

A. Sheh and D. P. W. Ellis. Chord segmentation and recognition using em-trained hidden markov models. In Proc. of the 4th International Society for Music Information Retrieval Conference (ISMIR) , Baltimore, Maryland, USA, 2003

work page 2003
[36]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolu- tional networks for large-scale image recognition. In 3rd International Conference on Learning Representa- tions (ICLR), Conference Track Proc., San Diego, CA, USA, 2015

work page 2015
[37]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to pre- vent neural networks from overﬁtting. Journal of Ma- chine Learning Research, 15:1929–1958, 2014

work page 1929
[38]

Boulanger-Lewandowski, and S.Dixon

S.Sigtia, N. Boulanger-Lewandowski, and S.Dixon. Audio chord recognition with a hybrid recurrent neu- ral network. In Proc. of the 16th International Society for Music Information Retrieval Conference (ISMIR) , pages 127–133, Málaga, Spain, 2015

work page 2015
[39]

Y . Ueda, Y . Uchiyama, T. Nishimoto, N. Ono, and S. Sagayama. Hmm-based approach for automatic chord detection using reﬁned acoustic features. In Proc. of the IEEE International Conference on Acous- tics, Speech, and Signal Processing(ICASSP) , pages 5518–5521, Dallas, Texas, USA, 2010

work page 2010
[40]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural In- formation Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 6000–6010, Long Beach, CA, USA, 2017

work page 2017
[41]

Wu and W

Y . Wu and W. Li. Automatic audio chord recogni- tion with midi-trained deep feature and BLSTM-CRF sequence decoding model. IEEE/ACM Trans. Audio, Speech & Language Processing, 27(2):355–366, 2019

work page 2019
[42]

Zhou and A

X. Zhou and A. Lerch. Chord detection using deep learning. In Proc. of the 16th International Society for Music Information Retrieval Conference (ISMIR) , pages 52–58, Málaga, Spain, 2015

work page 2015

[1] [1]

A BI- DIRECTIONAL TRANSFORMER FOR MUSICAL CHORD RECOG- NITION

INTRODUCTION The goal of chord recognition task is to output a se- quence of time-synchronized chord labels when a raw audio recording of music is given as input. Chords are highly abstract and descriptive features of music that can be used for a variety of musical purposes, including auto- matic lead-sheet creation for musicians, cover song iden- tiﬁcati...

work page 2019

[2] [2]

capture long-term dependencies

work page

[3] [3]

RELA TED WORK 2.1 Automatic Chord Recognition In the past, most automatic chord recognition systems were divided into three parts: feature extraction, pattern match- ing and chord sequence decoding. After applying transfor- mation such as short-time Fourier transform or constant-q transform (CQT) to an input audio signal, features are ex- tracted from the...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[4] [4]

This context- dependent characteristic of the task is the motivation for applying the self-attention mechanism

BI-DIRECTIONAL TRANSFORMER FOR CHORD RECOGNITION 3.1 Bi-directional Transformer Making use of appropriate surrounding frames is essen- tial for successful chord recognition [7, 8]. This context- dependent characteristic of the task is the motivation for applying the self-attention mechanism. With some modiﬁcation to the original Transformer architecture, ...

work page

[5] [5]

No chord

EXPERIMENTS 4.1 Data and Preprocessing BTC and other baseline models were evaluated on the following datasets. A subset of 221 songs from Isophon- ics 2 : 171 songs by the Beatles, 12 songs by Carole King, 20 songs by Queen and 18 songs by Zweieck; Robbie 2 http://isophonics.net/datasets Williams [12]: 65 songs by Robbie Williams; and a sub- set of 185 so...

work page 2048

[6] [6]

To the best of our knowledge, this paper was the ﬁrst attempt to apply Transformer to chord recognition

CONCLUSION In this paper, we presented bi-directional Transformer for chord recognition (BTC). To the best of our knowledge, this paper was the ﬁrst attempt to apply Transformer to chord recognition. The self-attention mechanism was ap- propriate for the task that attempts to capture long-term de- pendency by effectively exploring relevant sections. BTC h...

work page

[7] [7]

ACKNOWLEDGEMENTS This work was supported by Kakao and Kakao Brain cor- porations

work page

[8] [8]

L. J. Ba, R. Kiros, and G. E. Hinton. Layer normaliza- tion. arXiv preprint, arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[9] [9]

Bahdanau, K

D. Bahdanau, K. Cho, and Y . Bengio. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representa- tions (ICLR), Conference Track Proc., San Diego, CA, USA, 2015

work page 2015

[10] [10]

L. E. Baum and T. Petrie. Statistical inference for prob- abilistic functions of ﬁnite state markov chains. The annals of mathematical statistics , 37(6):1554–1563, 1966

work page 1966

[11] [11]

J. P. Bello. Chord segmentation and recognition using em-trained hidden markov models. In Proc. of the 8th International Society for Music Information Retrieval Conference (ISMIR), pages 239–244, Vienna, Austria, 2007

work page 2007

[12] [12]

Boulanger-Lewandowski, Y

N. Boulanger-Lewandowski, Y . Bengio, and P. Vin- cent. Audio chord recognition with recurrent neural networks. In Proc. of the 14th International Society for Music Information Retrieval Conference (ISMIR) , pages 335–340, Curitiba, Brazil, 2013

work page 2013

[13] [13]

T. Cho. Improved Techniques for Automatic Chord Recognition from Music Audio Signals . PhD thesis, New York University, 2014

work page 2014

[14] [14]

Cho and J

T. Cho and J. P. Bello. A feature smoothing method for chord recognition using recurrence plots. In Proc. of the 12th International Society for Music Information Retrieval Conference (ISMIR), pages 651–656, Miami, Florida, USA, 2011

work page 2011

[15] [15]

Cho and J

T. Cho and J. P. Bello. On the relative importance of individual components of chord recognition systems. IEEE/ACM Trans. Audio, Speech & Language Process- ing, 27(2):477–492, 2014

work page 2014

[16] [16]

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

J. Chung, Ç. Gülçehre, K. Cho, and Y . Bengio. Empir- ical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint, arXiv:1412.3555, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[17] [17]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transform- ers for language understanding. arXiv preprint , arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[18] [18]

L. Euler. Tentamen novae theoriae musicae ex certis- simis harmoniae principiis dilucide expositae . ex ty- pographia Academiae scientiarum, 1739

work page

[19] [19]

Di Giorgi, M

B. Di Giorgi, M. Zanoni, A. Sarti, and S. Tubaro. Au- tomatic chord recognition based on the probabilistic modeling of diatonic modal harmony. In Proc. of the 8th International Workshop on Multidimensional Sys- tems, Erlangen, Germany, 2013

work page 2013

[20] [20]

Music Transformer

C.-Z. Anna Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, C. Hawthorne, A. M. Dai, M. D. Hoffman, and D. Eck. Music transformer: Generating music with long-term structure. arXiv preprint, arXiv:1809.04281, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

E. J. Humphrey and J. P. Bello. Rethinking automatic chord recognition with convolutional neural networks. In 11th International Conference on Machine Learn- ing and Applications(ICMLA) , pages 357–362, Boca Raton, FL, USA, 2012

work page 2012

[22] [22]

E. J. Humphrey and J. P. Bello. Four timely insights on automatic chord estimation. In Proc. of the 16th Inter- national Society for Music Information Retrieval Con- ference (ISMIR), pages 673–679, Málaga, Spain, 2015

work page 2015

[23] [23]

E. J. Humphrey, T. Cho, and J. P. Bello. Learning a robust tonnetz-space transform for automatic chord recognition. In Proc. of the IEEE International Con- ference on Acoustics, Speech, and Signal Process- ing(ICASSP), pages 453–456, Kyoto, Japan, 2012

work page 2012

[24] [24]

D. P. Kingma and J. Ba. Adam: A method for stochas- tic optimization. In 3rd International Conference on Learning Representations (ICLR), Conference Track Proc., San Diego, CA, USA, 2015

work page 2015

[25] [25]

Korzeniowski, D

F. Korzeniowski, D. R. W. Sears, and G. Widmer. A large-scale study of language models for chord pre- diction. In Proc. of the IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP), pages 91–95, Calgary, AB, Canada, 2018

work page 2018

[26] [26]

Korzeniowski and G

F. Korzeniowski and G. Widmer. Feature learning for chord recognition: The deep chroma extractor. InProc. of the 17th International Society for Music Information Retrieval Conference (ISMIR), pages 37–43, New York City, USA, 2016

work page 2016

[27] [27]

Korzeniowski and G

F. Korzeniowski and G. Widmer. A fully convolutional deep auditory model for musical chord recognition. In 26th IEEE International Workshop on Machine Learn- ing for Signal Processing, (MLSP) , pages 1–6, Vietri sul Mare, Salerno, Italy, 2016

work page 2016

[28] [28]

Korzeniowski and G

F. Korzeniowski and G. Widmer. Improved chord recognition by combining duration and harmonic lan- guage models. In Proc. of the 19th International So- ciety for Music Information Retrieval Conference (IS- MIR), pages 10–17, Paris, France, 2018

work page 2018

[29] [29]

J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th International Conference on Machine Learn- ing (ICML 2001), Williams College , pages 282–289, Williamstown, MA, USA, 2001

work page 2001

[30] [30]

LeCun, Y

Y . LeCun, Y . Bengio, and G. E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015

work page 2015

[31] [31]

K. Lee. Identifying cover songs from audio using harmonic representation. MIREX 2006 , pages 36–38, 2006

work page 2006

[32] [32]

McFee and J

B. McFee and J. P. Bello. Structured training for large- vocabulary chord recognition. In Proc. of the 18th In- ternational Society for Music Information Retrieval Conference (ISMIR) , pages 188–194, Suzhou, China, 2017

work page 2017

[33] [33]

Pauwels, F

J. Pauwels, F. Kaiser, and G. Peeters. Combin- ing harmony-based and novelty-based approaches for structural segmentation. In Proc. of the 14th Interna- tional Society for Music Information Retrieval Confer- ence (ISMIR), pages 601–606, Curitiba, Brazil, 2013

work page 2013

[34] [34]

Raffel, B

C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis. Mir_eval: A transparent implementation of common mir metrics. In Proc. of the 15th International Society for Music Infor- mation Retrieval Conference (ISMIR), pages 367–372, Taipei, Taiwan, 2014

work page 2014

[35] [35]

Sheh and D

A. Sheh and D. P. W. Ellis. Chord segmentation and recognition using em-trained hidden markov models. In Proc. of the 4th International Society for Music Information Retrieval Conference (ISMIR) , Baltimore, Maryland, USA, 2003

work page 2003

[36] [36]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolu- tional networks for large-scale image recognition. In 3rd International Conference on Learning Representa- tions (ICLR), Conference Track Proc., San Diego, CA, USA, 2015

work page 2015

[37] [37]

Srivastava, G

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to pre- vent neural networks from overﬁtting. Journal of Ma- chine Learning Research, 15:1929–1958, 2014

work page 1929

[38] [38]

Boulanger-Lewandowski, and S.Dixon

S.Sigtia, N. Boulanger-Lewandowski, and S.Dixon. Audio chord recognition with a hybrid recurrent neu- ral network. In Proc. of the 16th International Society for Music Information Retrieval Conference (ISMIR) , pages 127–133, Málaga, Spain, 2015

work page 2015

[39] [39]

Y . Ueda, Y . Uchiyama, T. Nishimoto, N. Ono, and S. Sagayama. Hmm-based approach for automatic chord detection using reﬁned acoustic features. In Proc. of the IEEE International Conference on Acous- tics, Speech, and Signal Processing(ICASSP) , pages 5518–5521, Dallas, Texas, USA, 2010

work page 2010

[40] [40]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural In- formation Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 6000–6010, Long Beach, CA, USA, 2017

work page 2017

[41] [41]

Wu and W

Y . Wu and W. Li. Automatic audio chord recogni- tion with midi-trained deep feature and BLSTM-CRF sequence decoding model. IEEE/ACM Trans. Audio, Speech & Language Processing, 27(2):355–366, 2019

work page 2019

[42] [42]

Zhou and A

X. Zhou and A. Lerch. Chord detection using deep learning. In Proc. of the 16th International Society for Music Information Retrieval Conference (ISMIR) , pages 52–58, Málaga, Spain, 2015

work page 2015