Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment

Weiliang Luo

arxiv: 2410.18151 · v2 · pith:XPHMGSD3new · submitted 2024-10-23 · 💻 cs.SD · cs.LG· cs.MM· eess.AS

Music102: An D₁₂-equivariant transformer for chord progression accompaniment

Weiliang Luo This is my paper

Pith reviewed 2026-05-23 19:13 UTC · model grok-4.3

classification 💻 cs.SD cs.LGcs.MMeess.AS

keywords D12-equivariant transformerchord progression accompanimentmusical symmetryPOP909 datasetgroup theory in musictransformer architecturesymbolic musicself-attention adaptation

0 comments

The pith

A D12-equivariant transformer for chord accompaniment improves weighted loss and exact accuracy over a non-equivariant prototype while using fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Music102, a transformer that embeds D12 group actions of transposition and reflection so that melody and chord sequences remain equivariant under musical symmetries. It trains and evaluates this architecture on the POP909 dataset and reports gains over the earlier Music101 model in both loss and accuracy metrics. A sympathetic reader would care because the result suggests that hard-coding music-theoretic symmetries can make sequence models more accurate and more compact for symbolic accompaniment tasks. The work also shows how self-attention and layer normalization can be adapted to preserve these discrete symmetries.

Core claim

Music102 is a D12-equivariant transformer that maintains equivariance across melody and chord sequences by integrating transposition and reflection operations from group theory. Trained and tested on the POP909 dataset, it achieves lower weighted loss and higher exact accuracy than the non-equivariant Music101 prototype despite using fewer parameters. The architecture demonstrates that self-attention mechanisms and layer normalization can be made to respect the discrete musical domain while encoding prior symmetry knowledge.

What carries the argument

The D12-equivariant transformer that enforces symmetry under transposition and reflection operations on both melody and chord sequences.

Load-bearing premise

Enforcing D12-equivariance will improve the model's capacity to capture chord-progression patterns rather than constrain it to only symmetric ones.

What would settle it

If Music102 and Music101 are trained on identical POP909 splits and Music102 shows higher weighted loss or lower exact accuracy, the claimed improvement is falsified.

read the original abstract

We present Music102, an advanced model aimed at enhancing chord progression accompaniment through a $D_{12}$-equivariant transformer. Inspired by group theory and symbolic music structures, Music102 leverages musical symmetry--such as transposition and reflection operations--integrating these properties into the transformer architecture. By encoding prior music knowledge, the model maintains equivariance across both melody and chord sequences. The POP909 dataset was employed to train and evaluate Music102, revealing significant improvements over the non-equivariant Music101 prototype Music101 in both weighted loss and exact accuracy metrics, despite using fewer parameters. This work showcases the adaptability of self-attention mechanisms and layer normalization to the discrete musical domain, addressing challenges in computational music analysis. With its stable and flexible neural framework, Music102 sets the stage for further exploration in equivariant music generation and computational composition tools, bridging mathematical theory with practical music performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Music102 applies D12-equivariance to chord accompaniment but the gains over Music101 are not isolated from other possible differences.

read the letter

Music102 is a D12-equivariant transformer for chord progression accompaniment. It encodes transposition and reflection symmetries from the dihedral group into the self-attention and layer normalization so the model stays consistent under those operations on melody and chord sequences from the POP909 dataset. The central claim is that this version beats the non-equivariant Music101 prototype on weighted loss and exact accuracy while using fewer parameters. That is the concrete new piece: a direct adaptation of group-equivariant techniques to this specific music task. The paper does a clear job laying out the musical motivation and showing how the symmetries can be baked into the transformer components without adding parameters. The framing connects group theory to symbolic music in a straightforward way. The main limitation is the evaluation. The abstract supplies no implementation specifics on the equivariant layers, no ablation that holds everything else fixed, and no statistical tests or error analysis. Without those controls it is impossible to attribute the reported gains to the D12 structure rather than other differences in architecture, optimization, or data handling between the two models. The stress-test concern lands: the comparison does not isolate the effect of equivariance. The assumption that enforcing the full group action will improve rather than constrain performance on real chord patterns remains untested in the given material. This paper is for researchers already working on symmetry-aware models or computational music generation. A reader in that niche might pick up the adaptation idea, but anyone wanting a solid empirical demonstration would need the full methods and results. If the manuscript supplies the missing implementation details, ablations, and code, it deserves peer review so the technical claims can be checked properly.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Music102, a D_{12}-equivariant transformer for chord progression accompaniment that encodes musical symmetries (transpositions and reflections) via adapted self-attention and layer normalization. It claims that this model, trained on the POP909 dataset, achieves better weighted loss and exact accuracy than the non-equivariant Music101 prototype while using fewer parameters.

Significance. If the performance gains can be shown to arise specifically from the D12-equivariance under controlled conditions, the result would provide concrete evidence that group-equivariant architectures can improve efficiency and accuracy in symbolic music tasks by incorporating domain symmetries.

major comments (2)

[Abstract] Abstract: The central claim of 'significant improvements' over Music101 in weighted loss and exact accuracy is unsupported by any numerical values, definitions of the metrics, ablation results, or statistical tests, so the data-to-claim link cannot be assessed.
[Abstract] Abstract: No information is supplied on whether Music102 and Music101 share identical base architecture, layer counts, attention mechanisms, training procedure, hyperparameters, and data handling on POP909; without these controls the reported gains cannot be attributed to the D12-equivariant constraints rather than other differences.

minor comments (1)

[Abstract] Abstract: The phrase 'non-equivariant Music101 prototype Music101' contains an erroneous repetition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. We agree that the abstract as currently written does not sufficiently support its claims with quantitative details or explicit controls, and we will revise it to address both points. Our responses to the major comments are below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 'significant improvements' over Music101 in weighted loss and exact accuracy is unsupported by any numerical values, definitions of the metrics, ablation results, or statistical tests, so the data-to-claim link cannot be assessed.

Authors: We acknowledge that the abstract does not contain the supporting numerical values, metric definitions, or ablation details. The experimental results section of the manuscript reports the concrete weighted loss and exact accuracy figures on POP909 together with the comparison to Music101. To make the abstract self-contained we will insert the key quantitative results, a brief definition of each metric, and a short statement that the gains are measured under matched training conditions. We will not add statistical tests unless space permits, as the improvements appear consistently across both metrics. revision: yes
Referee: [Abstract] Abstract: No information is supplied on whether Music102 and Music101 share identical base architecture, layer counts, attention mechanisms, training procedure, hyperparameters, and data handling on POP909; without these controls the reported gains cannot be attributed to the D12-equivariant constraints rather than other differences.

Authors: Music102 is constructed by taking the identical base transformer architecture, layer count, attention mechanism, training procedure, hyperparameters, and POP909 data splits used for Music101 and then replacing only the self-attention and layer-normalization modules with their D12-equivariant counterparts. All other components remain unchanged. We will add an explicit sentence to the revised abstract stating that the comparison is controlled in this manner and will ensure the methods section already contains the full specification of the shared settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical performance claims only

full rationale

The manuscript presents Music102 as a D12-equivariant transformer and reports empirical gains in weighted loss and exact accuracy over the Music101 baseline on POP909, with no equations, derivations, or first-principles predictions that reduce by construction to fitted parameters or self-citations. The central claim is an observed metric improvement under the stated architectural change; this is externally falsifiable by replication and does not invoke any of the enumerated circular patterns. Self-citation of prior Music101 work is present but not load-bearing for any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

axioms (1)

domain assumption D12 symmetries (transpositions and reflections) are the musically relevant group actions to enforce for chord accompaniment
Invoked to justify the equivariant architecture

pith-pipeline@v0.9.0 · 5683 in / 1072 out tokens · 46151 ms · 2026-05-23T19:13:49.938642+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 3 internal anchors

[1]

INTRODUCTION In the burgeoning age of AI arts, generative AI, represented by the diffusion model, has been profoundly influencing the concept of digital paintings, while music production re- mains a frontier for machine intelligence to explore. For an AI composer, there’s a long way to achieve the holy grail of creating a complete classical symphony; howe...

work page 2025
[2]

RELATED WORK The notion of the symmetry of pitch classes is fundamen- tal in music theory [1], where the group theory exerts its power as in spatial objects [2]. As an essential prior knowl- edge of the music structure, computational music studies have embraced it in various tasks, such as transposition- invariant music metric [3], transposition-equivaria...

work page
[3]

Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment

BACKGROUND 3.1 Equal temperament The notion of a sound’spitch is physically instantiated by the frequency of its vibration. Due to the physiological features of human ears and/or centuries of cultural con- struction, sounds of frequencies with a simple integer ratio make people feel harmonious. The relation between two pitches with the simplest non-trivia...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

The melody notes N = {(Pn, b, v)} are embedded as a series of vec- tors m(k) ∈ [0, 1]12, where m(k) records the sounding notes during the timespan between (k − 1)u and ku

METHOD 4.1 Embedding music into vectors A minimum value of u is used as a time step. The melody notes N = {(Pn, b, v)} are embedded as a series of vec- tors m(k) ∈ [0, 1]12, where m(k) records the sounding notes during the timespan between (k − 1)u and ku. In- spired by [14], the embedding builds on the relative contri- bution of each note during the time...

work page
[5]

EXPERIMENTS The code conducting the experiments can be found in our Github repo. 5.1 Data Acquisition and processing The model is trained on the POP909 Dataset [17], which contains 909 pieces of Chinese pop music with the melody stored in MIDI and the chord progression annotation stored as a time series. We extract the melody from the MIDI file of each so...

work page
[6]

CONCLUSION To the best of our knowledge, this is the first transformer- based seq2seq model that considers the word-wise symme- try in the input and output word embeddings. As a result, the universal schemes in the transformer for natural lan- guage processing, including layer normalization and posi- tional encoding, need to be adapted to this new domain....

work page
[7]

Mazzola, The topos of music: geometric logic of concepts, theory, and performance

G. Mazzola, The topos of music: geometric logic of concepts, theory, and performance. Birkh ¨auser, 2012

work page 2012
[8]

Mathematics and group theory in music

A. Papadopoulos, “Mathematics and group theory in music,”arXiv preprint arXiv:1407.5757, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[9]

Deep rank-based transposition-invariant distances on musical sequences

G. Hadjeres and F. Nielsen, “Deep rank-based transposition-invariant distances on musical se- quences,”arXiv preprint arXiv:1709.00740, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Pesto: Pitch estimation with self-supervised transposition-equivariant objective,

A. Riou, S. Lattner, G. Hadjeres, and G. Peeters, “Pesto: Pitch estimation with self-supervised transposition-equivariant objective,” in International Society for Music Information Retrieval Conference (ISMIR 2023), 2023

work page 2023
[11]

Toward Fully Self- Supervised Multi-Pitch Estimation,

F. Cwitkowitz and Z. Duan, “Toward Fully Self- Supervised Multi-Pitch Estimation,” arXiv preprint arXiv:2402.15569, 2024

work page arXiv 2024
[12]

Learning the helix topology of mu- sical pitch,

V . Lostanlen, S. Sridhar, B. McFee, A. Farnsworth, and J. P. Bello, “Learning the helix topology of mu- sical pitch,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2020, pp. 11–15

work page 2020
[13]

Music Transformer,

C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck, “Music Transformer,” inInternational Conference on Learning Representations, 2019. [Online]. Available: https: //openreview.net/forum?id=rJe4ShAcF7

work page 2019
[14]

Museformer: Trans- former with Fine- and Coarse-Grained Attention for Music Generation,

B. Yu, P. Lu, R. Wang, W. Hu, X. Tan, W. Ye, S. Zhang, T. Qin, and T.-Y . Liu, “Museformer: Trans- former with Fine- and Coarse-Grained Attention for Music Generation,” inAdvances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Bel- grave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=GFiqdZOm-Ei

work page 2022
[15]

Se (3)-transformers: 3d roto-translation equivariant atten- tion networks,

F. Fuchs, D. Worrall, V . Fischer, and M. Welling, “Se (3)-transformers: 3d roto-translation equivariant atten- tion networks,” Advances in neural information pro- cessing systems, vol. 33, pp. 1970–1981, 2020

work page 1970
[16]

Equiformer: Equivariant graph attention transformer for 3d atomistic graphs,

Y .-L. Liao and T. Smidt, “Equiformer: Equivariant graph attention transformer for 3d atomistic graphs,” arXiv preprint arXiv:2206.11990, 2022

work page arXiv 2022
[17]

Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations,

Y .-L. Liao, B. Wood, A. Das, and T. Smidt, “Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations,” arXiv preprint arXiv:2306.12059, 2023

work page arXiv 2023
[18]

e3nn: Euclidean neural net- works,

M. Geiger and T. Smidt, “e3nn: Euclidean neural net- works,”arXiv preprint arXiv:2207.09453, 2022

work page arXiv 2022
[19]

The complete musician: An integrated approach to tonal theory, analysis, and listening,

S. G. Laitz, “The complete musician: An integrated approach to tonal theory, analysis, and listening,” 2012

work page 2012
[20]

Automatic chord arrange- ment with key detection for monophonic music,

B.-S. Lin and T.-C. Yeh, “Automatic chord arrange- ment with key detection for monophonic music,” in 2017 International Conference on Soft Computing, In- telligent System and Information Technology (ICSIIT). IEEE, 2017, pp. 21–25

work page 2017
[21]

Serre et al

J.-P. Serre et al. , Linear representations of finite groups. Springer, 1977, vol. 42

work page 1977
[22]

Attention is All you Need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems , I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://...

work page 2017
[23]

POP909: A Pop-song Dataset for Music Arrangement Generation,

Z. Wang, K. Chen, J. Jiang, Y . Zhang, M. Xu, S. Dai, X. Gu, and G. G. Xia, “POP909: A Pop-song Dataset for Music Arrangement Generation,” in International Society for Music Information Retrieval Conference ,

work page
[24]

Available: https://api.semanticscholar

[Online]. Available: https://api.semanticscholar. org/CorpusID:221140193

work page

[1] [1]

INTRODUCTION In the burgeoning age of AI arts, generative AI, represented by the diffusion model, has been profoundly influencing the concept of digital paintings, while music production re- mains a frontier for machine intelligence to explore. For an AI composer, there’s a long way to achieve the holy grail of creating a complete classical symphony; howe...

work page 2025

[2] [2]

RELATED WORK The notion of the symmetry of pitch classes is fundamen- tal in music theory [1], where the group theory exerts its power as in spatial objects [2]. As an essential prior knowl- edge of the music structure, computational music studies have embraced it in various tasks, such as transposition- invariant music metric [3], transposition-equivaria...

work page

[3] [3]

Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment

BACKGROUND 3.1 Equal temperament The notion of a sound’spitch is physically instantiated by the frequency of its vibration. Due to the physiological features of human ears and/or centuries of cultural con- struction, sounds of frequencies with a simple integer ratio make people feel harmonious. The relation between two pitches with the simplest non-trivia...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

The melody notes N = {(Pn, b, v)} are embedded as a series of vec- tors m(k) ∈ [0, 1]12, where m(k) records the sounding notes during the timespan between (k − 1)u and ku

METHOD 4.1 Embedding music into vectors A minimum value of u is used as a time step. The melody notes N = {(Pn, b, v)} are embedded as a series of vec- tors m(k) ∈ [0, 1]12, where m(k) records the sounding notes during the timespan between (k − 1)u and ku. In- spired by [14], the embedding builds on the relative contri- bution of each note during the time...

work page

[5] [5]

EXPERIMENTS The code conducting the experiments can be found in our Github repo. 5.1 Data Acquisition and processing The model is trained on the POP909 Dataset [17], which contains 909 pieces of Chinese pop music with the melody stored in MIDI and the chord progression annotation stored as a time series. We extract the melody from the MIDI file of each so...

work page

[6] [6]

CONCLUSION To the best of our knowledge, this is the first transformer- based seq2seq model that considers the word-wise symme- try in the input and output word embeddings. As a result, the universal schemes in the transformer for natural lan- guage processing, including layer normalization and posi- tional encoding, need to be adapted to this new domain....

work page

[7] [7]

Mazzola, The topos of music: geometric logic of concepts, theory, and performance

G. Mazzola, The topos of music: geometric logic of concepts, theory, and performance. Birkh ¨auser, 2012

work page 2012

[8] [8]

Mathematics and group theory in music

A. Papadopoulos, “Mathematics and group theory in music,”arXiv preprint arXiv:1407.5757, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[9] [9]

Deep rank-based transposition-invariant distances on musical sequences

G. Hadjeres and F. Nielsen, “Deep rank-based transposition-invariant distances on musical se- quences,”arXiv preprint arXiv:1709.00740, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Pesto: Pitch estimation with self-supervised transposition-equivariant objective,

A. Riou, S. Lattner, G. Hadjeres, and G. Peeters, “Pesto: Pitch estimation with self-supervised transposition-equivariant objective,” in International Society for Music Information Retrieval Conference (ISMIR 2023), 2023

work page 2023

[11] [11]

Toward Fully Self- Supervised Multi-Pitch Estimation,

F. Cwitkowitz and Z. Duan, “Toward Fully Self- Supervised Multi-Pitch Estimation,” arXiv preprint arXiv:2402.15569, 2024

work page arXiv 2024

[12] [12]

Learning the helix topology of mu- sical pitch,

V . Lostanlen, S. Sridhar, B. McFee, A. Farnsworth, and J. P. Bello, “Learning the helix topology of mu- sical pitch,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2020, pp. 11–15

work page 2020

[13] [13]

Music Transformer,

C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck, “Music Transformer,” inInternational Conference on Learning Representations, 2019. [Online]. Available: https: //openreview.net/forum?id=rJe4ShAcF7

work page 2019

[14] [14]

Museformer: Trans- former with Fine- and Coarse-Grained Attention for Music Generation,

B. Yu, P. Lu, R. Wang, W. Hu, X. Tan, W. Ye, S. Zhang, T. Qin, and T.-Y . Liu, “Museformer: Trans- former with Fine- and Coarse-Grained Attention for Music Generation,” inAdvances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Bel- grave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=GFiqdZOm-Ei

work page 2022

[15] [15]

Se (3)-transformers: 3d roto-translation equivariant atten- tion networks,

F. Fuchs, D. Worrall, V . Fischer, and M. Welling, “Se (3)-transformers: 3d roto-translation equivariant atten- tion networks,” Advances in neural information pro- cessing systems, vol. 33, pp. 1970–1981, 2020

work page 1970

[16] [16]

Equiformer: Equivariant graph attention transformer for 3d atomistic graphs,

Y .-L. Liao and T. Smidt, “Equiformer: Equivariant graph attention transformer for 3d atomistic graphs,” arXiv preprint arXiv:2206.11990, 2022

work page arXiv 2022

[17] [17]

Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations,

Y .-L. Liao, B. Wood, A. Das, and T. Smidt, “Equiformerv2: Improved equivariant transformer for scaling to higher-degree representations,” arXiv preprint arXiv:2306.12059, 2023

work page arXiv 2023

[18] [18]

e3nn: Euclidean neural net- works,

M. Geiger and T. Smidt, “e3nn: Euclidean neural net- works,”arXiv preprint arXiv:2207.09453, 2022

work page arXiv 2022

[19] [19]

The complete musician: An integrated approach to tonal theory, analysis, and listening,

S. G. Laitz, “The complete musician: An integrated approach to tonal theory, analysis, and listening,” 2012

work page 2012

[20] [20]

Automatic chord arrange- ment with key detection for monophonic music,

B.-S. Lin and T.-C. Yeh, “Automatic chord arrange- ment with key detection for monophonic music,” in 2017 International Conference on Soft Computing, In- telligent System and Information Technology (ICSIIT). IEEE, 2017, pp. 21–25

work page 2017

[21] [21]

Serre et al

J.-P. Serre et al. , Linear representations of finite groups. Springer, 1977, vol. 42

work page 1977

[22] [22]

Attention is All you Need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems , I. Guyon, U. V . Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://...

work page 2017

[23] [23]

POP909: A Pop-song Dataset for Music Arrangement Generation,

Z. Wang, K. Chen, J. Jiang, Y . Zhang, M. Xu, S. Dai, X. Gu, and G. G. Xia, “POP909: A Pop-song Dataset for Music Arrangement Generation,” in International Society for Music Information Retrieval Conference ,

work page

[24] [24]

Available: https://api.semanticscholar

[Online]. Available: https://api.semanticscholar. org/CorpusID:221140193

work page