Learning Complex Basis Functions for Invariant Representations of Audio
Pith reviewed 2026-05-24 22:14 UTC · model grok-4.3
The pith
A complex autoencoder learns basis functions that map audio to a transformation-invariant magnitude space and a variant phase space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mapping signals onto complex basis functions learned by the CAE results in a transformation-invariant magnitude space and a transformation-variant phase space. The phase space is useful to infer transformations between data pairs. When exploiting the invariance-property of the magnitude space, state-of-the-art results are achieved in audio-to-score alignment and repeated section discovery for audio.
What carries the argument
The Complex Autoencoder (CAE) that learns complex basis functions to separate signals into an invariant magnitude space and a transformation-variant phase space.
If this is right
- The phase space supports direct inference of transformations between pairs of audio examples.
- Exploiting only the magnitude space produces state-of-the-art results on audio-to-score alignment.
- Exploiting only the magnitude space produces state-of-the-art results on repeated section discovery in audio.
- Learned complex features outperform hand-crafted spectrogram features on the evaluated MIR tasks.
Where Pith is reading between the lines
- The same magnitude-phase separation might apply to other signal types that require invariance to shifts or rotations.
- Task-specific post-processing may not be needed when the learned magnitude space already matches the invariance requirements of multiple MIR problems.
- The approach could reduce reliance on manually designed filter banks in audio pipelines that face similar transformation issues.
Load-bearing premise
The learned complex basis functions produce magnitude representations that stay fixed under the specific orthogonal transformations that matter for the target audio tasks.
What would settle it
An experiment that applies transposition or time-shift to input audio and shows measurable change in the magnitude output of the trained CAE would falsify the invariance claim.
read the original abstract
Learning features from data has shown to be more successful than using hand-crafted features for many machine learning tasks. In music information retrieval (MIR), features learned from windowed spectrograms are highly variant to transformations like transposition or time-shift. Such variances are undesirable when they are irrelevant for the respective MIR task. We propose an architecture called Complex Autoencoder (CAE) which learns features invariant to orthogonal transformations. Mapping signals onto complex basis functions learned by the CAE results in a transformation-invariant "magnitude space" and a transformation-variant "phase space". The phase space is useful to infer transformations between data pairs. When exploiting the invariance-property of the magnitude space, we achieve state-of-the-art results in audio-to-score alignment and repeated section discovery for audio. A PyTorch implementation of the CAE, including the repeated section discovery method, is available online.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a Complex Autoencoder (CAE) architecture that learns complex-valued basis functions from windowed spectrograms. Mapping input signals onto these bases produces a magnitude representation claimed to be invariant to orthogonal transformations (e.g., transposition, time-shift) and a phase representation that encodes the transformations. The invariant magnitude space is then used to obtain state-of-the-art results on audio-to-score alignment and repeated section discovery in music information retrieval tasks. A PyTorch implementation is provided.
Significance. If the claimed invariance property holds beyond the training distribution and can be directly exploited without task-specific tuning, the approach would offer a data-driven alternative to hand-crafted features that explicitly encode transposition or shift invariance. This could strengthen robustness in MIR pipelines where such transformations are musically irrelevant. The open-source release of code and the repeated-section method is a positive contribution for reproducibility.
major comments (2)
- [§3] §3 (CAE architecture and training): No derivation or controlled experiment demonstrates that the learned complex bases satisfy the inner-product conditions required for magnitude invariance to transposition or time-shift on inputs outside the training distribution. The abstract asserts the property, but without an explicit equivariance argument or ablation that isolates the magnitude space from other pipeline components, the attribution of SOTA results to invariance remains unsupported.
- [§4] §4 (Experiments on audio-to-score alignment): The evaluation does not report an ablation that removes or replaces the magnitude-space invariance step while keeping all other choices fixed. Without this, it is impossible to quantify how much of the reported improvement is due to the claimed invariance versus other modeling decisions.
minor comments (2)
- [§3] Notation for the complex basis functions and the magnitude/phase decomposition should be introduced with explicit equations rather than prose descriptions.
- The abstract states that the phase space is 'useful to infer transformations,' but the manuscript does not quantify this utility with a separate experiment or metric.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (CAE architecture and training): No derivation or controlled experiment demonstrates that the learned complex bases satisfy the inner-product conditions required for magnitude invariance to transposition or time-shift on inputs outside the training distribution. The abstract asserts the property, but without an explicit equivariance argument or ablation that isolates the magnitude space from other pipeline components, the attribution of SOTA results to invariance remains unsupported.
Authors: We acknowledge that the original manuscript does not include an explicit mathematical derivation of the invariance property for inputs outside the training distribution, nor a controlled OOD experiment isolating the magnitude space. The CAE is motivated by the fact that the squared magnitude of the complex inner product is invariant to unitary transformations when the learned bases are complete, but this is asserted rather than formally derived or tested on transformed data unseen during training. We will add a derivation section showing the inner-product invariance condition and an ablation experiment evaluating magnitude stability under transposition and time-shift on held-out audio excerpts. This will also help isolate the contribution of the magnitude representation from other pipeline elements. revision: yes
-
Referee: [§4] §4 (Experiments on audio-to-score alignment): The evaluation does not report an ablation that removes or replaces the magnitude-space invariance step while keeping all other choices fixed. Without this, it is impossible to quantify how much of the reported improvement is due to the claimed invariance versus other modeling decisions.
Authors: We agree that the current experiments compare the full CAE-based pipeline against external baselines but lack an internal ablation that disables or replaces the magnitude-invariance component (e.g., using raw complex coefficients or a non-invariant representation) while holding all other modeling choices constant. Such an ablation would directly quantify the benefit attributable to invariance. We will include this controlled ablation in the revised experiments section for both the audio-to-score alignment and repeated-section tasks. revision: yes
Circularity Check
No significant circularity; derivation self-contained
full rationale
The abstract describes a CAE architecture that learns complex basis functions, with the resulting magnitude space asserted to be transformation-invariant as a property of the learned representation. No equations, self-citations, or derivations are quoted that reduce this invariance to a fitted input, self-definition, or prior author result by construction. The SOTA claims on alignment and section discovery are presented as empirical outcomes rather than tautological predictions. This matches the default case of a self-contained proposal without load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Learning Complex Basis Functions for Invariant Representations of Audio
INTRODUCTION Learning from audio data most commonly involves some prior processing of the raw sound signals. The most popu- lar features are derived from a spectrogram, which consists of the magnitude values of the Fourier transform of a win- dowed signal of interest. In a Fourier transform, a signal is projected onto sine and cosine functions of differen...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[2]
RELATED WORK Generally, mid-level representations in neural networks are highly variant to transformations in the input. The most common and well-known way to obtain shift-invariance in convolutional architectures is max-pooling [4]. How- ever, full shift-invariance can only be achieved step-wise by applying max-pooling over several layers. A whole line o...
-
[3]
assumed to be useful for a particular learning task at hand
MODEL AND MATHEMATICAL BACKGROUND We aim at learning orthogonal transformations encoding certain invariances of a class of signals which are known or Figure 2: Some examples of real (top) and imaginary (bottom) basis vectors learned from audio signals (time in seconds). assumed to be useful for a particular learning task at hand. To this end, we leverage ...
-
[4]
TRAINING For all the experiments described below, we choose 256 complex basis vectors and train the model for 500 epochs with a learning rate of 1e-3. We use a batch size of 1000, and we sample 100k transformations per epoch, gener- ally picking random instances from the train set to be transformed. The training data is standardized, and 50% dropout is us...
-
[5]
for the audio experiments, and p = 2 for the MNIST experiment. In the alignment experiment, we also penalize the mean of norms of all basis vectors and the deviation of the individual basis vectors’ norms to the average norm over all basis vectors. In the MNIST experiment, the norm of all basis vectors is set to 0.4 after every batch. For in- formation ab...
-
[6]
Discovery of Repeated Themes and Sections
EXPERIMENTS 5.1 Discovery of Repeated Themes and Sections In the MIREX task “Discovery of Repeated Themes and Sections”, 3 the performance of different algorithms to identify repeated (and possibly transposed) patterns in symbolic music and audio is tested. The commonly used JKUPDD dataset [6] contains 26 motifs, themes, and repeated sections annotated in...
work page 1984
-
[7]
is adopted, which finds diagonals in a self-similarity matrix using a threshold. As we normalized the matrices to zero median, the threshold chosen in this experiment is close to zero (i.e., 0.01). 5.1.1 Results and Discussion Table 1 shows the results of the experiment. Using our method, we could slightly outperform the Gated Autoen- coder approach propos...
work page 2017
-
[8]
CONCLUSION AND FUTURE WORK The empirical results in this work show that for music alignment, structure analysis, and invariant classification tasks, the features learned by the CAE have advantages over other features, like Chroma features, and features learned by a GAE. As opposed to Chroma features, the CAE features are transposition-invariant, and genera...
-
[9]
ACKNOWLEDGMENTS This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skodowsa-Curie grant agreement No. 765068. Monika D ¨orfler is supported by the Vienna Science and Technology Fund (WWTF) project SALSA (MA14-018)
work page 2020
-
[10]
Flexible and Robust Music Tracking
Andreas Arzt. Flexible and Robust Music Tracking. PhD the- sis, Johannes Kepler University Linz, 2016
work page 2016
-
[11]
Audio-to-score alignment using transposition-invariant features
Andreas Arzt and Stefan Lattner. Audio-to-score alignment using transposition-invariant features. In Proceedings of the International Conference on Music Information Retrieval (IS- MIR), Paris, France, 2018
work page 2018
-
[12]
Thierry Bertin-Mahieux and Daniel P. W. Ellis. Large-scale cover song recognition using the 2d fourier transform magni- tude. In Fabien Gouyon, Perfecto Herrera, Luis Gustavo Mar- tins, and Meinard M¨uller, editors, Proceedings of the 13th In- ternational Society for Music Information Retrieval Confer- ence, ISMIR 2012, Mosteiro S.Bento Da Vit ´oria, Port...
work page 2012
-
[13]
A theoretical analysis of feature pooling in visual recognition
Y-Lan Boureau, Jean Ponce, and Yann LeCun. A theoretical analysis of feature pooling in visual recognition. In Johannes F¨urnkranz and Thorsten Joachims, editors,Proceedings of the 27th International Conference on Machine Learning (ICML- 10), June 21-24, 2010, Haifa, Israel , pages 111–118. Omni- press, 2010
work page 2010
-
[14]
Invariant scattering con- volution networks
Joan Bruna and St ´ephane Mallat. Invariant scattering con- volution networks. IEEE Trans. Pattern Anal. Mach. Intell. , 35(8):1872–1886, 2013
work page 2013
-
[15]
Discovery of repeated themes and sections
Tom Collins. Discovery of repeated themes and sections. http://www.music-ir.org/mirex/wiki/2017: Discovery_of_Repeated_Themes_%26_ Sections, 2017
work page 2017
-
[16]
Tom Collins, Andreas Arzt, Sebastian Flossmann, and Ger- hard Widmer. Siarct-cfp: Improving precision and the discov- ery of inexact musical patterns in point-set representations. In ISMIR, pages 549–554, 2013
work page 2013
-
[17]
Deformable convolutional net- works
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional net- works. In IEEE International Conference on Computer Vi- sion, ICCV 2017, Venice, Italy, October 22-29, 2017 , pages 764–773. IEEE Computer Society, 2017
work page 2017
-
[18]
S. Dieleman and B. Schrauwen. End-to-end learning for mu- sic audio. In 2014 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , pages 6964– 6968, May 2014
work page 2014
-
[19]
MATCH: A music align- ment tool chest
Simon Dixon and Gerhard Widmer. MATCH: A music align- ment tool chest. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) , pages 492–497, London, UK, 2005
work page 2005
-
[20]
Daniel P.W. Ellis and Graham E. Poliner. Identifying ‘cover songs’ with chroma features and dynamic programming beat tracking. In Proceedings of the IEEE International Confer- ence on Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages 1429–1432, Honolulu, Hawaii, USA, 2007
work page 2007
-
[21]
Mul- tipitch estimation of piano sounds using a new probabilistic spectral smoothness principle
Valentin Emiya, Roland Badeau, and Bertrand David. Mul- tipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Transactions on Audio, Speech, and Language Processing, 18(6):1643–1654, 2010
work page 2010
-
[22]
The vienna 4x22 piano corpus, 1999
Werner Goebl. The vienna 4x22 piano corpus, 1999. http: //dx.doi.org/10.21939/4X22
-
[23]
Automatic alignment of music performances with structural differences
Maarten Grachten, Martin Gasser, Andreas Arzt, and Gerhard Widmer. Automatic alignment of music performances with structural differences. In Proceedings of the International So- ciety for Music Information Retrieval Conference (ISMIR) , pages 607–612, Curitiba, Brazil, 2013
work page 2013
-
[24]
Dannenberg, and George Tzanetakis
Ning Hu, Roger B. Dannenberg, and George Tzanetakis. Polyphonic audio matching and alignment for music retrieval. In Proceedings of the IEEE Workshop on Applications of Sig- nal Processing to Audio and Acoustics (WASPAA), New Paltz, NY , USA, 2003
work page 2003
-
[25]
Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neu- ral Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebe...
work page 2015
-
[26]
A comparative study of tonal acoustic features for a symbolic level music- to-score alignment
Cyril Joder, Slim Essid, and Ga ¨el Richard. A comparative study of tonal acoustic features for a symbolic level music- to-score alignment. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, Texas, USA, 2010
work page 2010
-
[27]
Dmitry Laptev, Nikolay Savinov, Joachim M. Buhmann, and Marc Pollefeys. TI-POOLING: transformation-invariant pooling for feature learning in convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 289–297. IEEE Computer Society, 2016
work page 2016
-
[28]
Courville, James Bergstra, and Yoshua Bengio
Hugo Larochelle, Dumitru Erhan, Aaron C. Courville, James Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of vari- ation. In Zoubin Ghahramani, editor, Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007 , volume 227 of AC...
work page 2007
-
[29]
Learning transposition-invariant interval features from sym- bolic music and audio
Stefan Lattner, Maarten Grachten, and Gerhard Widmer. Learning transposition-invariant interval features from sym- bolic music and audio. In Proceedings of the 19th Interna- tional Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018
work page 2018
-
[30]
Quoc V . Le, Will Y . Zou, Serena Y . Yeung, and Andrew Y . Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20- 25 June 2011 , pages 3361–3368. IEEE Computer Society, 2011
work page 2011
-
[31]
Shangzhen Luan, Chen Chen, Baochang Zhang, Jungong Han, and Jianzhuang Liu. Gabor convolutional networks. IEEE Trans. Image Processing, 27(9):4357–4366, 2018
work page 2018
-
[32]
Matija Marolt. A mid-level representation for melody-based retrieval in audio collections.IEEE Transactions on Multime- dia, 10(8):1617–1625, 2008
work page 2008
-
[33]
Transform invariant auto-encoder
Tadashi Matsuo, Hiroya Fukuhara, and Nobutaka Shimada. Transform invariant auto-encoder. In 2017 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems, IROS 2017, Vancouver, BC, Canada, September 24-28, 2017, pages 2359–2364. IEEE, 2017
work page 2017
-
[34]
Learning in- variant features by harnessing the aperture problem
Roland Memisevic and Georgios Exarchakis. Learning in- variant features by harnessing the aperture problem. In Pro- ceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013 , volume 28 of JMLR Workshop and Conference Proceedings, pages 100–108. JMLR.org, 2013
work page 2013
-
[35]
Fundamentals of Music Processing
Meinard M ¨uller. Fundamentals of Music Processing . Springer Verlag, 2015
work page 2015
-
[36]
Identifying poly- phonic patterns from audio recordings using music segmen- tation techniques
Oriol Nieto and Morwaread M Farbood. Identifying poly- phonic patterns from audio recordings using music segmen- tation techniques. In Proc. of the 15th International Society for Music Information Retrieval Conference, pages 411–416, 2014
work page 2014
-
[37]
Bilinear models of natural images
Bruno A Olshausen, Charles Cadieu, Jack Culpepper, and David K Warland. Bilinear models of natural images. InElec- tronic Imaging 2007 , pages 649206–649206. International Society for Optics and Photonics, 2007
work page 2007
-
[38]
Joint deep learning for pedestrian detection
Wanli Ouyang and Xiaogang Wang. Joint deep learning for pedestrian detection. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 2056–2063. IEEE Computer Society, 2013
work page 2013
-
[39]
Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas F. Ehmann, and Xavier Serra. End-to-end learning for music audio tagging at scale. In Proceedings of the 19th International Society for Music Information Retrieval Con- ference, ISMIR 2018, Paris, France, September 23-27, 2018, pages 637–644, 2018
work page 2018
-
[40]
FastDTW: Toward accurate dynamic time warping in linear time and space
Stan Salvador and Philip Chan. FastDTW: Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis, 11(5):561–580, 2007
work page 2007
-
[41]
Rotation, scaling and deformation invariant scattering for texture discrimination
Laurent Sifre and St ´ephane Mallat. Rotation, scaling and deformation invariant scattering for texture discrimination. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013 , pages 1233–1240. IEEE Computer Society, 2013
work page 2013
-
[42]
The intervalgram: an audio feature for large-scale melody recognition
Thomas C Walters, David A Ross, and Richard F Lyon. The intervalgram: an audio feature for large-scale melody recognition. In Proc. of the 9th International Symposium on Computer Music Modeling and Retrieval (CMMR). Citeseer, 2012
work page 2012
-
[43]
Cheng-i Wang, Jennifer Hsu, and Shlomo Dubnov. Music pattern discovery with variable markov oracle: A unified ap- proach to symbolic and audio representations. In Meinard M¨uller and Frans Wiering, editors, Proceedings of the 16th International Society for Music Information Retrieval Con- ference, ISMIR 2015, M ´alaga, Spain, October 26-30, 2015 , pages 17...
work page 2015
-
[44]
Gerhard Widmer. Discovering simple rules in complex data: A meta-learning algorithm and some surprising musical dis- coveries. Artificial Intelligence, 146(2):129–148, 2003
work page 2003
-
[45]
Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukham- betov, and Gabriel J. Brostow. Harmonic networks: Deep translation and rotation equivariance. In 2017 IEEE Confer- ence on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 7168–
work page 2017
-
[46]
IEEE Computer Society, 2017
work page 2017
-
[47]
Zhuoyao Zhong, Lianwen Jin, and Zecheng Xie. High per- formance offline handwritten chinese character recognition using googlenet and directional feature maps. In 13th In- ternational Conference on Document Analysis and Recogni- tion, ICDAR 2015, Nancy, France, August 23-26, 2015, pages 846–850. IEEE Computer Society, 2015
work page 2015
-
[48]
Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Ori- ented response networks. In 2017 IEEE Conference on Com- puter Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 4961–4970. IEEE Com- puter Society, 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.