pith. sign in

arxiv: 1907.05982 · v1 · pith:TEHJRWNNnew · submitted 2019-07-13 · 💻 cs.SD · cs.CV· cs.LG· eess.AS

Learning Complex Basis Functions for Invariant Representations of Audio

Pith reviewed 2026-05-24 22:14 UTC · model grok-4.3

classification 💻 cs.SD cs.CVcs.LGeess.AS
keywords complex autoencoderinvariant representationsaudio featuresmusic information retrievalmagnitude spacephase spaceaudio-to-score alignment
0
0 comments X

The pith

A complex autoencoder learns basis functions that map audio to a transformation-invariant magnitude space and a variant phase space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Complex Autoencoder to learn complex basis functions directly from data rather than relying on hand-crafted features. Signals projected onto these functions yield a magnitude component that stays unchanged under orthogonal transformations such as transposition and time shifts, while the phase component varies with those transformations. The invariant magnitude space can then be used directly for music information retrieval tasks. The method reports state-of-the-art performance on audio-to-score alignment and repeated section discovery. A reader would care because the separation supplies a data-driven route to features that ignore musically irrelevant shifts without extra post-processing steps.

Core claim

Mapping signals onto complex basis functions learned by the CAE results in a transformation-invariant magnitude space and a transformation-variant phase space. The phase space is useful to infer transformations between data pairs. When exploiting the invariance-property of the magnitude space, state-of-the-art results are achieved in audio-to-score alignment and repeated section discovery for audio.

What carries the argument

The Complex Autoencoder (CAE) that learns complex basis functions to separate signals into an invariant magnitude space and a transformation-variant phase space.

If this is right

  • The phase space supports direct inference of transformations between pairs of audio examples.
  • Exploiting only the magnitude space produces state-of-the-art results on audio-to-score alignment.
  • Exploiting only the magnitude space produces state-of-the-art results on repeated section discovery in audio.
  • Learned complex features outperform hand-crafted spectrogram features on the evaluated MIR tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same magnitude-phase separation might apply to other signal types that require invariance to shifts or rotations.
  • Task-specific post-processing may not be needed when the learned magnitude space already matches the invariance requirements of multiple MIR problems.
  • The approach could reduce reliance on manually designed filter banks in audio pipelines that face similar transformation issues.

Load-bearing premise

The learned complex basis functions produce magnitude representations that stay fixed under the specific orthogonal transformations that matter for the target audio tasks.

What would settle it

An experiment that applies transposition or time-shift to input audio and shows measurable change in the magnitude output of the trained CAE would falsify the invariance claim.

read the original abstract

Learning features from data has shown to be more successful than using hand-crafted features for many machine learning tasks. In music information retrieval (MIR), features learned from windowed spectrograms are highly variant to transformations like transposition or time-shift. Such variances are undesirable when they are irrelevant for the respective MIR task. We propose an architecture called Complex Autoencoder (CAE) which learns features invariant to orthogonal transformations. Mapping signals onto complex basis functions learned by the CAE results in a transformation-invariant "magnitude space" and a transformation-variant "phase space". The phase space is useful to infer transformations between data pairs. When exploiting the invariance-property of the magnitude space, we achieve state-of-the-art results in audio-to-score alignment and repeated section discovery for audio. A PyTorch implementation of the CAE, including the repeated section discovery method, is available online.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a Complex Autoencoder (CAE) architecture that learns complex-valued basis functions from windowed spectrograms. Mapping input signals onto these bases produces a magnitude representation claimed to be invariant to orthogonal transformations (e.g., transposition, time-shift) and a phase representation that encodes the transformations. The invariant magnitude space is then used to obtain state-of-the-art results on audio-to-score alignment and repeated section discovery in music information retrieval tasks. A PyTorch implementation is provided.

Significance. If the claimed invariance property holds beyond the training distribution and can be directly exploited without task-specific tuning, the approach would offer a data-driven alternative to hand-crafted features that explicitly encode transposition or shift invariance. This could strengthen robustness in MIR pipelines where such transformations are musically irrelevant. The open-source release of code and the repeated-section method is a positive contribution for reproducibility.

major comments (2)
  1. [§3] §3 (CAE architecture and training): No derivation or controlled experiment demonstrates that the learned complex bases satisfy the inner-product conditions required for magnitude invariance to transposition or time-shift on inputs outside the training distribution. The abstract asserts the property, but without an explicit equivariance argument or ablation that isolates the magnitude space from other pipeline components, the attribution of SOTA results to invariance remains unsupported.
  2. [§4] §4 (Experiments on audio-to-score alignment): The evaluation does not report an ablation that removes or replaces the magnitude-space invariance step while keeping all other choices fixed. Without this, it is impossible to quantify how much of the reported improvement is due to the claimed invariance versus other modeling decisions.
minor comments (2)
  1. [§3] Notation for the complex basis functions and the magnitude/phase decomposition should be introduced with explicit equations rather than prose descriptions.
  2. The abstract states that the phase space is 'useful to infer transformations,' but the manuscript does not quantify this utility with a separate experiment or metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (CAE architecture and training): No derivation or controlled experiment demonstrates that the learned complex bases satisfy the inner-product conditions required for magnitude invariance to transposition or time-shift on inputs outside the training distribution. The abstract asserts the property, but without an explicit equivariance argument or ablation that isolates the magnitude space from other pipeline components, the attribution of SOTA results to invariance remains unsupported.

    Authors: We acknowledge that the original manuscript does not include an explicit mathematical derivation of the invariance property for inputs outside the training distribution, nor a controlled OOD experiment isolating the magnitude space. The CAE is motivated by the fact that the squared magnitude of the complex inner product is invariant to unitary transformations when the learned bases are complete, but this is asserted rather than formally derived or tested on transformed data unseen during training. We will add a derivation section showing the inner-product invariance condition and an ablation experiment evaluating magnitude stability under transposition and time-shift on held-out audio excerpts. This will also help isolate the contribution of the magnitude representation from other pipeline elements. revision: yes

  2. Referee: [§4] §4 (Experiments on audio-to-score alignment): The evaluation does not report an ablation that removes or replaces the magnitude-space invariance step while keeping all other choices fixed. Without this, it is impossible to quantify how much of the reported improvement is due to the claimed invariance versus other modeling decisions.

    Authors: We agree that the current experiments compare the full CAE-based pipeline against external baselines but lack an internal ablation that disables or replaces the magnitude-invariance component (e.g., using raw complex coefficients or a non-invariant representation) while holding all other modeling choices constant. Such an ablation would directly quantify the benefit attributable to invariance. We will include this controlled ablation in the revised experiments section for both the audio-to-score alignment and repeated-section tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The abstract describes a CAE architecture that learns complex basis functions, with the resulting magnitude space asserted to be transformation-invariant as a property of the learned representation. No equations, self-citations, or derivations are quoted that reduce this invariance to a fitted input, self-definition, or prior author result by construction. The SOTA claims on alignment and section discovery are presented as empirical outcomes rather than tautological predictions. This matches the default case of a self-contained proposal without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5681 in / 1029 out tokens · 17958 ms · 2026-05-24T22:14:02.621986+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

  1. [1]

    Learning Complex Basis Functions for Invariant Representations of Audio

    INTRODUCTION Learning from audio data most commonly involves some prior processing of the raw sound signals. The most popu- lar features are derived from a spectrogram, which consists of the magnitude values of the Fourier transform of a win- dowed signal of interest. In a Fourier transform, a signal is projected onto sine and cosine functions of differen...

  2. [2]

    temporal slowness

    RELATED WORK Generally, mid-level representations in neural networks are highly variant to transformations in the input. The most common and well-known way to obtain shift-invariance in convolutional architectures is max-pooling [4]. How- ever, full shift-invariance can only be achieved step-wise by applying max-pooling over several layers. A whole line o...

  3. [3]

    assumed to be useful for a particular learning task at hand

    MODEL AND MATHEMATICAL BACKGROUND We aim at learning orthogonal transformations encoding certain invariances of a class of signals which are known or Figure 2: Some examples of real (top) and imaginary (bottom) basis vectors learned from audio signals (time in seconds). assumed to be useful for a particular learning task at hand. To this end, we leverage ...

  4. [4]

    We use a batch size of 1000, and we sample 100k transformations per epoch, gener- ally picking random instances from the train set to be transformed

    TRAINING For all the experiments described below, we choose 256 complex basis vectors and train the model for 500 epochs with a learning rate of 1e-3. We use a batch size of 1000, and we sample 100k transformations per epoch, gener- ally picking random instances from the train set to be transformed. The training data is standardized, and 50% dropout is us...

  5. [5]

    for the audio experiments, and p = 2 for the MNIST experiment. In the alignment experiment, we also penalize the mean of norms of all basis vectors and the deviation of the individual basis vectors’ norms to the average norm over all basis vectors. In the MNIST experiment, the norm of all basis vectors is set to 0.4 after every batch. For in- formation ab...

  6. [6]

    Discovery of Repeated Themes and Sections

    EXPERIMENTS 5.1 Discovery of Repeated Themes and Sections In the MIREX task “Discovery of Repeated Themes and Sections”, 3 the performance of different algorithms to identify repeated (and possibly transposed) patterns in symbolic music and audio is tested. The commonly used JKUPDD dataset [6] contains 26 motifs, themes, and repeated sections annotated in...

  7. [7]

    Images (original)

    is adopted, which finds diagonals in a self-similarity matrix using a threshold. As we normalized the matrices to zero median, the threshold chosen in this experiment is close to zero (i.e., 0.01). 5.1.1 Results and Discussion Table 1 shows the results of the experiment. Using our method, we could slightly outperform the Gated Autoen- coder approach propos...

  8. [8]

    phase-difference

    CONCLUSION AND FUTURE WORK The empirical results in this work show that for music alignment, structure analysis, and invariant classification tasks, the features learned by the CAE have advantages over other features, like Chroma features, and features learned by a GAE. As opposed to Chroma features, the CAE features are transposition-invariant, and genera...

  9. [9]

    ACKNOWLEDGMENTS This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skodowsa-Curie grant agreement No. 765068. Monika D ¨orfler is supported by the Vienna Science and Technology Fund (WWTF) project SALSA (MA14-018)

  10. [10]

    Flexible and Robust Music Tracking

    Andreas Arzt. Flexible and Robust Music Tracking. PhD the- sis, Johannes Kepler University Linz, 2016

  11. [11]

    Audio-to-score alignment using transposition-invariant features

    Andreas Arzt and Stefan Lattner. Audio-to-score alignment using transposition-invariant features. In Proceedings of the International Conference on Music Information Retrieval (IS- MIR), Paris, France, 2018

  12. [12]

    Thierry Bertin-Mahieux and Daniel P. W. Ellis. Large-scale cover song recognition using the 2d fourier transform magni- tude. In Fabien Gouyon, Perfecto Herrera, Luis Gustavo Mar- tins, and Meinard M¨uller, editors, Proceedings of the 13th In- ternational Society for Music Information Retrieval Confer- ence, ISMIR 2012, Mosteiro S.Bento Da Vit ´oria, Port...

  13. [13]

    A theoretical analysis of feature pooling in visual recognition

    Y-Lan Boureau, Jean Ponce, and Yann LeCun. A theoretical analysis of feature pooling in visual recognition. In Johannes F¨urnkranz and Thorsten Joachims, editors,Proceedings of the 27th International Conference on Machine Learning (ICML- 10), June 21-24, 2010, Haifa, Israel , pages 111–118. Omni- press, 2010

  14. [14]

    Invariant scattering con- volution networks

    Joan Bruna and St ´ephane Mallat. Invariant scattering con- volution networks. IEEE Trans. Pattern Anal. Mach. Intell. , 35(8):1872–1886, 2013

  15. [15]

    Discovery of repeated themes and sections

    Tom Collins. Discovery of repeated themes and sections. http://www.music-ir.org/mirex/wiki/2017: Discovery_of_Repeated_Themes_%26_ Sections, 2017

  16. [16]

    Siarct-cfp: Improving precision and the discov- ery of inexact musical patterns in point-set representations

    Tom Collins, Andreas Arzt, Sebastian Flossmann, and Ger- hard Widmer. Siarct-cfp: Improving precision and the discov- ery of inexact musical patterns in point-set representations. In ISMIR, pages 549–554, 2013

  17. [17]

    Deformable convolutional net- works

    Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional net- works. In IEEE International Conference on Computer Vi- sion, ICCV 2017, Venice, Italy, October 22-29, 2017 , pages 764–773. IEEE Computer Society, 2017

  18. [18]

    Dieleman and B

    S. Dieleman and B. Schrauwen. End-to-end learning for mu- sic audio. In 2014 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP) , pages 6964– 6968, May 2014

  19. [19]

    MATCH: A music align- ment tool chest

    Simon Dixon and Gerhard Widmer. MATCH: A music align- ment tool chest. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR) , pages 492–497, London, UK, 2005

  20. [20]

    Ellis and Graham E

    Daniel P.W. Ellis and Graham E. Poliner. Identifying ‘cover songs’ with chroma features and dynamic programming beat tracking. In Proceedings of the IEEE International Confer- ence on Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages 1429–1432, Honolulu, Hawaii, USA, 2007

  21. [21]

    Mul- tipitch estimation of piano sounds using a new probabilistic spectral smoothness principle

    Valentin Emiya, Roland Badeau, and Bertrand David. Mul- tipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Transactions on Audio, Speech, and Language Processing, 18(6):1643–1654, 2010

  22. [22]

    The vienna 4x22 piano corpus, 1999

    Werner Goebl. The vienna 4x22 piano corpus, 1999. http: //dx.doi.org/10.21939/4X22

  23. [23]

    Automatic alignment of music performances with structural differences

    Maarten Grachten, Martin Gasser, Andreas Arzt, and Gerhard Widmer. Automatic alignment of music performances with structural differences. In Proceedings of the International So- ciety for Music Information Retrieval Conference (ISMIR) , pages 607–612, Curitiba, Brazil, 2013

  24. [24]

    Dannenberg, and George Tzanetakis

    Ning Hu, Roger B. Dannenberg, and George Tzanetakis. Polyphonic audio matching and alignment for music retrieval. In Proceedings of the IEEE Workshop on Applications of Sig- nal Processing to Audio and Acoustics (WASPAA), New Paltz, NY , USA, 2003

  25. [25]

    Spatial transformer networks

    Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neu- ral Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebe...

  26. [26]

    A comparative study of tonal acoustic features for a symbolic level music- to-score alignment

    Cyril Joder, Slim Essid, and Ga ¨el Richard. A comparative study of tonal acoustic features for a symbolic level music- to-score alignment. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Dallas, Texas, USA, 2010

  27. [27]

    Buhmann, and Marc Pollefeys

    Dmitry Laptev, Nikolay Savinov, Joachim M. Buhmann, and Marc Pollefeys. TI-POOLING: transformation-invariant pooling for feature learning in convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV , USA, June 27-30, 2016, pages 289–297. IEEE Computer Society, 2016

  28. [28]

    Courville, James Bergstra, and Yoshua Bengio

    Hugo Larochelle, Dumitru Erhan, Aaron C. Courville, James Bergstra, and Yoshua Bengio. An empirical evaluation of deep architectures on problems with many factors of vari- ation. In Zoubin Ghahramani, editor, Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007 , volume 227 of AC...

  29. [29]

    Learning transposition-invariant interval features from sym- bolic music and audio

    Stefan Lattner, Maarten Grachten, and Gerhard Widmer. Learning transposition-invariant interval features from sym- bolic music and audio. In Proceedings of the 19th Interna- tional Society for Music Information Retrieval Conference, ISMIR 2018, Paris, France, September 23-27, 2018

  30. [30]

    Le, Will Y

    Quoc V . Le, Will Y . Zou, Serena Y . Yeung, and Andrew Y . Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20- 25 June 2011 , pages 3361–3368. IEEE Computer Society, 2011

  31. [31]

    Gabor convolutional networks

    Shangzhen Luan, Chen Chen, Baochang Zhang, Jungong Han, and Jianzhuang Liu. Gabor convolutional networks. IEEE Trans. Image Processing, 27(9):4357–4366, 2018

  32. [32]

    A mid-level representation for melody-based retrieval in audio collections.IEEE Transactions on Multime- dia, 10(8):1617–1625, 2008

    Matija Marolt. A mid-level representation for melody-based retrieval in audio collections.IEEE Transactions on Multime- dia, 10(8):1617–1625, 2008

  33. [33]

    Transform invariant auto-encoder

    Tadashi Matsuo, Hiroya Fukuhara, and Nobutaka Shimada. Transform invariant auto-encoder. In 2017 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems, IROS 2017, Vancouver, BC, Canada, September 24-28, 2017, pages 2359–2364. IEEE, 2017

  34. [34]

    Learning in- variant features by harnessing the aperture problem

    Roland Memisevic and Georgios Exarchakis. Learning in- variant features by harnessing the aperture problem. In Pro- ceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013 , volume 28 of JMLR Workshop and Conference Proceedings, pages 100–108. JMLR.org, 2013

  35. [35]

    Fundamentals of Music Processing

    Meinard M ¨uller. Fundamentals of Music Processing . Springer Verlag, 2015

  36. [36]

    Identifying poly- phonic patterns from audio recordings using music segmen- tation techniques

    Oriol Nieto and Morwaread M Farbood. Identifying poly- phonic patterns from audio recordings using music segmen- tation techniques. In Proc. of the 15th International Society for Music Information Retrieval Conference, pages 411–416, 2014

  37. [37]

    Bilinear models of natural images

    Bruno A Olshausen, Charles Cadieu, Jack Culpepper, and David K Warland. Bilinear models of natural images. InElec- tronic Imaging 2007 , pages 649206–649206. International Society for Optics and Photonics, 2007

  38. [38]

    Joint deep learning for pedestrian detection

    Wanli Ouyang and Xiaogang Wang. Joint deep learning for pedestrian detection. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 2056–2063. IEEE Computer Society, 2013

  39. [39]

    Schmidt, Andreas F

    Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas F. Ehmann, and Xavier Serra. End-to-end learning for music audio tagging at scale. In Proceedings of the 19th International Society for Music Information Retrieval Con- ference, ISMIR 2018, Paris, France, September 23-27, 2018, pages 637–644, 2018

  40. [40]

    FastDTW: Toward accurate dynamic time warping in linear time and space

    Stan Salvador and Philip Chan. FastDTW: Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis, 11(5):561–580, 2007

  41. [41]

    Rotation, scaling and deformation invariant scattering for texture discrimination

    Laurent Sifre and St ´ephane Mallat. Rotation, scaling and deformation invariant scattering for texture discrimination. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013 , pages 1233–1240. IEEE Computer Society, 2013

  42. [42]

    The intervalgram: an audio feature for large-scale melody recognition

    Thomas C Walters, David A Ross, and Richard F Lyon. The intervalgram: an audio feature for large-scale melody recognition. In Proc. of the 9th International Symposium on Computer Music Modeling and Retrieval (CMMR). Citeseer, 2012

  43. [43]

    Music pattern discovery with variable markov oracle: A unified ap- proach to symbolic and audio representations

    Cheng-i Wang, Jennifer Hsu, and Shlomo Dubnov. Music pattern discovery with variable markov oracle: A unified ap- proach to symbolic and audio representations. In Meinard M¨uller and Frans Wiering, editors, Proceedings of the 16th International Society for Music Information Retrieval Con- ference, ISMIR 2015, M ´alaga, Spain, October 26-30, 2015 , pages 17...

  44. [44]

    Discovering simple rules in complex data: A meta-learning algorithm and some surprising musical dis- coveries

    Gerhard Widmer. Discovering simple rules in complex data: A meta-learning algorithm and some surprising musical dis- coveries. Artificial Intelligence, 146(2):129–148, 2003

  45. [45]

    Worrall, Stephan J

    Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukham- betov, and Gabriel J. Brostow. Harmonic networks: Deep translation and rotation equivariance. In 2017 IEEE Confer- ence on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 7168–

  46. [46]

    IEEE Computer Society, 2017

  47. [47]

    High per- formance offline handwritten chinese character recognition using googlenet and directional feature maps

    Zhuoyao Zhong, Lianwen Jin, and Zecheng Xie. High per- formance offline handwritten chinese character recognition using googlenet and directional feature maps. In 13th In- ternational Conference on Document Analysis and Recogni- tion, ICDAR 2015, Nancy, France, August 23-26, 2015, pages 846–850. IEEE Computer Society, 2015

  48. [48]

    Ori- ented response networks

    Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Ori- ented response networks. In 2017 IEEE Conference on Com- puter Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017 , pages 4961–4970. IEEE Com- puter Society, 2017