Recognition: 2 theorem links
· Lean TheoremPHALAR: Phasors for Learned Musical Audio Representations
Pith reviewed 2026-05-12 01:07 UTC · model grok-4.3
The pith
PHALAR introduces pitch- and phase-equivariant biases via spectral pooling and complex heads to set new state-of-the-art in musical stem retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PHALAR is a contrastive learning framework that achieves new state-of-the-art stem retrieval across MoisesDB, Slakh, and ChocoChorales by employing a Learned Spectral Pooling layer and complex-valued head to enforce pitch-equivariant and phase-equivariant biases, delivering up to approximately 70 percent relative accuracy improvement, fewer than 50 percent of the parameters of prior models, and a 7 times training speedup while correlating more strongly with human coherence judgments than semantic baselines.
What carries the argument
The Learned Spectral Pooling layer together with a complex-valued head inside the contrastive objective, which inject pitch-equivariant and phase-equivariant inductive biases into the learned representations.
If this is right
- Stem retrieval becomes more accurate and efficient across standard music separation benchmarks.
- The learned representations support zero-shot transfer to beat tracking and chord recognition tasks.
- Human perceptual coherence judgments align more closely with model similarity scores than with prior semantic embeddings.
- Training cost and model size decrease while performance rises, easing deployment in music production pipelines.
Where Pith is reading between the lines
- Similar spectral pooling and complex-valued heads could be inserted into other contrastive audio models to improve temporal and harmonic sensitivity without increasing parameter count.
- The approach may generalize to speech separation or environmental sound retrieval where phase relationships carry critical information.
- If the equivariant biases prove robust, they could be combined with self-supervised objectives beyond contrastive learning to further reduce reliance on labeled stems.
Load-bearing premise
The pitch-equivariant and phase-equivariant biases introduced by the Learned Spectral Pooling layer and complex-valued head produce genuinely superior and generalizable stem retrieval performance rather than dataset-specific effects or artifacts of the evaluation protocol.
What would settle it
A controlled ablation on a held-out dataset in which removing the Learned Spectral Pooling layer and complex head causes retrieval accuracy to fall to the level of the prior semantic baseline.
Figures
read the original abstract
Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PHALAR, a contrastive framework for learning musical audio representations. It employs a Learned Spectral Pooling layer and a complex-valued head to enforce pitch-equivariant and phase-equivariant biases. The model is claimed to achieve up to approximately 70% relative accuracy improvement in stem retrieval over prior state-of-the-art while using less than 50% of the parameters and providing a 7x training speedup. New SOTA results are reported on MoisesDB, Slakh, and ChocoChorales, with higher correlation to human coherence judgments than semantic baselines. Zero-shot beat tracking and linear chord probing are used to demonstrate capture of broader musical structures.
Significance. If the performance claims hold under rigorous scrutiny, the work would represent a meaningful advance in efficient, musically structured audio representations by showing that targeted equivariance biases can yield both accuracy and efficiency gains in contrastive learning. The reported efficiency improvements and probing results on beat tracking and chords suggest broader utility beyond the primary retrieval task.
major comments (2)
- [Abstract and §4] Abstract and experimental results: the central claims of up to 70% relative accuracy gains and new SOTA across three datasets are presented without any reported details on baseline implementations, data splits, statistical significance testing, or ablation studies. This information is load-bearing for validating whether the gains arise from the proposed equivariance biases rather than evaluation artifacts or dataset-specific effects.
- [§3] §3 (Method): the Learned Spectral Pooling layer and complex-valued head are introduced as the source of the pitch- and phase-equivariant biases, but the manuscript provides insufficient mathematical specification (e.g., no explicit equations for the pooling operation or the complex head) to allow independent verification that these components indeed enforce the claimed equivariances rather than other effects.
minor comments (2)
- [Figures 3-5 and Table 2] Figure and table captions should explicitly state the evaluation metric (e.g., accuracy@K) and the exact human judgment protocol used for the coherence correlation analysis.
- [§5] The zero-shot beat tracking and chord probing sections would benefit from reporting the exact linear probe architectures and the number of runs for the reported correlations.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential significance of PHALAR. We address each major comment below and will submit a revised manuscript that incorporates the requested clarifications and expansions.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and experimental results: the central claims of up to 70% relative accuracy gains and new SOTA across three datasets are presented without any reported details on baseline implementations, data splits, statistical significance testing, or ablation studies. This information is load-bearing for validating whether the gains arise from the proposed equivariance biases rather than evaluation artifacts or dataset-specific effects.
Authors: We agree that greater experimental transparency is necessary to substantiate the claims. The current manuscript references the baselines and datasets in §4 but does not provide exhaustive implementation details or statistical tests. In the revision we will expand §4 with: explicit descriptions of all baseline models (including code-level hyperparameters and training protocols), the precise train/validation/test splits for MoisesDB, Slakh, and ChocoChorales, results of statistical significance tests (paired t-tests and bootstrap confidence intervals on retrieval accuracy), and dedicated ablation studies that isolate the Learned Spectral Pooling layer and complex-valued head. These additions will directly demonstrate that the reported gains derive from the equivariant biases. revision: yes
-
Referee: [§3] §3 (Method): the Learned Spectral Pooling layer and complex-valued head are introduced as the source of the pitch- and phase-equivariant biases, but the manuscript provides insufficient mathematical specification (e.g., no explicit equations for the pooling operation or the complex head) to allow independent verification that these components indeed enforce the claimed equivariances rather than other effects.
Authors: We acknowledge that the mathematical presentation in §3 requires greater explicitness. While the manuscript describes the components and their intended biases, it does not supply the full set of equations needed for independent verification. In the revised version we will insert the complete mathematical definitions: the precise formulation of the Learned Spectral Pooling operation (including the learned filter bank and spectral-domain pooling rule) and the complex-valued head (specifying the complex linear layers and phase-handling operations). We will also add a short derivation showing how these operations commute with pitch transposition and phase rotation, thereby confirming the equivariance properties. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces an empirical contrastive learning model (PHALAR) with architectural components for pitch- and phase-equivariance, validated through retrieval accuracy, human correlation, zero-shot beat tracking, and linear probing on external datasets (MoisesDB, Slakh, ChocoChorales). No mathematical derivation, uniqueness theorem, ansatz, or fitted-parameter prediction is presented that reduces to its own inputs by construction. All load-bearing claims are performance metrics obtained from training and evaluation protocols that are independent of the reported results. The framework is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Learned Spectral Pooling layer
no independent evidence
-
complex-valued head
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases... maps temporal alignment to geometric rotation in the complex plane.
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We achieve this by shifting the representation space from real-valued magnitudes to complex-valued phasors... C=8... 8-tick periodicity is never mentioned.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP , year =
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation , author =. IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP , year =
-
[2]
Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael , title =. 2020 , booktitle =
work page 2020
-
[3]
Cocola: Coherence-oriented contrastive learning of musical audio representations , author=. ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=
work page 2025
-
[4]
Kilgour, Kevin and Zuluaga, Mauricio and Roblek, Dominik and Sharifi, Matthew , booktitle=. Fr
-
[5]
IEEE/ACM transactions on audio, speech, and language processing , volume=
Hubert: Self-supervised speech representation learning by masked prediction of hidden units , author=. IEEE/ACM transactions on audio, speech, and language processing , volume=
-
[6]
arXiv preprint arXiv:2103.09410 , year=
Contrastive learning of musical representations , author=. arXiv preprint arXiv:2103.09410 , year=
-
[7]
International Conference on Learning Representations , year=
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training , author=. International Conference on Learning Representations , year=
-
[8]
International conference on machine learning , pages=
A simple framework for contrastive learning of visual representations , author=. International conference on machine learning , pages=
-
[9]
Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D , journal=. 2023 , pages=
work page 2023
-
[10]
International Conference on Learning Representations , year=
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=
-
[11]
Adapting frechet audio distance for generative music evaluation , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=
work page 2024
-
[12]
2017 ieee international conference on acoustics, speech and signal processing (icassp) , pages=
CNN architectures for large-scale audio classification , author=. 2017 ieee international conference on acoustics, speech and signal processing (icassp) , pages=
work page 2017
-
[13]
Advances in Neural Information Processing Systems , volume=
High-fidelity audio compression with improved rvqgan , author=. Advances in Neural Information Processing Systems , volume=
-
[14]
High Fidelity Neural Audio Compression , author=
-
[15]
CDPAM: Contrastive Learning for Perceptual Audio Similarity , year=
Manocha, Pranay and Jin, Zeyu and Zhang, Richard and Finkelstein, Adam , booktitle=. CDPAM: Contrastive Learning for Perceptual Audio Similarity , year=
-
[16]
Beat this! Accurate beat tracking without
Francesco Foscarin and Jan Schl. Beat this! Accurate beat tracking without
-
[17]
International Society for Music Information Retrieval Conference , year=
Deconstruct, Analyse, Reconstruct: How to improve Tempo, Beat, and Downbeat Estimation , author=. International Society for Music Information Retrieval Conference , year=
-
[18]
Tian Cheng and Masataka Goto , title =. Proceedings of the 24th International Society for Music Information Retrieval Conference , year = 2023, pages =
work page 2023
-
[19]
International Conference on Learning Representations , year=
Deep Complex Networks , author=. International Conference on Learning Representations , year=
-
[20]
International Conference on Learning Representations , year=
RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space , author=. International Conference on Learning Representations , year=
-
[21]
International Conference on Learning Representations , year=
Phase-aware speech enhancement with deep complex u-net , author=. International Conference on Learning Representations , year=
-
[22]
International conference on machine learning , pages=
Complex embeddings for simple link prediction , author=. International conference on machine learning , pages=. 2016 , organization=
work page 2016
-
[23]
Quantum-Inspired Complex Word Embedding
Li, Qiuchi and Uprety, Sagar and Wang, Benyou and Song, Dawei. Quantum-Inspired Complex Word Embedding. Proceedings of the Third Workshop on Representation Learning for NLP. 2018. doi:10.18653/v1/W18-3006
- [24]
-
[25]
Nicki Holighaus and Monika Dörfler and Gino Angelo Velasco and Thomas Grill , Title =. 2012 , Eprint =
work page 2012
-
[26]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[27]
Brigham, E. O. and Morrow, R. E. , journal=. The fast Fourier transform , year=
-
[28]
Advances in neural information processing systems , volume=
Spectral representations for convolutional neural networks , author=. Advances in neural information processing systems , volume=
-
[29]
International Conference on Learning Representations , year=
Spectral Normalization for Generative Adversarial Networks , author=. International Conference on Learning Representations , year=
-
[30]
Cauchy, Augustin-Louis , journal=. Sur l’
- [31]
-
[32]
Waters, Roger , title =
-
[33]
White, Jack , title =
-
[34]
MusicLM: Generating Music From Text
Musiclm: Generating music from text , author=. arXiv preprint arXiv:2301.11325 , year=
work page internal anchor Pith review arXiv
-
[35]
Stable audio open , author=. ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=
work page 2025
-
[36]
Proceedings of the 26th International Society for Music Information Retrieval Conference , year=
STAGE: Stemmed Accompaniment Generation through Prefix-Based Conditioning , author=. Proceedings of the 26th International Society for Music Information Retrieval Conference , year=
-
[37]
Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound , author=. 2025 , url=
work page 2025
-
[38]
Thirty-seventh Conference on Neural Information Processing Systems , year=
Simple and Controllable Music Generation , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
-
[39]
Jukebox: A Generative Model for Music
Jukebox: A generative model for music , author=. arXiv preprint arXiv:2005.00341 , year=
work page Pith review arXiv 2005
-
[40]
2020 twelfth international conference on quality of multimedia experience (QoMEX) , pages=
ViSQOL v3: An open source production ready objective speech and audio metric , author=. 2020 twelfth international conference on quality of multimedia experience (QoMEX) , pages=
work page 2020
-
[41]
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=
work page 2019
-
[42]
International conference on machine learning , pages=
Batch normalization: Accelerating deep network training by reducing internal covariate shift , author=. International conference on machine learning , pages=. 2015 , organization=
work page 2015
-
[43]
Layer normalization , author=. arXiv preprint arXiv:1607.06450 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
International conference on machine learning , pages=
Unitary evolution recurrent neural networks , author=. International conference on machine learning , pages=
-
[45]
SIAM Journal on Mathematics of Data Science , volume=
Quantitative approximation results for complex-valued neural networks , author=. SIAM Journal on Mathematics of Data Science , volume=. 2022 , publisher=
work page 2022
-
[46]
Contrastive Learning of General-Purpose Audio Representations , author=. 2020 , eprint=
work page 2020
-
[47]
Moisesdb: A dataset for source separation beyond 4-stems , author=. 2023 , eprint=
work page 2023
-
[48]
arXiv preprint arXiv:2209.14458 , year =
The Chamber Ensemble Generator: Limitless High-Quality MIR Data via Generative Modeling , author =. arXiv preprint arXiv:2209.14458 , year =
-
[49]
Cutting Music Source Separation Some
Manilow, Ethan and Wichern, Gordon and Seetharaman, Prem and Le Roux, Jonathan , booktitle=. Cutting Music Source Separation Some. 2019 , organization=
work page 2019
-
[50]
Keller Jordan and Yuchen Jin and Vlado Boza and You Jiacheng and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , url =
work page 2024
-
[51]
Harmonic/Percussive Separation using Median Filtering , journal =
Fitzgerald, Derry , year =. Harmonic/Percussive Separation using Median Filtering , journal =
-
[52]
International Society for Music Information Retrieval Conference , year=
Extending Harmonic-Percussive Separation of Audio Signals , author=. International Society for Music Information Retrieval Conference , year=
-
[53]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Rethinking the inception architecture for computer vision , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[54]
Rafii, Zafar and Liutkus, Antoine and Fabian-Robert St. The
-
[55]
Rafii, Zafar and Liutkus, Antoine and Stöter, Fabian-Robert and Mimilakis, Stylianos Ioannis and Bittner, Rachel , title =
-
[56]
20th International Society for Music Information Retrieval Conference,
A Bi-directional transformer for musical chord recognition , author=. 20th International Society for Music Information Retrieval Conference,
-
[57]
EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding , author=. 2026 , eprint=
work page 2026
-
[58]
Journal of the American Statistical Association , year=
Robust Locally Weighted Regression and Smoothing Scatterplots , author=. Journal of the American Statistical Association , year=
-
[59]
Qingyang Xi and Rachel Bittner and Johan Pauwels and Xuzhou Ye and Juan Pablo Bello , title =. Proceedings of the 19th International Society for Music Information Retrieval Conference , year = 2018, pages =
work page 2018
-
[60]
Adam: A Method for Stochastic Optimization
Adam: A method for stochastic optimization , author=. arXiv preprint arXiv:1412.6980 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
Representation Learning with Contrastive Predictive Coding
Representation learning with contrastive predictive coding , author=. arXiv preprint arXiv:1807.03748 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.