SAME: A Semantically-Aligned Music Autoencoder
Pith reviewed 2026-05-20 07:54 UTC · model grok-4.3
The pith
SAME reaches 4096 times temporal compression for music audio while preserving reconstruction quality and generative performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SAME reaches a 4096× temporal compression ratio while maintaining reconstruction quality and downstream generative performance by combining a transformer-based backbone with semantic regularisation approaches, phase-aware reconstruction losses and improved discriminator designs. The architecture delivers substantial computational cost benefits through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants, a large SAME-L and a CPU-deployable SAME-S, are released in open-weights form.
What carries the argument
Transformer-based autoencoder backbone with semantic regularisation that aligns the latent space for both reconstruction and generative use.
If this is right
- High compression ratio yields substantial savings in memory and compute for both encoding and subsequent generative modeling.
- The SAME-S variant enables CPU deployment while keeping the same compression level.
- Open-weights release of both variants allows direct use in other audio generation pipelines.
- Phase-aware losses and improved discriminators help keep perceptual quality high despite the extreme reduction in temporal resolution.
Where Pith is reading between the lines
- The same regularisation pattern could be tested on non-musical audio such as speech or environmental sound to check whether semantic alignment generalises.
- If the latents prove stable across different generative architectures, SAME could become a standard front-end for large-scale audio foundation models.
- The 4096× factor suggests that even higher ratios might be reachable by stacking additional semantic constraints.
Load-bearing premise
Semantic regularisation actually produces latents that remain useful for downstream generative models without introducing artifacts that degrade generation quality.
What would settle it
Train a standard generative model on the SAME latents and measure whether its output quality or diversity falls below that obtained from a comparable model using a lower-compression autoencoder.
Figures
read the original abstract
Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically-Aligned Music autoEncoder), an autoencoder for stereo music and general audio that reaches a 4096$\times$ temporal compression ratio while maintaining reconstruction quality and downstream generative performance. We achieve this by combining a tranformer-based backbone with set of semantic regularisation approaches, phase-aware reconstruction losses and improved discriminator designs. The architecture delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SAME, a transformer-based autoencoder for stereo music and general audio that achieves a 4096× temporal compression ratio. It combines semantic regularisation, phase-aware reconstruction losses, and improved discriminator designs to claim preservation of both reconstruction quality and utility for downstream generative models, with open-weight releases of a large variant (SAME-L) and a CPU-deployable small variant (SAME-S).
Significance. If the empirical claims are substantiated, the work would offer a practically useful high-compression latent representation for audio generation, delivering computational savings through both the extreme ratio and reliance on optimised transformer primitives. The open-weights release is a clear strength for reproducibility.
major comments (2)
- [Abstract] Abstract: the central claim that semantic regularisation 'maintains ... downstream generative performance' at 4096× compression lacks any reported generation metrics (FAD, CLAP, or listening-test scores), baselines, or ablations that isolate the regularisation terms from the transformer backbone and phase-aware losses; this directly bears on the weakest assumption that the regularisers do not over-constrain fine temporal/phase structure needed by downstream models.
- [§3] §3 (architecture and losses): without the exact weighting schedule or formulation of the semantic regularisation losses, it is impossible to assess whether they act as a strong classifier-style constraint that reduces latent expressiveness even when reconstruction metrics on the training distribution remain acceptable.
minor comments (1)
- The abstract would be strengthened by including at least one key quantitative result (e.g., a reconstruction or generation metric) to support the 'maintained quality' assertion.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript introducing SAME. We provide detailed responses to each major comment below and indicate the revisions we will make to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that semantic regularisation 'maintains ... downstream generative performance' at 4096× compression lacks any reported generation metrics (FAD, CLAP, or listening-test scores), baselines, or ablations that isolate the regularisation terms from the transformer backbone and phase-aware losses; this directly bears on the weakest assumption that the regularisers do not over-constrain fine temporal/phase structure needed by downstream models.
Authors: We appreciate the referee pointing out the need for stronger evidence on downstream performance. The manuscript reports reconstruction quality and demonstrates that the latents support generative modeling through their integration in downstream tasks, but we acknowledge the absence of specific quantitative generation metrics such as FAD or CLAP scores and explicit ablations. To address this, we will add these metrics along with ablations isolating the semantic regularisation in a new subsection of the experiments section. This will better substantiate that the regularisers preserve the fine structure needed by downstream models. revision: yes
-
Referee: [§3] §3 (architecture and losses): without the exact weighting schedule or formulation of the semantic regularisation losses, it is impossible to assess whether they act as a strong classifier-style constraint that reduces latent expressiveness even when reconstruction metrics on the training distribution remain acceptable.
Authors: Section 3 details the transformer backbone, phase-aware losses, and semantic regularisation terms based on alignment with pre-trained embeddings. The weighting is described as following a scheduled ramp-up to balance terms. We agree that greater explicitness would help readers evaluate constraint strength versus expressiveness. We will revise §3 to include the precise mathematical formulations of each regularisation loss and the exact weighting schedule and coefficients used during training. revision: partial
Circularity Check
No significant circularity; empirical architecture with experimental validation
full rationale
The paper introduces SAME as an empirical neural audio codec using a transformer backbone, semantic regularisation, phase-aware losses, and discriminator improvements to achieve 4096× compression. No closed-form derivations, first-principles predictions, or fitted parameters are presented as outputs that reduce to the inputs by construction. Claims rest on reported reconstruction quality and downstream generative performance from experiments rather than self-referential equations or self-citation chains that bear the central load. The work is self-contained against external benchmarks via open-weights release and standard evaluation metrics.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SAME consists of: 1. A query-based transformer resampling block (TRB)... 2. A bottleneck regularised for generative tractability... 3. Improved multi-resolution STFT... phase-derivative losses
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We target Dt=4096... d=256... soft-normalisation... Lkl... Ldiff... Lsem, Lcon
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
High-resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2022
work page 2022
-
[2]
SoundStream: An end-to-end neural audio codec,
N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An end-to-end neural audio codec,”IEEE/ACM Trans. Audio, Speech, Language Process., vol. 30, pp. 495–507, 2022
work page 2022
-
[3]
High fidelity neural audio compression,
A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,”Trans. Mach. Learning Res., 2023
work page 2023
-
[4]
High-fidelity audio compression with improved RVQGAN,
R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved RVQGAN,” inAdvances in Neural Inform. Process. Syst., 2023
work page 2023
-
[5]
Neural discrete representation learning,
A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in Advances in Neural Inform. Process. Syst., 2017
work page 2017
-
[6]
AudioLM: A language modeling approach to audio generation,
Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour, “AudioLM: A language modeling approach to audio generation,”IEEE/ACM Trans. Audio, Speech, Language Process., vol. 31, pp. 2523–2533, 2023
work page 2023
-
[7]
Simple and controllable music generation,
J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and controllable music generation,” inAdvances in Neural Inform. Process. Syst., 2023. 3https://stability-ai.github.io/SAME 9
work page 2023
-
[8]
Z. Evans, J. D. Parker, C. J. Carr, Z. Zukowski, J. Taylor, and J. Pons, “Stable Audio Open,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2025
work page 2025
-
[9]
Back to ear: Perceptually driven high fidelity music reconstruction,
K. Wang, Z. Wu, D. Zhou, R. Lin, J. Dai, and T. Jiang, “Back to ear: Perceptually driven high fidelity music reconstruction,”arXiv preprint arXiv:2509.14912, 2025
-
[10]
HILCodec: High-fidelity and lightweight neural audio codec,
S. Ahn, B. J. Woo, M. H. Han, C. Moon, and N. S. Kim, “HILCodec: High-fidelity and lightweight neural audio codec,”IEEE J. Sel. Topics Signal Process., vol. 18, no. 8, pp. 1517–1530, 2024
work page 2024
-
[11]
Music2Latent: Consistency autoencoders for latent audio compression,
M. Pasini, S. Lattner, and G. Fazekas, “Music2Latent: Consistency autoencoders for latent audio compression,” inProc. Int. Soc. Music Inform. Retrieval Conf., 2024
work page 2024
-
[12]
Music2Latent2: Audio compression with summary embeddings and autoregressive decoding,
——, “Music2Latent2: Audio compression with summary embeddings and autoregressive decoding,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2025
work page 2025
-
[13]
CoDiCodec: Unifying continuous and discrete compressed representations of audio,
——, “CoDiCodec: Unifying continuous and discrete compressed representations of audio,” inProc. Int. Soc. Music Inform. Retrieval Conf., 2025
work page 2025
-
[14]
Scaling transformers for low-bitrate high-quality speech coding,
J. D. Parker, A. Smirnov, J. Pons, C. J. Carr, Z. Zukowski, Z. Evans, and X. Liu, “Scaling transformers for low-bitrate high-quality speech coding,” inProc. Int. Conf. Learning Representations, 2025
work page 2025
-
[15]
TS3-Codec: Transformer-based simple streaming single codec,
H. Wu, N. Kanda, S. E. Eskimez, and J. Li, “TS3-Codec: Transformer-based simple streaming single codec,” inProc. Interspeech, 2025
work page 2025
-
[16]
ALMTokenizer: A low-bitrate and semantic-rich audio codec tokenizer for audio language modeling,
D. Yang, S. Liu, H. Guo, J. Zhao, Y. Wang, H. Wang, Z. Ju, X. Liu, X. Chen, X. Tan, X. Wu, and H. Meng, “ALMTokenizer: A low-bitrate and semantic-rich audio codec tokenizer for audio language modeling,” inProc. Int. Conf. Machine Learning, 2025
work page 2025
-
[17]
SpeechTokenizer: Unified speech tokenizer for speech language models,
X. Zhang, D. Zhang, S. Li, Y. Zhou, and X. Qiu, “SpeechTokenizer: Unified speech tokenizer for speech language models,” inProc. Int. Conf. Learning Representations, 2024
work page 2024
-
[18]
Moshi: a speech-text foundation model for real-time dialogue
A. Défossez, L. Mazaré, M. Orsini, A. Royer, P. Pérez, H. Jégou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
FunCodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,
Z. Du, S. Zhang, K. Hu, and S. Zheng, “FunCodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2024
work page 2024
-
[20]
An image is worth 32 tokens for reconstruction and generation,
Q. Yu, M. Weber, X. Deng, X. Shen, D. Cremers, and L.-C. Chen, “An image is worth 32 tokens for reconstruction and generation,” inAdvances in Neural Inform. Process. Syst., 2024
work page 2024
-
[21]
Perceiver: General perception with iterative attention,
A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira, “Perceiver: General perception with iterative attention,” inProc. Int. Conf. Machine Learning, 2021
work page 2021
-
[22]
T. Ye, L. Dong, Y. Xia, Y. Sun, Y. Zhu, G. Huang, and F. Wei, “Differential Transformer,” inProc. Int. Conf. Learning Representations, 2025
work page 2025
-
[23]
RoFormer: Enhanced transformer with Rotary Position Embedding,
J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “RoFormer: Enhanced transformer with Rotary Position Embedding,”Neurocomputing, vol. 568, p. 127063, 2024
work page 2024
-
[24]
Transformers without normalization,
J. Zhu, X. Chen, K. He, Y. LeCun, and Z. Liu, “Transformers without normalization,” inProc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2025
work page 2025
-
[25]
LiteRT: On-device runtime for cross-platform machine learning inference,
Google, “LiteRT: On-device runtime for cross-platform machine learning inference,” https://ai.google. dev/edge/litert, 2024, accessed 2026
work page 2024
-
[26]
Diffusion Transformers with Representation Autoencoders
B. Zheng, N. Ma, S. Tong, and S. Xie, “Diffusion transformers with representation autoencoders,” arXiv preprint arXiv:2510.11690, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026
J. Heek, E. Hoogeboom, T. Mensink, and T. Salimans, “Unified latents (UL): How to train your latents,”arXiv preprint arXiv:2602.17270, 2026
-
[28]
R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on Generative Adversarial Networks with multi-resolution spectrogram,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2020. 10
work page 2020
-
[29]
Multi-scale spectral loss revisited,
S. Schwär and M. Müller, “Multi-scale spectral loss revisited,”IEEE Signal Process. Lett., vol. 30, pp. 1712–1716, 2023
work page 2023
-
[30]
The relativistic discriminator: A key element missing from standard GAN,
A. Jolicoeur-Martineau, “The relativistic discriminator: A key element missing from standard GAN,” inProc. Int. Conf. Learning Representations, 2019
work page 2019
-
[31]
Near-perfect-reconstruction pseudo-QMF banks,
T. Q. Nguyen, “Near-perfect-reconstruction pseudo-QMF banks,”IEEE Trans. Signal Process., vol. 42, no. 1, pp. 65–76, 1994
work page 1994
-
[32]
LARP: Tokenizing videos with a learned autoregressive generative prior,
H. Wang, S. Suri, Y. Ren, H. Chen, and A. Shrivastava, “LARP: Tokenizing videos with a learned autoregressive generative prior,” inProc. Int. Conf. Learning Representations, 2025
work page 2025
-
[33]
Flow Matching for generative modeling,
Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, and M. Nickel, “Flow Matching for generative modeling,” inProc. Int. Conf. Learning Representations, 2023
work page 2023
-
[34]
Biorthogonal bases of compactly supported wavelets,
A. Cohen, I. Daubechies, and J.-C. Feauveau, “Biorthogonal bases of compactly supported wavelets,” Comm. Pure Appl. Math., vol. 45, no. 5, pp. 485–560, 1992
work page 1992
-
[35]
Encoder-decoder Gemma: Improving the quality-efficiency trade-off via adaptation,
B. Zhang, F. Moiseev, J. Ainslie, P. Suganthan, M. Ma, S. Bhupatiraju, F. Lebron, O. Firat, A. Joulin, and Z. Dong, “Encoder-decoder Gemma: Improving the quality-efficiency trade-off via adaptation,” arXiv preprint arXiv:2504.06225, 2025
-
[36]
Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,
K. Liang, L. Chen, B. Liu, and Q. Liu, “Cautious optimizers: Improving training with one line of code,”arXiv preprint arXiv:2411.16085, 2024
-
[37]
Long-form music generation with latent diffusion,
Z. Evans, J. D. Parker, C. J. Carr, Z. Zukowski, J. Taylor, and J. Pons, “Long-form music generation with latent diffusion,” inProc. Int. Soc. Music Inform. Retrieval Conf., 2024
work page 2024
-
[38]
The Song Describer Dataset: a corpus of audio captions for music-and-language evaluation,
I. Manco, B. Weck, S. Doh, M. Won, Y. Zhang, D. Bogdanov, Y. Wu, K. Chen, P. Tovstogan, E. Benetos, E. Quinton, G. Fazekas, and J. Nam, “The Song Describer Dataset: a corpus of audio captions for music-and-language evaluation,” inMachine Learning for Audio Workshop, NeurIPS, 2023
work page 2023
-
[39]
Adapting Fréchet Audio Distance for generative music evaluation,
A. Gui, H. Gamper, S. Braun, and D. Emmanouilidou, “Adapting Fréchet Audio Distance for generative music evaluation,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2024
work page 2024
-
[40]
MuQ-Eval: An open-source per-sample quality metric for AI music generation evaluation,
D. Zhu and Z. Li, “MuQ-Eval: An open-source per-sample quality metric for AI music generation evaluation,”arXiv preprint arXiv:2603.22677, 2026
-
[41]
ACE-Step 1.5: Pushing the boundaries of open-source music generation,
J. Gong, Y. Song, W. Zhao, S. Wang, S. Xu, J. Guo, and X. Yang, “ACE-Step 1.5: Pushing the boundaries of open-source music generation,”arXiv preprint arXiv:2602.00744, 2026. 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.