SAME: A Semantically-Aligned Music Autoencoder

CJ Carr; Jordi Pons; Josiah Taylor; Julian D. Parker; Matthew Rice; Zachary Zukowski; Zach Evans

arxiv: 2605.18613 · v1 · pith:AZLQ6BNNnew · submitted 2026-05-18 · 💻 cs.SD · cs.AI

SAME: A Semantically-Aligned Music Autoencoder

Julian D. Parker , Zach Evans , CJ Carr , Zachary Zukowski , Josiah Taylor , Matthew Rice , Jordi Pons This is my paper

Pith reviewed 2026-05-20 07:54 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords audio autoencodermusic compressionsemantic regularisationtransformer backbonelatent representationsgenerative audio modelsphase-aware loss

0 comments

The pith

SAME reaches 4096 times temporal compression for music audio while preserving reconstruction quality and generative performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAME, an autoencoder for stereo music and general audio that compresses the temporal dimension by a factor of 4096. It does so by using a transformer-based backbone together with semantic regularisation, phase-aware reconstruction losses, and improved discriminator designs. The resulting latent representations support both accurate reconstruction of the input and strong performance when used inside downstream generative models. A reader would care because such extreme compression lowers the computational cost of working with audio sequences at scale.

Core claim

SAME reaches a 4096× temporal compression ratio while maintaining reconstruction quality and downstream generative performance by combining a transformer-based backbone with semantic regularisation approaches, phase-aware reconstruction losses and improved discriminator designs. The architecture delivers substantial computational cost benefits through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants, a large SAME-L and a CPU-deployable SAME-S, are released in open-weights form.

What carries the argument

Transformer-based autoencoder backbone with semantic regularisation that aligns the latent space for both reconstruction and generative use.

If this is right

High compression ratio yields substantial savings in memory and compute for both encoding and subsequent generative modeling.
The SAME-S variant enables CPU deployment while keeping the same compression level.
Open-weights release of both variants allows direct use in other audio generation pipelines.
Phase-aware losses and improved discriminators help keep perceptual quality high despite the extreme reduction in temporal resolution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularisation pattern could be tested on non-musical audio such as speech or environmental sound to check whether semantic alignment generalises.
If the latents prove stable across different generative architectures, SAME could become a standard front-end for large-scale audio foundation models.
The 4096× factor suggests that even higher ratios might be reachable by stacking additional semantic constraints.

Load-bearing premise

Semantic regularisation actually produces latents that remain useful for downstream generative models without introducing artifacts that degrade generation quality.

What would settle it

Train a standard generative model on the SAME latents and measure whether its output quality or diversity falls below that obtained from a comparable model using a lower-compression autoencoder.

Figures

Figures reproduced from arXiv: 2605.18613 by CJ Carr, Jordi Pons, Josiah Taylor, Julian D. Parker, Matthew Rice, Zachary Zukowski, Zach Evans.

**Figure 2.** Figure 2: Embedding interleaving in encoder-mode TRB (stride [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Attention masks for a 12-embedding interleaved sequence (4 segments of [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically-Aligned Music autoEncoder), an autoencoder for stereo music and general audio that reaches a 4096$\times$ temporal compression ratio while maintaining reconstruction quality and downstream generative performance. We achieve this by combining a tranformer-based backbone with set of semantic regularisation approaches, phase-aware reconstruction losses and improved discriminator designs. The architecture delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SAME reaches 4096x compression on music via transformer backbone plus semantic regularization and phase losses, with open weights released, but the abstract gives no metrics or ablations to back the generative performance claim.

read the letter

Hey, quick take on the SAME paper. The headline item is a transformer-based autoencoder for stereo music that hits 4096 times temporal compression while claiming to keep both reconstruction quality and downstream generative utility, plus they ship open weights for a large version and a smaller CPU one. That compression ratio plus the efficiency of transformer primitives is the practical angle worth noting for anyone scaling latent audio models. What the work actually does is combine a transformer backbone with semantic regularization, phase-aware losses, and discriminator changes in a music-specific setup. Earlier neural codecs exist, but this particular mix at such high ratio for music appears new in the framing they give. Releasing both variants openly is straightforward and lets others test the latents directly, which helps the computational cost story they emphasize. The open release and focus on real deployment costs are the parts that feel solid. The softer spot is the missing evidence. The abstract asserts maintained generative performance but shows no numbers, no baselines, and no ablation isolating how the semantic terms affect generation metrics versus plain reconstruction. The stress-test concern lands here: those regularizers could push the latent space toward coarse categories and drop the fine temporal or phase details that music generation needs, even if training-set reconstruction holds. Without those results visible, the central claim stays unverified. This is aimed at people building latent generative models for music who need cheaper encoders. A reader working on diffusion or similar would get the architecture details and weights to try, provided the quality numbers check out in the full text. I would send it for peer review. The compression level and open release give it enough substance for referees to engage, even if the evaluation section needs strengthening.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SAME, a transformer-based autoencoder for stereo music and general audio that achieves a 4096× temporal compression ratio. It combines semantic regularisation, phase-aware reconstruction losses, and improved discriminator designs to claim preservation of both reconstruction quality and utility for downstream generative models, with open-weight releases of a large variant (SAME-L) and a CPU-deployable small variant (SAME-S).

Significance. If the empirical claims are substantiated, the work would offer a practically useful high-compression latent representation for audio generation, delivering computational savings through both the extreme ratio and reliance on optimised transformer primitives. The open-weights release is a clear strength for reproducibility.

major comments (2)

[Abstract] Abstract: the central claim that semantic regularisation 'maintains ... downstream generative performance' at 4096× compression lacks any reported generation metrics (FAD, CLAP, or listening-test scores), baselines, or ablations that isolate the regularisation terms from the transformer backbone and phase-aware losses; this directly bears on the weakest assumption that the regularisers do not over-constrain fine temporal/phase structure needed by downstream models.
[§3] §3 (architecture and losses): without the exact weighting schedule or formulation of the semantic regularisation losses, it is impossible to assess whether they act as a strong classifier-style constraint that reduces latent expressiveness even when reconstruction metrics on the training distribution remain acceptable.

minor comments (1)

The abstract would be strengthened by including at least one key quantitative result (e.g., a reconstruction or generation metric) to support the 'maintained quality' assertion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript introducing SAME. We provide detailed responses to each major comment below and indicate the revisions we will make to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that semantic regularisation 'maintains ... downstream generative performance' at 4096× compression lacks any reported generation metrics (FAD, CLAP, or listening-test scores), baselines, or ablations that isolate the regularisation terms from the transformer backbone and phase-aware losses; this directly bears on the weakest assumption that the regularisers do not over-constrain fine temporal/phase structure needed by downstream models.

Authors: We appreciate the referee pointing out the need for stronger evidence on downstream performance. The manuscript reports reconstruction quality and demonstrates that the latents support generative modeling through their integration in downstream tasks, but we acknowledge the absence of specific quantitative generation metrics such as FAD or CLAP scores and explicit ablations. To address this, we will add these metrics along with ablations isolating the semantic regularisation in a new subsection of the experiments section. This will better substantiate that the regularisers preserve the fine structure needed by downstream models. revision: yes
Referee: [§3] §3 (architecture and losses): without the exact weighting schedule or formulation of the semantic regularisation losses, it is impossible to assess whether they act as a strong classifier-style constraint that reduces latent expressiveness even when reconstruction metrics on the training distribution remain acceptable.

Authors: Section 3 details the transformer backbone, phase-aware losses, and semantic regularisation terms based on alignment with pre-trained embeddings. The weighting is described as following a scheduled ramp-up to balance terms. We agree that greater explicitness would help readers evaluate constraint strength versus expressiveness. We will revise §3 to include the precise mathematical formulations of each regularisation loss and the exact weighting schedule and coefficients used during training. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical architecture with experimental validation

full rationale

The paper introduces SAME as an empirical neural audio codec using a transformer backbone, semantic regularisation, phase-aware losses, and discriminator improvements to achieve 4096× compression. No closed-form derivations, first-principles predictions, or fitted parameters are presented as outputs that reduce to the inputs by construction. Claims rest on reported reconstruction quality and downstream generative performance from experiments rather than self-referential equations or self-citation chains that bear the central load. The work is self-contained against external benchmarks via open-weights release and standard evaluation metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated assumption that semantic regularization preserves generative utility at extreme compression ratios; no free parameters or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5666 in / 1060 out tokens · 47624 ms · 2026-05-20T07:54:50.777151+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SAME consists of: 1. A query-based transformer resampling block (TRB)... 2. A bottleneck regularised for generative tractability... 3. Improved multi-resolution STFT... phase-derivative losses
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We target Dt=4096... d=256... soft-normalisation... Lkl... Ldiff... Lsem, Lcon

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

[1]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2022

work page 2022
[2]

SoundStream: An end-to-end neural audio codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An end-to-end neural audio codec,”IEEE/ACM Trans. Audio, Speech, Language Process., vol. 30, pp. 495–507, 2022

work page 2022
[3]

High fidelity neural audio compression,

A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,”Trans. Mach. Learning Res., 2023

work page 2023
[4]

High-fidelity audio compression with improved RVQGAN,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved RVQGAN,” inAdvances in Neural Inform. Process. Syst., 2023

work page 2023
[5]

Neural discrete representation learning,

A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in Advances in Neural Inform. Process. Syst., 2017

work page 2017
[6]

AudioLM: A language modeling approach to audio generation,

Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour, “AudioLM: A language modeling approach to audio generation,”IEEE/ACM Trans. Audio, Speech, Language Process., vol. 31, pp. 2523–2533, 2023

work page 2023
[7]

Simple and controllable music generation,

J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and controllable music generation,” inAdvances in Neural Inform. Process. Syst., 2023. 3https://stability-ai.github.io/SAME 9

work page 2023
[8]

Stable Audio Open,

Z. Evans, J. D. Parker, C. J. Carr, Z. Zukowski, J. Taylor, and J. Pons, “Stable Audio Open,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2025

work page 2025
[9]

Back to ear: Perceptually driven high fidelity music reconstruction,

K. Wang, Z. Wu, D. Zhou, R. Lin, J. Dai, and T. Jiang, “Back to ear: Perceptually driven high fidelity music reconstruction,”arXiv preprint arXiv:2509.14912, 2025

work page arXiv 2025
[10]

HILCodec: High-fidelity and lightweight neural audio codec,

S. Ahn, B. J. Woo, M. H. Han, C. Moon, and N. S. Kim, “HILCodec: High-fidelity and lightweight neural audio codec,”IEEE J. Sel. Topics Signal Process., vol. 18, no. 8, pp. 1517–1530, 2024

work page 2024
[11]

Music2Latent: Consistency autoencoders for latent audio compression,

M. Pasini, S. Lattner, and G. Fazekas, “Music2Latent: Consistency autoencoders for latent audio compression,” inProc. Int. Soc. Music Inform. Retrieval Conf., 2024

work page 2024
[12]

Music2Latent2: Audio compression with summary embeddings and autoregressive decoding,

——, “Music2Latent2: Audio compression with summary embeddings and autoregressive decoding,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2025

work page 2025
[13]

CoDiCodec: Unifying continuous and discrete compressed representations of audio,

——, “CoDiCodec: Unifying continuous and discrete compressed representations of audio,” inProc. Int. Soc. Music Inform. Retrieval Conf., 2025

work page 2025
[14]

Scaling transformers for low-bitrate high-quality speech coding,

J. D. Parker, A. Smirnov, J. Pons, C. J. Carr, Z. Zukowski, Z. Evans, and X. Liu, “Scaling transformers for low-bitrate high-quality speech coding,” inProc. Int. Conf. Learning Representations, 2025

work page 2025
[15]

TS3-Codec: Transformer-based simple streaming single codec,

H. Wu, N. Kanda, S. E. Eskimez, and J. Li, “TS3-Codec: Transformer-based simple streaming single codec,” inProc. Interspeech, 2025

work page 2025
[16]

ALMTokenizer: A low-bitrate and semantic-rich audio codec tokenizer for audio language modeling,

D. Yang, S. Liu, H. Guo, J. Zhao, Y. Wang, H. Wang, Z. Ju, X. Liu, X. Chen, X. Tan, X. Wu, and H. Meng, “ALMTokenizer: A low-bitrate and semantic-rich audio codec tokenizer for audio language modeling,” inProc. Int. Conf. Machine Learning, 2025

work page 2025
[17]

SpeechTokenizer: Unified speech tokenizer for speech language models,

X. Zhang, D. Zhang, S. Li, Y. Zhou, and X. Qiu, “SpeechTokenizer: Unified speech tokenizer for speech language models,” inProc. Int. Conf. Learning Representations, 2024

work page 2024
[18]

Moshi: a speech-text foundation model for real-time dialogue

A. Défossez, L. Mazaré, M. Orsini, A. Royer, P. Pérez, H. Jégou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

FunCodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,

Z. Du, S. Zhang, K. Hu, and S. Zheng, “FunCodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2024

work page 2024
[20]

An image is worth 32 tokens for reconstruction and generation,

Q. Yu, M. Weber, X. Deng, X. Shen, D. Cremers, and L.-C. Chen, “An image is worth 32 tokens for reconstruction and generation,” inAdvances in Neural Inform. Process. Syst., 2024

work page 2024
[21]

Perceiver: General perception with iterative attention,

A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira, “Perceiver: General perception with iterative attention,” inProc. Int. Conf. Machine Learning, 2021

work page 2021
[22]

Differential Transformer,

T. Ye, L. Dong, Y. Xia, Y. Sun, Y. Zhu, G. Huang, and F. Wei, “Differential Transformer,” inProc. Int. Conf. Learning Representations, 2025

work page 2025
[23]

RoFormer: Enhanced transformer with Rotary Position Embedding,

J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “RoFormer: Enhanced transformer with Rotary Position Embedding,”Neurocomputing, vol. 568, p. 127063, 2024

work page 2024
[24]

Transformers without normalization,

J. Zhu, X. Chen, K. He, Y. LeCun, and Z. Liu, “Transformers without normalization,” inProc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2025

work page 2025
[25]

LiteRT: On-device runtime for cross-platform machine learning inference,

Google, “LiteRT: On-device runtime for cross-platform machine learning inference,” https://ai.google. dev/edge/litert, 2024, accessed 2026

work page 2024
[26]

Diffusion Transformers with Representation Autoencoders

B. Zheng, N. Ma, S. Tong, and S. Xie, “Diffusion transformers with representation autoencoders,” arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026

J. Heek, E. Hoogeboom, T. Mensink, and T. Salimans, “Unified latents (UL): How to train your latents,”arXiv preprint arXiv:2602.17270, 2026

work page arXiv 2026
[28]

Parallel WaveGAN: A fast waveform generation model based on Generative Adversarial Networks with multi-resolution spectrogram,

R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on Generative Adversarial Networks with multi-resolution spectrogram,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2020. 10

work page 2020
[29]

Multi-scale spectral loss revisited,

S. Schwär and M. Müller, “Multi-scale spectral loss revisited,”IEEE Signal Process. Lett., vol. 30, pp. 1712–1716, 2023

work page 2023
[30]

The relativistic discriminator: A key element missing from standard GAN,

A. Jolicoeur-Martineau, “The relativistic discriminator: A key element missing from standard GAN,” inProc. Int. Conf. Learning Representations, 2019

work page 2019
[31]

Near-perfect-reconstruction pseudo-QMF banks,

T. Q. Nguyen, “Near-perfect-reconstruction pseudo-QMF banks,”IEEE Trans. Signal Process., vol. 42, no. 1, pp. 65–76, 1994

work page 1994
[32]

LARP: Tokenizing videos with a learned autoregressive generative prior,

H. Wang, S. Suri, Y. Ren, H. Chen, and A. Shrivastava, “LARP: Tokenizing videos with a learned autoregressive generative prior,” inProc. Int. Conf. Learning Representations, 2025

work page 2025
[33]

Flow Matching for generative modeling,

Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, and M. Nickel, “Flow Matching for generative modeling,” inProc. Int. Conf. Learning Representations, 2023

work page 2023
[34]

Biorthogonal bases of compactly supported wavelets,

A. Cohen, I. Daubechies, and J.-C. Feauveau, “Biorthogonal bases of compactly supported wavelets,” Comm. Pure Appl. Math., vol. 45, no. 5, pp. 485–560, 1992

work page 1992
[35]

Encoder-decoder Gemma: Improving the quality-efficiency trade-off via adaptation,

B. Zhang, F. Moiseev, J. Ainslie, P. Suganthan, M. Ma, S. Bhupatiraju, F. Lebron, O. Firat, A. Joulin, and Z. Dong, “Encoder-decoder Gemma: Improving the quality-efficiency trade-off via adaptation,” arXiv preprint arXiv:2504.06225, 2025

work page arXiv 2025
[36]

Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,

K. Liang, L. Chen, B. Liu, and Q. Liu, “Cautious optimizers: Improving training with one line of code,”arXiv preprint arXiv:2411.16085, 2024

work page arXiv 2024
[37]

Long-form music generation with latent diffusion,

Z. Evans, J. D. Parker, C. J. Carr, Z. Zukowski, J. Taylor, and J. Pons, “Long-form music generation with latent diffusion,” inProc. Int. Soc. Music Inform. Retrieval Conf., 2024

work page 2024
[38]

The Song Describer Dataset: a corpus of audio captions for music-and-language evaluation,

I. Manco, B. Weck, S. Doh, M. Won, Y. Zhang, D. Bogdanov, Y. Wu, K. Chen, P. Tovstogan, E. Benetos, E. Quinton, G. Fazekas, and J. Nam, “The Song Describer Dataset: a corpus of audio captions for music-and-language evaluation,” inMachine Learning for Audio Workshop, NeurIPS, 2023

work page 2023
[39]

Adapting Fréchet Audio Distance for generative music evaluation,

A. Gui, H. Gamper, S. Braun, and D. Emmanouilidou, “Adapting Fréchet Audio Distance for generative music evaluation,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2024

work page 2024
[40]

MuQ-Eval: An open-source per-sample quality metric for AI music generation evaluation,

D. Zhu and Z. Li, “MuQ-Eval: An open-source per-sample quality metric for AI music generation evaluation,”arXiv preprint arXiv:2603.22677, 2026

work page arXiv 2026
[41]

ACE-Step 1.5: Pushing the boundaries of open-source music generation,

J. Gong, Y. Song, W. Zhao, S. Wang, S. Xu, J. Guo, and X. Yang, “ACE-Step 1.5: Pushing the boundaries of open-source music generation,”arXiv preprint arXiv:2602.00744, 2026. 11

work page arXiv 2026

[1] [1]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2022

work page 2022

[2] [2]

SoundStream: An end-to-end neural audio codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An end-to-end neural audio codec,”IEEE/ACM Trans. Audio, Speech, Language Process., vol. 30, pp. 495–507, 2022

work page 2022

[3] [3]

High fidelity neural audio compression,

A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,”Trans. Mach. Learning Res., 2023

work page 2023

[4] [4]

High-fidelity audio compression with improved RVQGAN,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved RVQGAN,” inAdvances in Neural Inform. Process. Syst., 2023

work page 2023

[5] [5]

Neural discrete representation learning,

A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete representation learning,” in Advances in Neural Inform. Process. Syst., 2017

work page 2017

[6] [6]

AudioLM: A language modeling approach to audio generation,

Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour, “AudioLM: A language modeling approach to audio generation,”IEEE/ACM Trans. Audio, Speech, Language Process., vol. 31, pp. 2523–2533, 2023

work page 2023

[7] [7]

Simple and controllable music generation,

J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and controllable music generation,” inAdvances in Neural Inform. Process. Syst., 2023. 3https://stability-ai.github.io/SAME 9

work page 2023

[8] [8]

Stable Audio Open,

Z. Evans, J. D. Parker, C. J. Carr, Z. Zukowski, J. Taylor, and J. Pons, “Stable Audio Open,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2025

work page 2025

[9] [9]

Back to ear: Perceptually driven high fidelity music reconstruction,

K. Wang, Z. Wu, D. Zhou, R. Lin, J. Dai, and T. Jiang, “Back to ear: Perceptually driven high fidelity music reconstruction,”arXiv preprint arXiv:2509.14912, 2025

work page arXiv 2025

[10] [10]

HILCodec: High-fidelity and lightweight neural audio codec,

S. Ahn, B. J. Woo, M. H. Han, C. Moon, and N. S. Kim, “HILCodec: High-fidelity and lightweight neural audio codec,”IEEE J. Sel. Topics Signal Process., vol. 18, no. 8, pp. 1517–1530, 2024

work page 2024

[11] [11]

Music2Latent: Consistency autoencoders for latent audio compression,

M. Pasini, S. Lattner, and G. Fazekas, “Music2Latent: Consistency autoencoders for latent audio compression,” inProc. Int. Soc. Music Inform. Retrieval Conf., 2024

work page 2024

[12] [12]

Music2Latent2: Audio compression with summary embeddings and autoregressive decoding,

——, “Music2Latent2: Audio compression with summary embeddings and autoregressive decoding,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2025

work page 2025

[13] [13]

CoDiCodec: Unifying continuous and discrete compressed representations of audio,

——, “CoDiCodec: Unifying continuous and discrete compressed representations of audio,” inProc. Int. Soc. Music Inform. Retrieval Conf., 2025

work page 2025

[14] [14]

Scaling transformers for low-bitrate high-quality speech coding,

J. D. Parker, A. Smirnov, J. Pons, C. J. Carr, Z. Zukowski, Z. Evans, and X. Liu, “Scaling transformers for low-bitrate high-quality speech coding,” inProc. Int. Conf. Learning Representations, 2025

work page 2025

[15] [15]

TS3-Codec: Transformer-based simple streaming single codec,

H. Wu, N. Kanda, S. E. Eskimez, and J. Li, “TS3-Codec: Transformer-based simple streaming single codec,” inProc. Interspeech, 2025

work page 2025

[16] [16]

ALMTokenizer: A low-bitrate and semantic-rich audio codec tokenizer for audio language modeling,

D. Yang, S. Liu, H. Guo, J. Zhao, Y. Wang, H. Wang, Z. Ju, X. Liu, X. Chen, X. Tan, X. Wu, and H. Meng, “ALMTokenizer: A low-bitrate and semantic-rich audio codec tokenizer for audio language modeling,” inProc. Int. Conf. Machine Learning, 2025

work page 2025

[17] [17]

SpeechTokenizer: Unified speech tokenizer for speech language models,

X. Zhang, D. Zhang, S. Li, Y. Zhou, and X. Qiu, “SpeechTokenizer: Unified speech tokenizer for speech language models,” inProc. Int. Conf. Learning Representations, 2024

work page 2024

[18] [18]

Moshi: a speech-text foundation model for real-time dialogue

A. Défossez, L. Mazaré, M. Orsini, A. Royer, P. Pérez, H. Jégou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

FunCodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,

Z. Du, S. Zhang, K. Hu, and S. Zheng, “FunCodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2024

work page 2024

[20] [20]

An image is worth 32 tokens for reconstruction and generation,

Q. Yu, M. Weber, X. Deng, X. Shen, D. Cremers, and L.-C. Chen, “An image is worth 32 tokens for reconstruction and generation,” inAdvances in Neural Inform. Process. Syst., 2024

work page 2024

[21] [21]

Perceiver: General perception with iterative attention,

A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira, “Perceiver: General perception with iterative attention,” inProc. Int. Conf. Machine Learning, 2021

work page 2021

[22] [22]

Differential Transformer,

T. Ye, L. Dong, Y. Xia, Y. Sun, Y. Zhu, G. Huang, and F. Wei, “Differential Transformer,” inProc. Int. Conf. Learning Representations, 2025

work page 2025

[23] [23]

RoFormer: Enhanced transformer with Rotary Position Embedding,

J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu, “RoFormer: Enhanced transformer with Rotary Position Embedding,”Neurocomputing, vol. 568, p. 127063, 2024

work page 2024

[24] [24]

Transformers without normalization,

J. Zhu, X. Chen, K. He, Y. LeCun, and Z. Liu, “Transformers without normalization,” inProc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2025

work page 2025

[25] [25]

LiteRT: On-device runtime for cross-platform machine learning inference,

Google, “LiteRT: On-device runtime for cross-platform machine learning inference,” https://ai.google. dev/edge/litert, 2024, accessed 2026

work page 2024

[26] [26]

Diffusion Transformers with Representation Autoencoders

B. Zheng, N. Ma, S. Tong, and S. Xie, “Diffusion transformers with representation autoencoders,” arXiv preprint arXiv:2510.11690, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026

J. Heek, E. Hoogeboom, T. Mensink, and T. Salimans, “Unified latents (UL): How to train your latents,”arXiv preprint arXiv:2602.17270, 2026

work page arXiv 2026

[28] [28]

Parallel WaveGAN: A fast waveform generation model based on Generative Adversarial Networks with multi-resolution spectrogram,

R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on Generative Adversarial Networks with multi-resolution spectrogram,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2020. 10

work page 2020

[29] [29]

Multi-scale spectral loss revisited,

S. Schwär and M. Müller, “Multi-scale spectral loss revisited,”IEEE Signal Process. Lett., vol. 30, pp. 1712–1716, 2023

work page 2023

[30] [30]

The relativistic discriminator: A key element missing from standard GAN,

A. Jolicoeur-Martineau, “The relativistic discriminator: A key element missing from standard GAN,” inProc. Int. Conf. Learning Representations, 2019

work page 2019

[31] [31]

Near-perfect-reconstruction pseudo-QMF banks,

T. Q. Nguyen, “Near-perfect-reconstruction pseudo-QMF banks,”IEEE Trans. Signal Process., vol. 42, no. 1, pp. 65–76, 1994

work page 1994

[32] [32]

LARP: Tokenizing videos with a learned autoregressive generative prior,

H. Wang, S. Suri, Y. Ren, H. Chen, and A. Shrivastava, “LARP: Tokenizing videos with a learned autoregressive generative prior,” inProc. Int. Conf. Learning Representations, 2025

work page 2025

[33] [33]

Flow Matching for generative modeling,

Y. Lipman, R. T. Q. Chen, H. Ben-Hamu, and M. Nickel, “Flow Matching for generative modeling,” inProc. Int. Conf. Learning Representations, 2023

work page 2023

[34] [34]

Biorthogonal bases of compactly supported wavelets,

A. Cohen, I. Daubechies, and J.-C. Feauveau, “Biorthogonal bases of compactly supported wavelets,” Comm. Pure Appl. Math., vol. 45, no. 5, pp. 485–560, 1992

work page 1992

[35] [35]

Encoder-decoder Gemma: Improving the quality-efficiency trade-off via adaptation,

B. Zhang, F. Moiseev, J. Ainslie, P. Suganthan, M. Ma, S. Bhupatiraju, F. Lebron, O. Firat, A. Joulin, and Z. Dong, “Encoder-decoder Gemma: Improving the quality-efficiency trade-off via adaptation,” arXiv preprint arXiv:2504.06225, 2025

work page arXiv 2025

[36] [36]

Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,

K. Liang, L. Chen, B. Liu, and Q. Liu, “Cautious optimizers: Improving training with one line of code,”arXiv preprint arXiv:2411.16085, 2024

work page arXiv 2024

[37] [37]

Long-form music generation with latent diffusion,

Z. Evans, J. D. Parker, C. J. Carr, Z. Zukowski, J. Taylor, and J. Pons, “Long-form music generation with latent diffusion,” inProc. Int. Soc. Music Inform. Retrieval Conf., 2024

work page 2024

[38] [38]

The Song Describer Dataset: a corpus of audio captions for music-and-language evaluation,

I. Manco, B. Weck, S. Doh, M. Won, Y. Zhang, D. Bogdanov, Y. Wu, K. Chen, P. Tovstogan, E. Benetos, E. Quinton, G. Fazekas, and J. Nam, “The Song Describer Dataset: a corpus of audio captions for music-and-language evaluation,” inMachine Learning for Audio Workshop, NeurIPS, 2023

work page 2023

[39] [39]

Adapting Fréchet Audio Distance for generative music evaluation,

A. Gui, H. Gamper, S. Braun, and D. Emmanouilidou, “Adapting Fréchet Audio Distance for generative music evaluation,” inProc. IEEE Int. Conf. Acoustics, Speech and Signal Process., 2024

work page 2024

[40] [40]

MuQ-Eval: An open-source per-sample quality metric for AI music generation evaluation,

D. Zhu and Z. Li, “MuQ-Eval: An open-source per-sample quality metric for AI music generation evaluation,”arXiv preprint arXiv:2603.22677, 2026

work page arXiv 2026

[41] [41]

ACE-Step 1.5: Pushing the boundaries of open-source music generation,

J. Gong, Y. Song, W. Zhao, S. Wang, S. Xu, J. Guo, and X. Yang, “ACE-Step 1.5: Pushing the boundaries of open-source music generation,”arXiv preprint arXiv:2602.00744, 2026. 11

work page arXiv 2026