Stable Audio 3
Pith reviewed 2026-05-20 00:46 UTC · model grok-4.3
The pith
Stable Audio 3 produces variable-length audio up to several minutes using a semantic-acoustic autoencoder for compact latents and adversarial post-training to cut inference steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Stable Audio 3 is a set of small, medium, and large latent diffusion models for variable-length audio generation and editing. The models rest on a semantic-acoustic autoencoder that maps audio to a compact latent space, which supports efficient diffusion while retaining fidelity and semantic organization. Adversarial post-training then reduces the number of inference steps, improves fidelity, and strengthens prompt adherence. The resulting system runs in less than two seconds on an H200 GPU or a few seconds on a MacBook Pro M4 and supports inpainting for targeted edits and continuations.
What carries the argument
The semantic-acoustic autoencoder, which projects audio into a compact latent space to enable efficient diffusion-based generation while preserving fidelity and encouraging semantic structure.
If this is right
- Variable-length output avoids the cost of generating full fixed-length audio for short sounds.
- Inpainting support enables targeted editing of existing recordings and continuation from short clips.
- Adversarial post-training lowers inference steps while raising fidelity and prompt match.
- Small and medium models run on consumer hardware with open weights and pipeline.
Where Pith is reading between the lines
- The same compact latent representation could be adapted for other time-based media such as speech or environmental sound design.
- Semantic structure in the latent space may allow more precise control when combining multiple prompts or styles.
- Releasing the full training and inference code alongside weights makes it straightforward to test the method on domain-specific audio datasets.
Load-bearing premise
The semantic-acoustic autoencoder must successfully map audio into a compact latent space that preserves fidelity and encourages semantic structure.
What would settle it
Generate audio from a detailed text prompt using the reduced number of inference steps after post-training and check whether the output shows noticeably lower sound quality or weaker adherence to the prompt than a baseline without the autoencoder or post-training.
Figures
read the original abstract
Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Stable Audio 3, a family of latent diffusion models (small, medium, large) for variable-length audio generation and editing. It features a novel semantic-acoustic autoencoder that projects audio into a compact latent space to enable efficient diffusion while preserving fidelity and encouraging semantic structure. Adversarial post-training is applied to reduce the number of inference steps and improve generation quality and prompt adherence. The models are trained on licensed and Creative Commons data, achieve fast inference times (under 2s on H200 GPU, few seconds on MacBook Pro M4), and release weights for the small and medium models along with training and inference pipelines.
Significance. If the empirical results support the claims, this work represents a meaningful advance in practical, high-quality audio synthesis by addressing efficiency, variable length, and editing capabilities. The release of model weights and pipelines on consumer hardware is a strength that facilitates reproducibility and adoption. The integration of semantic structure in the latent space could improve prompt adherence if validated.
major comments (2)
- [Abstract] Abstract: The central claims regarding the novel semantic-acoustic autoencoder's ability to preserve audio fidelity and encourage semantic structure, as well as the benefits of adversarial post-training, are stated without any quantitative results, ablation studies, error bars, or baseline comparisons. This absence prevents verification of the performance and architectural assertions.
- [Description of the semantic-acoustic autoencoder] Description of the semantic-acoustic autoencoder: The training objective for the autoencoder is not specified in sufficient detail to confirm how semantic structure is induced in the latent space. Without an explicit semantic term (e.g., contrastive loss or classification loss) in addition to reconstruction, the property may not hold, which is load-bearing for the efficiency and quality claims of the subsequent diffusion stage.
minor comments (1)
- [Abstract] Abstract: The abstract mentions support for inpainting and continuation but does not elaborate on how these are implemented in the latent diffusion framework.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review of our manuscript. We address each major comment in detail below, providing clarifications from the full paper and indicating where revisions will be made to improve the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims regarding the novel semantic-acoustic autoencoder's ability to preserve audio fidelity and encourage semantic structure, as well as the benefits of adversarial post-training, are stated without any quantitative results, ablation studies, error bars, or baseline comparisons. This absence prevents verification of the performance and architectural assertions.
Authors: We agree that the abstract is a high-level summary and does not contain the quantitative details, ablations, or baseline comparisons. These are provided in full in Sections 4 (Experiments) and 5 (Ablations), including FID scores, CLAP similarity, inference latency benchmarks against baselines, and error bars from multiple runs. To address the concern, we will revise the abstract to include a small number of representative quantitative highlights (e.g., inference speed and key quality metrics) while keeping it concise. revision: yes
-
Referee: [Description of the semantic-acoustic autoencoder] Description of the semantic-acoustic autoencoder: The training objective for the autoencoder is not specified in sufficient detail to confirm how semantic structure is induced in the latent space. Without an explicit semantic term (e.g., contrastive loss or classification loss) in addition to reconstruction, the property may not hold, which is load-bearing for the efficiency and quality claims of the subsequent diffusion stage.
Authors: Section 3.1 of the manuscript specifies the autoencoder training objective as a combination of multi-resolution reconstruction losses (L1 and mel-spectrogram) plus an explicit semantic alignment term. This term uses cosine similarity between latent codes and embeddings from a frozen pre-trained audio-language model (similar to a contrastive objective) to encourage semantic clustering in the latent space. We will expand this section with the full loss equation, hyperparameter values, and an additional sentence clarifying that the semantic term is distinct from pure reconstruction, thereby supporting the downstream diffusion efficiency claims. revision: partial
Circularity Check
No circularity detected in derivation chain
full rationale
The paper presents an architectural description of a semantic-acoustic autoencoder combined with latent diffusion and adversarial post-training. No mathematical derivations, predictions, or first-principles results are shown that reduce by construction to fitted parameters or self-referential definitions. Claims about projecting audio into a compact latent space while preserving fidelity and encouraging semantic structure are framed as outcomes of the training procedure rather than tautological re-statements of inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text. The overall approach is additive and relies on standard techniques applied to audio, making the derivation chain self-contained without circular reductions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent... SAME uses a multi-resolution STFT loss... relativistic GAN objective... diffusion alignment loss... semantic regression losses... contrastive latent alignment loss
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
4096× downsampling... TRB layers... differential attention... variable-length attention and masked loss... per-element timestep shifts
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. Frank. MusicLM: Generating music from text.arXiv preprint, 2023
work page 2023
-
[3]
R. Yuan, H. Lin, S. Guo, G. Zhang, J. Pan, Y . Zang, et al. YuE: Scaling open foundation models for long-form music generation.arXiv preprint, 2025
work page 2025
-
[4]
D. Yang, Y . Xie, Y . Yin, Z. Wang, X. Yi, G. Zhu, et al. HeartMuLa: A family of open sourced music foundation models.arXiv preprint, 2026
work page 2026
- [5]
- [6]
-
[7]
H. Liu, Z. Chen, Y . Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley. AudioLDM: Text-to-audio generation with latent diffusion models. InICML, 2023
work page 2023
-
[8]
K. Chen, Y . Wu, H. Liu, M. Nezhurina, T. Berg-Kirkpatrick, and S. Dubnov. MusicLDM: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. InICASSP, 2024
work page 2024
-
[9]
F. Schneider, O. Kamal, Z. Jin, and B. Schölkopf. Moûsai: Text-to-music generation with long-context latent diffusion. InACL, 2024
work page 2024
-
[10]
H. Liu, Y . Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian, Y . Wang, W. Wang, Y . Wang, and M. D. Plumbley. Au- dioLDM 2: Learning holistic audio generation with self-supervised pretraining.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024
work page 2024
-
[11]
J. Gong, Y . Song, W. Zhao, S. Wang, S. Xu, J. Guo, and X. Yang. ACE-Step 1.5: Pushing the boundaries of open-source music generation.arXiv preprint, 2026
work page 2026
- [12]
-
[13]
A. Défossez, J. Copet, G. Synnaeve, and Y . Adi. High fidelity neural audio compression.Transactions on Machine Learning Research, 2023
work page 2023
- [14]
-
[15]
K. Wang, Z. Wu, D. Zhou, R. Lin, J. Dai, and T. Jiang. Back to ear: Perceptually driven high fidelity music reconstruction.arXiv preprint, 2025
work page 2025
-
[16]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020
work page 2020
-
[17]
Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. InICLR, 2021
work page 2021
- [18]
-
[19]
X. Liu, C. Gong, and Q. Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint, 2022
work page 2022
- [20]
-
[21]
M. S. Albergo and E. Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InICLR, 2023
work page 2023
-
[22]
T. Salimans and J. Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint, 2022
work page 2022
-
[23]
E. Luhman and T. Luhman. Knowledge Distillation in iterative generative models for improved sampling speed. arXiv preprint, 2021
work page 2021
-
[24]
T. Ye, L. Li, G. Huang, S. Xia, D. Li, and F. Wei. Differential Transformer. InICLR, 2025
work page 2025
-
[25]
W. Peebles and S. Xie. Scalable diffusion models with transformers. InICCV, 2023. 23 Stable Audio 3TECHNICALREPORT
work page 2023
-
[26]
M. S. Burtsev, Y . Kuratov, A. Peganov, and G. V . Sapunov. Memory Transformer.arXiv preprint, 2020
work page 2020
- [27]
- [28]
- [29]
-
[30]
N. Majumder, C.-Y . Hung, D. Ghosal, W.-N. Hsu, R. Mihalcea, and S. Poria. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. InACM MM, 2024
work page 2024
- [31]
-
[32]
Z. Ning, H. Chen, Y . Jiang, C. Hao, G. Ma, S. Wang, J. Yao, and L. Xie. Diffrhythm: Blazingly fast and em- barrassingly simple end-to-end full-length song generation with latent diffusion.arXiv preprint, 2025. Weights available athttps://huggingface.co/ASLP-lab/DiffRhythm-full
work page 2025
- [33]
- [34]
- [35]
-
[36]
J. Chen, C. Ge, E. Xie, Y . Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li. Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InECCV, 2024
work page 2024
-
[37]
Z. Li, J. Zhang, Q. Lin, J. Xiong, Y . Long, X. Deng, Y . Zhang, X. Liu, M. Huang, Z. Xiao, et al. Hunyuan-dit: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding.arXiv preprint, 2024
work page 2024
- [38]
-
[39]
S. Tong, B. Zheng, Z. Wang, B. Tang, N. Ma, E. Brown, J. Yang, R. Fergus, Y . LeCun, and S. Xie. Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint, 2026
work page 2026
-
[40]
J. D. Parker, Z. Evans, C. J. Carr, Z. Zukowski, J. Taylor, M. Rice, and J. Pons. SAME: A semantically-aligned music autoencoder. Technical report, 2025
work page 2025
-
[41]
J. Pons, Z. Zukowski, J. D. Parker, C. J. Carr, J. Taylor, and Z. Evans. Music and artificial intelligence: Artistic trends.arXiv preprint, 2025
work page 2025
-
[42]
H. F. García, P. Seetharaman, R. Kumar, and B. Pardo. VampNet: Music generation via masked acoustic token modeling. InISMIR, 2023
work page 2023
-
[43]
P. Li, B. Chen, Y . Yao, Y . Wang, A. Wang, and A. Wang. JEN-1: Text-guided universal music generation with omnidirectional diffusion models. InIEEE CAI, 2024
work page 2024
-
[44]
P. Seetharaman, O. Nieto, and J. Salamon. Generative audio extension and morphing. InICASSP, 2026
work page 2026
-
[45]
Y . Wang, Z. Ju, X. Tan, L. He, Z. Wu, J. Bian, and S. Zhao. AUDIT: Audio editing by following instructions with latent diffusion models. InNeurIPS, 2023
work page 2023
-
[46]
B. Han, J. Dai, W. Hao, X. He, D. Guo, J. Chen, Y . Wang, Y . Qian, and X. Song. InstructME: An instruction guided music edit framework with latent diffusion models. InIJCAI, 2024
work page 2024
-
[47]
J. D. Parker, J. Spijkervet, K. Kosta, F. Yesiler, B. Kuznetsov, J.-C. Wang, M. Avent, J. Chen, and D. Le. Stemgen: A music generation model that listens. InICASSP, 2024
work page 2024
-
[48]
M. Levy, B. Di Giorgi, F. Weers, A. Katharopoulos, and T. Nickson. Controllable music production with diffusion models and guidance gradients.arXiv preprint, 2023
work page 2023
- [49]
-
[50]
G. L. Lan, B. Shi, Z. Ni, S. Srinivasan, A. Kumar, B. Ellis, D. Kant, V . Nagaraja, E. Chang, W.-N. Hsu, et al. High fidelity text-guided music editing via single-stage flow matching.arXiv preprint, 2024
work page 2024
- [51]
- [52]
-
[53]
O. Tal, A. Ziv, I. Gat, F. Kreuk, and Y . Adi. Joint audio and symbolic conditioning for temporally controlled text-to-music generation.arXiv preprint, 2024
work page 2024
- [54]
-
[55]
S.-L. Wu, C. Donahue, S. Watanabe, and N. J. Bryan. Music ControlNet: Multiple time-varying controls for music generation.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024
work page 2024
-
[56]
H. F. García, O. Nieto, J. Salamon, B. Pardo, and P. Seetharaman. Sketch2sound: Controllable audio generation via time-varying signals and sonic imitations. InICASSP, 2025
work page 2025
-
[57]
J. Wang. Audio palette: A diffusion transformer with multi-signal conditioning for controllable foley synthesis. arXiv preprint, 2025
work page 2025
-
[58]
S. Kim, G. Kim, S. Yagishita, D. Han, J. Im, and Y . Sung. Enhancing diffusion-based music generation perfor- mance with lora.Applied Sciences, 2025
work page 2025
-
[59]
Y . Song, P. Dhariwal, M. Chen, and I. Sutskever. Consistency models. InICML, 2023
work page 2023
-
[60]
Z. Xiao, K. Kreis, and A. Vahdat. Tackling the generative learning trilemma with denoising diffusion GANs. In ICLR, 2022
work page 2022
- [61]
-
[62]
Y . Ren, X. Xia, Y . Lu, J. Zhang, J. Wu, P. Xie, X. Wang, and X. Xiao. Hyper-SD: Trajectory segmented consistency model for efficient image synthesis.arXiv preprint, 2024
work page 2024
-
[63]
F.-Y . Wang, Z. Huang, A. W. Bergman, D. Shen, P. Gao, M. Lingelbach, K. Sun, W. Bian, G. Song, Y . Liu, H. Li, and X. Wang. Phased consistency model.arXiv preprint, 2024
work page 2024
- [64]
-
[65]
J. Chen, S. Xue, Y . Zhao, J. Yu, S. Paul, J. Chen, H. Cai, E. Xie, and S. Han. Sana-sprint: One-step diffusion with continuous-time consistency distillation.arXiv preprint, 2025
work page 2025
- [66]
- [67]
-
[68]
Y . Xu, W. Nie, and A. Vahdat. One-step diffusion models withf-divergence distribution matching.arXiv preprint, 2025
work page 2025
-
[69]
T. Yin, M. Gharbi, T. Park, R. Zhang, E. Shechtman, F. Durand, and W. T. Freeman. Improved distribution matching distillation for fast image synthesis.arXiv preprint, 2024
work page 2024
-
[70]
M. Kang, R. Zhang, C. Barnes, S. Paris, S. Kwak, J. Park, E. Shechtman, J.-Y . Zhu, and T. Park. Distilling diffusion models into conditional gans.arXiv preprint, 2024
work page 2024
-
[71]
Y . Xu, Y . Zhao, Z. Xiao, and T. Hou. Ufogen: You forward once large scale text-to-image generation via diffusion gans. InCVPR, 2024
work page 2024
-
[72]
S. Lin, X. Xia, Y . Ren, C. Yang, X. Xiao, and L. Jiang. Diffusion adversarial post-training for one-step video generation.arXiv preprint, 2025
work page 2025
- [73]
-
[74]
H. Liu, R. Huang, Y . Liu, H. Cao, J. Wang, X. Cheng, S. Zheng, and Z. Zhao. AudioLCM: Text-to-audio generation with latent consistency models. InACM MM, 2024
work page 2024
-
[75]
G. Hadjeres, M. Ferras, K. Koutini, B. Weck, A. Bittar, T. Hummel, Z. Lahrichi, H. Missoum, J. Serrà, and Y . Mitsufuji. Woosh: A sound effects foundation model.arXiv preprint, 2026
work page 2026
-
[76]
J. Su, M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 2024
work page 2024
-
[77]
A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint, 2024. 25 Stable Audio 3TECHNICALREPORT
work page 2024
- [78]
-
[79]
N. Shazeer. Glu variants improve transformer.arXiv preprint, 2020
work page 2020
-
[80]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.