Taming Audio VAEs via Target-KL Regularization

Prem Seetharaman; Rithesh Kumar

arxiv: 2605.17085 · v1 · pith:3MJKYDYLnew · submitted 2026-05-16 · 💻 cs.SD · cs.LG· eess.AS

Taming Audio VAEs via Target-KL Regularization

Prem Seetharaman , Rithesh Kumar This is my paper

Pith reviewed 2026-05-20 14:48 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS

keywords audio VAEtarget KL regularizationlatent diffusionrate distortiontext to audioneural audio codeccompression trade-off

0 comments

The pith

Target-KL regularization trains audio VAEs at precise bitrates to optimize latent diffusion generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces target-KL regularization to control the bitrate of audio variational autoencoders used in latent diffusion models. This method addresses the difficult balance between making latents too compressed for good reconstruction and too loose for easy prediction by the diffusion model. By training at specific rates, it becomes possible to draw rate-distortion curves and compare continuous VAEs directly against discrete neural audio codecs. The authors show that choosing the right compression level through this sweep improves text-to-sound generation performance. A sympathetic reader would care because better controlled latents could lead to higher quality and more efficient audio generation systems.

Core claim

We propose target-KL regularization as a way to train audio VAEs to target specific KL divergence values that correspond to desired bitrates. This framework allows us to study the compression trade-off in the context of latent diffusion for audio, construct rate-distortion curves, and identify optimal operating points for downstream tasks such as text-to-sound generation.

What carries the argument

Target-KL regularization, which adjusts the VAE training objective to achieve a predetermined KL term value that sets the effective bitrate of the latent representation.

If this is right

Audio VAEs can now be evaluated at matching bitrates with discrete codecs for fair comparison.
Rate-distortion curves can be built for continuous latent representations in audio.
Sweeping over compression rates reveals the best setting for text-to-sound diffusion models.
The latent structure remains usable for diffusion-based generation at controlled rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar regularization could be applied to VAEs in other domains like images to control their latent rates.
Future work might explore content-adaptive bitrate selection using this method.
This could reduce the need for post-training compression techniques in generative audio pipelines.

Load-bearing premise

The assumption that setting a target KL divergence during VAE training reliably produces the intended bitrate while keeping the latents structured enough for effective diffusion modeling.

What would settle it

Training an audio VAE with a specific target KL value and measuring whether the actual bitrate matches the target, or checking if varying the target produces corresponding changes in generation quality without collapse.

read the original abstract

Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text-to-audio, text-to-music and text-to-speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high-dimensional signals into a low frame rate continuous representation that is conducive for downstream prediction. Regularizing these VAEs is challenging, as there is a trade-off between over-regularized (poor output quality) and under-regularized (difficult to predict) latent representations. We propose a framework for studying this trade-off through compression and train Audio VAEs at specific bitrates via target-KL regularization. This allows direct comparison to well-studied discrete neural audio codec models, and the construction of rate-distortion curves for audio VAEs. We evaluate the impact of target-KL regularization on text-to-sound generation and find that sweeping compression rates is helpful in identifying the optimal generation setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Target-KL gives a knob for average KL in audio VAEs but the fixed-bitrate and discrete-codec comparison claims look under-supported without quantization or rate measurements.

read the letter

The paper's main move is to add a target-KL term so that audio VAEs can be trained toward chosen compression levels, then use those models to draw rate-distortion curves and check downstream text-to-sound quality. That framing is useful because it turns the usual regularization trade-off into something you can sweep and compare against existing codec numbers. The practical payoff they show is that different target values change generation behavior in a measurable way, which is the sort of knob people actually need when picking a VAE for a latent diffusion pipeline. The work is honest about the motivation and sticks to a clear experimental goal. The soft spot is the central claim that target-KL produces controllable, specific bitrates that let you compare directly to discrete neural audio codecs. The KL term bounds average information under the posterior, but without any quantization or entropy coding step the realized rate on the continuous latents can drift, so the curves are not obviously on the same scale as the discrete literature. The abstract does not mention any post-hoc rate measurement or coding step that would close that gap, which leaves the comparability more asserted than shown. The evaluation section is also light on numbers; it says sweeping rates helps but does not report concrete quality or efficiency deltas. This is the kind of paper that researchers building or tuning audio latent models will want to look at, especially if they are trying to set compression for a new diffusion system. It is coherent on its own terms and engages the right literature, so it deserves a serious referee even though the bitrate equivalence will need more evidence in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes target-KL regularization as a method to train audio VAEs at controllable specific bitrates. This framework studies the compression trade-off in latent representations for downstream latent diffusion models, enables direct rate-distortion comparisons to discrete neural audio codecs, and is evaluated on text-to-sound generation where sweeping rates identifies optimal settings.

Significance. If the central claim holds, the work would supply a practical control mechanism for the regularization trade-off in audio VAEs and a standardized way to produce rate-distortion curves, bridging continuous latent models with the discrete codec literature.

major comments (2)

[Method and Experiments] The central claim that target-KL regularization produces controllable, specific bitrates enabling direct comparison to discrete codecs (abstract) requires explicit demonstration that the realized information rate on the continuous latent sequence matches the target; the KL term alone bounds average information under the variational posterior but does not guarantee commensurable effective bitrate without quantization or entropy coding. This is load-bearing for the rate-distortion curve construction and must be addressed with concrete measurements in the experiments.
[Experiments] The evaluation of impact on text-to-sound generation via sweeping compression rates (abstract) needs quantitative ablations showing that downstream diffusion performance varies systematically with the target KL value, including metrics such as generation quality scores at each operating point; without these, the optimality claim cannot be verified.

minor comments (2)

[Method] Clarify the exact loss formulation and hyper-parameter schedule used to enforce the target KL value during training.
[Introduction] Add references to prior work on KL regularization in VAEs and rate-distortion analysis in audio codecs for context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have made revisions to the manuscript to strengthen the presentation of our results.

read point-by-point responses

Referee: [Method and Experiments] The central claim that target-KL regularization produces controllable, specific bitrates enabling direct comparison to discrete codecs (abstract) requires explicit demonstration that the realized information rate on the continuous latent sequence matches the target; the KL term alone bounds average information under the variational posterior but does not guarantee commensurable effective bitrate without quantization or entropy coding. This is load-bearing for the rate-distortion curve construction and must be addressed with concrete measurements in the experiments.

Authors: We agree that explicit verification of the realized rate is necessary to support direct comparisons with discrete codecs and to make the rate-distortion curves rigorous. In the revised manuscript we have added concrete measurements: we report both the target KL (converted to bits per second) and an empirical effective rate obtained by quantizing the continuous latents with a uniform scalar quantizer followed by entropy coding of the resulting symbols. These realized rates are shown alongside the targets in a new table and figure in Section 4; the measured rates track the targets within a small margin, confirming controllability. This addition directly addresses the load-bearing concern for the rate-distortion analysis. revision: yes
Referee: [Experiments] The evaluation of impact on text-to-sound generation via sweeping compression rates (abstract) needs quantitative ablations showing that downstream diffusion performance varies systematically with the target KL value, including metrics such as generation quality scores at each operating point; without these, the optimality claim cannot be verified.

Authors: We appreciate the call for more granular quantitative evidence. The original manuscript reported aggregate results and qualitative examples for text-to-sound generation; we have now expanded the evaluation with a dedicated ablation table (Table 3) that lists objective metrics (FAD, CLAP score) and perceptual quality scores at each target KL operating point. The table shows systematic variation: quality improves with moderate increases in target KL and then saturates or declines at higher rates, thereby identifying the empirically optimal setting. These results are discussed in Section 5 and support the claim that rate sweeping is useful for downstream tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; target-KL regularization functions as an independent control

full rationale

The paper introduces target-KL regularization as an explicit training mechanism to set the KL term to a chosen value, thereby controlling average information content in the VAE latent space. This is framed as a new framework for exploring compression-quality trade-offs and constructing rate-distortion curves, rather than deriving the target bitrate from the model's own outputs or prior self-citations. No equation or claim reduces the central result to a self-definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. The derivation remains self-contained because the regularization hyperparameter is chosen externally and the downstream diffusion experiments serve as an independent test of the resulting latents.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted from the text.

pith-pipeline@v0.9.0 · 5688 in / 1083 out tokens · 56793 ms · 2026-05-20T14:48:39.482168+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a framework for studying this trade-off through compression and train Audio VAEs at specific bitrates via target-KL regularization... bps = S/log2 * KL(qϕ(z|x)∥p(z))
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KLtarget = B*log2/S ; Ltarget-KL = (KL-KLtarget)^2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 11 internal anchors

[1]

INTRODUCTION Hierarchical generative modeling [1, 2, 3, 4, 5] has become the standard approach for audio generation tasks including text-to- speech, text-to-music and text-to-sound synthesis. It involves an auto-encoder component that can compress high dimensional nat- ural signals into low frame rate latent representations, followed by a powerful generat...

work page
[2]

Target-KL regularization, a novel method for targeting a spe- cific bitrate when training a continuous V AE, which enables modelers to make trade-offs between reconstruction quality and latent regularization

work page
[3]

A unified study of the rate-distortion trade-off for both con- tinuous and discrete audio compression models

work page
[4]

A study on the impact of compression rate on diffusion-based text-to-audio generative models

work page
[5]

Taming Audio VAEs via Target-KL Regularization

TARGET-KL FOR FIXED BITRA TE V AE Autoencoders for compressing audio signalsxinto latentszare trained with the following objective: Ex∼D h Ez∼qϕ(z|x) logp θ(x|z)−λ∗D KL qϕ(z|x)∥p ψ(z) i . (1) Note that whenλ= 1, this reduces to the original ELBO objec- tive. In VQ-V AEs,qϕ(z|x)is deterministic and by assuming a sim- ple uniform prior overz, we obtain a co...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Model architecture Our model is built on the same framework of neural audio codec models, except we replace the quantization bottleneck with a gaus- sian regularization

EXPERIMENTS 3.1. Model architecture Our model is built on the same framework of neural audio codec models, except we replace the quantization bottleneck with a gaus- sian regularization. We use the same fully convolutional encoder- decoder model architecture from DAC [14] and the same training recipe. We train on a dataset of speech, music, and sound effe...

work page
[7]

RESULTS In Figure 1, we show the rate-distortion trends for a variety of dis- crete and continuous audio compression models. We find that target- KL regularization allows us to target specific bitrates for continuous V AEs and study how various architectures behave under different compression rates explicitly. We find that DAC-V AE seems to form 1https://...

work page
[8]

This allows for direct comparison to discrete neural audio codecs and enables systematic study of the rate-distortion trade-off for continuous audio compres- sion models

CONCLUSION AND FUTURE WORK In this work, we proposed target-KL regularization, a method for training continuous V AEs at fixed bitrates. This allows for direct comparison to discrete neural audio codecs and enables systematic study of the rate-distortion trade-off for continuous audio compres- sion models. We evaluated our models on text-to-sound and text...

work page
[9]

Neural discrete representation learning,

Aaron Van Den Oord, Oriol Vinyals, et al., “Neural discrete representation learning,”Advances in neural information pro- cessing systems, vol. 30, 2017

work page 2017
[10]

High-resolution image synthesis with latent diffusion models,

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–10695

work page 2022
[11]

Audiolm: a language modeling approach to audio generation,

Zal ´an Borsos, Rapha ¨el Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM transactions on audio, speech, and language pro- cessing, vol. 31, pp. 2523–2533, 2023

work page 2023
[12]

Stable audio open,

Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons, “Stable audio open,” inICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[13]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al., “Neural codec language models are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Vampnet: Music generation via masked acous- tic token modeling,

Hugo Flores Garcia, Prem Seetharaman, Rithesh Kumar, and Bryan Pardo, “Vampnet: Music generation via masked acous- tic token modeling,”arXiv preprint arXiv:2307.04686, 2023

work page arXiv 2023
[15]

Maskgct: Zero-shot text- to-speech with masked generative codec transformer,

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu, “Maskgct: Zero-shot text- to-speech with masked generative codec transformer,”arXiv preprint arXiv:2409.00750, 2024

work page arXiv 2024
[16]

Sound- storm: Efficient parallel audio generation,

Zal ´an Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, and Marco Tagliasacchi, “Sound- storm: Efficient parallel audio generation,”arXiv preprint arXiv:2305.09636, 2023

work page arXiv 2023
[17]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling, “Auto-encoding varia- tional bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[18]

Denoising dif- fusion probabilistic models,

Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising dif- fusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020
[19]

Au- dioLDM: Text-to-audio generation with latent diffusion mod- els,

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley, “Au- dioLDM: Text-to-audio generation with latent diffusion mod- els,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023

work page 2023
[20]

Scaling rectified flow transformers for high-resolution image synthesis,

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty- first international conference on machine learning, 2024

work page 2024
[21]

Soundstream: An end-to- end neural audio codec,

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi, “Soundstream: An end-to- end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021

work page 2021
[22]

High-fidelity audio compression with improved rvqgan,

Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar, “High-fidelity audio compression with improved rvqgan,”Advances in Neural Information Pro- cessing Systems, vol. 36, pp. 27980–27993, 2023

work page 2023
[23]

In- terpreting rate-distortion of variational autoencoder and using model uncertainty for anomaly detection,

Seonho Park, George Adosoglou, and Panos M Pardalos, “In- terpreting rate-distortion of variational autoencoder and using model uncertainty for anomaly detection,”Annals of Mathe- matics and Artificial Intelligence, vol. 90, no. 7, pp. 735–752, 2022

work page 2022
[24]

Practical Lossless Compression with Latent Variables using Bits Back Coding

James Townsend, Tom Bird, and David Barber, “Practical loss- less compression with latent variables using bits back coding,” arXiv preprint arXiv:1901.04866, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[25]

Fixing a Broken ELBO

Alex Alemi, Ben Poole, Ian Fischer, Josh Dillon, Rif A Saurus, and Kevin Murphy, “An information-theoretic analysis of deep latent-variable models,”arXiv preprint arXiv:1711.00464, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Improved variational in- ference with inverse autoregressive flow,

Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling, “Improved variational in- ference with inverse autoregressive flow,”Advances in neural information processing systems, vol. 29, 2016

work page 2016
[27]

An introduction to variational autoencoders,

Diederik P Kingma, Max Welling, et al., “An introduction to variational autoencoders,”Foundations and Trends® in Ma- chine Learning, vol. 12, no. 4, pp. 307–392, 2019

work page 2019
[28]

Bigvgan: A universal neural vocoder with large-scale training,

Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon, “Bigvgan: A universal neural vocoder with large-scale training,”arXiv preprint arXiv:2206.04658, 2022

work page arXiv 2022
[29]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre D ´efossez, Laurent Mazar ´e, Manu Orsini, Am ´elie Royer, Patrick P ´erez, Herv ´e J ´egou, Edouard Grave, and Neil Zeghidour, “Moshi: a speech-text foundation model for real- time dialogue,”arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Audio set: An ontology and human-labeled dataset for audio events,

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE international con- ference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780

work page 2017
[31]

Spectrostream: A versatile neural codec for general audio,

Yunpeng Li, Kehang Han, Brian McWilliams, Zalan Borsos, and Marco Tagliasacchi, “Spectrostream: A versatile neural codec for general audio,”arXiv preprint arXiv:2508.05207, 2025

work page arXiv 2025
[32]

High Fidelity Neural Audio Compression

Alexandre D ´efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi, “High fidelity neural audio compression,”arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho, “Progressive distillation for fast sampling of diffusion models,”arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

sim- ple diffusion: End-to-end diffusion for high resolution im- ages,

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans, “sim- ple diffusion: End-to-end diffusion for high resolution im- ages,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 13213–13232

work page 2023
[35]

Simple-tts: End-to-end text-to-speech synthesis with latent diffusion,

Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q Wein- berger, and Felix Wu, “Simple-tts: End-to-end text-to-speech synthesis with latent diffusion,”arXiv preprint, 2023

work page 2023
[36]

Scalable diffusion models with transformers,

William Peebles and Saining Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF interna- tional conference on computer vision, 2023, pp. 4195–4205

work page 2023
[37]

Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer,

Keon Lee, Dong Won Kim, Jaehyeon Kim, and Jaewoong Cho, “Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer,”arXiv e-prints, pp. arXiv–2406, 2024

work page 2024
[38]

Autoregressive diffusion transformer for text-to-speech synthesis,

Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, and Haizhou Li, “Autoregressive diffusion transformer for text-to-speech synthesis,”arXiv preprint arXiv:2406.05551, 2024

work page arXiv 2024
[39]

Byt5: Towards a token-free future with pre-trained byte-to- byte models,

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel, “Byt5: Towards a token-free future with pre-trained byte-to- byte models,”Transactions of the Association for Computa- tional Linguistics, vol. 10, pp. 291–306, 2022

work page 2022
[40]

Exploring the limits of transfer learning with a unified text-to-text transformer,

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020

work page 2020
[41]

Phonemizer: Text to phones transcription for multiple languages in python,

Mathieu Bernard and Hadrien Titeux, “Phonemizer: Text to phones transcription for multiple languages in python,”Jour- nal of Open Source Software, vol. 6, no. 68, pp. 3958, 2021

work page 2021
[42]

Emilia: A large-scale, extensive, multilin- gual, and diverse dataset for speech generation,

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, et al., “Emilia: A large-scale, extensive, multilin- gual, and diverse dataset for speech generation,”arXiv preprint arXiv:2501.15907, 2025

work page arXiv 2025
[43]

Sila: Signal-to-language augmen- tation for enhanced control in text-to-audio generation,

Sonal Kumar, Prem Seetharaman, Justin Salamon, Dinesh Manocha, and Oriol Nieto, “Sila: Signal-to-language augmen- tation for enhanced control in text-to-audio generation,”arXiv preprint arXiv:2412.09789, 2024

work page arXiv 2024
[44]

Sketch2sound: Controllable audio generation via time-varying signals and sonic imitations,

Hugo Flores Garc ´ıa, Oriol Nieto, Justin Salamon, Bryan Pardo, and Prem Seetharaman, “Sketch2sound: Controllable audio generation via time-varying signals and sonic imitations,” in ICASSP 2025-2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[45]

FLAM: Frame-wise language-audio model- ing,

Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, and Justin Salamon, “FLAM: Frame-wise language-audio model- ing,” inForty-second International Conference on Machine Learning, 2025

work page 2025
[46]

Scaling instruction- finetuned language models,

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al., “Scaling instruction- finetuned language models,”Journal of Machine Learning Re- search, vol. 25, no. 70, pp. 1–53, 2024

work page 2024
[47]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al., “Seed-tts: A family of high-quality versatile speech generation models,”arXiv preprint arXiv:2406.02430, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen, “Finite scalar quantization: Vq-vae made simple,”arXiv preprint arXiv:2309.15505, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie, “Representa- tion alignment for generation: Training diffusion transformers is easier than you think,”arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

INTRODUCTION Hierarchical generative modeling [1, 2, 3, 4, 5] has become the standard approach for audio generation tasks including text-to- speech, text-to-music and text-to-sound synthesis. It involves an auto-encoder component that can compress high dimensional nat- ural signals into low frame rate latent representations, followed by a powerful generat...

work page

[2] [2]

Target-KL regularization, a novel method for targeting a spe- cific bitrate when training a continuous V AE, which enables modelers to make trade-offs between reconstruction quality and latent regularization

work page

[3] [3]

A unified study of the rate-distortion trade-off for both con- tinuous and discrete audio compression models

work page

[4] [4]

A study on the impact of compression rate on diffusion-based text-to-audio generative models

work page

[5] [5]

Taming Audio VAEs via Target-KL Regularization

TARGET-KL FOR FIXED BITRA TE V AE Autoencoders for compressing audio signalsxinto latentszare trained with the following objective: Ex∼D h Ez∼qϕ(z|x) logp θ(x|z)−λ∗D KL qϕ(z|x)∥p ψ(z) i . (1) Note that whenλ= 1, this reduces to the original ELBO objec- tive. In VQ-V AEs,qϕ(z|x)is deterministic and by assuming a sim- ple uniform prior overz, we obtain a co...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Model architecture Our model is built on the same framework of neural audio codec models, except we replace the quantization bottleneck with a gaus- sian regularization

EXPERIMENTS 3.1. Model architecture Our model is built on the same framework of neural audio codec models, except we replace the quantization bottleneck with a gaus- sian regularization. We use the same fully convolutional encoder- decoder model architecture from DAC [14] and the same training recipe. We train on a dataset of speech, music, and sound effe...

work page

[7] [7]

RESULTS In Figure 1, we show the rate-distortion trends for a variety of dis- crete and continuous audio compression models. We find that target- KL regularization allows us to target specific bitrates for continuous V AEs and study how various architectures behave under different compression rates explicitly. We find that DAC-V AE seems to form 1https://...

work page

[8] [8]

This allows for direct comparison to discrete neural audio codecs and enables systematic study of the rate-distortion trade-off for continuous audio compres- sion models

CONCLUSION AND FUTURE WORK In this work, we proposed target-KL regularization, a method for training continuous V AEs at fixed bitrates. This allows for direct comparison to discrete neural audio codecs and enables systematic study of the rate-distortion trade-off for continuous audio compres- sion models. We evaluated our models on text-to-sound and text...

work page

[9] [9]

Neural discrete representation learning,

Aaron Van Den Oord, Oriol Vinyals, et al., “Neural discrete representation learning,”Advances in neural information pro- cessing systems, vol. 30, 2017

work page 2017

[10] [10]

High-resolution image synthesis with latent diffusion models,

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10684–10695

work page 2022

[11] [11]

Audiolm: a language modeling approach to audio generation,

Zal ´an Borsos, Rapha ¨el Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM transactions on audio, speech, and language pro- cessing, vol. 31, pp. 2523–2533, 2023

work page 2023

[12] [12]

Stable audio open,

Zach Evans, Julian D Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons, “Stable audio open,” inICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025

[13] [13]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al., “Neural codec language models are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Vampnet: Music generation via masked acous- tic token modeling,

Hugo Flores Garcia, Prem Seetharaman, Rithesh Kumar, and Bryan Pardo, “Vampnet: Music generation via masked acous- tic token modeling,”arXiv preprint arXiv:2307.04686, 2023

work page arXiv 2023

[15] [15]

Maskgct: Zero-shot text- to-speech with masked generative codec transformer,

Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Xueyao Zhang, Shunsi Zhang, and Zhizheng Wu, “Maskgct: Zero-shot text- to-speech with masked generative codec transformer,”arXiv preprint arXiv:2409.00750, 2024

work page arXiv 2024

[16] [16]

Sound- storm: Efficient parallel audio generation,

Zal ´an Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, and Marco Tagliasacchi, “Sound- storm: Efficient parallel audio generation,”arXiv preprint arXiv:2305.09636, 2023

work page arXiv 2023

[17] [17]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling, “Auto-encoding varia- tional bayes,”arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[18] [18]

Denoising dif- fusion probabilistic models,

Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising dif- fusion probabilistic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020

[19] [19]

Au- dioLDM: Text-to-audio generation with latent diffusion mod- els,

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley, “Au- dioLDM: Text-to-audio generation with latent diffusion mod- els,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023

work page 2023

[20] [20]

Scaling rectified flow transformers for high-resolution image synthesis,

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty- first international conference on machine learning, 2024

work page 2024

[21] [21]

Soundstream: An end-to- end neural audio codec,

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi, “Soundstream: An end-to- end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021

work page 2021

[22] [22]

High-fidelity audio compression with improved rvqgan,

Rithesh Kumar, Prem Seetharaman, Alejandro Luebs, Ishaan Kumar, and Kundan Kumar, “High-fidelity audio compression with improved rvqgan,”Advances in Neural Information Pro- cessing Systems, vol. 36, pp. 27980–27993, 2023

work page 2023

[23] [23]

In- terpreting rate-distortion of variational autoencoder and using model uncertainty for anomaly detection,

Seonho Park, George Adosoglou, and Panos M Pardalos, “In- terpreting rate-distortion of variational autoencoder and using model uncertainty for anomaly detection,”Annals of Mathe- matics and Artificial Intelligence, vol. 90, no. 7, pp. 735–752, 2022

work page 2022

[24] [24]

Practical Lossless Compression with Latent Variables using Bits Back Coding

James Townsend, Tom Bird, and David Barber, “Practical loss- less compression with latent variables using bits back coding,” arXiv preprint arXiv:1901.04866, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[25] [25]

Fixing a Broken ELBO

Alex Alemi, Ben Poole, Ian Fischer, Josh Dillon, Rif A Saurus, and Kevin Murphy, “An information-theoretic analysis of deep latent-variable models,”arXiv preprint arXiv:1711.00464, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

Improved variational in- ference with inverse autoregressive flow,

Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling, “Improved variational in- ference with inverse autoregressive flow,”Advances in neural information processing systems, vol. 29, 2016

work page 2016

[27] [27]

An introduction to variational autoencoders,

Diederik P Kingma, Max Welling, et al., “An introduction to variational autoencoders,”Foundations and Trends® in Ma- chine Learning, vol. 12, no. 4, pp. 307–392, 2019

work page 2019

[28] [28]

Bigvgan: A universal neural vocoder with large-scale training,

Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon, “Bigvgan: A universal neural vocoder with large-scale training,”arXiv preprint arXiv:2206.04658, 2022

work page arXiv 2022

[29] [29]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre D ´efossez, Laurent Mazar ´e, Manu Orsini, Am ´elie Royer, Patrick P ´erez, Herv ´e J ´egou, Edouard Grave, and Neil Zeghidour, “Moshi: a speech-text foundation model for real- time dialogue,”arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Audio set: An ontology and human-labeled dataset for audio events,

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in2017 IEEE international con- ference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780

work page 2017

[31] [31]

Spectrostream: A versatile neural codec for general audio,

Yunpeng Li, Kehang Han, Brian McWilliams, Zalan Borsos, and Marco Tagliasacchi, “Spectrostream: A versatile neural codec for general audio,”arXiv preprint arXiv:2508.05207, 2025

work page arXiv 2025

[32] [32]

High Fidelity Neural Audio Compression

Alexandre D ´efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi, “High fidelity neural audio compression,”arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho, “Progressive distillation for fast sampling of diffusion models,”arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

sim- ple diffusion: End-to-end diffusion for high resolution im- ages,

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans, “sim- ple diffusion: End-to-end diffusion for high resolution im- ages,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 13213–13232

work page 2023

[35] [35]

Simple-tts: End-to-end text-to-speech synthesis with latent diffusion,

Justin Lovelace, Soham Ray, Kwangyoun Kim, Kilian Q Wein- berger, and Felix Wu, “Simple-tts: End-to-end text-to-speech synthesis with latent diffusion,”arXiv preprint, 2023

work page 2023

[36] [36]

Scalable diffusion models with transformers,

William Peebles and Saining Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF interna- tional conference on computer vision, 2023, pp. 4195–4205

work page 2023

[37] [37]

Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer,

Keon Lee, Dong Won Kim, Jaehyeon Kim, and Jaewoong Cho, “Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer,”arXiv e-prints, pp. arXiv–2406, 2024

work page 2024

[38] [38]

Autoregressive diffusion transformer for text-to-speech synthesis,

Zhijun Liu, Shuai Wang, Sho Inoue, Qibing Bai, and Haizhou Li, “Autoregressive diffusion transformer for text-to-speech synthesis,”arXiv preprint arXiv:2406.05551, 2024

work page arXiv 2024

[39] [39]

Byt5: Towards a token-free future with pre-trained byte-to- byte models,

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel, “Byt5: Towards a token-free future with pre-trained byte-to- byte models,”Transactions of the Association for Computa- tional Linguistics, vol. 10, pp. 291–306, 2022

work page 2022

[40] [40]

Exploring the limits of transfer learning with a unified text-to-text transformer,

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020

work page 2020

[41] [41]

Phonemizer: Text to phones transcription for multiple languages in python,

Mathieu Bernard and Hadrien Titeux, “Phonemizer: Text to phones transcription for multiple languages in python,”Jour- nal of Open Source Software, vol. 6, no. 68, pp. 3958, 2021

work page 2021

[42] [42]

Emilia: A large-scale, extensive, multilin- gual, and diverse dataset for speech generation,

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, et al., “Emilia: A large-scale, extensive, multilin- gual, and diverse dataset for speech generation,”arXiv preprint arXiv:2501.15907, 2025

work page arXiv 2025

[43] [43]

Sila: Signal-to-language augmen- tation for enhanced control in text-to-audio generation,

Sonal Kumar, Prem Seetharaman, Justin Salamon, Dinesh Manocha, and Oriol Nieto, “Sila: Signal-to-language augmen- tation for enhanced control in text-to-audio generation,”arXiv preprint arXiv:2412.09789, 2024

work page arXiv 2024

[44] [44]

Sketch2sound: Controllable audio generation via time-varying signals and sonic imitations,

Hugo Flores Garc ´ıa, Oriol Nieto, Justin Salamon, Bryan Pardo, and Prem Seetharaman, “Sketch2sound: Controllable audio generation via time-varying signals and sonic imitations,” in ICASSP 2025-2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025

[45] [45]

FLAM: Frame-wise language-audio model- ing,

Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, and Justin Salamon, “FLAM: Frame-wise language-audio model- ing,” inForty-second International Conference on Machine Learning, 2025

work page 2025

[46] [46]

Scaling instruction- finetuned language models,

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al., “Scaling instruction- finetuned language models,”Journal of Machine Learning Re- search, vol. 25, no. 70, pp. 1–53, 2024

work page 2024

[47] [47]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al., “Seed-tts: A family of high-quality versatile speech generation models,”arXiv preprint arXiv:2406.02430, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[48] [48]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen, “Finite scalar quantization: Vq-vae made simple,”arXiv preprint arXiv:2309.15505, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie, “Representa- tion alignment for generation: Training diffusion transformers is easier than you think,”arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024