Taming Audio VAEs via Target-KL Regularization

arxiv: 2605.17085 · v1 · pith:3MJKYDYLnew · submitted 2026-05-16 · 💻 cs.SD · cs.LG· eess.AS

Taming Audio VAEs via Target-KL Regularization

Prem Seetharaman , Rithesh Kumar This is my paper

classification 💻 cs.SD cs.LGeess.AS

keywords audiogenerationvaeslatentregularizationtarget-klcompressiondiffusion

0 comments p. Extension

pith:3MJKYDYL Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{3MJKYDYL}

Prints a linked pith:3MJKYDYL badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text-to-audio, text-to-music and text-to-speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high-dimensional signals into a low frame rate continuous representation that is conducive for downstream prediction. Regularizing these VAEs is challenging, as there is a trade-off between over-regularized (poor output quality) and under-regularized (difficult to predict) latent representations. We propose a framework for studying this trade-off through compression and train Audio VAEs at specific bitrates via target-KL regularization. This allows direct comparison to well-studied discrete neural audio codec models, and the construction of rate-distortion curves for audio VAEs. We evaluate the impact of target-KL regularization on text-to-sound generation and find that sweeping compression rates is helpful in identifying the optimal generation setting.

This paper has not been read by Pith yet.

Taming Audio VAEs via Target-KL Regularization

discussion (0)