F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

Dinghao Zhou; Di Wu; Pengyu Cheng; Shengfan Shen; Sixiang Lv; Xingchen Song

arxiv: 2606.06357 · v1 · pith:XAQ7EFX7new · submitted 2026-06-04 · 💻 cs.SD · cs.AI· eess.AS

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

Dinghao Zhou , Xingchen Song , Di Wu , Pengyu Cheng , Shengfan Shen , Sixiang Lv This is my paper

classification 💻 cs.SD cs.AIeess.AS

keywords latentsaudioautoencodercontinuousgenerationunderstandingbottleneckencoder

0 comments

read the original abstract

Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets

This paper has not been read by Pith yet.

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

discussion (0)