A single neural audio codec can operate at multiple token temporal resolutions by generating TTR-dependent convolutional kernels from shared parameters while adjusting kernel size and stride.
Neural Audio Codec with Adjustable Token Temporal Resolution Using Sampling-Frequency-Independent Convolutional Layers
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Discrete tokens obtained from neural audio codecs (NACs) have been used as compact representations in audio generation and understanding models. In such token-based systems, token temporal resolution (TTR), defined as the time interval between adjacent token frames, is important because it controls the trade-off between representing rapid acoustic events and reducing token-sequence length. However, most NACs are trained at a single TTR and require separate training for each TTR. This paper proposes a mechanism that enables a single NAC to operate at multiple TTRs using sampling-frequency-independent convolutional layers. The mechanism regards TTR as the sampling period of the token sequence and generates TTR-dependent convolutional kernels from a shared parameter set, while adjusting the kernel size and stride for each TTR. We incorporate the mechanism into Descript Audio Codec, leaving the quantizer unchanged. Experiments on environmental sound reconstruction show that the proposed model outperforms a single-model baseline that switches TTR-specific layers for each TTR.
fields
eess.AS 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Neural Audio Codec with Adjustable Token Temporal Resolution Using Sampling-Frequency-Independent Convolutional Layers
A single neural audio codec can operate at multiple token temporal resolutions by generating TTR-dependent convolutional kernels from shared parameters while adjusting kernel size and stride.