Neural Audio Codec with Adjustable Token Temporal Resolution Using Sampling-Frequency-Independent Convolutional Layers

· 2026 · eess.AS · arXiv 2607.01865

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Discrete tokens obtained from neural audio codecs (NACs) have been used as compact representations in audio generation and understanding models. In such token-based systems, token temporal resolution (TTR), defined as the time interval between adjacent token frames, is important because it controls the trade-off between representing rapid acoustic events and reducing token-sequence length. However, most NACs are trained at a single TTR and require separate training for each TTR. This paper proposes a mechanism that enables a single NAC to operate at multiple TTRs using sampling-frequency-independent convolutional layers. The mechanism regards TTR as the sampling period of the token sequence and generates TTR-dependent convolutional kernels from a shared parameter set, while adjusting the kernel size and stride for each TTR. We incorporate the mechanism into Descript Audio Codec, leaving the quantizer unchanged. Experiments on environmental sound reconstruction show that the proposed model outperforms a single-model baseline that switches TTR-specific layers for each TTR.

representative citing papers

Neural Audio Codec with Adjustable Token Temporal Resolution Using Sampling-Frequency-Independent Convolutional Layers

eess.AS · 2026-07-02 · unverdicted · novelty 6.0

A single neural audio codec can operate at multiple token temporal resolutions by generating TTR-dependent convolutional kernels from shared parameters while adjusting kernel size and stride.

citing papers explorer

Showing 1 of 1 citing paper.

Neural Audio Codec with Adjustable Token Temporal Resolution Using Sampling-Frequency-Independent Convolutional Layers eess.AS · 2026-07-02 · unverdicted · none · ref 2 · internal anchor
A single neural audio codec can operate at multiple token temporal resolutions by generating TTR-dependent convolutional kernels from shared parameters while adjusting kernel size and stride.

Neural Audio Codec with Adjustable Token Temporal Resolution Using Sampling-Frequency-Independent Convolutional Layers

fields

years

verdicts

representative citing papers

citing papers explorer