arxiv: 2406.14294 · v4 · submitted 2024-06-20 · 💻 cs.SD · cs.AI· eess.AS

DASB - Discrete Audio and Speech Benchmark

Pooneh Mousavi , Jarod Duret , Darius Petermann , Artem Ploujnikov , Luca Della Libera , Anastasia Kuznetsova , Cem Subakan , Mirco Ravanelli This is my paper

classification 💻 cs.SD cs.AIeess.AS

keywords audiodiscretetokensdasbspeechacrossbenchmarkcontinuous

0 comments p. Extension

read the original abstract

Discrete audio tokens have recently gained considerable attention for their potential to bridge audio and language processing, enabling multimodal language models that can both generate and understand audio. However, preserving key information such as phonetic content, speaker identity, and paralinguistic cues remains a major challenge. Identifying the optimal tokenizer and configuration is further complicated by inconsistent evaluation settings across existing studies. To address this, we introduce the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework for benchmarking discrete audio tokens across speech, general audio, and music domains on a range of discriminative and generative tasks. Our results show that discrete representations are less robust than continuous ones and require careful tuning of factors such as model architecture, data size, learning rate, and capacity. Semantic tokens generally outperform acoustic tokens, but a gap remains between discrete tokens and continuous features, highlighting the need for further research. DASB codes, evaluation setup, and leaderboards are publicly available at https://poonehmousavi.github.io/DASB-website/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
cs.CL 2025-09 unverdicted novelty 6.0

StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.