wav2tok 2.0 improves audio tokenization for query-by-example spoken term detection via staged training that first learns speaker-invariant representations then enforces pairwise token alignment.
wav2tok 2.0: Scalable Audio Tokenization Maintaining Explicit Pairwise Token Alignment for Efficient Audio Retrieval
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Learning discrete speech representations that preserve similarity across variable-length utterances is central to query-by-example spoken term detection (QbE-STD). While wav2tok introduced CTC-based sequence alignment to enforce token consistency, its tightly coupled clustering and alignment training recipe limits scalability. We propose wav2tok 2.0, a scalable alignment-aware speech tokenizer built on the BEST-STD backbone. wav2tok 2.0 employs staged training, first learning discriminative, speaker-invariant representations via contrastive learning and vector quantization, and then enforcing pairwise token consistency using a CTC alignment loss and a novel DTW-aligned framewise prediction objective with adaptive weighting. Experiments show that wav2tok 2.0 consistently outperforms BEST-STD and general-purpose tokenizers on QbE-STD while remaining efficient and scalable.
fields
cs.SD 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
wav2tok 2.0: Scalable Audio Tokenization Maintaining Explicit Pairwise Token Alignment for Efficient Audio Retrieval
wav2tok 2.0 improves audio tokenization for query-by-example spoken term detection via staged training that first learns speaker-invariant representations then enforces pairwise token alignment.