wav2tok 2.0: Scalable Audio Tokenization Maintaining Explicit Pairwise Token Alignment for Efficient Audio Retrieval

Adhiraj Banerjee; Vipul Arora

arxiv: 2606.26824 · v1 · pith:QCZEL3XUnew · submitted 2026-06-25 · 💻 cs.SD · eess.AS

wav2tok 2.0: Scalable Audio Tokenization Maintaining Explicit Pairwise Token Alignment for Efficient Audio Retrieval

Adhiraj Banerjee , Vipul Arora This is my paper

Pith reviewed 2026-06-26 02:44 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords audio tokenizationspoken term detectioncontrastive learningCTC alignmentDTW alignmentvector quantizationquery-by-examplespeech representations

0 comments

The pith

wav2tok 2.0 achieves scalable audio tokenization by staging contrastive learning before CTC and DTW alignment to preserve token consistency for query-by-example retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that decoupling contrastive representation learning from alignment training produces a tokenizer that scales better while keeping explicit pairwise token alignment across variable-length audio. It first builds discriminative speaker-invariant features through contrastive learning and vector quantization, then applies a CTC loss plus a DTW-aligned framewise prediction objective with adaptive weighting. This staged process is evaluated on query-by-example spoken term detection, where it outperforms earlier single-stage and general tokenizers. A reader would care because consistent discrete audio tokens enable efficient similarity-based retrieval without repeated full-sequence processing.

Core claim

wav2tok 2.0 is built on the BEST-STD backbone and uses staged training: first learning discriminative, speaker-invariant representations via contrastive learning and vector quantization, then enforcing pairwise token consistency with a CTC alignment loss and a novel DTW-aligned framewise prediction objective with adaptive weighting. Experiments demonstrate that this consistently outperforms BEST-STD and general-purpose tokenizers on QbE-STD while remaining efficient and scalable.

What carries the argument

Staged training that first performs contrastive learning and vector quantization, then applies CTC and DTW-based alignment losses.

If this is right

The tokenizer produces more consistent tokens across variable-length utterances than prior methods.
It delivers higher accuracy on query-by-example spoken term detection tasks.
It scales to larger datasets without the training bottlenecks of tightly coupled clustering and alignment.
It keeps computational cost low enough for practical deployment in audio retrieval systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of stages could let practitioners swap in stronger contrastive backbones without retraining the alignment components from scratch.
Similar staging might help other sequence tokenization problems where discrimination and cross-sequence alignment both matter, such as video clip retrieval.

Load-bearing premise

Separating contrastive representation learning from the subsequent CTC and DTW alignment losses will improve scalability and token consistency without introducing inconsistencies or losing discriminative power.

What would settle it

A head-to-head comparison on a much larger audio corpus where the two-stage model either requires more total training compute or achieves lower retrieval accuracy than a version that trains clustering and alignment together from the start.

Figures

Figures reproduced from arXiv: 2606.26824 by Adhiraj Banerjee, Vipul Arora.

**Figure 1.** Figure 1: Stage II pairwise alignment framework combining CTC-based sequence alignment with a novel DTW-aligned framewise token prediction. variable-length sequences, yielding a monotonic many-to-many correspondence between frames. Frame-level anchor–positive pairs are then formed by selecting, for each anchor frame t, the aligned positive frame with maximum cosine similarity. The encoder is trained using a SimCLR-s… view at source ↗

read the original abstract

Learning discrete speech representations that preserve similarity across variable-length utterances is central to query-by-example spoken term detection (QbE-STD). While wav2tok introduced CTC-based sequence alignment to enforce token consistency, its tightly coupled clustering and alignment training recipe limits scalability. We propose wav2tok 2.0, a scalable alignment-aware speech tokenizer built on the BEST-STD backbone. wav2tok 2.0 employs staged training, first learning discriminative, speaker-invariant representations via contrastive learning and vector quantization, and then enforcing pairwise token consistency using a CTC alignment loss and a novel DTW-aligned framewise prediction objective with adaptive weighting. Experiments show that wav2tok 2.0 consistently outperforms BEST-STD and general-purpose tokenizers on QbE-STD while remaining efficient and scalable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

wav2tok 2.0 splits training into contrastive then alignment stages to fix scalability, but the abstract gives no numbers or ablations so the consistency claim is untestable from what's shown.

read the letter

The main point here is a staged training recipe that first runs contrastive learning plus vector quantization for discriminative representations, then adds CTC loss and a new DTW-aligned framewise prediction objective with adaptive weighting to enforce pairwise token consistency. This is positioned as a fix for the original wav2tok's tightly coupled clustering-and-alignment setup that limited scale.

What stands out as new is the explicit decoupling plus the DTW-based objective. The work sits on the BEST-STD backbone and cites the prior wav2tok paper directly, so the contribution is the modular training sequence rather than new core components. It does a reasonable job naming the scalability bottleneck and offering a practical separation that could let people train on bigger data without the old coupling constraints.

The soft spots are clear from the abstract alone. It claims consistent outperformance on QbE-STD over BEST-STD and general tokenizers but supplies zero metrics, no dataset sizes, no ablation tables, and no controls for whether the second stage preserves the first stage's speaker-invariant properties. The stress-test concern lands: without direct checks like pairwise token agreement rates or DTW distances on held-out pairs, there is no evidence that decoupling avoids reintroducing the inconsistencies the original method targeted. If those results exist in the full paper they need to be front and center; otherwise the scalability and alignment claims rest on unshown data.

This is aimed at people working on query-by-example spoken term detection and efficient audio retrieval systems. A reader who needs concrete training recipes for discrete representations in retrieval tasks could extract the staged procedure and try it, even if the current write-up leaves the performance claims provisional.

The paper shows clear engagement with the relevant components and literature. It deserves peer review because the staging idea targets a documented limitation and is specific enough to be checked against experiments, though any review will almost certainly ask for the missing quantitative details and ablations.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces wav2tok 2.0, a scalable alignment-aware speech tokenizer built on the BEST-STD backbone. It employs staged training: first learning discriminative speaker-invariant representations via contrastive learning and vector quantization, then enforcing pairwise token consistency using a CTC alignment loss and a novel DTW-aligned framewise prediction objective with adaptive weighting. The paper claims that this approach consistently outperforms BEST-STD and general-purpose tokenizers on QbE-STD while remaining efficient and scalable.

Significance. If the staged training successfully maintains explicit pairwise token alignment without loss of discriminative power from the contrastive stage, the method could address scalability limitations of tightly coupled recipes in prior work such as the original wav2tok, supporting more efficient audio retrieval.

major comments (2)

[Abstract] Abstract: the claim that 'Experiments show that wav2tok 2.0 consistently outperforms BEST-STD and general-purpose tokenizers on QbE-STD' supplies no metrics, dataset details, ablation results, or statistical controls, so the central empirical claim cannot be evaluated from the given text.
[Abstract] Abstract: no direct metric (e.g., pairwise token agreement rate or DTW distance on held-out pairs) is supplied to show that the second-stage CTC + DTW objectives preserve the first-stage representations' discriminative power or avoid reintroducing token inconsistencies.

minor comments (1)

The abstract refers to 'adaptive weighting' without indicating how the weights are computed or scheduled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these targeted comments on the abstract. Both points are well-taken and point to opportunities to make the abstract more self-contained. We will revise the abstract accordingly while preserving its brevity.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'Experiments show that wav2tok 2.0 consistently outperforms BEST-STD and general-purpose tokenizers on QbE-STD' supplies no metrics, dataset details, ablation results, or statistical controls, so the central empirical claim cannot be evaluated from the given text.

Authors: We agree that the abstract would be strengthened by quantitative detail. In the revised version we will insert concise performance figures (e.g., mAP on the primary QbE-STD benchmark), name the evaluation corpora, and add a parenthetical reference to the ablation tables in Section 4. This change will allow readers to assess the central claim directly from the abstract without altering its length substantially. revision: yes
Referee: [Abstract] Abstract: no direct metric (e.g., pairwise token agreement rate or DTW distance on held-out pairs) is supplied to show that the second-stage CTC + DTW objectives preserve the first-stage representations' discriminative power or avoid reintroducing token inconsistencies.

Authors: We acknowledge the absence of an explicit preservation metric in the abstract. The body of the paper reports that the staged training maintains downstream retrieval performance, but a direct token-level consistency measure is not highlighted in the abstract. We will add a short clause reporting the pairwise token agreement rate (or DTW distance) computed on held-out pairs after the second stage, thereby providing the requested evidence of representation stability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical staged training with experimental validation

full rationale

The paper proposes wav2tok 2.0 as a staged-training extension of the BEST-STD backbone, first performing contrastive + VQ learning then adding CTC + DTW objectives. All central claims rest on reported QbE-STD experimental outcomes rather than any derivation, equation, or fitted quantity that reduces to its own inputs by construction. The reference to the original wav2tok supplies background context only and does not serve as a load-bearing uniqueness theorem or ansatz. No self-definitional, fitted-input-called-prediction, or renaming patterns appear.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; insufficient detail to populate the ledger.

pith-pipeline@v0.9.1-grok · 5673 in / 1056 out tokens · 53554 ms · 2026-06-26T02:44:53.496739+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 9 canonical work pages · 5 internal anchors

[1]

Introduction Query-by-example spoken term detection (QbE-STD) aims to retrieve utterances from large audio archives that contain a given spoken query, operating directly on audio signals rather than text, and is central to applications such as audio index- ing, podcast retrieval, and voice search [1, 2, 3]. Early ap- proaches relied on ASR-based represent...
[2]

wav2tok 2.0: Scalable Audio Tokenization Maintaining Explicit Pairwise Token Alignment for Efficient Audio Retrieval

Method 2.1. Encoder and Tokenization Backbone wav2tok 2.0 adopts the BEST-STD [25] architecture as a back- bone for scalable representation learning. The encoderf θ maps an input utterancexto a sequence of frame-level embeddings Z={z t}T t=1 using a spectrogram frontend followed by a bidi- rectional Mamba-based [31] state-space model. The final en- coder ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Evaluation Framework We adopt the same indexing, retrieval, and evaluation frame- work as BEST-STD [25] to ensure a controlled comparison

Experiments 3.1. Evaluation Framework We adopt the same indexing, retrieval, and evaluation frame- work as BEST-STD [25] to ensure a controlled comparison. Each audio track in the speech archive is segmented into over- lapping fixed-length segments of durationlwith hop sizeh. Each segment is tokenized into a discrete token sequenceqij = {q1, . . . , qT },...
[4]

Results 4.1. Pairwise Token Consistency Analysis We evaluate pairwise token consistency using Jaccard sim- ilarity over unigram and bigram token sets, where bigrams partially capture local order information and are more sen- sitive to alignment-preserving tokenizations. As shown in Table 1, general-purpose tokenizers such as HuBERT [21], WavLM [22], Speec...
[5]

Conclusion We proposewav2tok 2.0, a scalable retrieval-oriented speech tokenizer that makes explicit pairwise alignment a first-class training signal. Built on BEST-STD [25], wav2tok 2.0 adds CTC-based sequence alignment with a novel DTW-aligned framewise prediciton objective, yielding more stable (espe- cially bigram-consistent) tokenizations and consist...
[6]

LLM is used only to aid or polish writing and does not impact the core methodology, scientific rigorousness, or originality of the research

Declaration of LLM Usage. LLM is used only to aid or polish writing and does not impact the core methodology, scientific rigorousness, or originality of the research
[7]

An audio indexing system for election video material,

C. Alberti, M. Bacchiani, A. Bezman, C. Chelba, A. Drofa, H. Liao, P. Moreno, T. Power, A. Sahuguet, M. Shugrinaet al., “An audio indexing system for election video material,” in2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2009, pp. 4873–4876

2009
[8]

Podcastle: collaborative training of acous- tic models on the basis of wisdom of crowds for podcast transcrip- tion

J. Ogata and M. Goto, “Podcastle: collaborative training of acous- tic models on the basis of wisdom of crowds for podcast transcrip- tion.” inInterspeech, 2009, pp. 1491–1494

2009
[9]

An introduction to voice search,

Y .-Y . Wang, D. Yu, Y .-C. Ju, and A. Acero, “An introduction to voice search,”IEEE Signal Processing Magazine, vol. 25, no. 3, pp. 28–38, 2008

2008
[10]

V ocabulary inde- pendent spoken term detection,

J. Mamou, B. Ramabhadran, and O. Siohan, “V ocabulary inde- pendent spoken term detection,” inProceedings of the 30th an- nual international ACM SIGIR conference on Research and de- velopment in information retrieval, 2007, pp. 615–622

2007
[11]

Rapid and accurate spoken term detection

D. R. Miller, M. Kleber, C.-L. Kao, O. Kimball, T. Colthurst, S. A. Lowe, R. M. Schwartz, and H. Gish, “Rapid and accurate spoken term detection.” inInterspeech, vol. 7, 2007, pp. 314–317

2007
[12]

A comparison of phone and grapheme-based spoken term detection,

D. Wang, J. Frankel, J. Tejedor, and S. King, “A comparison of phone and grapheme-based spoken term detection,” in2008 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing. IEEE, 2008, pp. 4969–4972

2008
[13]

Lattice-based search for spoken ut- terance retrieval,

M. Saraclar and R. Sproat, “Lattice-based search for spoken ut- terance retrieval,” inProceedings of the Human Language Tech- nology Conference of the North American Chapter of the Associa- tion for Computational Linguistics: HLT-NAACL 2004, 2004, pp. 129–136

2004
[14]

Lattice indexing for spoken term detec- tion,

D. Can and M. Saraclar, “Lattice indexing for spoken term detec- tion,”IEEE Transactions on Audio, Speech, and Language Pro- cessing, vol. 19, no. 8, pp. 2338–2347, 2011

2011
[15]

Neural network based end-to-end query by example spoken term detection,

D. Ram, L. Miculicich, and H. Bourlard, “Neural network based end-to-end query by example spoken term detection,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 28, pp. 1416–1427, 2020

2020
[16]

Cnn based query by example spoken term detection

D. Ram, L. Miculicich, and H. Bourlard, “Cnn based query by example spoken term detection.” inInterspeech, 2018, pp. 92–96

2018
[17]

Segmental dtw: A parallelizable alternative to dynamic time warping,

T. Tsai, “Segmental dtw: A parallelizable alternative to dynamic time warping,” inICASSP 2021-2021 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 106–110

2021
[18]

Unsupervised learning of audio segment representations using sequence-to-sequence recurrent neural networks,

Y .-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y . Lee, and L.-S. Lee, “Unsupervised learning of audio segment representations using sequence-to-sequence recurrent neural networks,” inProc. Inter- speech, 2016, pp. 765–769

2016
[19]

Multi-view Recurrent Neural Acoustic Word Embeddings

W. He, W. Wang, and K. Livescu, “Multi-view recurrent neu- ral acoustic word embeddings,”arXiv preprint arXiv:1611.04496, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

Deep convolutional acoustic word embeddings using word-pair side information,

H. Kamper, W. Wang, and K. Livescu, “Deep convolutional acoustic word embeddings using word-pair side information,” in 2016 IEEE international conference on acoustics, speech and sig- nal processing (ICASSP). IEEE, 2016, pp. 4950–4954

2016
[21]

Phonetic-and-semantic embedding of spoken words with appli- cations in spoken content retrieval,

Y .-C. Chen, S.-F. Huang, C.-H. Shen, H.-Y . Lee, and L.-S. Lee, “Phonetic-and-semantic embedding of spoken words with appli- cations in spoken content retrieval,” in2018 IEEE Spoken Lan- guage Technology Workshop (SLT). IEEE, 2018, pp. 941–948

2018
[22]

Acoustic span embeddings for multilingual query-by-example search,

Y . Hu, S. Settle, and K. Livescu, “Acoustic span embeddings for multilingual query-by-example search,” in2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 935– 942

2021
[23]

Enc-dec rnn acoustic word embed- dings learned via pairwise prediction,

A. Banerjee and V . Arora, “Enc-dec rnn acoustic word embed- dings learned via pairwise prediction,” inProc. Interspeech 2023, 2023, pp. 1478–1482

2023
[24]

Attention-based audio embeddings for query-by-example,

A. Singh, K. Demuynck, and V . Arora, “Attention-based audio embeddings for query-by-example,”arXiv preprint arXiv:2210.08624, 2022

work page arXiv 2022
[25]

Simultaneously learning robust audio embeddings and balanced hash codes for query-by- example,

A. Singh, K. Demuynck, and V . Arora, “Simultaneously learning robust audio embeddings and balanced hash codes for query-by- example,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023
[26]

Flowhash: Accelerat- ing audio search with balanced hashing via normalizing flow,

A. Singh, K. Demuynck, and V . Arora, “Flowhash: Accelerat- ing audio search with balanced hashing via normalizing flow,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, 2024

2024
[27]

Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

2021
[28]

Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022
[29]

Speechtok- enizer: Unified speech tokenizer for speech large language mod- els,

X. Zhang, D. Zhang, S. Li, Y . Zhou, and X. Qiu, “Speechtok- enizer: Unified speech tokenizer for speech large language mod- els,”arXiv preprint arXiv:2308.16692, 2023

work page arXiv 2023
[30]

High Fidelity Neural Audio Compression

A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Best-std: Bidirectional mamba-enhanced speech tokenization for spoken term detection,

A. Singh, K. Demuynck, and V . Arora, “Best-std: Bidirectional mamba-enhanced speech tokenization for spoken term detection,” inICASSP 2025-2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025
[32]

Language-agnostic speech tokenizer for spoken term detection with efficient re- trieval,

A. Singh, K. Demuynck, and V . Arora, “Language-agnostic speech tokenizer for spoken term detection with efficient re- trieval,” inProc. Interspeech 2025, 2025, pp. 2630–2634

2025
[33]

Best-std2. 0: Balanced and efficient speech tokenizer for spoken term detection,

A. Singh, K. Demuynck, and V . Arora, “Best-std2. 0: Balanced and efficient speech tokenizer for spoken term detection,”arXiv preprint arXiv:2512.16395, 2025

work page arXiv 2025
[34]

Sinkhorn distances: Lightspeed computation of op- timal transport,

M. Cuturi, “Sinkhorn distances: Lightspeed computation of op- timal transport,”Advances in neural information processing sys- tems, vol. 26, 2013

2013
[35]

wav2tok: Deep sequence tokenizer for audio retrieval,

A. Banerjee and V . Arora, “wav2tok: Deep sequence tokenizer for audio retrieval,” inThe Eleventh International Conference on Learning Representations, 2022

2022
[36]

Connectionist temporal classification,

A. Graves, “Connectionist temporal classification,” inSupervised sequence labelling with recurrent neural networks. Springer, 2012, pp. 61–93

2012
[37]

Efficiently Modeling Long Sequences with Structured State Spaces

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,”arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PmLR, 2020, pp. 1597–1607

2020
[39]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210

2015
[40]

Timit acoustic-phonetic con- tinuous speech corpus,

J. S. Garofolo, L. F. Lamel, W. M. Fisher, D. S. Pallett, N. L. Dahlgren, V . Zue, and J. G. Fiscus, “Timit acoustic-phonetic con- tinuous speech corpus,”(No Title), 1993

1993
[41]

But/phonexia bottleneck feature extractor

A. Silnova, P. Matejka, O. Glembek, O. Plchot, O. Novotn `y, F. Grezl, P. Schwarz, L. Burget, and J. Cernock `y, “But/phonexia bottleneck feature extractor.” inOdyssey, 2018, pp. 283–287

2018
[42]

Exploiting phone log-likelihood ratio features for the de- tection of the native language of non-native english speakers

A. Abad, E. Ribeiro, F. N. Kepler, R. F. Astudillo, and I. Tran- coso, “Exploiting phone log-likelihood ratio features for the de- tection of the native language of non-native english speakers.” in INTERSPEECH, 2016, pp. 2413–2417

2016
[43]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understand- ing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Wavllm: Towards ro- bust and adaptive speech large language model,

S. Hu, L. Zhou, S. Liu, S. Chen, L. Meng, H. Hao, J. Pan, X. Liu, J. Li, S. Sivasankaranet al., “Wavllm: Towards ro- bust and adaptive speech large language model,”arXiv preprint arXiv:2404.00656, 2024

work page arXiv 2024

[1] [1]

Introduction Query-by-example spoken term detection (QbE-STD) aims to retrieve utterances from large audio archives that contain a given spoken query, operating directly on audio signals rather than text, and is central to applications such as audio index- ing, podcast retrieval, and voice search [1, 2, 3]. Early ap- proaches relied on ASR-based represent...

[2] [2]

wav2tok 2.0: Scalable Audio Tokenization Maintaining Explicit Pairwise Token Alignment for Efficient Audio Retrieval

Method 2.1. Encoder and Tokenization Backbone wav2tok 2.0 adopts the BEST-STD [25] architecture as a back- bone for scalable representation learning. The encoderf θ maps an input utterancexto a sequence of frame-level embeddings Z={z t}T t=1 using a spectrogram frontend followed by a bidi- rectional Mamba-based [31] state-space model. The final en- coder ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Evaluation Framework We adopt the same indexing, retrieval, and evaluation frame- work as BEST-STD [25] to ensure a controlled comparison

Experiments 3.1. Evaluation Framework We adopt the same indexing, retrieval, and evaluation frame- work as BEST-STD [25] to ensure a controlled comparison. Each audio track in the speech archive is segmented into over- lapping fixed-length segments of durationlwith hop sizeh. Each segment is tokenized into a discrete token sequenceqij = {q1, . . . , qT },...

[4] [4]

Results 4.1. Pairwise Token Consistency Analysis We evaluate pairwise token consistency using Jaccard sim- ilarity over unigram and bigram token sets, where bigrams partially capture local order information and are more sen- sitive to alignment-preserving tokenizations. As shown in Table 1, general-purpose tokenizers such as HuBERT [21], WavLM [22], Speec...

[5] [5]

Conclusion We proposewav2tok 2.0, a scalable retrieval-oriented speech tokenizer that makes explicit pairwise alignment a first-class training signal. Built on BEST-STD [25], wav2tok 2.0 adds CTC-based sequence alignment with a novel DTW-aligned framewise prediciton objective, yielding more stable (espe- cially bigram-consistent) tokenizations and consist...

[6] [6]

LLM is used only to aid or polish writing and does not impact the core methodology, scientific rigorousness, or originality of the research

Declaration of LLM Usage. LLM is used only to aid or polish writing and does not impact the core methodology, scientific rigorousness, or originality of the research

[7] [7]

An audio indexing system for election video material,

C. Alberti, M. Bacchiani, A. Bezman, C. Chelba, A. Drofa, H. Liao, P. Moreno, T. Power, A. Sahuguet, M. Shugrinaet al., “An audio indexing system for election video material,” in2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2009, pp. 4873–4876

2009

[8] [8]

Podcastle: collaborative training of acous- tic models on the basis of wisdom of crowds for podcast transcrip- tion

J. Ogata and M. Goto, “Podcastle: collaborative training of acous- tic models on the basis of wisdom of crowds for podcast transcrip- tion.” inInterspeech, 2009, pp. 1491–1494

2009

[9] [9]

An introduction to voice search,

Y .-Y . Wang, D. Yu, Y .-C. Ju, and A. Acero, “An introduction to voice search,”IEEE Signal Processing Magazine, vol. 25, no. 3, pp. 28–38, 2008

2008

[10] [10]

V ocabulary inde- pendent spoken term detection,

J. Mamou, B. Ramabhadran, and O. Siohan, “V ocabulary inde- pendent spoken term detection,” inProceedings of the 30th an- nual international ACM SIGIR conference on Research and de- velopment in information retrieval, 2007, pp. 615–622

2007

[11] [11]

Rapid and accurate spoken term detection

D. R. Miller, M. Kleber, C.-L. Kao, O. Kimball, T. Colthurst, S. A. Lowe, R. M. Schwartz, and H. Gish, “Rapid and accurate spoken term detection.” inInterspeech, vol. 7, 2007, pp. 314–317

2007

[12] [12]

A comparison of phone and grapheme-based spoken term detection,

D. Wang, J. Frankel, J. Tejedor, and S. King, “A comparison of phone and grapheme-based spoken term detection,” in2008 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing. IEEE, 2008, pp. 4969–4972

2008

[13] [13]

Lattice-based search for spoken ut- terance retrieval,

M. Saraclar and R. Sproat, “Lattice-based search for spoken ut- terance retrieval,” inProceedings of the Human Language Tech- nology Conference of the North American Chapter of the Associa- tion for Computational Linguistics: HLT-NAACL 2004, 2004, pp. 129–136

2004

[14] [14]

Lattice indexing for spoken term detec- tion,

D. Can and M. Saraclar, “Lattice indexing for spoken term detec- tion,”IEEE Transactions on Audio, Speech, and Language Pro- cessing, vol. 19, no. 8, pp. 2338–2347, 2011

2011

[15] [15]

Neural network based end-to-end query by example spoken term detection,

D. Ram, L. Miculicich, and H. Bourlard, “Neural network based end-to-end query by example spoken term detection,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 28, pp. 1416–1427, 2020

2020

[16] [16]

Cnn based query by example spoken term detection

D. Ram, L. Miculicich, and H. Bourlard, “Cnn based query by example spoken term detection.” inInterspeech, 2018, pp. 92–96

2018

[17] [17]

Segmental dtw: A parallelizable alternative to dynamic time warping,

T. Tsai, “Segmental dtw: A parallelizable alternative to dynamic time warping,” inICASSP 2021-2021 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 106–110

2021

[18] [18]

Unsupervised learning of audio segment representations using sequence-to-sequence recurrent neural networks,

Y .-A. Chung, C.-C. Wu, C.-H. Shen, H.-Y . Lee, and L.-S. Lee, “Unsupervised learning of audio segment representations using sequence-to-sequence recurrent neural networks,” inProc. Inter- speech, 2016, pp. 765–769

2016

[19] [19]

Multi-view Recurrent Neural Acoustic Word Embeddings

W. He, W. Wang, and K. Livescu, “Multi-view recurrent neu- ral acoustic word embeddings,”arXiv preprint arXiv:1611.04496, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[20] [20]

Deep convolutional acoustic word embeddings using word-pair side information,

H. Kamper, W. Wang, and K. Livescu, “Deep convolutional acoustic word embeddings using word-pair side information,” in 2016 IEEE international conference on acoustics, speech and sig- nal processing (ICASSP). IEEE, 2016, pp. 4950–4954

2016

[21] [21]

Phonetic-and-semantic embedding of spoken words with appli- cations in spoken content retrieval,

Y .-C. Chen, S.-F. Huang, C.-H. Shen, H.-Y . Lee, and L.-S. Lee, “Phonetic-and-semantic embedding of spoken words with appli- cations in spoken content retrieval,” in2018 IEEE Spoken Lan- guage Technology Workshop (SLT). IEEE, 2018, pp. 941–948

2018

[22] [22]

Acoustic span embeddings for multilingual query-by-example search,

Y . Hu, S. Settle, and K. Livescu, “Acoustic span embeddings for multilingual query-by-example search,” in2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 935– 942

2021

[23] [23]

Enc-dec rnn acoustic word embed- dings learned via pairwise prediction,

A. Banerjee and V . Arora, “Enc-dec rnn acoustic word embed- dings learned via pairwise prediction,” inProc. Interspeech 2023, 2023, pp. 1478–1482

2023

[24] [24]

Attention-based audio embeddings for query-by-example,

A. Singh, K. Demuynck, and V . Arora, “Attention-based audio embeddings for query-by-example,”arXiv preprint arXiv:2210.08624, 2022

work page arXiv 2022

[25] [25]

Simultaneously learning robust audio embeddings and balanced hash codes for query-by- example,

A. Singh, K. Demuynck, and V . Arora, “Simultaneously learning robust audio embeddings and balanced hash codes for query-by- example,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023

[26] [26]

Flowhash: Accelerat- ing audio search with balanced hashing via normalizing flow,

A. Singh, K. Demuynck, and V . Arora, “Flowhash: Accelerat- ing audio search with balanced hashing via normalizing flow,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, 2024

2024

[27] [27]

Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdi- nov, and A. Mohamed, “Hubert: Self-supervised speech represen- tation learning by masked prediction of hidden units,”IEEE/ACM transactions on audio, speech, and language processing, vol. 29, pp. 3451–3460, 2021

2021

[28] [28]

Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022

[29] [29]

Speechtok- enizer: Unified speech tokenizer for speech large language mod- els,

X. Zhang, D. Zhang, S. Li, Y . Zhou, and X. Qiu, “Speechtok- enizer: Unified speech tokenizer for speech large language mod- els,”arXiv preprint arXiv:2308.16692, 2023

work page arXiv 2023

[30] [30]

High Fidelity Neural Audio Compression

A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,”arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[31] [31]

Best-std: Bidirectional mamba-enhanced speech tokenization for spoken term detection,

A. Singh, K. Demuynck, and V . Arora, “Best-std: Bidirectional mamba-enhanced speech tokenization for spoken term detection,” inICASSP 2025-2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025

[32] [32]

Language-agnostic speech tokenizer for spoken term detection with efficient re- trieval,

A. Singh, K. Demuynck, and V . Arora, “Language-agnostic speech tokenizer for spoken term detection with efficient re- trieval,” inProc. Interspeech 2025, 2025, pp. 2630–2634

2025

[33] [33]

Best-std2. 0: Balanced and efficient speech tokenizer for spoken term detection,

A. Singh, K. Demuynck, and V . Arora, “Best-std2. 0: Balanced and efficient speech tokenizer for spoken term detection,”arXiv preprint arXiv:2512.16395, 2025

work page arXiv 2025

[34] [34]

Sinkhorn distances: Lightspeed computation of op- timal transport,

M. Cuturi, “Sinkhorn distances: Lightspeed computation of op- timal transport,”Advances in neural information processing sys- tems, vol. 26, 2013

2013

[35] [35]

wav2tok: Deep sequence tokenizer for audio retrieval,

A. Banerjee and V . Arora, “wav2tok: Deep sequence tokenizer for audio retrieval,” inThe Eleventh International Conference on Learning Representations, 2022

2022

[36] [36]

Connectionist temporal classification,

A. Graves, “Connectionist temporal classification,” inSupervised sequence labelling with recurrent neural networks. Springer, 2012, pp. 61–93

2012

[37] [37]

Efficiently Modeling Long Sequences with Structured State Spaces

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,”arXiv preprint arXiv:2111.00396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[38] [38]

A simple framework for contrastive learning of visual representations,

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PmLR, 2020, pp. 1597–1607

2020

[39] [39]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210

2015

[40] [40]

Timit acoustic-phonetic con- tinuous speech corpus,

J. S. Garofolo, L. F. Lamel, W. M. Fisher, D. S. Pallett, N. L. Dahlgren, V . Zue, and J. G. Fiscus, “Timit acoustic-phonetic con- tinuous speech corpus,”(No Title), 1993

1993

[41] [41]

But/phonexia bottleneck feature extractor

A. Silnova, P. Matejka, O. Glembek, O. Plchot, O. Novotn `y, F. Grezl, P. Schwarz, L. Burget, and J. Cernock `y, “But/phonexia bottleneck feature extractor.” inOdyssey, 2018, pp. 283–287

2018

[42] [42]

Exploiting phone log-likelihood ratio features for the de- tection of the native language of non-native english speakers

A. Abad, E. Ribeiro, F. N. Kepler, R. F. Astudillo, and I. Tran- coso, “Exploiting phone log-likelihood ratio features for the de- tection of the native language of non-native english speakers.” in INTERSPEECH, 2016, pp. 2413–2417

2016

[43] [43]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understand- ing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Wavllm: Towards ro- bust and adaptive speech large language model,

S. Hu, L. Zhou, S. Liu, S. Chen, L. Meng, H. Hao, J. Pan, X. Liu, J. Li, S. Sivasankaranet al., “Wavllm: Towards ro- bust and adaptive speech large language model,”arXiv preprint arXiv:2404.00656, 2024

work page arXiv 2024