Bridging the Age Gap: Towards Detecting Neural Audio Codec Synthesized Elderly Speech Deepfake

Chi-Chun Lee; Girish; Mohd Mujtaba Akhtar; Orchid Chetia Phukan

arxiv: 2606.21735 · v1 · pith:CIPUTDICnew · submitted 2026-06-19 · 📡 eess.AS · cs.SD

Bridging the Age Gap: Towards Detecting Neural Audio Codec Synthesized Elderly Speech Deepfake

Orchid Chetia Phukan , Girish , Mohd Mujtaba Akhtar , Chi-Chun Lee This is my paper

Pith reviewed 2026-06-26 12:49 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords elderly speech deepfakeneural audio codeccodec fake detectionmultimodal foundation modelsJensen-Shannon DivergenceBONSAI frameworkage gap detectioncross-modal pretraining

0 comments

The pith

Fusing LanguageBind and ImageBind via BONSAI detects synthesized elderly speech deepfakes at 1.66 percent average EER.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Elderly CodecFake Detection task because existing detectors trained on general codec fake datasets fail to generalize to elderly speech. It releases the Elderly-CodecFake dataset covering English and Chinese voices to support the new task. The authors hypothesize and test that multimodal foundation models benefit from cross-modal pretraining exposure to elderly content, then show that fusing LanguageBind and ImageBind with Jensen-Shannon Divergence in the BONSAI framework reaches 1.66 percent average equal error rate while beating both single models and prior baselines.

Core claim

The paper claims that multimodal foundation models such as LanguageBind and ImageBind are more effective for detecting neural audio codec synthesized elderly speech deepfakes due to their exposure to elderly content during cross-modal pretraining, and that fusing them through the BONSAI framework with Jensen-Shannon Divergence achieves an average EER of 1.66 percent, outperforming individual foundation models and competitive state-of-the-art baselines to set a new benchmark for the ECFD task.

What carries the argument

BONSAI framework, which fuses outputs from multimodal foundation models using Jensen-Shannon Divergence to produce a single detection score for elderly codec fakes.

If this is right

Standard codec fake detectors require age-specific adaptation or retraining to handle elderly voices reliably.
Cross-modal pretraining that includes elderly audio-visual pairs improves downstream detection of demographic subgroups.
Jensen-Shannon Divergence fusion of multiple foundation models provides a practical route to lower error rates on this task.
The released ECF dataset in English and Chinese enables systematic study of age-related vulnerabilities in audio deepfake detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion approach could be tested on other demographic gaps such as child or accented speech to check broader applicability.
If pretraining exposure drives the gain, periodically refreshing foundation models with newer elderly speech data might produce further error reductions.
BONSAI could be evaluated on live elderly voice authentication systems to measure real-world false acceptance rates.

Load-bearing premise

The superior performance of LanguageBind and ImageBind on elderly speech arises specifically from their exposure to elderly content during cross-modal pretraining.

What would settle it

An experiment that measures LanguageBind and ImageBind performance on elderly codec fakes after removing or masking elderly content from their pretraining data and finds no performance gap relative to other models would falsify the core hypothesis.

Figures

Figures reproduced from arXiv: 2606.21735 by Chi-Chun Lee, Girish, Mohd Mujtaba Akhtar, Orchid Chetia Phukan.

**Figure 1.** Figure 1: t-SNE Plots for Real Speech (a) and CF Speech (b) samples for TIS Corpus[28]; Clear separation of Young and Elderly Speech can be observed in both Real and CF scenarios To address this limitation, we introduce Elderly CodecFake Detection (ECFD) task ( [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Proposed Framework: BONSAI; JSD stands for JensenShannon Divergence 3.2. Modeling Here, we detail the downstream modeling with individual FMs followed by discussion of the novel proposed approach for FMs fusion, BONSAI. Downstream Modeling with Individual FMs: For downstream modeling, we adopt two widely used architectures for speech deepfake detection: AASIST [11] and a convolutional neural network (CNN)… view at source ↗

**Figure 3.** Figure 3: t-SNE Plots (a) Wav2vec2 (b) LanguageBind Pairs Concatenation BONSAI E1 E2 (Elderly) Avg E1 E2 (Elderly) Avg W2v2 + WL 9.36 9.01 9.18 7.54 6.97 7.26 W2v2 + Wh 7.17 6.63 6.90 6.24 5.64 5.94 W2v2 + IB 5.20 4.79 5.00 4.87 3.22 4.05 W2v2 + LB 4.47 4.22 4.35 4.00 3.76 3.88 WL + Wh 7.94 7.35 7.65 5.68 5.17 5.43 WL + IB 5.12 4.72 4.92 4.36 4.09 4.23 WL + LB 4.86 4.43 4.65 3.92 3.56 3.74 Wh + IB 4.51 4.02 4.27 3.2… view at source ↗

read the original abstract

In this study, we introduce the Elderly CodecFake Detection (ECFD) task and release the Elderly-CodecFake (ECF) dataset in English and Chinese. We show that state-of-the-art CF detectors trained on previous benchmark CF datasets generalize poorly to elderly speech, revealing a critical vulnerability. We further hypothesize and demonstrate that multimodal foundation models (FMs) such as LanguageBind (LB) and ImageBind (IB) are more effective for ECFD due to their exposure to elderly content during cross-modal pretraining. Motivated by prior evidence that fusion of FMs enhances downstream performance, we explore fusion of FMs for ECFD. To this end, we propose BONSAI, a novel framework that employs Jensen-Shannon Divergence as the fusion mechanism. BONSAI with the fusion of LB and IB achieves an average EER (%) of 1.66 and outperforms individual FMs as well as competitive SOTA baselines, establishing a new benchmark for the ECFD task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New task and dataset for elderly codec deepfake detection, with a fusion method that beats baselines, but the key claim about pretraining exposure lacks controls.

read the letter

The main things here are a new task called ECFD and the ECF dataset covering English and Chinese elderly speech synthesized by neural codecs. Existing detectors trained on younger or general data do not transfer well, which is a practical point worth noting for anyone building audio security tools.

What the work actually adds is the demonstration that prior CF detectors fail on this demographic slice, plus the BONSAI fusion approach that combines LanguageBind and ImageBind via Jensen-Shannon divergence to reach 1.66% average EER. Releasing the dataset in two languages is concrete and reusable. The fusion itself is a straightforward application of existing multimodal models rather than a new architecture.

The soft spot is the explanation offered for why the multimodal models succeed. The abstract ties the gains to exposure to elderly content during cross-modal pretraining, yet nothing in the reported experiments isolates that factor—no matched models without elderly data, no representation probes, and no ablation that removes age-related examples. The performance numbers could just as easily come from generic multimodal properties or the divergence mechanism. Dataset construction details and evaluation protocol are also thin in the abstract, so the generalization claim rests on whatever is in the full text.

This is for researchers focused on audio deepfake robustness and demographic gaps in detection systems. The dataset release gives it some lasting value even if the causal story needs tightening. It is coherent enough on its own terms to deserve referee time rather than a desk reject.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces the Elderly CodecFake Detection (ECFD) task and releases the Elderly-CodecFake (ECF) dataset in English and Chinese. It reports that state-of-the-art codec-fake detectors trained on prior benchmarks generalize poorly to elderly speech. The authors hypothesize that multimodal foundation models (LanguageBind, ImageBind) outperform prior detectors specifically because of elderly content exposure during cross-modal pretraining, and propose BONSAI, a fusion framework using Jensen-Shannon divergence, which achieves 1.66% average EER and sets a new benchmark.

Significance. If the empirical results prove robust and reproducible, the work identifies a genuine generalization gap in current deepfake detection for elderly speech and supplies a new public dataset, both of which are valuable contributions. The fusion mechanism itself may be of broader interest, but the causal attribution of gains to elderly pretraining exposure is not yet isolated.

major comments (2)

[Abstract] Abstract: The central explanatory claim—that LB and IB outperform prior detectors 'due to their exposure to elderly content during cross-modal pretraining'—is not supported by any isolating experiment. No ablation removes elderly examples from pretraining, no comparison is made to otherwise-matched multimodal models lacking such exposure, and no representation probing for age-related features is reported. Consequently the reported 1.66% EER cannot be attributed to the hypothesized factor rather than generic multimodal fusion or the JS-divergence mechanism.
[Abstract] Abstract: No information is supplied on ECF dataset construction—elderly speech sources, synthesis pipelines, speaker demographics, train/test splits, or language-specific balancing. Without these details the generalization claims and the numerical benchmark cannot be verified or reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the presentation of our hypotheses and improve reproducibility. We respond to each major comment below and will incorporate revisions in the next manuscript version.

read point-by-point responses

Referee: [Abstract] Abstract: The central explanatory claim—that LB and IB outperform prior detectors 'due to their exposure to elderly content during cross-modal pretraining'—is not supported by any isolating experiment. No ablation removes elderly examples from pretraining, no comparison is made to otherwise-matched multimodal models lacking such exposure, and no representation probing for age-related features is reported. Consequently the reported 1.66% EER cannot be attributed to the hypothesized factor rather than generic multimodal fusion or the JS-divergence mechanism.

Authors: We agree that the manuscript does not contain isolating experiments (e.g., ablating elderly data from pretraining or probing representations) that would causally link performance gains exclusively to elderly content exposure. The original text presents this as a hypothesis motivated by known pretraining characteristics of LB/IB and the observed generalization gap to elderly speech. In revision we will rephrase the abstract and relevant sections to describe the advantage as a plausible hypothesis supported by indirect evidence, while explicitly noting that gains could also arise from multimodal fusion or the JS-divergence mechanism. Performing the suggested controlled ablations is not feasible without access to the original pretraining corpora. revision: yes
Referee: [Abstract] Abstract: No information is supplied on ECF dataset construction—elderly speech sources, synthesis pipelines, speaker demographics, train/test splits, or language-specific balancing. Without these details the generalization claims and the numerical benchmark cannot be verified or reproduced.

Authors: The full manuscript (Section 3) provides the requested details on elderly speech sources, codec synthesis pipelines, speaker demographics (age 60+, gender balance), speaker-disjoint train/test splits, and language balancing for English and Chinese. To improve accessibility and address the referee's concern directly, we will add a concise summary of dataset construction and key statistics to the abstract in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on released dataset with no derivations or self-referential reductions

full rationale

The paper introduces an ECFD task and ECF dataset, reports generalization failure of prior detectors, hypothesizes superior performance of LB/IB due to pretraining exposure, and presents BONSAI fusion results (EER 1.66) as direct experimental outcomes. No equations, fitted parameters renamed as predictions, self-citations as load-bearing premises, or ansatzes appear in the provided text. The central performance claim is a measurement on held-out data rather than a quantity forced by construction from inputs. The explanatory hypothesis about elderly content exposure is not isolated experimentally, but this is a limitation of evidence strength, not a circular reduction of the reported result to its own definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Central claim rests on the assumption that the new ECF dataset is representative of real-world elderly codec-synthesized speech and that pretraining exposure in multimodal models transfers to this detection task; no free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.1-grok · 5715 in / 1035 out tokens · 16597 ms · 2026-06-26T12:49:26.359434+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 5 linked inside Pith

[1]

Current text-to-speech (TTS) and voice-conversion (VC) models can generate speech utterances in nearly human-level realism

Introduction In recent years, the boundary between genuine and speech deep- fakes has become increasingly blurred. Current text-to-speech (TTS) and voice-conversion (VC) models can generate speech utterances in nearly human-level realism. While these capabil- ities support valuable applications in assistive communication, human–computer interaction, and e...

Pith/arXiv arXiv 2026
[2]

All data are obtained from publicly available corpora

Elderly Codecfake Dataset This section first describes the sources of real elderly speech data, followed by an overview of the NACs employed in this study. All data are obtained from publicly available corpora. Finally, we detail the pipeline used to generate ECF dataset. Real Elderly Speech Source:SeniorTalk(E1) [ 31]: It is a Mandarin conversational spe...
[3]

[ 21], one of the foundational CF detection work

and Lu et al. [ 21], one of the foundational CF detection work. The procedure converts real elderly speech into NAC- generated counterparts using multiple NACs that constitute the core of modern ALM systems. We start from publicly available elderly speech datasets described above, where each original recording is treated as a real reference sample. For cr...
[4]

We then detail the proposed framework,BONSAI

Methodology This section presents the FMs employed in our study, followed by the downstream modeling approaches. We then detail the proposed framework,BONSAI. 3.1. Foundation Models The FMs considered are SOTA in their respective benchmarks. Mutlimodal FMs: We select LanguageBind (LB) [ 40] and ImageBind (IB) [ 41] as multimodal FMs. IB maps diverse modal...
[5]

Experiments 4.1. Training Details We train the models by combining the training sets from Se- niorTalk and TIS, while validation and testing are performed separately on the respective validation and test splits of each individual dataset. All models are trained for 20 epochs using a learning rate of 1e-3 and a batch size of 32. We employ the Adam optimize...
[6]

Our analysis demon- strated that existing SOTA CF detection models trained on prior benchmark datasets generalize poorly to elderly speech, reveal- ing a critical robustness gap

Conclusion In this work, we introduced the ECFD task and the ECF dataset comprising English and Chinese speech. Our analysis demon- strated that existing SOTA CF detection models trained on prior benchmark datasets generalize poorly to elderly speech, reveal- ing a critical robustness gap. Furthermore, we showed that multimodal FMs, such as LB and IB, pro...
[7]

Acknowledgement This work was supported by National Science and Technology Council (NSTC), Taiwan (grant#: 115-2634-F-002-012)
[8]

These tools did not contribute to the development of scientific concepts, data analysis, generation of results, or interpretation of findings

Generative AI Use Disclosure AI Assistants were utilized exclusively to enhance grammatical accuracy, clarity, and the overall readability of the manuscript. These tools did not contribute to the development of scientific concepts, data analysis, generation of results, or interpretation of findings. The authors assume full responsibility for the accuracy ...
[9]

Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,

Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilçi, M. Sahidullah, and A. Sizov, “Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” in INTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association. International Speech Com- munication Association, 2015, pp. 2037–2041

2015
[10]

The asvspoof 2017 challenge: As- sessing the limits of replay spoofing attack detection,

T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, “The asvspoof 2017 challenge: As- sessing the limits of replay spoofing attack detection,” 2017

2017
[11]

Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,

X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Leeet al., “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,”Computer Speech & Language, vol. 64, p. 101114, 2020

2019
[12]

Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,

X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kin- nunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautschet al., “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 31, pp. 2507–2522, 2023

2021
[13]

Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,

X. Wang, H. Delgado, H. Tak, J.-w. Jung, H.-j. Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunenet al., “Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,”arXiv preprint arXiv:2408.08739, 2024

arXiv 2024
[14]

Combining evidences from mel cep- stral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech

T. B. Patel and H. A. Patil, “Combining evidences from mel cep- stral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech.” inInterspeech, 2015, pp. 2062–2066

2015
[15]

Cochlear filter and instantaneous frequency based features for spoofed speech detection,

——, “Cochlear filter and instantaneous frequency based features for spoofed speech detection,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 618–631, 2016

2016
[16]

Toward robust audio spoofing detection: A detailed comparison of traditional and learned features,

B. Balamurali, K. E. Lin, S. Lui, J.-M. Chen, and D. Herremans, “Toward robust audio spoofing detection: A detailed comparison of traditional and learned features,”IEEE Access, vol. 7, pp. 84 229– 84 241, 2019

2019
[17]

Spoofing speech detection using temporal convolutional neural network,

X. Tian, X. Xiao, E. S. Chng, and H. Li, “Spoofing speech detection using temporal convolutional neural network,” in2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 2016, pp. 1–6

2016
[18]

A light convolutional gru-rnn deep feature extractor for asv spoofing detection,

A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. M. Gomez, “A light convolutional gru-rnn deep feature extractor for asv spoofing detection,” inProc. Interspeech, vol. 2019, 2019, pp. 1068–1072

2019
[19]

Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,

J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.- J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” inICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 6367–6371

2022
[20]

The vicomtech audio deep- fake detection system based on wav2vec2 for the 2022 add chal- lenge,

J. M. Martín-Doñas and A. Álvarez, “The vicomtech audio deep- fake detection system based on wav2vec2 for the 2022 add chal- lenge,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 9241–9245

2022
[21]

Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,” inThe Speaker and Language Recognition Workshop (Odyssey 2022). ISCA, 2022

2022
[22]

Im- proved DeepFake Detection Using Whisper Features,

P. Kawa, M. Plata, M. Czuba, P. Szyma ´nski, and P. Syga, “Im- proved DeepFake Detection Using Whisper Features,” inInter- speech 2023, 2023, pp. 4009–4013

2023
[23]

Audio deep- fake detection with self-supervised wavlm and multi-fusion atten- tive classifier,

Y . Guo, H. Huang, X. Chen, H. Zhao, and Y . Wang, “Audio deep- fake detection with self-supervised wavlm and multi-fusion atten- tive classifier,” inICASSP 2024-2024 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 702–12 706

2024
[24]

Efficient audio deepfake detection using wavlm with early exiting,

A. Pimentel, Y . Zhu, H. R. Guimarães, and T. H. Falk, “Efficient audio deepfake detection using wavlm with early exiting,” in2024 IEEE International Workshop on Information Forensics and Secu- rity (WIFS). IEEE, 2024, pp. 1–6

2024
[25]

Leveraging SSL Speech Features and Mamba for Enhanced DeepFake Detection,

H. M. Tran, D. Lolive, D. Guennec, A. Sini, A. Delhay, and P.- F. Marteau, “Leveraging SSL Speech Features and Mamba for Enhanced DeepFake Detection,” inInterspeech 2025, 2025, pp. 5323–5327

2025
[26]

Multi-level ssl feature gating for audio deepfake de- tection,

H. M. Tran, D. Lolive, A. Sini, A. Delhay, P.-F. Marteau, and D. Guennec, “Multi-level ssl feature gating for audio deepfake de- tection,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 11 766–11 775

2025
[27]

Audiolm: a language modeling approach to audio gener- ation,

Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi et al., “Audiolm: a language modeling approach to audio gener- ation,”IEEE/ACM transactions on audio, speech, and language processing, vol. 31, pp. 2523–2533, 2023

2023
[28]

Codecfake: Enhancing anti- spoofing models against deepfake audios from codec-based speech synthesis systems,

H. Wu, Y . Tseng, and H. yi Lee, “Codecfake: Enhancing anti- spoofing models against deepfake audios from codec-based speech synthesis systems,” inInterspeech 2024, 2024, pp. 1770–1774

2024
[29]

Codecfake: An initial dataset for detecting llm-based deepfake audio,

Y . Lu, Y . Xie, R. Fu, Z. Wen, J. Tao, Z. Wang, X. Qi, X. Liu, Y . Li, Y . Liu, X. Wang, and S. Shi, “Codecfake: An initial dataset for detecting llm-based deepfake audio,” inInterspeech 2024, 2024, pp. 1390–1394

2024
[30]

Codecfake+: A large-scale neu- ral audio codec-based deepfake speech dataset,

X. Chen, J. Du, H. Wu, L. Zhang, I. Lin, I. Chiu, W. Ren, Y . Tseng, Y . Tsao, J.-S. R. Janget al., “Codecfake+: A large-scale neu- ral audio codec-based deepfake speech dataset,”arXiv preprint arXiv:2501.08238, 2025

Pith/arXiv arXiv 2025
[31]

Whiadd: Semantic- acoustic fusion for robust audio deepfake detection,

J. Cui, B. Yu, Q. Wang, F. Meng, and J. Lu, “Whiadd: Semantic- acoustic fusion for robust audio deepfake detection,” inProceed- ings of the 33rd ACM International Conference on Multimedia, 2025, pp. 11 610–11 618

2025
[32]

The codecfake dataset and countermea- sures for the universally detection of deepfake audio,

Y . Xie, Y . Lu, R. Fu, Z. Wen, Z. Wang, J. Tao, X. Qi, X. Wang, Y . Liu, H. Chenget al., “The codecfake dataset and countermea- sures for the universally detection of deepfake audio,”IEEE Trans- actions on Audio, Speech and Language Processing, 2025

2025
[33]

How does our voice change as we age? a systematic review and meta-analysis of acoustic and perceptual voice data from healthy adults over 50 years of age,

S. Rojas, E. Kefalianos, and A. V ogel, “How does our voice change as we age? a systematic review and meta-analysis of acoustic and perceptual voice data from healthy adults over 50 years of age,” Journal of Speech, Language, and Hearing Research, vol. 63, no. 2, pp. 533–551, 2020

2020
[34]

The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks,

B. W. Schulleret al., “The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks,” inInterspeech 2020, 2020, pp. 2042–2046

2020
[35]

Is Everything Fine, Grandma? Acous- tic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition,

G. So˘gancıo˘glu, O. Verkholyak, H. Kaya, D. Fedotov, T. Cadée, A. A. Salah, and A. Karpov, “Is Everything Fine, Grandma? Acous- tic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition,” inInterspeech 2020, 2020, pp. 2097–2101

2020
[36]

Human voices communicating trustworthy intent: A demographically di- verse speech audio dataset,

C. Maltezou-Papastylianou, R. Scherer, and S. Paulmann, “Human voices communicating trustworthy intent: A demographically di- verse speech audio dataset,”Scientific Data, vol. 12, no. 1, p. 921, 2025

2025
[37]

Strong alone, stronger together: Synergizing modality-binding foundation models with optimal transport for non-verbal emotion recognition,

O. C. Phukan, M. M. Akhtar, S. R. Behera, S. Kalita, A. B. Buduru, R. Sharma, S. M. Prasannaet al., “Strong alone, stronger together: Synergizing modality-binding foundation models with optimal transport for non-verbal emotion recognition,” inICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025
[38]

Het- erogeneity over homogeneity: Investigating multilingual speech pre-trained models for detecting audio deepfake,

O. C. Phukan, G. S. Kashyap, A. B. Buduru, and R. Sharma, “Het- erogeneity over homogeneity: Investigating multilingual speech pre-trained models for detecting audio deepfake,”arXiv preprint arXiv:2404.00809, 2024

arXiv 2024
[39]

Seniortalk: A chinese conversation dataset with rich annotations for super-aged seniors,

Y . Chen, H. Wang, shiyao wang, J. Chen, J. He, J. Zhou, X. Yang, Y . Wang, Y . Lin, and Y . Qin, “Seniortalk: A chinese conversation dataset with rich annotations for super-aged seniors,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. [Online]. Available: https://openreview.net/forum?id=...

2025
[40]

High-fidelity audio compression with improved rvqgan,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved rvqgan,”Advances in Neural Information Processing Systems, vol. 36, 2024

2024
[41]

High fidelity neu- ral audio compression,

A. Défossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neu- ral audio compression,”arXiv preprint arXiv:2210.13438, 2022

Pith/arXiv arXiv 2022
[42]

Soundstream: An end-to-end neural audio codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasac- chi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021

2021
[43]

Speechtokenizer: Unified speech tokenizer for speech language models,

X. Zhang, D. Zhang, S. Li, Y . Zhou, and X. Qiu, “Speechtokenizer: Unified speech tokenizer for speech language models,” inThe Twelfth International Conference on Learning Representations,
[44]

Available: https://openreview.net/forum?id= AF9Q8Vip84

[Online]. Available: https://openreview.net/forum?id= AF9Q8Vip84
[45]

Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,

Z. Du, S. Zhang, K. Hu, and S. Zheng, “Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 591–595

2024
[46]

Audiodec: An open-source streaming high-fidelity neural audio codec,

Y .-C. Wu, I. D. Gebru, D. Markovi´c, and A. Richard, “Audiodec: An open-source streaming high-fidelity neural audio codec,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023
[47]

Snac: Multi- scale neural audio codec,

H. Siuzdak, F. Grötschla, and L. A. Lanzendörfer, “Snac: Multi- scale neural audio codec,” inAudio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, 2024

2024
[48]

Moshi: a speech-text foundation model for real-time dialogue,

A. Défossez, L. Mazaré, M. Orsini, A. Royer, P. Pérez, H. Jégou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

Pith/arXiv arXiv 2024
[49]

Languagebind: Extending video- language pretraining to n-modality by language-based semantic alignment,

B. Zhu, B. Lin, M. Ning, Y . Yan, J. Cui, W. HongFa, Y . Pang, W. Jiang, J. Zhang, Z. Li, C. W. Zhang, Z. Li, W. Liu, and L. Yuan, “Languagebind: Extending video- language pretraining to n-modality by language-based semantic alignment,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https: //openreview.net/f...

2024
[50]

Imagebind: One embedding space to bind them all,

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 15 180–15 190

2023
[51]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

2020
[52]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505– 1518, 2022

2022
[53]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

2023
[54]

Multi-sensor data fusion based on a generalised belief divergence measure,

F. Xiao, “Multi-sensor data fusion based on a generalised belief divergence measure,”arXiv preprint arXiv:1806.01563, 2018

Pith/arXiv arXiv 2018
[55]

Multimodal generative learn- ing utilizing jensen-shannon-divergence,

T. Sutter, I. Daunhawer, and J. V ogt, “Multimodal generative learn- ing utilizing jensen-shannon-divergence,”Advances in neural in- formation processing systems, vol. 33, pp. 6100–6110, 2020

2020

[1] [1]

Current text-to-speech (TTS) and voice-conversion (VC) models can generate speech utterances in nearly human-level realism

Introduction In recent years, the boundary between genuine and speech deep- fakes has become increasingly blurred. Current text-to-speech (TTS) and voice-conversion (VC) models can generate speech utterances in nearly human-level realism. While these capabil- ities support valuable applications in assistive communication, human–computer interaction, and e...

Pith/arXiv arXiv 2026

[2] [2]

All data are obtained from publicly available corpora

Elderly Codecfake Dataset This section first describes the sources of real elderly speech data, followed by an overview of the NACs employed in this study. All data are obtained from publicly available corpora. Finally, we detail the pipeline used to generate ECF dataset. Real Elderly Speech Source:SeniorTalk(E1) [ 31]: It is a Mandarin conversational spe...

[3] [3]

[ 21], one of the foundational CF detection work

and Lu et al. [ 21], one of the foundational CF detection work. The procedure converts real elderly speech into NAC- generated counterparts using multiple NACs that constitute the core of modern ALM systems. We start from publicly available elderly speech datasets described above, where each original recording is treated as a real reference sample. For cr...

[4] [4]

We then detail the proposed framework,BONSAI

Methodology This section presents the FMs employed in our study, followed by the downstream modeling approaches. We then detail the proposed framework,BONSAI. 3.1. Foundation Models The FMs considered are SOTA in their respective benchmarks. Mutlimodal FMs: We select LanguageBind (LB) [ 40] and ImageBind (IB) [ 41] as multimodal FMs. IB maps diverse modal...

[5] [5]

Experiments 4.1. Training Details We train the models by combining the training sets from Se- niorTalk and TIS, while validation and testing are performed separately on the respective validation and test splits of each individual dataset. All models are trained for 20 epochs using a learning rate of 1e-3 and a batch size of 32. We employ the Adam optimize...

[6] [6]

Our analysis demon- strated that existing SOTA CF detection models trained on prior benchmark datasets generalize poorly to elderly speech, reveal- ing a critical robustness gap

Conclusion In this work, we introduced the ECFD task and the ECF dataset comprising English and Chinese speech. Our analysis demon- strated that existing SOTA CF detection models trained on prior benchmark datasets generalize poorly to elderly speech, reveal- ing a critical robustness gap. Furthermore, we showed that multimodal FMs, such as LB and IB, pro...

[7] [7]

Acknowledgement This work was supported by National Science and Technology Council (NSTC), Taiwan (grant#: 115-2634-F-002-012)

[8] [8]

These tools did not contribute to the development of scientific concepts, data analysis, generation of results, or interpretation of findings

Generative AI Use Disclosure AI Assistants were utilized exclusively to enhance grammatical accuracy, clarity, and the overall readability of the manuscript. These tools did not contribute to the development of scientific concepts, data analysis, generation of results, or interpretation of findings. The authors assume full responsibility for the accuracy ...

[9] [9]

Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,

Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilçi, M. Sahidullah, and A. Sizov, “Asvspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” in INTERSPEECH 2015 16th Annual Conference of the International Speech Communication Association. International Speech Com- munication Association, 2015, pp. 2037–2041

2015

[10] [10]

The asvspoof 2017 challenge: As- sessing the limits of replay spoofing attack detection,

T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, “The asvspoof 2017 challenge: As- sessing the limits of replay spoofing attack detection,” 2017

2017

[11] [11]

Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,

X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Leeet al., “Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,”Computer Speech & Language, vol. 64, p. 101114, 2020

2019

[12] [12]

Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,

X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kin- nunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautschet al., “Asvspoof 2021: Towards spoofed and deepfake speech detection in the wild,”IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 31, pp. 2507–2522, 2023

2021

[13] [13]

Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,

X. Wang, H. Delgado, H. Tak, J.-w. Jung, H.-j. Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunenet al., “Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,”arXiv preprint arXiv:2408.08739, 2024

arXiv 2024

[14] [14]

Combining evidences from mel cep- stral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech

T. B. Patel and H. A. Patil, “Combining evidences from mel cep- stral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech.” inInterspeech, 2015, pp. 2062–2066

2015

[15] [15]

Cochlear filter and instantaneous frequency based features for spoofed speech detection,

——, “Cochlear filter and instantaneous frequency based features for spoofed speech detection,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 618–631, 2016

2016

[16] [16]

Toward robust audio spoofing detection: A detailed comparison of traditional and learned features,

B. Balamurali, K. E. Lin, S. Lui, J.-M. Chen, and D. Herremans, “Toward robust audio spoofing detection: A detailed comparison of traditional and learned features,”IEEE Access, vol. 7, pp. 84 229– 84 241, 2019

2019

[17] [17]

Spoofing speech detection using temporal convolutional neural network,

X. Tian, X. Xiao, E. S. Chng, and H. Li, “Spoofing speech detection using temporal convolutional neural network,” in2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 2016, pp. 1–6

2016

[18] [18]

A light convolutional gru-rnn deep feature extractor for asv spoofing detection,

A. Gomez-Alanis, A. M. Peinado, J. A. Gonzalez, and A. M. Gomez, “A light convolutional gru-rnn deep feature extractor for asv spoofing detection,” inProc. Interspeech, vol. 2019, 2019, pp. 1068–1072

2019

[19] [19]

Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,

J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.- J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” inICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 6367–6371

2022

[20] [20]

The vicomtech audio deep- fake detection system based on wav2vec2 for the 2022 add chal- lenge,

J. M. Martín-Doñas and A. Álvarez, “The vicomtech audio deep- fake detection system based on wav2vec2 for the 2022 add chal- lenge,” inICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 9241–9245

2022

[21] [21]

Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,” inThe Speaker and Language Recognition Workshop (Odyssey 2022). ISCA, 2022

2022

[22] [22]

Im- proved DeepFake Detection Using Whisper Features,

P. Kawa, M. Plata, M. Czuba, P. Szyma ´nski, and P. Syga, “Im- proved DeepFake Detection Using Whisper Features,” inInter- speech 2023, 2023, pp. 4009–4013

2023

[23] [23]

Audio deep- fake detection with self-supervised wavlm and multi-fusion atten- tive classifier,

Y . Guo, H. Huang, X. Chen, H. Zhao, and Y . Wang, “Audio deep- fake detection with self-supervised wavlm and multi-fusion atten- tive classifier,” inICASSP 2024-2024 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 702–12 706

2024

[24] [24]

Efficient audio deepfake detection using wavlm with early exiting,

A. Pimentel, Y . Zhu, H. R. Guimarães, and T. H. Falk, “Efficient audio deepfake detection using wavlm with early exiting,” in2024 IEEE International Workshop on Information Forensics and Secu- rity (WIFS). IEEE, 2024, pp. 1–6

2024

[25] [25]

Leveraging SSL Speech Features and Mamba for Enhanced DeepFake Detection,

H. M. Tran, D. Lolive, D. Guennec, A. Sini, A. Delhay, and P.- F. Marteau, “Leveraging SSL Speech Features and Mamba for Enhanced DeepFake Detection,” inInterspeech 2025, 2025, pp. 5323–5327

2025

[26] [26]

Multi-level ssl feature gating for audio deepfake de- tection,

H. M. Tran, D. Lolive, A. Sini, A. Delhay, P.-F. Marteau, and D. Guennec, “Multi-level ssl feature gating for audio deepfake de- tection,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 11 766–11 775

2025

[27] [27]

Audiolm: a language modeling approach to audio gener- ation,

Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi et al., “Audiolm: a language modeling approach to audio gener- ation,”IEEE/ACM transactions on audio, speech, and language processing, vol. 31, pp. 2523–2533, 2023

2023

[28] [28]

Codecfake: Enhancing anti- spoofing models against deepfake audios from codec-based speech synthesis systems,

H. Wu, Y . Tseng, and H. yi Lee, “Codecfake: Enhancing anti- spoofing models against deepfake audios from codec-based speech synthesis systems,” inInterspeech 2024, 2024, pp. 1770–1774

2024

[29] [29]

Codecfake: An initial dataset for detecting llm-based deepfake audio,

Y . Lu, Y . Xie, R. Fu, Z. Wen, J. Tao, Z. Wang, X. Qi, X. Liu, Y . Li, Y . Liu, X. Wang, and S. Shi, “Codecfake: An initial dataset for detecting llm-based deepfake audio,” inInterspeech 2024, 2024, pp. 1390–1394

2024

[30] [30]

Codecfake+: A large-scale neu- ral audio codec-based deepfake speech dataset,

X. Chen, J. Du, H. Wu, L. Zhang, I. Lin, I. Chiu, W. Ren, Y . Tseng, Y . Tsao, J.-S. R. Janget al., “Codecfake+: A large-scale neu- ral audio codec-based deepfake speech dataset,”arXiv preprint arXiv:2501.08238, 2025

Pith/arXiv arXiv 2025

[31] [31]

Whiadd: Semantic- acoustic fusion for robust audio deepfake detection,

J. Cui, B. Yu, Q. Wang, F. Meng, and J. Lu, “Whiadd: Semantic- acoustic fusion for robust audio deepfake detection,” inProceed- ings of the 33rd ACM International Conference on Multimedia, 2025, pp. 11 610–11 618

2025

[32] [32]

The codecfake dataset and countermea- sures for the universally detection of deepfake audio,

Y . Xie, Y . Lu, R. Fu, Z. Wen, Z. Wang, J. Tao, X. Qi, X. Wang, Y . Liu, H. Chenget al., “The codecfake dataset and countermea- sures for the universally detection of deepfake audio,”IEEE Trans- actions on Audio, Speech and Language Processing, 2025

2025

[33] [33]

How does our voice change as we age? a systematic review and meta-analysis of acoustic and perceptual voice data from healthy adults over 50 years of age,

S. Rojas, E. Kefalianos, and A. V ogel, “How does our voice change as we age? a systematic review and meta-analysis of acoustic and perceptual voice data from healthy adults over 50 years of age,” Journal of Speech, Language, and Hearing Research, vol. 63, no. 2, pp. 533–551, 2020

2020

[34] [34]

The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks,

B. W. Schulleret al., “The INTERSPEECH 2020 Computational Paralinguistics Challenge: Elderly Emotion, Breathing & Masks,” inInterspeech 2020, 2020, pp. 2042–2046

2020

[35] [35]

Is Everything Fine, Grandma? Acous- tic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition,

G. So˘gancıo˘glu, O. Verkholyak, H. Kaya, D. Fedotov, T. Cadée, A. A. Salah, and A. Karpov, “Is Everything Fine, Grandma? Acous- tic and Linguistic Modeling for Robust Elderly Speech Emotion Recognition,” inInterspeech 2020, 2020, pp. 2097–2101

2020

[36] [36]

Human voices communicating trustworthy intent: A demographically di- verse speech audio dataset,

C. Maltezou-Papastylianou, R. Scherer, and S. Paulmann, “Human voices communicating trustworthy intent: A demographically di- verse speech audio dataset,”Scientific Data, vol. 12, no. 1, p. 921, 2025

2025

[37] [37]

Strong alone, stronger together: Synergizing modality-binding foundation models with optimal transport for non-verbal emotion recognition,

O. C. Phukan, M. M. Akhtar, S. R. Behera, S. Kalita, A. B. Buduru, R. Sharma, S. M. Prasannaet al., “Strong alone, stronger together: Synergizing modality-binding foundation models with optimal transport for non-verbal emotion recognition,” inICASSP 2025- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025

[38] [38]

Het- erogeneity over homogeneity: Investigating multilingual speech pre-trained models for detecting audio deepfake,

O. C. Phukan, G. S. Kashyap, A. B. Buduru, and R. Sharma, “Het- erogeneity over homogeneity: Investigating multilingual speech pre-trained models for detecting audio deepfake,”arXiv preprint arXiv:2404.00809, 2024

arXiv 2024

[39] [39]

Seniortalk: A chinese conversation dataset with rich annotations for super-aged seniors,

Y . Chen, H. Wang, shiyao wang, J. Chen, J. He, J. Zhou, X. Yang, Y . Wang, Y . Lin, and Y . Qin, “Seniortalk: A chinese conversation dataset with rich annotations for super-aged seniors,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. [Online]. Available: https://openreview.net/forum?id=...

2025

[40] [40]

High-fidelity audio compression with improved rvqgan,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved rvqgan,”Advances in Neural Information Processing Systems, vol. 36, 2024

2024

[41] [41]

High fidelity neu- ral audio compression,

A. Défossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neu- ral audio compression,”arXiv preprint arXiv:2210.13438, 2022

Pith/arXiv arXiv 2022

[42] [42]

Soundstream: An end-to-end neural audio codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasac- chi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021

2021

[43] [43]

Speechtokenizer: Unified speech tokenizer for speech language models,

X. Zhang, D. Zhang, S. Li, Y . Zhou, and X. Qiu, “Speechtokenizer: Unified speech tokenizer for speech language models,” inThe Twelfth International Conference on Learning Representations,

[44] [44]

Available: https://openreview.net/forum?id= AF9Q8Vip84

[Online]. Available: https://openreview.net/forum?id= AF9Q8Vip84

[45] [45]

Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,

Z. Du, S. Zhang, K. Hu, and S. Zheng, “Funcodec: A fundamental, reproducible and integrable open-source toolkit for neural speech codec,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 591–595

2024

[46] [46]

Audiodec: An open-source streaming high-fidelity neural audio codec,

Y .-C. Wu, I. D. Gebru, D. Markovi´c, and A. Richard, “Audiodec: An open-source streaming high-fidelity neural audio codec,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023

[47] [47]

Snac: Multi- scale neural audio codec,

H. Siuzdak, F. Grötschla, and L. A. Lanzendörfer, “Snac: Multi- scale neural audio codec,” inAudio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, 2024

2024

[48] [48]

Moshi: a speech-text foundation model for real-time dialogue,

A. Défossez, L. Mazaré, M. Orsini, A. Royer, P. Pérez, H. Jégou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

Pith/arXiv arXiv 2024

[49] [49]

Languagebind: Extending video- language pretraining to n-modality by language-based semantic alignment,

B. Zhu, B. Lin, M. Ning, Y . Yan, J. Cui, W. HongFa, Y . Pang, W. Jiang, J. Zhang, Z. Li, C. W. Zhang, Z. Li, W. Liu, and L. Yuan, “Languagebind: Extending video- language pretraining to n-modality by language-based semantic alignment,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https: //openreview.net/f...

2024

[50] [50]

Imagebind: One embedding space to bind them all,

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 15 180–15 190

2023

[51] [51]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

2020

[52] [52]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505– 1518, 2022

2022

[53] [53]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

2023

[54] [54]

Multi-sensor data fusion based on a generalised belief divergence measure,

F. Xiao, “Multi-sensor data fusion based on a generalised belief divergence measure,”arXiv preprint arXiv:1806.01563, 2018

Pith/arXiv arXiv 2018

[55] [55]

Multimodal generative learn- ing utilizing jensen-shannon-divergence,

T. Sutter, I. Daunhawer, and J. V ogt, “Multimodal generative learn- ing utilizing jensen-shannon-divergence,”Advances in neural in- formation processing systems, vol. 33, pp. 6100–6110, 2020

2020