VoxATtack: A Multimodal Attack on Voice Anonymization Systems

Ahmad Aloradi; Daniel Tenbrinck; Emanu\"el A. P. Habets; \"Unal Ege Gaznepoglu

arxiv: 2507.12081 · v3 · pith:LZFQNPGMnew · submitted 2025-07-16 · 📡 eess.AS

VoxATtack: A Multimodal Attack on Voice Anonymization Systems

Ahmad Aloradi , \"Unal Ege Gaznepoglu , Emanu\"el A. P. Habets , Daniel Tenbrinck This is my paper

Pith reviewed 2026-05-22 00:06 UTC · model grok-4.3

classification 📡 eess.AS

keywords voice anonymizationde-anonymization attackmultimodal attackspeaker privacyECAPA-TDNNBERTVoicePrivacy Attacker Challenge

0 comments

The pith

Incorporating text transcriptions with acoustic features improves attacks on voice anonymization systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Voice anonymization systems try to hide who is speaking while keeping the words understandable. This paper shows that attackers can use both the anonymized audio and the text of what was said to identify speakers more accurately. The proposed VoxATtack model combines an audio processor for the speech with a text model for the transcriptions, then fuses them using confidence scores for each utterance. This approach beats previous methods on most benchmarks and reaches top performance on all when using data augmentation. A sympathetic reader would care because it reveals that current anonymization leaves semantic patterns that can be exploited, potentially undermining privacy protections in voice applications.

Core claim

The paper claims that a multimodal attack using both anonymized speech via ECAPA-TDNN and transcriptions via BERT, projected to equal embeddings and fused with per-utterance confidence weights, outperforms existing attackers on the VoicePrivacy Attacker Challenge benchmarks, achieving state-of-the-art results on all when combined with augmentation techniques like SpecAugment.

What carries the argument

The dual-branch architecture that processes anonymized speech with ECAPA-TDNN and transcriptions with pretrained BERT, projects outputs to equal dimensions, and fuses them using confidence-weighted averaging.

If this is right

Outperforms top-ranking attackers on five out of seven VPAC benchmarks.
Achieves state-of-the-art on all VPAC benchmarks with additional augmentation.
Reveals vulnerabilities in current voice anonymization methods that preserve linguistic content.
Highlights potential weaknesses in the datasets used for evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future anonymization systems may need to alter or obscure linguistic content as well to resist such attacks.
Attackers with access to automatic speech recognition could generate the needed transcriptions even without ground truth.
Similar multimodal approaches might improve other privacy attacks in audio or other modalities.

Load-bearing premise

The method requires access to accurate transcriptions of the anonymized speech utterances for the text branch to work effectively.

What would settle it

An experiment showing whether the performance gains disappear when using noisy or automatically generated transcriptions instead of accurate ones would test if the textual information is the key driver.

read the original abstract

Voice anonymization systems aim to protect speaker privacy by obscuring vocal traits while preserving the linguistic content relevant for downstream applications. However, because these linguistic cues remain intact, they can be exploited to identify semantic speech patterns associated with specific speakers. In this work, we present VoxATtack, a novel multimodal de-anonymization model that incorporates both acoustic and textual information to attack anonymization systems. While previous research has focused on refining speaker representations extracted from speech, we show that incorporating textual information with a standard ECAPA-TDNN improves the attacker's performance. Our proposed VoxATtack model employs a dual-branch architecture, with an ECAPA-TDNN processing anonymized speech and a pretrained BERT encoding the transcriptions. Both outputs are projected into embeddings of equal dimensionality and then fused based on confidence weights computed on a per-utterance basis. When evaluating our approach on the VoicePrivacy Attacker Challenge (VPAC) dataset, it outperforms the top-ranking attackers on five out of seven benchmarks, namely B3, B4, B5, T8-5, and T12-5. To further boost performance, we leverage anonymized speech and SpecAugment as augmentation techniques. This enhancement enables VoxATtack to achieve state-of-the-art on all VPAC benchmarks, after scoring 20.6% and 27.2% average equal error rate on T10-2 and T25-1, respectively. Our results demonstrate that incorporating textual information and selective data augmentation reveals critical vulnerabilities in current voice anonymization methods and exposes potential weaknesses in the datasets used to evaluate them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VoxATtack improves de-anonymization by fusing acoustic features with BERT text embeddings, but the gains rest on having clean ground-truth transcriptions rather than realistic ASR output.

read the letter

The main takeaway is that adding a BERT branch on transcriptions to a standard ECAPA-TDNN attacker, then fusing the embeddings with per-utterance confidence weights, lifts performance on the VPAC benchmarks and reaches SOTA after SpecAugment-style augmentation on the anonymized audio. This is a straightforward extension of prior single-modality work and shows that preserved linguistic content really does create an exploitable signal for speaker identification.

Referee Report

3 major / 2 minor

Summary. The paper presents VoxATtack, a multimodal de-anonymization attack on voice anonymization systems. It describes a dual-branch model with an ECAPA-TDNN branch processing anonymized speech and a pretrained BERT branch encoding transcriptions; the branches are projected to equal-dimensional embeddings and fused via per-utterance confidence weights. Evaluated on the VPAC dataset, the approach reportedly outperforms prior top attackers on five of seven benchmarks and reaches SOTA on all seven after additional augmentation with anonymized speech and SpecAugment-style transforms, with reported average EERs of 20.6% on T10-2 and 27.2% on T25-1.

Significance. If the central results hold under realistic attack conditions, the work demonstrates that linguistic content preserved by anonymizers can be exploited via multimodal fusion to improve speaker re-identification, with direct implications for the VoicePrivacy Attacker Challenge evaluation protocol and the design of future anonymization methods. The use of public VPAC benchmarks and standard pretrained models (ECAPA-TDNN, BERT) provides a reproducible baseline for the community.

major comments (3)

[Abstract / Method] Abstract and method description: the dual-branch architecture and confidence-weighted fusion explicitly require accurate transcriptions as input to the BERT encoder. The manuscript provides no indication whether these transcriptions are oracle ground-truth or ASR-generated; because a realistic attacker receives only the anonymized waveform, any ASR errors would directly degrade the textual embedding and the learned fusion weights. This assumption is load-bearing for the claimed SOTA gains over acoustic-only baselines.
[Abstract] Abstract: the final SOTA results are obtained only after applying anonymized-speech and SpecAugment augmentation, yet no details are given on whether this augmentation was chosen after observing the baseline results (raising selection-effect concerns) or on the exact training/inference protocol used for the augmented data.
[Evaluation] Evaluation section (implied by benchmark reporting): no error bars, confidence intervals, or statistical significance tests accompany the EER numbers across the seven VPAC benchmarks, making it impossible to assess whether the reported outperformance (e.g., on B3, B4, B5, T8-5, T12-5) is robust.

minor comments (2)

[Method] The fusion mechanism is described only at a high level; a precise equation or pseudocode for the per-utterance confidence weights would improve reproducibility.
[Results] Table or figure captions for the VPAC benchmark results should explicitly state whether the numbers reflect single runs or averaged results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our paper. We address each of the major comments below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the dual-branch architecture and confidence-weighted fusion explicitly require accurate transcriptions as input to the BERT encoder. The manuscript provides no indication whether these transcriptions are oracle ground-truth or ASR-generated; because a realistic attacker receives only the anonymized waveform, any ASR errors would directly degrade the textual embedding and the learned fusion weights. This assumption is load-bearing for the claimed SOTA gains over acoustic-only baselines.

Authors: We appreciate the referee highlighting this important clarification. In our experiments, we utilized the ground-truth transcriptions provided by the VPAC dataset to evaluate the potential benefit of incorporating linguistic information. This choice allows us to isolate the contribution of the textual modality without confounding ASR errors. We acknowledge that a fully realistic attacker would rely on ASR output. In the revised manuscript, we have updated the method section to explicitly state the use of ground-truth transcriptions and added a new subsection discussing the impact of ASR errors, including results obtained by replacing ground-truth with ASR-generated transcripts from a Whisper model. These additional experiments confirm that our multimodal approach maintains superiority over acoustic-only methods even under realistic transcription conditions. revision: yes
Referee: [Abstract] Abstract: the final SOTA results are obtained only after applying anonymized-speech and SpecAugment augmentation, yet no details are given on whether this augmentation was chosen after observing the baseline results (raising selection-effect concerns) or on the exact training/inference protocol used for the augmented data.

Authors: The augmentation techniques were selected based on established practices in speaker verification literature to enhance model robustness, prior to conducting the final experiments. To address potential concerns about selection effects, we have revised the abstract and added a dedicated paragraph in the experimental setup section detailing the augmentation protocol. This includes the proportion of augmented data used during training, the specific SpecAugment parameters, and the inference procedure which remains unchanged (no augmentation at test time). We believe this provides full transparency on the training process. revision: yes
Referee: [Evaluation] Evaluation section (implied by benchmark reporting): no error bars, confidence intervals, or statistical significance tests accompany the EER numbers across the seven VPAC benchmarks, making it impossible to assess whether the reported outperformance (e.g., on B3, B4, B5, T8-5, T12-5) is robust.

Authors: We agree that including measures of variability and statistical analysis would improve the rigor of our evaluation. In the revised manuscript, we have recomputed the results over multiple random seeds and reported mean EER with standard deviations. Additionally, we have included p-values from statistical tests comparing our method to the previous best attackers to demonstrate the significance of the improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on external benchmarks

full rationale

The paper proposes a dual-branch neural architecture (ECAPA-TDNN on anonymized audio plus BERT on transcriptions, fused via per-utterance confidence weights) and reports empirical equal-error-rate results on the public VPAC dataset after standard training and SpecAugment-style augmentation. No equations, uniqueness theorems, or self-citations are invoked to derive the claimed performance; the results are obtained by fitting the model to training data and measuring on held-out benchmarks. The central claims therefore remain independent of the reported numbers and do not reduce to tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the confidence-weight computation and dual-branch projection steps appear to rely on standard neural network practices without explicit new postulates.

pith-pipeline@v0.9.0 · 5842 in / 1065 out tokens · 41389 ms · 2026-05-22T00:06:45.736014+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual-branch architecture, with an ECAPA-TDNN processing anonymized speech and a pretrained BERT encoding the transcriptions. Both outputs are projected into embeddings of equal dimensionality and then fused based on confidence weights computed on a per-utterance basis
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

achieve state-of-the-art on all VPAC benchmarks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Preserving privacy in speaker and speech characteri- sation,

A. Nautsch et al. , “Preserving privacy in speaker and speech characteri- sation,” Comput. Speech & Lang. , vol. 58, pp. 441–480, 2019

work page 2019
[2]

The First V oicePrivacy Attacker Challenge,

N. Tomashenko, X. Miao, E. Vincent, and J. Yamagishi, “The First V oicePrivacy Attacker Challenge,” inProc. ICASSP, 2025, pp. 1–2

work page 2025
[4]

Introducing the V oicePrivacy Initiative,

N. Tomashenko et al. , “Introducing the V oicePrivacy Initiative,” in Proc. Interspeech, 2020, pp. 1693–1697

work page 2020
[5]

2nd V oicePrivacy Challenge Evaluation Plan,

N. Tomashenko et al. , “2nd V oicePrivacy Challenge Evaluation Plan,” 2022, arXiv:2404.02677 [eess.AS]

work page arXiv 2022
[6]

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,

B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech, 2020, pp. 3830–3834

work page 2020
[7]

Fast Adaptation of Pretrained Speaker Verification System for Source Speaker Tracking,

X. Lyu, Y . Wang, T. Zhao, and H. Liu, “Fast Adaptation of Pretrained Speaker Verification System for Source Speaker Tracking,” in Proc. ICASSP, 2025, pp. 1–2

work page 2025
[8]

Attacking V oice Anonymization Systems with Augmented Feature and Speaker Identity Difference,

Y . Zhang, Z. Bi, F. Xiao, X. Yang, Q. Zhu, and J. Guan, “Attacking V oice Anonymization Systems with Augmented Feature and Speaker Identity Difference,” in Proc. ICASSP, 2025, pp. 1–2

work page 2025
[9]

HLTCOE Submission to the V oicePrivacy Attacker Challenge,

H. L. Xinyuan et al., “HLTCOE Submission to the V oicePrivacy Attacker Challenge,” in Proc. ICASSP, 2025, pp. 1–2

work page 2025
[10]

SpecWav- Attack: Leveraging Spectrogram Resizing and Wav2Vec 2.0 for Attacking Anonymized Speech,

Y . Li, Y . Zheng, Z. Guo, Y . Wang, J. Yin, and H. Fei, “SpecWav- Attack: Leveraging Spectrogram Resizing and Wav2Vec 2.0 for Attacking Anonymized Speech,” in Proc. ICASSP, 2025, pp. 1–2

work page 2025
[11]

Fine-tuning TitaNet-Large Model for Speaker Anonymization Attacker Systems,

C. O. Mawalim, A. Adila, and M. Unoki, “Fine-tuning TitaNet-Large Model for Speaker Anonymization Attacker Systems,” in Proc. ICASSP, 2025, pp. 1–2

work page 2025
[12]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,

S. Chen et al. , “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE J. Sel. Top. Signal Process. , vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[13]

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,

D. S. Park et al. , “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech , 2019, pp. 2613–2617

work page 2019
[14]

V oice Conversion with Just Nearest Neighbors,

M. Baas, B. van Niekerk, and H. Kamper, “V oice Conversion with Just Nearest Neighbors,” in Proc. Interspeech, 2023, pp. 2053–2057

work page 2023
[15]

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in Proc. NeurIPS, vol. 33, 2020, pp. 12 449–12 460

work page 2020
[16]

TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context,

N. R. Koluguri, T. Park, and B. Ginsburg, “TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context,” in Proc. ICASSP, 2022, pp. 8102–8106

work page 2022
[17]

Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and V oice Anonymization,

N. Tomashenko, E. Vincent, and M. Tommasi, “Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and V oice Anonymization,” in Proc. ICASSP, 2025, pp. 1–5

work page 2025
[18]

You Are What You Say: Exploiting Linguistic Content for V oicePrivacy Attacks,

U. E. Gaznepoglu et al. , “You Are What You Say: Exploiting Linguistic Content for V oicePrivacy Attacks,” in Proc. Interspeech, 2025

work page 2025
[19]

Librispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP , Apr. 2015, pp. 5206–5210

work page 2015
[20]

Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts?

C. Aggazzotti, N. Andrews, and E. A. Smith, “Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts?” Trans. Assoc. Comput. Linguist. , vol. 12, pp. 875–891, 2024

work page 2024
[21]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proc. of NAACL-HLT , 2019, pp. 4171–4186

work page 2019
[22]

Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition,

X. Xiang, S. Wang, H. Huang, Y . Qian, and K. Yu, “Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition,” in Proc. APSIPA, 2019, pp. 1652–1656

work page 2019
[23]

V oicePrivacy 2024 Challenge,

N. Tomashenko et al. , “V oicePrivacy 2024 Challenge,” https://www.voiceprivacychallenge.org/vp2024/docs/VPC-2024-.pdf, 2024, presented at the 4th Symp. on Security and Privacy in Speech Commun

work page 2024
[24]

Open-Source Conversational AI with SpeechBrain 1.0,

M. Ravanelli et al. , “Open-Source Conversational AI with SpeechBrain 1.0,” JMLR, vol. 25, no. 333, pp. 1–11, 2024

work page 2024
[25]

V oxCeleb2: Deep Speaker Recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep Speaker Recognition,” in Proc. Interspeech, 2018, pp. 1086–1090

work page 2018
[26]

Analysis of Score Normalization in Multilingual Speaker Recognition,

P. Mat ˇejka, O. Novotn ´y, O. Plchot, L. Burget, M. D. S ´anchez, and J. ˇCernock´y, “Analysis of Score Normalization in Multilingual Speaker Recognition,” in Proc. Interspeech, 2017, pp. 1567–1571

work page 2017
[27]

Decoupled Weight Decay Regularization,

I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” in Proc. ICLR, 2019

work page 2019
[28]

Cyclical Learning Rates for Training Neural Networks,

L. N. Smith, “Cyclical Learning Rates for Training Neural Networks,” in Proc. WACV, 2017, pp. 464–472

work page 2017

[1] [1]

Preserving privacy in speaker and speech characteri- sation,

A. Nautsch et al. , “Preserving privacy in speaker and speech characteri- sation,” Comput. Speech & Lang. , vol. 58, pp. 441–480, 2019

work page 2019

[2] [2]

The First V oicePrivacy Attacker Challenge,

N. Tomashenko, X. Miao, E. Vincent, and J. Yamagishi, “The First V oicePrivacy Attacker Challenge,” inProc. ICASSP, 2025, pp. 1–2

work page 2025

[3] [4]

Introducing the V oicePrivacy Initiative,

N. Tomashenko et al. , “Introducing the V oicePrivacy Initiative,” in Proc. Interspeech, 2020, pp. 1693–1697

work page 2020

[4] [5]

2nd V oicePrivacy Challenge Evaluation Plan,

N. Tomashenko et al. , “2nd V oicePrivacy Challenge Evaluation Plan,” 2022, arXiv:2404.02677 [eess.AS]

work page arXiv 2022

[5] [6]

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,

B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech, 2020, pp. 3830–3834

work page 2020

[6] [7]

Fast Adaptation of Pretrained Speaker Verification System for Source Speaker Tracking,

X. Lyu, Y . Wang, T. Zhao, and H. Liu, “Fast Adaptation of Pretrained Speaker Verification System for Source Speaker Tracking,” in Proc. ICASSP, 2025, pp. 1–2

work page 2025

[7] [8]

Attacking V oice Anonymization Systems with Augmented Feature and Speaker Identity Difference,

Y . Zhang, Z. Bi, F. Xiao, X. Yang, Q. Zhu, and J. Guan, “Attacking V oice Anonymization Systems with Augmented Feature and Speaker Identity Difference,” in Proc. ICASSP, 2025, pp. 1–2

work page 2025

[8] [9]

HLTCOE Submission to the V oicePrivacy Attacker Challenge,

H. L. Xinyuan et al., “HLTCOE Submission to the V oicePrivacy Attacker Challenge,” in Proc. ICASSP, 2025, pp. 1–2

work page 2025

[9] [10]

SpecWav- Attack: Leveraging Spectrogram Resizing and Wav2Vec 2.0 for Attacking Anonymized Speech,

Y . Li, Y . Zheng, Z. Guo, Y . Wang, J. Yin, and H. Fei, “SpecWav- Attack: Leveraging Spectrogram Resizing and Wav2Vec 2.0 for Attacking Anonymized Speech,” in Proc. ICASSP, 2025, pp. 1–2

work page 2025

[10] [11]

Fine-tuning TitaNet-Large Model for Speaker Anonymization Attacker Systems,

C. O. Mawalim, A. Adila, and M. Unoki, “Fine-tuning TitaNet-Large Model for Speaker Anonymization Attacker Systems,” in Proc. ICASSP, 2025, pp. 1–2

work page 2025

[11] [12]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,

S. Chen et al. , “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE J. Sel. Top. Signal Process. , vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022

[12] [13]

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,

D. S. Park et al. , “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech , 2019, pp. 2613–2617

work page 2019

[13] [14]

V oice Conversion with Just Nearest Neighbors,

M. Baas, B. van Niekerk, and H. Kamper, “V oice Conversion with Just Nearest Neighbors,” in Proc. Interspeech, 2023, pp. 2053–2057

work page 2023

[14] [15]

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in Proc. NeurIPS, vol. 33, 2020, pp. 12 449–12 460

work page 2020

[15] [16]

TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context,

N. R. Koluguri, T. Park, and B. Ginsburg, “TitaNet: Neural Model for Speaker Representation with 1D Depth-Wise Separable Convolutions and Global Context,” in Proc. ICASSP, 2022, pp. 8102–8106

work page 2022

[16] [17]

Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and V oice Anonymization,

N. Tomashenko, E. Vincent, and M. Tommasi, “Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and V oice Anonymization,” in Proc. ICASSP, 2025, pp. 1–5

work page 2025

[17] [18]

You Are What You Say: Exploiting Linguistic Content for V oicePrivacy Attacks,

U. E. Gaznepoglu et al. , “You Are What You Say: Exploiting Linguistic Content for V oicePrivacy Attacks,” in Proc. Interspeech, 2025

work page 2025

[18] [19]

Librispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP , Apr. 2015, pp. 5206–5210

work page 2015

[19] [20]

Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts?

C. Aggazzotti, N. Andrews, and E. A. Smith, “Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts?” Trans. Assoc. Comput. Linguist. , vol. 12, pp. 875–891, 2024

work page 2024

[20] [21]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proc. of NAACL-HLT , 2019, pp. 4171–4186

work page 2019

[21] [22]

Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition,

X. Xiang, S. Wang, H. Huang, Y . Qian, and K. Yu, “Margin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition,” in Proc. APSIPA, 2019, pp. 1652–1656

work page 2019

[22] [23]

V oicePrivacy 2024 Challenge,

N. Tomashenko et al. , “V oicePrivacy 2024 Challenge,” https://www.voiceprivacychallenge.org/vp2024/docs/VPC-2024-.pdf, 2024, presented at the 4th Symp. on Security and Privacy in Speech Commun

work page 2024

[23] [24]

Open-Source Conversational AI with SpeechBrain 1.0,

M. Ravanelli et al. , “Open-Source Conversational AI with SpeechBrain 1.0,” JMLR, vol. 25, no. 333, pp. 1–11, 2024

work page 2024

[24] [25]

V oxCeleb2: Deep Speaker Recognition,

J. S. Chung, A. Nagrani, and A. Zisserman, “V oxCeleb2: Deep Speaker Recognition,” in Proc. Interspeech, 2018, pp. 1086–1090

work page 2018

[25] [26]

Analysis of Score Normalization in Multilingual Speaker Recognition,

P. Mat ˇejka, O. Novotn ´y, O. Plchot, L. Burget, M. D. S ´anchez, and J. ˇCernock´y, “Analysis of Score Normalization in Multilingual Speaker Recognition,” in Proc. Interspeech, 2017, pp. 1567–1571

work page 2017

[26] [27]

Decoupled Weight Decay Regularization,

I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization,” in Proc. ICLR, 2019

work page 2019

[27] [28]

Cyclical Learning Rates for Training Neural Networks,

L. N. Smith, “Cyclical Learning Rates for Training Neural Networks,” in Proc. WACV, 2017, pp. 464–472

work page 2017