pith. sign in

arxiv: 2506.09521 · v2 · pith:3OTPUH6Anew · submitted 2025-06-11 · 📡 eess.AS · cs.CL

You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks

Pith reviewed 2026-05-22 01:01 UTC · model grok-4.3

classification 📡 eess.AS cs.CL
keywords speaker anonymizationvoice privacylinguistic contentBERTequal error ratedataset biasLibriSpeechautomatic speaker verification
0
0 comments X

The pith

Textual content alone enables speaker identification in anonymized voice datasets with error rates as low as 2%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts a language model to function as an automatic speaker verification system that uses only the text of spoken utterances. On standard VoicePrivacy challenge datasets it reaches a mean equal error rate of 35 percent, with some individual speakers identified at error rates of 2 percent. This performance stems from consistent semantic keywords that appear across a speaker's utterances because of how the source LibriSpeech recordings were collected and partitioned. A sympathetic reader would care because the result implies that existing privacy tests may be measuring content overlap rather than genuine voice distinctiveness, which would make reported anonymization strengths misleading.

Core claim

By training BERT solely on utterance transcripts labeled by speaker, the system achieves a mean equal error rate of 35% on the VoicePrivacy Attacker Challenge datasets, with the lowest per-speaker EER reaching 2%. Explainability analysis shows model decisions are driven by semantically similar keywords that recur within each speaker's material due to LibriSpeech curation practices. The work concludes that current evaluation protocols must be revised to remove these linguistic shortcuts if they are to provide unbiased measures of speaker anonymization effectiveness.

What carries the argument

BERT adapted as an automatic speaker verification system that operates exclusively on textual transcripts of utterances.

Load-bearing premise

The low error rates are caused mainly by intra-speaker linguistic content similarities introduced by LibriSpeech curation rather than by any inherent speaker-specific language patterns.

What would settle it

If utterances are reassigned to speakers so that semantic content no longer overlaps within each speaker across training and evaluation sets, the equal error rate should rise to near 50 percent under the paper's account.

Figures

Figures reproduced from arXiv: 2506.09521 by Ahmad Aloradi, Anna Leschanowsky, Daniel Tenbrinck, Emanu\"el A. P. Habets, Nils Peters, Prachi Singh, \"Unal Ege Gaznepoglu.

Figure 1
Figure 1. Figure 1: VPC attack models define how much information is available to an attacker [11]. ilarity measure. Finally, equal error rate (EER) is calculated for male and female speakers. Lower EERs correspond to a better de-anonymization and hence a successful attack. The literature on the analysis of attacker ASV scores is rather limited. The Zero Evidence Biometric Recognition As￾sessment (ZEBRA) framework provides wo… view at source ↗
Figure 2
Figure 2. Figure 2: Speaker-level breakdown of ASVanon eval (ECAPA-TDNN) scores on the libri-dev dataset. The top row shows female speakers, while the bottom row shows male speakers. Columns correspond to the attacked anonymization system. Bar plots denote the cosine similarity score distributions, where light green bars indicate the positive pairs (enrollment and trial speakers matching) and orange bars indicate the negative… view at source ↗
Figure 4
Figure 4. Figure 4: Score distributions of our attack on libri-dev. 121 1221 1284 1580 1995 237 2961 3570 4446 4970 4992 5142 5683 6829 8463 8555 10 20 30 40 50 EER gender = f 1089 1320 260 2830 4077 5105 61 6930 7021 7127 7176 8224 908 enrol spk 10 20 30 40 50 EER gender = m ASVanon eval , B3 ASVanon eval , B4 ASVanon eval , B5 Text-Based [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Radar plots comparing our text-based attack to ASVanon eval on libri-test dataset. Spokes indicate enrollment speaker IDs; circular axes show corresponding speaker EERs 4. Results and discussion [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Explainability study results. Tokens that contribute positively to a model decision are highlighted in green, and red stands for negative contributions. The intensity of the highlight signifies the strength. Attribution score is the sum of all word importance scores, used as a measure of confidence. The char￾acters ’##’ occur when a word is represented by multiple tokens to show that surrounding tokens are… view at source ↗
read the original abstract

Speaker anonymization systems hide the identity of speakers while preserving other information such as linguistic content and emotions. To evaluate their privacy benefits, attacks in the form of automatic speaker verification (ASV) systems are employed. In this study, we assess the impact of intra-speaker linguistic content similarity in the attacker training and evaluation datasets, by adapting BERT, a language model, as an ASV system. On the VoicePrivacy Attacker Challenge datasets, our method achieves a mean equal error rate (EER) of 35%, with certain speakers attaining EERs as low as 2%, based solely on the textual content of their utterances. Our explainability study reveals that the system decisions are linked to semantically similar keywords within utterances, stemming from how LibriSpeech is curated. Our study suggests reworking the VoicePrivacy datasets to ensure a fair and unbiased evaluation and challenge the reliance on global EER for privacy evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript adapts BERT as an automatic speaker verification (ASV) attacker that operates solely on textual content from the VoicePrivacy Attacker Challenge datasets. It reports a mean equal error rate (EER) of 35% (with some speakers as low as 2%), attributes the result to intra-speaker linguistic similarity induced by LibriSpeech curation practices, supports this via an explainability analysis linking decisions to semantically similar keywords, and recommends reworking the datasets to remove curation bias and de-emphasizing global EER in privacy evaluations.

Significance. If the central attribution to curation artifacts is verified through appropriate controls, the result would be significant for the field: it demonstrates that linguistic content alone can produce non-trivial speaker verification performance on standard privacy-evaluation corpora, directly challenging the assumption that anonymization systems are evaluated against text-independent attackers. The empirical measurement on fixed public datasets and the inclusion of an explainability component are strengths that support reproducibility and interpretability.

major comments (2)
  1. [Methods] Methods section: the adaptation of BERT as an ASV system reports a mean EER of 35% and minimum 2% but provides no information on training/validation splits, fine-tuning hyperparameters, baseline text-based or acoustic ASV comparators, or statistical significance of the per-speaker EER distribution; these omissions make it impossible to evaluate whether the reported figures are robust or sensitive to implementation choices.
  2. [Results and Explainability] Results / Explainability analysis: the claim that decisions arise from curation-induced keyword similarity (rather than inherent speaker-specific language use) rests on post-hoc keyword inspection, yet the manuscript contains no quantitative ablation such as topic-matched controls, shuffled-text baselines, or cross-topic EER measurements that would isolate the curation artifact from natural intra-speaker lexical patterns; this control is load-bearing for the recommendation to rework the datasets specifically for curation bias.
minor comments (1)
  1. [Abstract] Abstract: the statement 'certain speakers attaining EERs as low as 2%' would be clearer if accompanied by the number of such speakers or the full EER distribution across the test set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened for clarity and rigor. We address each major comment below and have incorporated revisions to improve the presentation of our methods and analyses.

read point-by-point responses
  1. Referee: [Methods] Methods section: the adaptation of BERT as an ASV system reports a mean EER of 35% and minimum 2% but provides no information on training/validation splits, fine-tuning hyperparameters, baseline text-based or acoustic ASV comparators, or statistical significance of the per-speaker EER distribution; these omissions make it impossible to evaluate whether the reported figures are robust or sensitive to implementation choices.

    Authors: We agree that the original manuscript omitted key implementation details necessary for full reproducibility and assessment of robustness. In the revised version, we have expanded the Methods section with a new subsection on experimental setup. This includes the data splits (80/20 training/validation on the VoicePrivacy attacker training set derived from LibriSpeech), fine-tuning hyperparameters (Adam optimizer with learning rate 2e-5, batch size 32, 4 epochs, early stopping on validation loss), direct comparisons to text-based baselines (TF-IDF + logistic regression and a fine-tuned DistilBERT classifier) as well as the official acoustic ASV baselines from the VoicePrivacy challenge, and statistical analysis of the per-speaker EERs (mean, standard deviation, and Wilcoxon signed-rank tests against chance-level performance). These additions allow readers to evaluate sensitivity to choices. revision: yes

  2. Referee: [Results and Explainability] Results / Explainability analysis: the claim that decisions arise from curation-induced keyword similarity (rather than inherent speaker-specific language use) rests on post-hoc keyword inspection, yet the manuscript contains no quantitative ablation such as topic-matched controls, shuffled-text baselines, or cross-topic EER measurements that would isolate the curation artifact from natural intra-speaker lexical patterns; this control is load-bearing for the recommendation to rework the datasets specifically for curation bias.

    Authors: We recognize that quantitative controls would further isolate the contribution of curation-induced overlaps. Our original explainability analysis (attention visualization and keyword attribution via integrated gradients) already links model decisions to semantically similar content words that recur across a speaker's utterances due to LibriSpeech topic curation. To strengthen this, the revised manuscript now includes two additional quantitative experiments: (1) a shuffled-text baseline in which word order within each utterance is randomly permuted while preserving the bag-of-words distribution, resulting in EER rising to approximately 48-50% (near chance), and (2) cross-topic EER measurements obtained by partitioning utterances via LDA-derived topics and recomputing verification performance within versus across topics. These results show substantially higher EER when topic overlap is removed, supporting our attribution to curation artifacts rather than intrinsic speaker-specific lexical habits. We maintain that the recommendation to rework the datasets remains valid on the basis of this evidence. revision: yes

Circularity Check

0 steps flagged

Empirical measurement on public datasets; no derivation chain present

full rationale

The paper reports direct experimental results from adapting BERT as an ASV system and measuring EER on the fixed VoicePrivacy Attacker Challenge datasets derived from LibriSpeech. The reported mean EER of 35% (with some speakers at 2%) is an observed performance metric obtained by running the model on the textual content of utterances, not a quantity derived from equations or parameters that reduce to the inputs by construction. No mathematical derivations, fitted parameters presented as predictions, self-citation load-bearing premises, or ansatz smuggling appear in the abstract or described content. The explainability analysis linking decisions to keywords is post-hoc interpretation of the empirical outputs rather than a circular step. The central claim remains independently falsifiable on the public datasets and does not rely on any self-referential reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard transfer-learning assumptions for BERT and on the empirical observation of dataset bias; no new physical entities or ad-hoc constants are introduced.

free parameters (1)
  • BERT fine-tuning hyperparameters
    Learning rate, batch size, and number of epochs for adapting BERT to the speaker-verification task are chosen or optimized but not enumerated in the abstract.
axioms (1)
  • domain assumption BERT embeddings capture semantically relevant features that correlate with speaker identity when intra-speaker linguistic overlap exists.
    Invoked when the authors adapt BERT as an ASV system and attribute decisions to keyword similarity.

pith-pipeline@v0.9.0 · 5716 in / 1508 out tokens · 86044 ms · 2026-05-22T01:01:25.631351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    Introduction The field of speaker anonymization has emerged in response to the risks associated with advances in speech-processing tech- nology, such as the inadvertent disclosure of personal informa- tion (e.g., age, health) when using cloud-enabled voice inter- faces [1]. Speaker anonymization systems protect the speaker’s identity while preserving impor...

  2. [2]

    Its aim is to develop techniques to compromise the privacy of speakers that have been processed by seven anonymization systems

    was held for the first time. Its aim is to develop techniques to compromise the privacy of speakers that have been processed by seven anonymization systems. These include the top three base- line systems from the V oice Privacy Challenge (VPC) 2024: B3, B4, B5, and the top four participant-submitted systems. Works on both anonymization [4]–[6] and attacks ...

  3. [3]

    The anonymization systems to be attacked are chosen such that their architectures and intermediate represen- tations are diverse

    Semi-informed attack: status quo This section presents an analysis of the ASV anon eval scores on the speaker level. The anonymization systems to be attacked are chosen such that their architectures and intermediate represen- tations are diverse. B3 performs any-to-any voice conversion via an ASR-TTS pipeline, where a Wasserstein GAN generates a target ps...

  4. [4]

    Upon manual inspection, we found that the texts read by speakers 1673 and 652 were on spe- cific and unique topics, and some words recurring across their utterances

    Proposed text-based attack Consistent de-anonymization of specific speakers suggests the existence of persistent features that are invariant to different anonymization strategies. Upon manual inspection, we found that the texts read by speakers 1673 and 652 were on spe- cific and unique topics, and some words recurring across their utterances. So, ASV anon ...

  5. [5]

    4 shows the score distributions of our text-based attack; please refer to Fig

    Results and discussion Fig. 4 shows the score distributions of our text-based attack; please refer to Fig. 2 for comparison and interpretation. On av- erage, our attack achieves an EER of 33.68% for female speak- ers and 36.30% for male speakers, performing only slightly worse than ASV anon eval despite the limited available information. Turning to speake...

  6. [6]

    Conclusion In this work, we explored the speaker-level behavior of ASVanon eval on speaker anonymization systems. In our analysis, we identi- fied that reporting global EERs, which is a common practice in evaluating ASV systems for speaker verification, can obfuscate the shortcomings of speaker anonymization systems by overes- timating their effectiveness. ...

  7. [7]

    01IS24072A (COMFORT)

    Acknowledgements This work is partially supported by the German Ministry of Science and Technology (BMBF) under grant agreement No. 01IS24072A (COMFORT)

  8. [8]

    Introducing the V oicePrivacy Ini- tiative,

    N. Tomashenko et al., “Introducing the V oicePrivacy Ini- tiative,” in Proc. Interspeech Conf., 2020

  9. [9]

    Champion et al

    P . Champion et al. , 3rd V oicePrivacy Challenge Eval- uation Plan (V ersion 2.1) , 2024. [Online]. Available: https : / / www . voiceprivacychallenge . org / vp2024 / docs / VoicePrivacy _ 2024 _ Eval_Plan_v2.1.pdf

  10. [10]

    Tomashenko, X

    N. Tomashenko, X. Miao, E. Vincent, J. Y amagishi, and N. Evans, The First V oicePrivacy Attacker Challenge evaluation plan (version 2.2), 2024. [Online]. Available: https : / / www . voiceprivacychallenge . org/attacker/docs/Attacker_Challenge_ Eval_Plan.pdf

  11. [11]

    Prosody Is Not Identity: A Speaker Anonymiza- tion Approach Using Prosody Cloning,

    S. Meyer, F. Lux, J. Koch, P . Denisov, P . Tilli, and N. T. Vu, “Prosody Is Not Identity: A Speaker Anonymiza- tion Approach Using Prosody Cloning,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2023

  12. [12]

    Speaker Anonymization Using Neural Audio Codec Language Models,

    M. Panariello, F. Nespoli, M. Todisco, and N. Evans, “Speaker Anonymization Using Neural Audio Codec Language Models,” in Proc. IEEE Intl. Conf. on Acous- tics, Speech and Signal Processing (ICASSP) , 2024

  13. [13]

    Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques,

    P . Champion, “Anonymizing Speech: Evaluating and Designing Speaker Anonymization Techniques,” PhD thesis, Universite de Lorraine, 2024

  14. [14]

    On the Invertibility of a V oice Privacy Sys- tem Using Embedding Alignment,

    P . Champion, T. Thebaud, G. Le Lan, A. Larcher, and D. Jouvet, “On the Invertibility of a V oice Privacy Sys- tem Using Embedding Alignment,” in Proc. IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU), 2021

  15. [15]

    Evaluating X- V ector-Based Speaker Anonymization under White-Box Assessment,

    P . Champion, D. Jouvet, and A. Larcher, “Evaluating X- V ector-Based Speaker Anonymization under White-Box Assessment,” in Proc. Speech and Computers, 2021

  16. [16]

    Attacking V oice Anonymization Systems with Augmented Feature and Speaker Identity Difference,

    Y . Zhang, Z. Bi, F. Xiao, X. Y ang, Q. Zhu, and J. Guan, “Attacking V oice Anonymization Systems with Augmented Feature and Speaker Identity Difference,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2025

  17. [17]

    ECAPA-TDNN: Emphasized Channel Attention, Prop- agation and Aggregation in TDNN Based Speaker V eri- fication,

    B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Prop- agation and Aggregation in TDNN Based Speaker V eri- fication,” in Proc. Interspeech Conf., 2020

  18. [18]

    Evaluating V oice Conversion-Based Privacy Protection against Informed Attackers,

    B. M. Lal Srivastava, N. V auquier, M. Sahidullah, A. Bellet, M. Tommasi, and E. Vincent, “Evaluating V oice Conversion-Based Privacy Protection against Informed Attackers,” in Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP) , 2020

  19. [19]

    The Privacy ZEBRA: Zero Evi- dence Biometric Recognition Assessment,

    A. Nautsch et al. , “The Privacy ZEBRA: Zero Evi- dence Biometric Recognition Assessment,” in Proc. In- terspeech Conf., 2020

  20. [20]

    Representing evidence for attribute privacy: Bayesian updating, compositional evidence and calibra- tion,

    P .-G. No ´e, “Representing evidence for attribute privacy: Bayesian updating, compositional evidence and calibra- tion,” PhD thesis, Universit ´e d’Avignon, 2023

  21. [21]

    Anonymizing Speaker V oices: Easy to Imitate, Difficult to Recognize?

    J. Williams, K. Pizzi, N. Tomashenko, and S. Das, “Anonymizing Speaker V oices: Easy to Imitate, Difficult to Recognize?” In Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP) , 2024

  22. [22]

    SHEEP, GOA TS, LAMBS and WOL VES: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation,

    G. Doddington, W. Liggett, A. Martin, M. Przybocki, and D. A. Reynolds, “SHEEP, GOA TS, LAMBS and WOL VES: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation,” in Proc. Intl. Conf. on Spoken Lang. Processing (ICSLP) , 1998

  23. [23]

    Addressing challenges in speaker anonymization to maintain utility while ensur- ing privacy of pathological speech,

    S. Tayebi Arasteh et al. , “Addressing challenges in speaker anonymization to maintain utility while ensur- ing privacy of pathological speech,” Nature Communi- cations Medicine, vol. 4, no. 1, 2024

  24. [24]

    Speaker anonymisation using the McAdams coefficient,

    J. Patino, N. Tomashenko, M. Todisco, A. Nautsch, and N. Evans, “Speaker anonymisation using the McAdams coefficient,” in Proc. Interspeech Conf., 2021

  25. [25]

    The V oicePrivacy 2022 Chal- lenge: Progress and Perspectives in V oice Anonymisa- tion,

    M. Panariello et al. , “The V oicePrivacy 2022 Chal- lenge: Progress and Perspectives in V oice Anonymisa- tion,” IEEE/ACM Trans. on Audio, Speech, and Lan- guage Processing (TASLP), vol. 32, 2024

  26. [26]

    Transformers: State-of-the-Art Natural Language Processing,

    T. Wolf et al. , “Transformers: State-of-the-Art Natural Language Processing,” in Proc. Conf. Empirical Meth- ods in Natural Language Processing: System Demon- strations, Q. Liu and D. Schlangen, Eds., 2020

  27. [27]

    BERT: Pre-training of Deep Bidirectional Transform- ers for Language Understanding,

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transform- ers for Language Understanding,” in Proc. Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL- HLT), 2019

  28. [28]

    Open-Source Conversational AI with SpeechBrain 1.0,

    M. Ravanelli et al. , “Open-Source Conversational AI with SpeechBrain 1.0,” Journal of Machine Learning Research (JMLR), vol. 25, no. 333, 2024

  29. [29]

    Mar- gin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition,

    X. Xiang, S. Wang, H. Huang, Y . Qian, and K. Y u, “Mar- gin Matters: Towards More Discriminative Deep Neural Network Embeddings for Speaker Recognition,” in Proc. Asia-Pacific Signal and Information Processing Associ- ation (APSIPA) Annual Summit and Conf. , 2019

  30. [30]

    Pierse, Transformers-Interpret, 2021

    C. Pierse, Transformers-Interpret, 2021. [Online]. Available: https : / / github . com / cdpierse / transformers-interpret

  31. [31]

    Using natural language processing on free-text clinical notes to identify patients with long-term COVID effects,

    Y . Zhu et al. , “Using natural language processing on free-text clinical notes to identify patients with long-term COVID effects,” in Proc. ACM Intl. Conf. on Bioinfor- matics, Computational Biology and Health Informatics (BCB), 2022

  32. [32]

    Axiomatic At- tribution for Deep Networks,

    M. Sundararajan, A. Taly, and Q. Y an, “Axiomatic At- tribution for Deep Networks,” in Proc. Intl. Conf. on Machine Learning (ICML), 2017