Perceptual implications of automatic anonymization in pathological speech

Andreas Maier; Elmar Noeth; Hiu Ching Hung; Lukas Buess; Mahshad Lotfinia; Maria Schuster; Maryam Parvin; Paula Andrea Perez-Toro; Saba Afza; Seung Hee Yang

arxiv: 2505.00409 · v3 · pith:H66EGH2Cnew · submitted 2025-05-01 · 📡 eess.AS · cs.AI· cs.LG

Perceptual implications of automatic anonymization in pathological speech

Soroosh Tayebi Arasteh , Saba Afza , Tri-Thien Nguyen , Lukas Buess , Maryam Parvin , Tomas Arias-Vergara , Paula Andrea Perez-Toro , Hiu Ching Hung

show 6 more authors

Mahshad Lotfinia Thomas Gorges Elmar Noeth Maria Schuster Seung Hee Yang Andreas Maier

This is my paper

Pith reviewed 2026-05-22 17:54 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.LG

keywords automatic anonymizationpathological speechperceptual evaluationclinical severity ratingspeech privacydisorder-specific effectslistener perception

0 comments

The pith

Automatic anonymization changes perceived quality and detectability of pathological speech without disrupting clinical severity ratings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors test the effects of automatic anonymization on how people hear and judge speech from individuals with different disorders like cleft lip and palate, dysarthria, and others. They have listeners perform tasks to spot the anonymized versions, rate the sound quality, and assess the clinical severity. Detection is high at 91 percent without prior exposure, quality drops noticeably, but severity ratings stay consistent for most conditions. The standout result is that the computational measure of how much privacy is gained does not correspond to how obvious the anonymization is to human ears. This matters for deciding if such processed speech can be shared for research or clinical purposes while maintaining usefulness.

Core claim

In a study with 180 speakers across five pathological groups and controls, automatic anonymization led to 91% zero-shot and 93% few-shot detection accuracy by ten listeners. Quality ratings fell by 30 points on a 0-100 scale, reordering group perceptions. Clinical severity ratings showed high agreement with kappa values of 0.87 to 0.94 for several disorders and no more than one-grade shifts. Perceptual results decoupled from computational privacy metrics, where the pathology with strongest computational anonymization was the least perceptually conspicuous.

What carries the argument

A structured human listening protocol that includes zero-shot and few-shot discrimination, quality rating, and blinded clinical severity assessment by a phoniatrician.

If this is right

Severity ratings by clinicians remain reliable for dysarthria, dysglossia, and dysphonia after anonymization.
Quality perception decreases and the relative ordering of disorder groups changes.
Native language influences how easily anonymization is detected but not the quality drop.
Listener expertise influences quality degradation but not detection rates.
Disorder-stratified evaluation is needed because perceptual and computational measures do not align.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of anonymization systems should test with actual clinicians and patients rather than relying solely on algorithms.
Future work could explore whether adjusting anonymization intensity per disorder improves the balance between privacy and usability.
Similar evaluations in other languages or with voice disorders not tested here might show different patterns.

Load-bearing premise

The specific anonymization method used and the results from ten listeners will hold for other anonymization approaches and for larger groups of clinicians and patients.

What would settle it

Finding a different anonymization technique where perceptual conspicuousness directly corresponds to the computational privacy score, or observing clinical severity ratings that shift by more than one grade in a blinded evaluation.

Figures

Figures reproduced from arXiv: 2505.00409 by Andreas Maier, Elmar Noeth, Hiu Ching Hung, Lukas Buess, Mahshad Lotfinia, Maria Schuster, Maryam Parvin, Paula Andrea Perez-Toro, Saba Afza, Seung Hee Yang, Soroosh Tayebi Arasteh, Thomas Gorges, Tomas Arias-Vergara, Tri-Thien Nguyen.

**Figure 2.** Figure 2: Perceptual discrimination accuracy across pathology groups. Box plots display listener accuracy (in %) in detecting which sample is the original in anonymized–original pairs across six speaker categories: Cleft Lip and Palate (CLP) (n=30), control adults (n=30), control children (n=30), Dysarthria (n=30), Dysglossia (n=30), and Dysphonia (n=30). Results are averaged across all listeners (n=10). (a) shows t… view at source ↗

**Figure 3.** Figure 3: Subjective quality ratings for original and anonymized speech. Bar plots show average perceived speech quality (normalized to a percentage scale) across six pathology groups: Cleft Lip and Palate (CLP) (n=30), control adults (n=30), control children (n=30), Dysarthria (n=30), Dysglossia (n=30), and Dysphonia (n=30). For each category, mean ratings—averaged across all samples and all listeners— are presente… view at source ↗

**Figure 4.** Figure 4: Correlations between human perceptual results and automatic anonymization metrics. Scatter plots depict the relationships between human perceptual metrics (discrimination and quality) and automatic anonymization metrics (EER and AUC) across five groups: Cleft Lip and Palate (n=30), Dysarthria (n=30), Dysglossia (n=30), Dysphonia (n=30), and overall patient average. Panel (a) shows results averaged across a… view at source ↗

**Figure 5.** Figure 5: Correlations between intelligibility and human perceptual results. Scatter plots depict the relationships between human perceptual metrics (discrimination and quality) and intelligibility metrics across five groups: Cleft Lip and Palate (n=30), Dysarthria (n=30), Dysglossia (n=30), Dysphonia (n=30), and overall patient average. Panel (a) shows results averaged across all listeners (n=10), panel (b) for non… view at source ↗

read the original abstract

Automatic anonymization is increasingly used to enable ethical sharing of clinical speech, yet its perceptual and clinical consequences remain undercharacterized. We present a human-centered evaluation of automatically anonymized pathological speech, using a structured protocol with ten native and non-native German listeners spanning clinical and signal-processing expertise. The cohort comprised 180 German speakers from CLP, Dysarthria, Dysglossia, Dysphonia, and adult and child controls. Each original recording and its automatically-anonymized counterpart was evaluated on four tasks: zero-shot Turing-style discrimination, few-shot discrimination after brief familiarization, 5-point quality rating, and 4-point blinded clinical severity rating by a senior phoniatrician. Listeners detected anonymization at 91% zero-shot and 93% few-shot accuracy, with significant variation across disorders (p=0.008) that attenuated with familiarization. Perceived quality dropped by 30 ppts on a 0-100 scale (p<0.001), reorganizing the perceived-quality hierarchy across groups. Native language modulated detectability but not quality degradation, while domain expertise modulated quality degradation but not detectability, a double dissociation between the two listener attributes; speaker sex and age produced no detectable bias. Clinical severity ratings were preserved at near-perfect agreement in Dysarthria, Dysglossia, and Dysphonia (quadratic-weighted Cohen's kappa 0.87-0.94), with no recording shifting by more than one grade. Crucially, perceptual outcomes were decoupled from the standard computational privacy metric: the pathology with the strongest computational anonymization was the least perceptually conspicuous, and vice versa. These findings argue for disorder-stratified, listener-stratified, clinician-validated evaluation as the minimum standard for licensing anonymized speech for clinical use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Anonymization of pathological speech is easy to detect with big quality drops but clinical severity ratings hold up, and it decouples from computational metrics, though the call for new stratified standards rests on one method and ten listeners.

read the letter

The main takeaway is that listeners spot automatically anonymized pathological speech at 91-93% accuracy across zero-shot and few-shot tasks, perceived quality falls by 30 points, yet a phoniatrician's blinded severity ratings stay stable with kappa 0.87-0.94 and no big shifts. The reported mismatch with computational privacy scores is the freshest angle, since the disorder that scored best on the computer metric was actually the least noticeable to people.

Referee Report

2 major / 2 minor

Summary. The paper evaluates perceptual effects of automatic anonymization on pathological speech from 180 German speakers across CLP, Dysarthria, Dysglossia, Dysphonia, and controls. Using 10 listeners (native/non-native, clinical/signal-processing expertise), it reports 91-93% detection accuracy (zero- and few-shot), a 30-point drop in perceived quality, preserved clinical severity ratings (kappa 0.87-0.94 in three disorders), listener-attribute double dissociations, and a decoupling between computational privacy metrics and perceptual conspicuousness, arguing that disorder-stratified, listener-stratified, clinician-validated evaluation is required as the minimum standard for clinical licensing.

Significance. If the decoupling and preservation findings hold, the work supplies concrete evidence that computational privacy metrics can misalign with human perception in disordered speech, supporting more rigorous, stratified protocols for ethical data release. The structured multi-task protocol, statistical tests, and explicit double dissociation between language and expertise effects are strengths; the near-perfect severity agreement in selected disorders is a useful clinical anchor.

major comments (2)

[Results (decoupling paragraph)] Results section on decoupling: the claim that 'the pathology with the strongest computational anonymization was the least perceptually conspicuous' and the ensuing call for disorder-stratified evaluation as minimum standard rest on a single unspecified anonymization pipeline; without cross-technique replication or explicit confirmation that the privacy metric is pathology-invariant, the generalization does not follow from the data.
[Methods (listener protocol)] Methods and listener cohort: subgroup splits for native-language and expertise effects are necessarily small (n=10 total), and no power analysis or pre-registration of exclusion rules is referenced; this limits support for the reported p=0.008 disorder variation and the double-dissociation interpretation as generalizable findings.

minor comments (2)

[Abstract] Abstract and results: the 30-point quality drop is reported on a 0-100 scale but the exact rating instrument and anchoring are not restated, making direct comparison to prior work harder.
[Introduction or Methods] The manuscript refers to 'the standard computational privacy metric' without a brief equation or citation in the main text; adding this would clarify the cross-disorder comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. We address each major comment below and have made revisions to strengthen the paper where appropriate.

read point-by-point responses

Referee: [Results (decoupling paragraph)] Results section on decoupling: the claim that 'the pathology with the strongest computational anonymization was the least perceptually conspicuous' and the ensuing call for disorder-stratified evaluation as minimum standard rest on a single unspecified anonymization pipeline; without cross-technique replication or explicit confirmation that the privacy metric is pathology-invariant, the generalization does not follow from the data.

Authors: We appreciate this observation. The anonymization pipeline is described in detail in the Methods (Section 2.3), using a standard voice-conversion approach whose implementation details and hyperparameters are provided. The computational privacy metric follows the protocol established in prior anonymization literature. We agree that the observed decoupling is specific to the evaluated pipeline and that broader claims would benefit from multi-technique replication. We have revised the Results and Discussion to explicitly qualify the finding as pertaining to this pipeline, to note that the privacy metric's pathology-invariance was not independently verified here, and to frame the call for disorder-stratified evaluation as a minimum standard motivated by the present data while recommending cross-technique studies for future work. revision: partial
Referee: [Methods (listener protocol)] Methods and listener cohort: subgroup splits for native-language and expertise effects are necessarily small (n=10 total), and no power analysis or pre-registration of exclusion rules is referenced; this limits support for the reported p=0.008 disorder variation and the double-dissociation interpretation as generalizable findings.

Authors: We acknowledge the constraints of the modest listener sample (n=10). The cohort was assembled to capture the four listener attributes of interest while remaining feasible for a multi-task perceptual protocol. The reported p=0.008 reflects the omnibus test across the full listener group; the double dissociation emerges from the distinct patterns of modulation by language versus expertise. Because the study was exploratory rather than hypothesis-confirmatory, pre-registration was not undertaken. We will expand the Discussion with an explicit limitations subsection that (i) states the sample-size limitation, (ii) includes post-hoc power estimates for the key statistical tests, and (iii) calls for larger, pre-registered replications to establish generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity: direct empirical listener study

full rationale

The paper reports results from a structured human evaluation protocol involving 180 speakers and 10 listeners across discrimination, quality, and clinical severity tasks. All central claims (detection rates, quality drops, disorder-specific variation, decoupling from a computational privacy metric, and the call for stratified evaluation) rest on measured listener responses and standard statistical tests rather than any derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, ansatzes, or uniqueness theorems appear; the study is self-contained against its own data collection and does not reduce any result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical human-subjects study with no mathematical derivations, fitted parameters, or postulated new entities; it rests on standard assumptions about listener reliability and statistical testing.

axioms (1)

domain assumption Human listeners can perform reliable zero-shot and few-shot discrimination and rating tasks on short speech samples.
The protocol treats listener judgments as valid ground truth for perceptual and clinical outcomes.

pith-pipeline@v0.9.0 · 5919 in / 1331 out tokens · 63374 ms · 2026-05-22T17:54:24.590070+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Listeners demonstrated consistently high discrimination accuracy... repeated-measures ANOVA... Pearson correlation coefficients... EER vs. Turing (Zero-shot) r = –0.020
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Anonymization consistently reduced perceived quality... one-way ANOVA p=0.0046

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

89 extracted references · 89 canonical work pages · 2 internal anchors

[1]

Kent, R. D. Hearing and Believing: Some Limits to the Auditory-Perceptual Assessment of Speech and Voice Disorders. Am J Speech Lang Pathol 5, 7–23 (1996)

work page 1996
[3]

& Dehak, N

Pappagari, R., Cho, J., Moro-Velázquez, L. & Dehak, N. Using State of the Art Speaker Recognition and Natural Language Processing Technologies to Detect Alzheimer’s Disease and Assess its Severity. in INTERSPEECH 2020 2177–2181 (ISCA, 2020). doi:10.21437/Interspeech.2020-2587

work page doi:10.21437/interspeech.2020-2587 2020
[4]

Riedhammer, K. et al. Medical Speech Processing for Diagnosis and Monitoring: Clinical Use Cases. in Fortschritte der Akustik - DAGA 1417–1420 (Hamburg, Germany, 2023)

work page 2023
[5]

Bayerl, S. P. et al. What can Speech and Language Tell us About the Working Alliance in Psychotherapy. in Interspeech 2022 2443–2447 (ISCA, 2022). doi:10.21437/Interspeech.2022-347

work page doi:10.21437/interspeech.2022-347 2022
[6]

& Tavel, J

Strimbu, K. & Tavel, J. A. What are biomarkers?: Current Opinion in HIV and AIDS 5, 463– 466 (2010)

work page 2010
[7]

Califf, R. M. Biomarker definitions and their applications. Exp Biol Med (Maywood) 243, 213–221 (2018)

work page 2018
[8]

C., Rowe, H

Ramanarayanan, V., Lammert, A. C., Rowe, H. P., Quatieri, T. F. & Green, J. R. Speech as a Biomarker: Opportunities, Interpretability, and Challenges. Perspect ASHA SIGs 7, 276– 283 (2022)

work page 2022
[9]

L., Lutz, O

Kröger, J. L., Lutz, O. H.-M. & Raschke, P. Privacy Implications of Voice and Speech Analysis – Information Disclosure by Inference. in Privacy and Identity Management. Data for Better Living: AI and Privacy (eds. Friedewald, M., Önen, M., Lievens, E., Krenn, S. & Fricker, S.) vol. 576 242–258 (Springer International Publishing, Cham, 2020)

work page 2020
[10]

Tayebi Arasteh, S. et al. Federated Learning for Secure Development of AI Models for Parkinson’s Disease Detection Using Speech from Different Languages. in INTERSPEECH 2023 5003--5007 (Dublin, Ireland, 2023). doi:10.21437/Interspeech.2023-2108

work page doi:10.21437/interspeech.2023-2108 2023
[11]

Khamsehashari, R. et al. Voice Privacy - leveraging multi-scale blocks with ECAPA-TDNN SE-Res2NeXt extension for speaker anonymization. in 2nd Symposium on Security and Privacy in Speech Communication 43–48 (ISCA, 2022). doi:10.21437/SPSC.2022-8

work page doi:10.21437/spsc.2022-8 2022
[14]

Nautsch, A. et al. Preserving privacy in speaker and speech characterisation. Computer Speech & Language 58, 441–480 (2019)

work page 2019
[18]

Qian, J. et al. Towards Privacy-Preserving Speech Data Publishing. in IEEE INFOCOM 2018 - IEEE Conference on Computer Communications 1079–1087 (IEEE, Honolulu, HI, 2018). doi:10.1109/INFOCOM.2018.8486250

work page doi:10.1109/infocom.2018.8486250 2018
[19]

Lal Srivastava, B. M. et al. Evaluating Voice Conversion-Based Privacy Protection against Informed Attackers. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, 32 Speech and Signal Processing (ICASSP) 2802–2806 (IEEE, Barcelona, Spain, 2020). doi:10.1109/ICASSP40776.2020.9053868

work page doi:10.1109/icassp40776.2020.9053868 2020
[20]

Ghosh, S. et al. Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example. in Interspeech 2024 4438–4442 (ISCA, 2024). doi:10.21437/Interspeech.2024-328

work page doi:10.21437/interspeech.2024-328 2024
[21]

Srivastava, B. M. L. et al. Design Choices for X-Vector Based Speaker Anonymization. in INTERSPEECH 2020 1713–1717 (ISCA, 2020). doi:10.21437/Interspeech.2020-2692

work page doi:10.21437/interspeech.2020-2692 2020
[22]

O., Okada, S

Mawalim, C. O., Okada, S. & Unoki, M. Speaker anonymization by pitch shifting based on time-scale modification. in 2nd Symposium on Security and Privacy in Speech Communication 35–42 (ISCA, 2022). doi:10.21437/SPSC.2022-7

work page doi:10.21437/spsc.2022-7 2022
[23]

Tayebi Arasteh, S. et al. The effect of speech pathology on automatic speaker verification: a large-scale study. Sci Rep 13, 20476 (2023)

work page 2023
[24]

Tayebi Arasteh, S. et al. Differential privacy enables fair and accurate AI-based analysis of speech disorders while protecting patient data. Preprint at https://doi.org/10.48550/arXiv.2409.19078 (2024)

work page doi:10.48550/arxiv.2409.19078 2024
[25]

Srivastava, B. M. L. et al. Privacy and Utility of X-Vector Based Speaker Anonymization. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 2383–2395 (2022)

work page 2022
[26]

& Haase, M

Siegert, I., Rech, S., Bäckström, T. & Haase, M. User Perspective on Anonymity in Voice Assistants – A comparison between Germany and Finland. in Proceedings of the Workshop on Legal and Ethical Issues in Human Language Technologies @ LREC-COLING 2024 (Turin, Italy, 2024)

work page 2024
[27]

J., Foster, N

Kluin, K. J., Foster, N. L., Berent, S. & Gilman, S. Perceptual analysis of speech disorders in progressive supranuclear palsy. Neurology 43, 563–566 (1993)

work page 1993
[28]

Sachin, S. et al. Clinical speech impairment in Parkinson’s disease, progressive supranuclear palsy, and multiple system atrophy. Neurol India 56, 122–126 (2008)

work page 2008
[29]

& Laganaro, M

Pernon, M., Assal, F., Kodrasi, I. & Laganaro, M. Perceptual Classification of Motor Speech Disorders: The Role of Severity, Speech Task, and Listener’s Expertise. J Speech Lang Hear Res 65, 2727–2747 (2022)

work page 2022
[30]

Turing, A. M. COMPUTING MACHINERY AND INTELLIGENCE. Mind LIX, 433–460 (1950)

work page 1950
[31]

Maier, A. et al. PEAKS – A system for the automatic evaluation of voice and speech disorders. Speech Communication 51, 425–437 (2009)

work page 2009
[32]

& Grunwell, P

Harding, A. & Grunwell, P. Characteristics of cleft palate speech. Intl J Lang & Comm Disor 31, 331–357 (1996)

work page 1996
[33]

& Richman, L

Millard, T. & Richman, L. C. Different Cleft Conditions, Facial Appearance, and Speech: Relationship to Psychological Variables. The Cleft Palate-Craniofacial Journal 38, 68–75 (2001)

work page 2001
[34]

& Schuster, M

Maier, A., Nöth, E., Batliner, A., Nkenke, E. & Schuster, M. Fully Automatic Assessment of Speech of Children with Cleft Lip and Palate. Informatica 30, 477–482 (2006)

work page 2006
[35]

Pathophysiology of Motor Speech Disorders (Dysarthria)

Hirose, H. Pathophysiology of Motor Speech Disorders (Dysarthria). Folia Phoniatr Logop 38, 61–88 (1986)

work page 1986
[36]

& Ziegler, W

Schröter-Morasch, H. & Ziegler, W. Rehabilitation of impaired speech function (dysarthria, dysglossia). GMS Curr Top Otorhinolaryngol Head Neck Surg 4, Doc15 (2005)

work page 2005
[37]

N., Price, S., Kelly, P

Sama, A., Carding, P. N., Price, S., Kelly, P. & Wilson, J. A. The Clinical Features of Functional Dysphonia. The Laryngoscope 111, 458–463 (2001)

work page 2001
[38]

Fox, A. V. PLAKSS : Psycholinguistische Analyse kindlicher Sprechstörungen. (Swets & Zeitlinger, Frankfurt a.M, Germany, 2002)

work page 2002
[40]

McAdams, S. E. Spectral Fusion, Spectral Parsing and the Formation of Auditory Images. (Ph.D. dissertation, Stanford University, 1984). 33

work page 1984
[41]

Common European Framework of Reference for Languages

Little, D. Common European Framework of Reference for Languages. in The TESOL Encyclopedia of English Language Teaching (eds. Liontas, J. I., International Association, T. & DelliCarpini, M.) 1–7 (Wiley, 2020). doi:10.1002/9781118784235.eelt0114.pub2

work page doi:10.1002/9781118784235.eelt0114.pub2 2020
[42]

Larson, M. G. Analysis of variance. Circulation 117, 115–121 (2008)

work page 2008
[43]

Sullivan, L. M. Repeated Measures. Circulation 117, 1238–1243 (2008)

work page 2008
[44]

Muhammad, L. N. Guidelines for repeated measures statistical analysis approaches with basic science research considerations. J Clin Invest 133, e171058 (2023)

work page 2023
[45]

& Hochberg, Y

Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B: Statistical Methodology 57, 289–300 (1995)

work page 1995
[46]

Mann, H. B. & Whitney, D. R. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann. Math. Statist. 18, 50–60 (1947)

work page 1947
[47]

McKnight, P. E. & Najab, J. Mann‐Whitney U Test. in The Corsini Encyclopedia of Psychology (eds. Weiner, I. B. & Craighead, W. E.) 1–1 (Wiley, 2010). doi:10.1002/9780470479216.corpsy0524

work page doi:10.1002/9780470479216.corpsy0524 2010
[48]

Shapiro, S. S. & Wilk, M. B. An Analysis of Variance Test for Normality (Complete Samples). Biometrika 52, 591 (1965)

work page 1965
[49]

A technique for the measurement of attitudes

Likert, A. A technique for the measurement of attitudes. Archives of Psychology 22, 140 (1932)

work page 1932
[50]

& Willson, V

Ross, A. & Willson, V. L. One-Way Anova. in Basic and Advanced Statistical Tests 21–24 (SensePublishers, Rotterdam, 2017). doi:10.1007/978-94-6351-086-8_5

work page doi:10.1007/978-94-6351-086-8_5 2017
[51]

Hansen, J. H. L. & Hasan, T. Speaker Recognition by Machines and Humans: A tutorial review. IEEE Signal Process. Mag. 32, 74–99 (2015)

work page 2015
[52]

Kinnunen, T. & Li, H. An overview of text-independent speaker recognition: From features to supervectors. Speech Communication 52, 12–40 (2010)

work page 2010
[53]

& Schmidhuber, J

Hochreiter, S. & Schmidhuber, J. Long Short-Term Memory. Neural Computation 9, 1735– 1780 (1997)

work page 1997
[54]

& Khudanpur, S

Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5206–5210 (IEEE, South Brisbane, Queensland, Australia, 2015). doi:10.1109/ICASSP.2015.7178964

work page doi:10.1109/icassp.2015.7178964 2015
[55]

& Moreno, I

Wan, L., Wang, Q., Papir, A. & Moreno, I. L. Generalized End-to-End Loss for Speaker Verification. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4879–4883 (IEEE, Calgary, AB, 2018). doi:10.1109/ICASSP.2018.8462665

work page doi:10.1109/icassp.2018.8462665 2018
[56]

Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization. in Proceedings of the 3rd International Conference for Learning Representations (ICLR) (San Diego, CA, USA, 2015)

work page 2015
[57]

Arasteh, S. T. An Empirical Study on Text-Independent Speaker Verification based on the GE2E Method. Preprint at http://arxiv.org/abs/2011.04896 (2022)

work page arXiv 2011
[58]

& Sainath, T

Prabhavalkar, R., Alvarez, R., Parada, C., Nakkiran, P. & Sainath, T. N. Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks. in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4704–4708 (IEEE, South Brisbane, Queensland, Australia, 2015). doi:10.1109/...

work page doi:10.1109/icassp.2015.7178863 2015
[59]

Determining the initial states in forward-backward filtering

Gustafsson, F. Determining the initial states in forward-backward filtering. IEEE Trans. Signal Process. 44, 988–992 (1996)

work page 1996
[60]

& Krejtz, I

Chlasta, K., Wołk, K. & Krejtz, I. Automated speech-based screening of depression using deep convolutional neural networks. Procedia Computer Science 164, 618–628 (2019)

work page 2019
[61]

& Othmani, A

Muzammel, M., Salam, H., Hoffmann, Y., Chetouani, M. & Othmani, A. AudVowelConsNet: A phoneme-level based deep CNN architecture for clinical depression diagnosis. Machine Learning with Applications 2, 100005 (2020). 34

work page 2020
[62]

Deep residual learning for image recognition,

He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, Las Vegas, NV, USA, 2016). doi:10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[63]

Deng, J. et al. ImageNet: A large-scale hierarchical image database. in 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, Miami, FL, 2009). doi:10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[64]

Tayebi Arasteh, S. et al. Preserving fairness and diagnostic accuracy in private large-scale AI models for medical imaging. Commun Med 4, 46 (2024)

work page 2024
[65]

& Ferris, S

Mehrabian, A. & Ferris, S. R. Inference of attitudes from nonverbal communication in two channels. Journal of Consulting Psychology 31, 248–252 (1967)

work page 1967
[66]

& Scherer, K

Bänziger, T. & Scherer, K. R. The role of intonation in emotional expressions. Speech Communication 46, 252–267 (2005)

work page 2005
[67]

& Åhlander, V

Kitzing, P., Maier, A. & Åhlander, V. L. Automatic speech recognition (ASR) and its use as a tool for assessment or therapy of voice, speech, and language disorders. Logopedics Phoniatrics Vocology 34, 91–96 (2009)

work page 2009
[68]

& Dehak, N

Moro-Velazquez, L., Villalba, J. & Dehak, N. Using X-Vectors to Automatically Detect Parkinson’s Disease from Speech. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1155–1159 (IEEE, Barcelona, Spain, 2020). doi:10.1109/ICASSP40776.2020.9053770

work page doi:10.1109/icassp40776.2020.9053770 2020
[69]

& Sha’abani, M

Jamal, N., Shanta, S., Mahmud, F. & Sha’abani, M. Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review. in 020028 (Johor, Malaysia, 2017). doi:10.1063/1.5002046

work page doi:10.1063/1.5002046 2017
[70]

M., Ricketts, T

Picou, E. M., Ricketts, T. A. & Hornsby, B. W. Y. How Hearing Aids, Background Noise, and Visual Cues Influence Objective Listening Effort. Ear & Hearing 34, e52–e64 (2013). 35 Supplementary information Supplementary Note 1 Anonymization method Speech anonymization methods are broadly categorized into two classes: signal processing - based methods and dee...

work page 2013
[71]

Spectral conversion: The waveform is converted into Mel -spectrograms or other time - frequency representations

work page
[72]

Feature disentanglement : Speaker identity features are extracted and modified or replaced. 37

work page
[73]

The VoicePrivacy Challenge4–7 includes several DL-based baseline systems are explained below

Re-synthesis: The modified features are used to synthesize a new waveform via a vocoder or neural synthesizer. The VoicePrivacy Challenge4–7 includes several DL-based baseline systems are explained below. X-vector replacement with neural source-filter synthesis This method 8 anonymizes speech by replacing the original speaker representation with a synthet...

work page
[74]

Tayebi Arasteh, S. et al. Addressing challenges in speaker anonymization to maintain utility while ensuring privacy of pathological speech. Commun Med 4, 182 (2024)

work page 2024
[75]

McAdams, S. E. Spectral Fusion, Spectral Parsing and the Formation of Auditory Images. (Ph.D. dissertation, Stanford University, 1984)

work page 1984
[76]

& Evans, N

Patino, J., Tomashenko, N., Todisco, M., Nautsch, A. & Evans, N. Speaker Anonymisation Using the McAdams Coefficient. in INTERSPEECH 2021 1099–1103 (2021). doi:10.21437/Interspeech.2021-1070

work page doi:10.21437/interspeech.2021-1070 2021
[77]

Tomashenko, N. et al. The VoicePrivacy 2022 Challenge Evaluation Plan. Preprint at http://arxiv.org/abs/2203.12468 (2022)

work page arXiv 2022
[78]

Tomashenko, N. et al. The VoicePrivacy 2020 Challenge: Results and findings. Computer Speech & Language 74, 101362 (2022)

work page 2020
[79]

Tomashenko, N. et al. Introducing the VoicePrivacy Initiative. in INTERSPEECH 2020 1693–1697 (ISCA, 2020). doi:10.21437/Interspeech.2020-1333. 44

work page doi:10.21437/interspeech.2020-1333 2020
[80]

Tomashenko, N. et al. The VoicePrivacy 2024 Challenge Evaluation Plan. Preprint at https://doi.org/10.48550/arXiv.2404.02677 (2024)

work page doi:10.48550/arxiv.2404.02677 2024
[81]

Fang, F. et al. Speaker Anonymization Using X-vector and Neural Waveform Models. in 10th ISCA Speech Synthesis Workshop (Vienna, Austria, 2019)

work page 2019
[82]

& Khudanpur, S

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. & Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5329–5333 (IEEE, Calgary, AB, 2018). doi:10.1109/ICASSP.2018.8461375

work page doi:10.1109/icassp.2018.8461375 2018
[83]

& Khudanpur, S

Peddinti, V., Povey, D. & Khudanpur, S. A time delay neural network architecture for efficient modeling of long temporal contexts. in Interspeech 2015 (ISCA, ISCA, 2015). doi:10.21437/interspeech.2015-647

work page doi:10.21437/interspeech.2015-647 2015
[84]

Povey, D. et al. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks. in Interspeech 2018 (ISCA, ISCA, 2018). doi:10.21437/interspeech.2018-1417

work page doi:10.21437/interspeech.2018-1417 2018
[85]

& Bae, J

Kong, J., Kim, J. & Bae, J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. in NIPS’20: Proceedings of the 34th International Conference on Neural Information Processing Systems vol. 1428Pages 17022–17033 (2020)

work page 2020
[86]

Meyer, S. et al. Prosody Is Not Identity: A Speaker Anonymization Approach Using Prosody Cloning. in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1–5 (IEEE, Rhodes Island, Greece, 2023). doi:10.1109/icassp49357.2023.10096607

work page doi:10.1109/icassp49357.2023.10096607 2023
[87]

Meyer, S. et al. Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy. in 2022 IEEE Spoken Language Technology Workshop (SLT) 912–919 (IEEE, Doha, Qatar, 2023). doi:10.1109/SLT54892.2023.10022601

work page doi:10.1109/slt54892.2023.10022601 2022

Showing first 80 references.

[1] [1]

Kent, R. D. Hearing and Believing: Some Limits to the Auditory-Perceptual Assessment of Speech and Voice Disorders. Am J Speech Lang Pathol 5, 7–23 (1996)

work page 1996

[2] [3]

& Dehak, N

Pappagari, R., Cho, J., Moro-Velázquez, L. & Dehak, N. Using State of the Art Speaker Recognition and Natural Language Processing Technologies to Detect Alzheimer’s Disease and Assess its Severity. in INTERSPEECH 2020 2177–2181 (ISCA, 2020). doi:10.21437/Interspeech.2020-2587

work page doi:10.21437/interspeech.2020-2587 2020

[3] [4]

Riedhammer, K. et al. Medical Speech Processing for Diagnosis and Monitoring: Clinical Use Cases. in Fortschritte der Akustik - DAGA 1417–1420 (Hamburg, Germany, 2023)

work page 2023

[4] [5]

Bayerl, S. P. et al. What can Speech and Language Tell us About the Working Alliance in Psychotherapy. in Interspeech 2022 2443–2447 (ISCA, 2022). doi:10.21437/Interspeech.2022-347

work page doi:10.21437/interspeech.2022-347 2022

[5] [6]

& Tavel, J

Strimbu, K. & Tavel, J. A. What are biomarkers?: Current Opinion in HIV and AIDS 5, 463– 466 (2010)

work page 2010

[6] [7]

Califf, R. M. Biomarker definitions and their applications. Exp Biol Med (Maywood) 243, 213–221 (2018)

work page 2018

[7] [8]

C., Rowe, H

Ramanarayanan, V., Lammert, A. C., Rowe, H. P., Quatieri, T. F. & Green, J. R. Speech as a Biomarker: Opportunities, Interpretability, and Challenges. Perspect ASHA SIGs 7, 276– 283 (2022)

work page 2022

[8] [9]

L., Lutz, O

Kröger, J. L., Lutz, O. H.-M. & Raschke, P. Privacy Implications of Voice and Speech Analysis – Information Disclosure by Inference. in Privacy and Identity Management. Data for Better Living: AI and Privacy (eds. Friedewald, M., Önen, M., Lievens, E., Krenn, S. & Fricker, S.) vol. 576 242–258 (Springer International Publishing, Cham, 2020)

work page 2020

[9] [10]

Tayebi Arasteh, S. et al. Federated Learning for Secure Development of AI Models for Parkinson’s Disease Detection Using Speech from Different Languages. in INTERSPEECH 2023 5003--5007 (Dublin, Ireland, 2023). doi:10.21437/Interspeech.2023-2108

work page doi:10.21437/interspeech.2023-2108 2023

[10] [11]

Khamsehashari, R. et al. Voice Privacy - leveraging multi-scale blocks with ECAPA-TDNN SE-Res2NeXt extension for speaker anonymization. in 2nd Symposium on Security and Privacy in Speech Communication 43–48 (ISCA, 2022). doi:10.21437/SPSC.2022-8

work page doi:10.21437/spsc.2022-8 2022

[11] [14]

Nautsch, A. et al. Preserving privacy in speaker and speech characterisation. Computer Speech & Language 58, 441–480 (2019)

work page 2019

[12] [18]

Qian, J. et al. Towards Privacy-Preserving Speech Data Publishing. in IEEE INFOCOM 2018 - IEEE Conference on Computer Communications 1079–1087 (IEEE, Honolulu, HI, 2018). doi:10.1109/INFOCOM.2018.8486250

work page doi:10.1109/infocom.2018.8486250 2018

[13] [19]

Lal Srivastava, B. M. et al. Evaluating Voice Conversion-Based Privacy Protection against Informed Attackers. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, 32 Speech and Signal Processing (ICASSP) 2802–2806 (IEEE, Barcelona, Spain, 2020). doi:10.1109/ICASSP40776.2020.9053868

work page doi:10.1109/icassp40776.2020.9053868 2020

[14] [20]

Ghosh, S. et al. Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example. in Interspeech 2024 4438–4442 (ISCA, 2024). doi:10.21437/Interspeech.2024-328

work page doi:10.21437/interspeech.2024-328 2024

[15] [21]

Srivastava, B. M. L. et al. Design Choices for X-Vector Based Speaker Anonymization. in INTERSPEECH 2020 1713–1717 (ISCA, 2020). doi:10.21437/Interspeech.2020-2692

work page doi:10.21437/interspeech.2020-2692 2020

[16] [22]

O., Okada, S

Mawalim, C. O., Okada, S. & Unoki, M. Speaker anonymization by pitch shifting based on time-scale modification. in 2nd Symposium on Security and Privacy in Speech Communication 35–42 (ISCA, 2022). doi:10.21437/SPSC.2022-7

work page doi:10.21437/spsc.2022-7 2022

[17] [23]

Tayebi Arasteh, S. et al. The effect of speech pathology on automatic speaker verification: a large-scale study. Sci Rep 13, 20476 (2023)

work page 2023

[18] [24]

Tayebi Arasteh, S. et al. Differential privacy enables fair and accurate AI-based analysis of speech disorders while protecting patient data. Preprint at https://doi.org/10.48550/arXiv.2409.19078 (2024)

work page doi:10.48550/arxiv.2409.19078 2024

[19] [25]

Srivastava, B. M. L. et al. Privacy and Utility of X-Vector Based Speaker Anonymization. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 2383–2395 (2022)

work page 2022

[20] [26]

& Haase, M

Siegert, I., Rech, S., Bäckström, T. & Haase, M. User Perspective on Anonymity in Voice Assistants – A comparison between Germany and Finland. in Proceedings of the Workshop on Legal and Ethical Issues in Human Language Technologies @ LREC-COLING 2024 (Turin, Italy, 2024)

work page 2024

[21] [27]

J., Foster, N

Kluin, K. J., Foster, N. L., Berent, S. & Gilman, S. Perceptual analysis of speech disorders in progressive supranuclear palsy. Neurology 43, 563–566 (1993)

work page 1993

[22] [28]

Sachin, S. et al. Clinical speech impairment in Parkinson’s disease, progressive supranuclear palsy, and multiple system atrophy. Neurol India 56, 122–126 (2008)

work page 2008

[23] [29]

& Laganaro, M

Pernon, M., Assal, F., Kodrasi, I. & Laganaro, M. Perceptual Classification of Motor Speech Disorders: The Role of Severity, Speech Task, and Listener’s Expertise. J Speech Lang Hear Res 65, 2727–2747 (2022)

work page 2022

[24] [30]

Turing, A. M. COMPUTING MACHINERY AND INTELLIGENCE. Mind LIX, 433–460 (1950)

work page 1950

[25] [31]

Maier, A. et al. PEAKS – A system for the automatic evaluation of voice and speech disorders. Speech Communication 51, 425–437 (2009)

work page 2009

[26] [32]

& Grunwell, P

Harding, A. & Grunwell, P. Characteristics of cleft palate speech. Intl J Lang & Comm Disor 31, 331–357 (1996)

work page 1996

[27] [33]

& Richman, L

Millard, T. & Richman, L. C. Different Cleft Conditions, Facial Appearance, and Speech: Relationship to Psychological Variables. The Cleft Palate-Craniofacial Journal 38, 68–75 (2001)

work page 2001

[28] [34]

& Schuster, M

Maier, A., Nöth, E., Batliner, A., Nkenke, E. & Schuster, M. Fully Automatic Assessment of Speech of Children with Cleft Lip and Palate. Informatica 30, 477–482 (2006)

work page 2006

[29] [35]

Pathophysiology of Motor Speech Disorders (Dysarthria)

Hirose, H. Pathophysiology of Motor Speech Disorders (Dysarthria). Folia Phoniatr Logop 38, 61–88 (1986)

work page 1986

[30] [36]

& Ziegler, W

Schröter-Morasch, H. & Ziegler, W. Rehabilitation of impaired speech function (dysarthria, dysglossia). GMS Curr Top Otorhinolaryngol Head Neck Surg 4, Doc15 (2005)

work page 2005

[31] [37]

N., Price, S., Kelly, P

Sama, A., Carding, P. N., Price, S., Kelly, P. & Wilson, J. A. The Clinical Features of Functional Dysphonia. The Laryngoscope 111, 458–463 (2001)

work page 2001

[32] [38]

Fox, A. V. PLAKSS : Psycholinguistische Analyse kindlicher Sprechstörungen. (Swets & Zeitlinger, Frankfurt a.M, Germany, 2002)

work page 2002

[33] [40]

McAdams, S. E. Spectral Fusion, Spectral Parsing and the Formation of Auditory Images. (Ph.D. dissertation, Stanford University, 1984). 33

work page 1984

[34] [41]

Common European Framework of Reference for Languages

Little, D. Common European Framework of Reference for Languages. in The TESOL Encyclopedia of English Language Teaching (eds. Liontas, J. I., International Association, T. & DelliCarpini, M.) 1–7 (Wiley, 2020). doi:10.1002/9781118784235.eelt0114.pub2

work page doi:10.1002/9781118784235.eelt0114.pub2 2020

[35] [42]

Larson, M. G. Analysis of variance. Circulation 117, 115–121 (2008)

work page 2008

[36] [43]

Sullivan, L. M. Repeated Measures. Circulation 117, 1238–1243 (2008)

work page 2008

[37] [44]

Muhammad, L. N. Guidelines for repeated measures statistical analysis approaches with basic science research considerations. J Clin Invest 133, e171058 (2023)

work page 2023

[38] [45]

& Hochberg, Y

Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B: Statistical Methodology 57, 289–300 (1995)

work page 1995

[39] [46]

Mann, H. B. & Whitney, D. R. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Ann. Math. Statist. 18, 50–60 (1947)

work page 1947

[40] [47]

McKnight, P. E. & Najab, J. Mann‐Whitney U Test. in The Corsini Encyclopedia of Psychology (eds. Weiner, I. B. & Craighead, W. E.) 1–1 (Wiley, 2010). doi:10.1002/9780470479216.corpsy0524

work page doi:10.1002/9780470479216.corpsy0524 2010

[41] [48]

Shapiro, S. S. & Wilk, M. B. An Analysis of Variance Test for Normality (Complete Samples). Biometrika 52, 591 (1965)

work page 1965

[42] [49]

A technique for the measurement of attitudes

Likert, A. A technique for the measurement of attitudes. Archives of Psychology 22, 140 (1932)

work page 1932

[43] [50]

& Willson, V

Ross, A. & Willson, V. L. One-Way Anova. in Basic and Advanced Statistical Tests 21–24 (SensePublishers, Rotterdam, 2017). doi:10.1007/978-94-6351-086-8_5

work page doi:10.1007/978-94-6351-086-8_5 2017

[44] [51]

Hansen, J. H. L. & Hasan, T. Speaker Recognition by Machines and Humans: A tutorial review. IEEE Signal Process. Mag. 32, 74–99 (2015)

work page 2015

[45] [52]

Kinnunen, T. & Li, H. An overview of text-independent speaker recognition: From features to supervectors. Speech Communication 52, 12–40 (2010)

work page 2010

[46] [53]

& Schmidhuber, J

Hochreiter, S. & Schmidhuber, J. Long Short-Term Memory. Neural Computation 9, 1735– 1780 (1997)

work page 1997

[47] [54]

& Khudanpur, S

Panayotov, V., Chen, G., Povey, D. & Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5206–5210 (IEEE, South Brisbane, Queensland, Australia, 2015). doi:10.1109/ICASSP.2015.7178964

work page doi:10.1109/icassp.2015.7178964 2015

[48] [55]

& Moreno, I

Wan, L., Wang, Q., Papir, A. & Moreno, I. L. Generalized End-to-End Loss for Speaker Verification. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4879–4883 (IEEE, Calgary, AB, 2018). doi:10.1109/ICASSP.2018.8462665

work page doi:10.1109/icassp.2018.8462665 2018

[49] [56]

Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization. in Proceedings of the 3rd International Conference for Learning Representations (ICLR) (San Diego, CA, USA, 2015)

work page 2015

[50] [57]

Arasteh, S. T. An Empirical Study on Text-Independent Speaker Verification based on the GE2E Method. Preprint at http://arxiv.org/abs/2011.04896 (2022)

work page arXiv 2011

[51] [58]

& Sainath, T

Prabhavalkar, R., Alvarez, R., Parada, C., Nakkiran, P. & Sainath, T. N. Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks. in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 4704–4708 (IEEE, South Brisbane, Queensland, Australia, 2015). doi:10.1109/...

work page doi:10.1109/icassp.2015.7178863 2015

[52] [59]

Determining the initial states in forward-backward filtering

Gustafsson, F. Determining the initial states in forward-backward filtering. IEEE Trans. Signal Process. 44, 988–992 (1996)

work page 1996

[53] [60]

& Krejtz, I

Chlasta, K., Wołk, K. & Krejtz, I. Automated speech-based screening of depression using deep convolutional neural networks. Procedia Computer Science 164, 618–628 (2019)

work page 2019

[54] [61]

& Othmani, A

Muzammel, M., Salam, H., Hoffmann, Y., Chetouani, M. & Othmani, A. AudVowelConsNet: A phoneme-level based deep CNN architecture for clinical depression diagnosis. Machine Learning with Applications 2, 100005 (2020). 34

work page 2020

[55] [62]

Deep residual learning for image recognition,

He, K., Zhang, X., Ren, S. & Sun, J. Deep Residual Learning for Image Recognition. in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (IEEE, Las Vegas, NV, USA, 2016). doi:10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016

[56] [63]

Deng, J. et al. ImageNet: A large-scale hierarchical image database. in 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, Miami, FL, 2009). doi:10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009

[57] [64]

Tayebi Arasteh, S. et al. Preserving fairness and diagnostic accuracy in private large-scale AI models for medical imaging. Commun Med 4, 46 (2024)

work page 2024

[58] [65]

& Ferris, S

Mehrabian, A. & Ferris, S. R. Inference of attitudes from nonverbal communication in two channels. Journal of Consulting Psychology 31, 248–252 (1967)

work page 1967

[59] [66]

& Scherer, K

Bänziger, T. & Scherer, K. R. The role of intonation in emotional expressions. Speech Communication 46, 252–267 (2005)

work page 2005

[60] [67]

& Åhlander, V

Kitzing, P., Maier, A. & Åhlander, V. L. Automatic speech recognition (ASR) and its use as a tool for assessment or therapy of voice, speech, and language disorders. Logopedics Phoniatrics Vocology 34, 91–96 (2009)

work page 2009

[61] [68]

& Dehak, N

Moro-Velazquez, L., Villalba, J. & Dehak, N. Using X-Vectors to Automatically Detect Parkinson’s Disease from Speech. in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1155–1159 (IEEE, Barcelona, Spain, 2020). doi:10.1109/ICASSP40776.2020.9053770

work page doi:10.1109/icassp40776.2020.9053770 2020

[62] [69]

& Sha’abani, M

Jamal, N., Shanta, S., Mahmud, F. & Sha’abani, M. Automatic speech recognition (ASR) based approach for speech therapy of aphasic patients: A review. in 020028 (Johor, Malaysia, 2017). doi:10.1063/1.5002046

work page doi:10.1063/1.5002046 2017

[63] [70]

M., Ricketts, T

Picou, E. M., Ricketts, T. A. & Hornsby, B. W. Y. How Hearing Aids, Background Noise, and Visual Cues Influence Objective Listening Effort. Ear & Hearing 34, e52–e64 (2013). 35 Supplementary information Supplementary Note 1 Anonymization method Speech anonymization methods are broadly categorized into two classes: signal processing - based methods and dee...

work page 2013

[64] [71]

Spectral conversion: The waveform is converted into Mel -spectrograms or other time - frequency representations

work page

[65] [72]

Feature disentanglement : Speaker identity features are extracted and modified or replaced. 37

work page

[66] [73]

The VoicePrivacy Challenge4–7 includes several DL-based baseline systems are explained below

Re-synthesis: The modified features are used to synthesize a new waveform via a vocoder or neural synthesizer. The VoicePrivacy Challenge4–7 includes several DL-based baseline systems are explained below. X-vector replacement with neural source-filter synthesis This method 8 anonymizes speech by replacing the original speaker representation with a synthet...

work page

[67] [74]

Tayebi Arasteh, S. et al. Addressing challenges in speaker anonymization to maintain utility while ensuring privacy of pathological speech. Commun Med 4, 182 (2024)

work page 2024

[68] [75]

McAdams, S. E. Spectral Fusion, Spectral Parsing and the Formation of Auditory Images. (Ph.D. dissertation, Stanford University, 1984)

work page 1984

[69] [76]

& Evans, N

Patino, J., Tomashenko, N., Todisco, M., Nautsch, A. & Evans, N. Speaker Anonymisation Using the McAdams Coefficient. in INTERSPEECH 2021 1099–1103 (2021). doi:10.21437/Interspeech.2021-1070

work page doi:10.21437/interspeech.2021-1070 2021

[70] [77]

Tomashenko, N. et al. The VoicePrivacy 2022 Challenge Evaluation Plan. Preprint at http://arxiv.org/abs/2203.12468 (2022)

work page arXiv 2022

[71] [78]

Tomashenko, N. et al. The VoicePrivacy 2020 Challenge: Results and findings. Computer Speech & Language 74, 101362 (2022)

work page 2020

[72] [79]

Tomashenko, N. et al. Introducing the VoicePrivacy Initiative. in INTERSPEECH 2020 1693–1697 (ISCA, 2020). doi:10.21437/Interspeech.2020-1333. 44

work page doi:10.21437/interspeech.2020-1333 2020

[73] [80]

Tomashenko, N. et al. The VoicePrivacy 2024 Challenge Evaluation Plan. Preprint at https://doi.org/10.48550/arXiv.2404.02677 (2024)

work page doi:10.48550/arxiv.2404.02677 2024

[74] [81]

Fang, F. et al. Speaker Anonymization Using X-vector and Neural Waveform Models. in 10th ISCA Speech Synthesis Workshop (Vienna, Austria, 2019)

work page 2019

[75] [82]

& Khudanpur, S

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. & Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 5329–5333 (IEEE, Calgary, AB, 2018). doi:10.1109/ICASSP.2018.8461375

work page doi:10.1109/icassp.2018.8461375 2018

[76] [83]

& Khudanpur, S

Peddinti, V., Povey, D. & Khudanpur, S. A time delay neural network architecture for efficient modeling of long temporal contexts. in Interspeech 2015 (ISCA, ISCA, 2015). doi:10.21437/interspeech.2015-647

work page doi:10.21437/interspeech.2015-647 2015

[77] [84]

Povey, D. et al. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks. in Interspeech 2018 (ISCA, ISCA, 2018). doi:10.21437/interspeech.2018-1417

work page doi:10.21437/interspeech.2018-1417 2018

[78] [85]

& Bae, J

Kong, J., Kim, J. & Bae, J. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. in NIPS’20: Proceedings of the 34th International Conference on Neural Information Processing Systems vol. 1428Pages 17022–17033 (2020)

work page 2020

[79] [86]

Meyer, S. et al. Prosody Is Not Identity: A Speaker Anonymization Approach Using Prosody Cloning. in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1–5 (IEEE, Rhodes Island, Greece, 2023). doi:10.1109/icassp49357.2023.10096607

work page doi:10.1109/icassp49357.2023.10096607 2023

[80] [87]

Meyer, S. et al. Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy. in 2022 IEEE Spoken Language Technology Workshop (SLT) 912–919 (IEEE, Doha, Qatar, 2023). doi:10.1109/SLT54892.2023.10022601

work page doi:10.1109/slt54892.2023.10022601 2022