DDPO-VC: Speaker De-Identification via Diffusion Denoising Policy Optimization

Cody Karjadi; James Glass; Liming Wang; Rhoda Au

arxiv: 2606.15313 · v2 · pith:RP5OFVLFnew · submitted 2026-06-13 · 📡 eess.AS · cs.SD

DDPO-VC: Speaker De-Identification via Diffusion Denoising Policy Optimization

Liming Wang , Cody Karjadi , Rhoda Au , James Glass This is my paper

Pith reviewed 2026-07-03 23:34 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords speaker de-identificationdiffusion modelsreinforcement learningvoice conversionprivacy preservationcognitive utilitydementia speechhealth signals

0 comments

The pith

Reinforcement learning post-training of diffusion models enables speaker de-identification while preserving correlated cognitive utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that disentanglement-based speaker de-identification fails when speaker identity correlates with utility variables such as cognitive health status, producing either privacy leaks or loss of downstream information. DDPO-VC solves this by applying reinforcement learning to post-train diffusion models for voice conversion, using reward signals that combine outputs from separate privacy-focused and utility-focused teachers. The resulting models are tested on two dementia speech benchmarks, where they outperform strong baseline de-identification methods on both privacy preservation and cognitive utility metrics. A sympathetic reader would care because many practical speech applications involve precisely these correlations between identity and health signals.

Core claim

DDPO-VC performs speaker de-identification through reinforcement learning-based post-training of diffusion models for voice conversion, learning from combined reward signals supplied by privacy-focused and utility-focused teachers to manage correlations between speaker identity and cognitive status without leakage or utility loss.

What carries the argument

Diffusion denoising policy optimization (DDPO) for post-training voice conversion models with combined privacy and utility teacher rewards.

If this is right

The approach outperforms strong de-identification baselines in privacy preservation on dementia speech benchmarks.
Cognitive utility is maintained at higher levels than baselines on the same benchmarks.
Correlations between identity and health status are handled without producing leakage or utility loss.
The framework offers a general post-training recipe for diffusion-based conversion tasks facing similar trade-offs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward-combination strategy could extend to other modalities where identity correlates with sensitive attributes.
Tuning the relative weighting of the two teacher rewards may allow controllable privacy-utility operating points.
The method might apply to non-speech biometric data where identity and diagnostic signals are entangled.

Load-bearing premise

Reward signals from separate privacy-focused and utility-focused teachers can be combined during RL post-training without causing private information leakage or loss of downstream utility.

What would settle it

Running the method on the two dementia speech benchmarks and observing no gains over disentanglement baselines in both privacy and cognitive utility metrics would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2606.15313 by Cody Karjadi, James Glass, Liming Wang, Rhoda Au.

**Figure 1.** Figure 1: Model architecture of DDPO-VC. III. METHOD In this section, we describe DDPO-VC, a speaker deid model that combines conditional diffusion modeling with reinforcement learning (RL) post-training. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: shows the training loss curves for different reward models on ADReSS. DDPO-VC with the EfficientNet dementia teacher is less stable than with the Whisper teacher, suggesting that the richer linguistic representations in Whisper provide a smoother utility signal. Adding DPO also slows convergence, consistent with the fact that preference pairs are easier to define for speaker dissimilarity than for subtle … view at source ↗

read the original abstract

A key challenge of speaker de-identification is the balance between privacy and utility. Many utility variables, such as the cognitive health status of the speaker, are correlated with the privacy variable, such as the speaker identity, violating the independence assumption held by the disentanglement-based approaches, causing leakage of private information and the loss of useful information for downstream tasks. To tackle this challenge, we propose a general framework, DDPO-VC, for speaker de-identification through reinforcement learning-based post-training with diffusion models. Learning from reward signals combining knowledge from privacy-focused and utility-focused teachers, our method outperforms various strong \deid/ methods in both privacy preservation and cognitive utility on two commonly used dementia speech benchmarks. Please check out our code\footnote{\href{https://github.com/cactuswiththoughts/DDPO-VC}{https://github.com/cactuswiththoughts/DDPO-VC}} and demo\footnote{\href{https://cactuswiththoughts.github.io/SpeakerDeID-Demo/}{https://cactuswiththoughts.github.io/SpeakerDeID-Demo/}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DDPO-VC adds RL post-training with dual privacy-utility teachers to diffusion models to handle correlated speaker identity and cognitive status in de-identification.

read the letter

The paper's core move is to replace disentanglement assumptions with a dual-teacher RL setup on diffusion models. Privacy and utility signals are learned separately then combined as rewards during post-training, so the model can suppress identity while keeping cognitive markers even when the two are entangled in dementia speech.

What is new is the specific DDPO-VC pipeline that applies diffusion denoising policy optimization to this task and the explicit dual-teacher reward construction. The authors release code and a demo, which is useful for anyone who wants to check the implementation.

The work does well on motivation: the correlation problem is real in health speech data and disentanglement methods often leak or discard signal. Testing on two standard dementia benchmarks and claiming gains on both privacy and utility metrics gives a concrete target.

Soft spots are the usual ones for an empirical RL paper. The abstract gives no numbers, baselines, or ablation on how the two reward streams are balanced, so the full paper must show that the combination actually reduces leakage without utility collapse and that results are stable across seeds and datasets. If the teachers are themselves imperfect, the downstream gains could be fragile. No obvious circularity or invented metrics in the description.

This is for groups working on privacy-preserving speech for medical AI or voice conversion. Readers who need reproducible de-identification code on dementia data will find it worth looking at.

It should go to peer review. The problem is practical, the code path exists, and the central empirical claim can be checked.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes DDPO-VC, a general framework for speaker de-identification that applies reinforcement learning-based post-training to diffusion models. It combines reward signals from separate privacy-focused and utility-focused teachers to address correlations between speaker identity and cognitive status (e.g., dementia-related attributes), claiming superior performance over strong de-identification baselines in both privacy preservation and downstream cognitive utility on two dementia speech benchmarks. Code and a demo are provided for reproducibility.

Significance. If the empirical results hold, the work offers a practical alternative to disentanglement-based de-identification methods by directly optimizing a combined reward that respects attribute correlations without explicit independence assumptions. The RL post-training formulation on diffusion models and the open code repository constitute verifiable strengths that could support follow-up work in privacy-preserving speech processing for clinical applications.

minor comments (3)

The abstract states outperformance on benchmarks but does not name the specific metrics, baselines, or experimental design; adding a concise summary of these (e.g., EER for privacy, accuracy for cognitive utility) would improve readability without altering the technical content.
Notation for the combined reward (privacy + utility teachers) should be introduced with an explicit equation in the method section to clarify how the two signals are weighted or fused during policy optimization.
The paper references a GitHub repository and demo page; confirming that the released code includes the exact training configurations and evaluation scripts used for the reported benchmark numbers would strengthen reproducibility claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary and recommendation of minor revision. No major comments were provided in the report, so we have no points to address point-by-point at this stage. We will incorporate any minor suggestions in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central contribution is an empirical RL post-training framework (DDPO-VC) that combines reward signals from separate privacy and utility teachers to optimize a diffusion model for de-identification. Claims of outperformance are grounded in benchmark results on dementia speech datasets rather than any closed derivation, self-referential definition, or fitted parameter renamed as prediction. No equations, uniqueness theorems, or self-citations are invoked in a load-bearing way within the provided text; the method is presented as a direct response to disentanglement limitations, with reproducible code supplied as an independent verification path. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the method description relies on standard diffusion and RL concepts without detailing new postulates.

pith-pipeline@v0.9.1-grok · 5727 in / 998 out tokens · 22010 ms · 2026-07-03T23:34:08.507891+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Consumer sentinel network data book 2024,

Federal Trade Commission, “Consumer sentinel network data book 2024,” Federal Trade Commission, Tech. Rep., Mar. 2025, includes identity theft report categories, including medical services identity theft reports for calendar year 2024. [Online]. Available: https: //www.ftc.gov/reports/consumer-sentinel-network-data-book-2024

2024
[2]

AutoVC: Zero-shot voice style transfer with only autoencoder loss,

K. Qian, Y . Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “AutoVC: Zero-shot voice style transfer with only autoencoder loss,” inICML, 2019, pp. 5210–5219. [Online]. Available: http: //proceedings.mlr.press/v97/qian19c.html

2019
[3]

Vqmivc: Vector quantization and mutual information- based unsupervised speech representation disentanglement for one- shot voice conversion,

W. Wanget al., “Vqmivc: Vector quantization and mutual information- based unsupervised speech representation disentanglement for one- shot voice conversion,” inInterspeech, 2021. [Online]. Available: https://arxiv.org/pdf/2106.10132

work page arXiv 2021
[4]

Can diffusion models disentangle? a theoretical perspective,

L. Wang, M. J. Mirza, Y . Gong, Y . Gong, J. Zhang, B. H. Tracey, K. Placek, M. Vilela, and J. R. Glass, “Can diffusion models disentangle? a theoretical perspective,” inNeurIPS, 2025. [Online]. Available: https://arxiv.org/abs/2504.00220

work page arXiv 2025
[5]

Generative modeling by estimating gradients of the data distribution,

Y . Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” inNeural Information Processing System,
[6]

Generative Modeling by Estimating Gradients of the Data Distribution

[Online]. Available: https://arxiv.org/pdf/1907.05600

work page internal anchor Pith review Pith/arXiv arXiv 1907
[7]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 6840–6851

2020
[8]

Grad- TTS: A diffusion probabilistic model for text-to-speech,

V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudinov, “Grad- TTS: A diffusion probabilistic model for text-to-speech,” inICML, ser. Proceedings of Machine Learning Research, vol. 139. PMLR, 2021

2021
[9]

Diffusion-based voice conversion with fast maximum likelihood sampling scheme,

V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, M. S. Kudinov, and J. Wei, “Diffusion-based voice conversion with fast maximum likelihood sampling scheme,” inICLR, 2022. [Online]. Available: https://openreview.net/forum?id=8c50f-DoW Au

2022
[10]

Vevo: Control- lable zero-shot voice imitation with self-supervised disentanglement,

X. Zhang, X. Zhang, K. Peng, Z. Tang, V . Manohar, Y . Liu, J. Hwang, D. Li, Y . Wang, J. Chan, Y . Huang, Z. Wu, and M. Ma, “Vevo: Control- lable zero-shot voice imitation with self-supervised disentanglement,” in ICLR, 2025

2025
[11]

DiffVC+: Improving Diffusion-based V oice Conversion for Speaker Anonymization,

F. Huang, K. Zeng, and W. Zhu, “DiffVC+: Improving Diffusion-based V oice Conversion for Speaker Anonymization,” inInterspeech 2024, 2024, pp. 4453–4457

2024
[12]

Diffusion model alignment using direct preference optimization,

B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik, “Diffusion model alignment using direct preference optimization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 8228–8238

2024
[13]

TriAAN- VC: Triple Adaptive Attention Normalization for Any-to-Any V oice Conversion,

H. J. Park, S. W. Yang, J. S. Kim, W. Shin, and S. W. Han, “TriAAN- VC: Triple Adaptive Attention Normalization for Any-to-Any V oice Conversion,” inICASSP, 2023

2023
[14]

LM-VC: Zero- shot voice conversion via speech generation based on language models,

Z. Wang, Y . Chen, L. Xie, Q. Tian, and Y . Wang, “LM-VC: Zero- shot voice conversion via speech generation based on language models,” IEEE Signal Processing Letters, vol. 30, pp. 1157–1161, 2023

2023
[15]

Emo-stargan: A semi-supervised any-to-many non-parallel emotion- preserving voice conversion,

S. Ghosh, A. Das, Y . Sinha, I. Siegert, T. Polzehl, and S. Stober, “Emo-stargan: A semi-supervised any-to-many non-parallel emotion- preserving voice conversion,” inProceedings of Interspeech 2023, 2023, pp. 1498–1502. [Online]. Available: https://arxiv.org/abs/2309.07586

work page arXiv 2023
[16]

Available: https://arxiv.org/abs/2403.03100

Z. Ju, Y . Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y . Liu, Y . Leng, K. Song, S. Tanget al., “Naturalspeech 3: Zero-shot speech syn- thesis with factorized codec and diffusion models,”arXiv preprint arXiv:2403.03100, 2024

work page arXiv 2024
[17]

Speaker anonymisation using the mcadams coefficient,

J. Patino, N. Tomashenko, M. Todisco, A. Nautsch, and N. Evans, “Speaker anonymisation using the mcadams coefficient,” inInterspeech 2021, 2021, pp. 1099–1103

2021
[18]

Speaker anonymization using x-vector and neural waveform models,

F. Fang, X. Wang, J. Yamagishi, I. Echizen, M. Todisco, N. Evans, and J.-F. Bonastre, “Speaker anonymization using x-vector and neural waveform models,” inProceedings of the 10th ISCA Speech Synthesis Workshop, 2019

2019
[19]

Evaluating voice conversion-based privacy protection against informed attackers,

B. M. L. Srivastava, N. Vauquier, M. Sahidullah, A. Bellet, M. Tommasi, and E. Vincent, “Evaluating voice conversion-based privacy protection against informed attackers,” inICASSP, 2020, pp. 2802–2806

2020
[20]

Design choices for x-vector based speaker anonymization,

B. M. L. Srivastava, N. Tomashenko, X. Wang, E. Vincent, J. Yamagishi, M. Maouche, A. Bellet, and M. Tommasi, “Design choices for x-vector based speaker anonymization,” inInterspeech 2020, 2020, pp. 1693– 1697

2020
[21]

Speaker anonymization with distribution-preserving x-vector generation for the voiceprivacy challenge 2020,

H. Turner, G. Lovisotto, and I. Martinovic, “Speaker anonymization with distribution-preserving x-vector generation for the voiceprivacy challenge 2020,” inProceedings of the V oicePrivacy 2020 Challenge Workshop, 2020, challenge system description / preprint

2020
[22]

Asynchronous voice anonymization by learning from speaker-adversarial speech,

R. Wang, L. Chen, K. A. Lee, and Z.-H. Ling, “Asynchronous voice anonymization by learning from speaker-adversarial speech,”IEEE Signal Processing Letters, vol. 32, pp. 1905–1909, 2025

1905
[23]

Are disentangled repre- sentations all you need to build speaker anonymization systems?

P. Champion, D. Jouvet, and A. Larcher, “Are disentangled repre- sentations all you need to build speaker anonymization systems?” in Interspeech 2022, 2022, pp. 821–825

2022
[24]

Speaker anonymization using neural audio codec language models,

M. Panariello, F. Nespoli, M. Todisco, and N. Evans, “Speaker anonymization using neural audio codec language models,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 2024, pp. 4725–4729

2024
[25]

Speaker anonymization with phonetic intermediate representations,

S. Meyer, F. Lux, P. Denisov, J. Koch, P. Tilli, and N. T. Vu, “Speaker anonymization with phonetic intermediate representations,” inInter- speech 2022, 2022, pp. 1983–1987

2022
[26]

EASY: Emotion-aware speaker anonymization via factorized distillation,

J. Yao, H. Liu, E. S. Chng, and L. Xie, “EASY: Emotion-aware speaker anonymization via factorized distillation,” inInterspeech 2025, 2025

2025
[27]

Anonymiz- ing speech with generative adversarial networks to preserve speaker privacy,

S. Meyer, P. Tilli, P. Denisov, F. Lux, J. Koch, and N. T. Vu, “Anonymiz- ing speech with generative adversarial networks to preserve speaker privacy,” in2022 IEEE Spoken Language Technology Workshop, 2022, pp. 912–919

2022
[28]

Private kNN-VC: Interpretable Anonymization of Converted Speech,

C. Franzreb, A. Das, T. Polzehl, and S. M ¨oller, “Private kNN-VC: Interpretable Anonymization of Converted Speech,” inInterspeech, 2025, pp. 3224–3228

2025
[29]

Differentially private speaker anonymization,

A. S. Shamsabadi, B. M. L. Srivastava, A. Bellet, N. Vauquier, E. Vin- cent, M. Maouche, M. Tommasi, and N. Papernot, “Differentially private speaker anonymization,”Proceedings on Privacy Enhancing Technolo- gies, vol. 2023, no. 1, pp. 98–114, 2023

2023
[30]

The third voiceprivacy challenge: Preserving emotional expressiveness and linguistic content in voice anonymization,

N. Tomashenko, X. Miao, P. Champion, S. Meyer, M. Panariello, X. Wang, N. Evans, E. Vincent, J. Yamagishi, and M. Todisco, “The third voiceprivacy challenge: Preserving emotional expressiveness and linguistic content in voice anonymization,” 2026

2026
[31]

Why disentanglement-based speaker anonymization systems fail at preserving emotions?

¨U. E. Gaznepoglu and N. Peters, “Why disentanglement-based speaker anonymization systems fail at preserving emotions?” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, 2025, pp. 1–5

2025
[32]

Ad- dressing challenges in speaker anonymization to maintain utility while ensuring privacy of pathological speech,

S. T. Arasteh, T. Arias-Vergara, P. A. P ´erez-Toro, T. Weise, K. Packh¨auser, M. Schuster, E. Noeth, A. Maier, S. H. Yanget al., “Ad- dressing challenges in speaker anonymization to maintain utility while ensuring privacy of pathological speech,”Communications Medicine, vol. 4, p. 182, 2024

2024
[33]

Distinctive and natural speaker anonymization via singular value transformation-assisted matrix,

J. Yao, Q. Wang, P. Guo, Z. Ning, and L. Xie, “Distinctive and natural speaker anonymization via singular value transformation-assisted matrix,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2944–2956, 2024

2024
[34]

Deep reinforcement learning from human preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” inAdvances in Neural Information Processing Systems, 2017

2017
[35]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inAdvances in Neural Information Processing Systems, 2023

2023
[36]

Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability,

R. Liu, B. Sisman, and H. Li, “Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability,” in Interspeech, 2021

2021
[37]

Koel-TTS: Enhancing LLM based speech generation with preference alignment and classifier free guidance,

S. Hussain, P. Neekhara, X. Yang, E. Casanova, S. Ghosh, M. T. Desta, R. Fejgin, R. Valle, and J. Li, “Koel-TTS: Enhancing LLM based speech generation with preference alignment and classifier free guidance,”arXiv preprint arXiv:2502.05236, 2025

work page arXiv 2025
[38]

Robust zero- shot text-to-speech synthesis with reverse inference optimization,

Y . Hu, C. Chen, S. Wang, E. S. Chng, and C. Zhang, “Robust zero- shot text-to-speech synthesis with reverse inference optimization,”arXiv preprint arXiv:2407.02243, 2024

work page arXiv 2024
[39]

Re-ENACT: Reinforcement learn- ing for emotional speech generation using actor-critic strategy,

R. Shankar and A. Venkataraman, “Re-ENACT: Reinforcement learn- ing for emotional speech generation using actor-critic strategy,”arXiv preprint arXiv:2408.01892, 2024

work page arXiv 2024
[40]

Pref- erence alignment improves language model-based tts,

J. Tian, C. Zhang, J. Shi, H. Zhang, J. Yu, S. Watanabe, and D. Yu, “Pref- erence alignment improves language model-based tts,”arXiv preprint arXiv:2409.12403, 2024

work page arXiv 2024
[41]

Speechalign: Aligning speech generation to human preferences,

D. Zhang, Z. Li, S. Li, X. Zhang, P. Wang, Y . Zhou, and X. Qiu, “Speechalign: Aligning speech generation to human preferences,” in NeurIPS, 2024

2024
[42]

Differentiable reward optimization for llm based tts system,

C. Gao, Z. Du, and S. Zhang, “Differentiable reward optimization for llm based tts system,” inInterspeech, 2025

2025
[43]

Fine-grained preference optimization improves zero-shot text- to-speech,

J. Yao, Y . Yang, Y . Pan, Y . Feng, Z. Ning, J. Ye, H. Zhou, and L. Xie, “Fine-grained preference optimization improves zero-shot text- to-speech,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2025

2025
[44]

Align-slm: Textless spoken language models with reinforcement learning from ai feedback,

G.-T. Lin, P. G. Shivakumar, A. Gourav, Y . Gu, A. Gandhe, H. yi Lee, and I. Bulyko, “Align-slm: Textless spoken language models with reinforcement learning from ai feedback,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

2025
[45]

Advancing zero-shot text-to-speech intelligibility across diverse domains via pref- erence alignment,

X. Zhang, Y . Wang, C. Wang, Z. Li, Z. Chen, and Z. Wu, “Advancing zero-shot text-to-speech intelligibility across diverse domains via pref- erence alignment,” inACL, 2025

2025
[46]

MPO: Multidimensional preference optimization for language model-based text-to-speech,

K. Xia, X. Zhu, J. Yao, and L. Xie, “MPO: Multidimensional preference optimization for language model-based text-to-speech,”arXiv preprint arXiv:2509.00685, 2025

work page arXiv 2025
[47]

Speechjudge: Towards human- level judgment for speech naturalness,

X. Zhang, C. Wang, H. Liao, Z. Li, Y . Wang, L. Wang, D. Jia, Y . Chen, X. Li, Z. Chen, and Z. Wu, “Speechjudge: Towards human- level judgment for speech naturalness,” inInternational Conference on Learning Representations, 2026

2026
[48]

GSRM: Generative speech reward model for speech rlhf,

M. Shen, T. Jayashankar, O. Hanna, N. Kanda, Y . Wang, K. ˇZmol´ıkov´a, R. Xie, N. Moritz, A. Xu, Y . Gaur, G. Wornell, Q. He, and J. Wu, “GSRM: Generative speech reward model for speech rlhf,”arXiv preprint arXiv:2602.13891, 2026

work page arXiv 2026
[49]

FlexiV oice: Enabling flexible style control in zero-shot TTS with natural language instructions,

D. Chen, X. Zhang, Y . Wang, K. Dai, L. Ma, and Z. Wu, “FlexiV oice: Enabling flexible style control in zero-shot TTS with natural language instructions,” inICLR, 2026

2026
[50]

VGPO: Fine-tuning speech autoregressive diffusion models with value guided policy optimization,

Z. Liu, D. Jia, Y . A. Li, C. Du, X. Zhuang, Z. Chen, Y . Wang, Y . Wang, S. Wang, and H. Li, “VGPO: Fine-tuning speech autoregressive diffusion models with value guided policy optimization,” 2026, submitted to International Conference on Learning Representations. [Online]. Available: https://openreview.net/pdf?id=LLWIaUZvEu

2026
[51]

Training diffusion models with reinforcement learning,

K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine, “Training diffusion models with reinforcement learning,” inInternational Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=YCWjhGrJFD

2024
[52]

DLPO: Diffusion model loss-guided reinforcement learning for fine-tuning text-to-speech diffusion models,

J. Chen, J.-S. Byun, M. Elsner, and A. Perrault, “DLPO: Diffusion model loss-guided reinforcement learning for fine-tuning text-to-speech diffusion models,”arXiv preprint arXiv:2405.14632, 2024

work page arXiv 2024
[53]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,”arXiv, 2021

2021
[54]

Ecapa-tdnn: Em- phasized channel attention, propagation and aggregation in tdnn based speaker verification,

B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Em- phasized channel attention, propagation and aggregation in tdnn based speaker verification,” inInterspeech, 2020

2020
[55]

Recognizing dementia from neuropsychological tests with state space models,

L. Wang, S. Bhati, C. Karjadi, R. Au, and J. Glass, “Recognizing dementia from neuropsychological tests with state space models,” in 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Honolulu, HI, USA: IEEE, 2025, pp. 1–7

2025
[56]

Alzheimer’s dementia recognition through spontaneous speech: The ADReSS Challenge,

S. Luz, F. Haider, S. de la Fuente, D. Fromm, and B. MacWhinney, “Alzheimer’s dementia recognition through spontaneous speech: The ADReSS Challenge,” inInterspeech, Shanghai, China, 2020. [Online]. Available: https://arxiv.org/abs/2004.06833

work page arXiv 2020
[57]

Role-specific language models for processing recorded neuropsychological exams,

T. Al Hanai, R. Au, and J. Glass, “Role-specific language models for processing recorded neuropsychological exams,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 2 (Short Papers), M. Walker, H. Ji, and A. Stent, Eds. New Orleans, Louisiana: Associat...

2018
[58]

Leveraging pretrained representations with task-related keywords for alzheimer’s disease detection,

J. Li, K. Song, J. Li, B. Zheng, D. Li, X. Wu, X. Liu, and H. Meng, “Leveraging pretrained representations with task-related keywords for alzheimer’s disease detection,” inArXiv, 2023. [Online]. Available: https://arxiv.org/pdf/2303.08019.pdf

work page arXiv 2023
[59]

Robust speech recognition via large-scale weak super- vision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” inICML, 2023

2023
[60]

Efficientnet: Rethinking model scaling for convolutional neural networks,

M. Tan and Q. V . Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” inICML, vol. 97. PMLR, 2019, pp. 6105–6114. [Online]. Available: http://proceedings.mlr.press/v97/ tan19a.html

2019
[61]

Detecting dementia from long neuropsychological interviews,

N. Dawalatabad, Y . Gong, S. Khurana, R. Au, and J. Glass, “Detecting dementia from long neuropsychological interviews,” inFindings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp. 5270–5283. [Online]. Available: https://aclanthology.org/ 2022.findings-emnlp.386/

2022
[62]

The t05 system for the VoiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,

K. Baba, W. Nakata, Y . Saito, and H. Saruwatari, “The t05 system for the VoiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,” inIEEE Spoken Language Technology Workshop (SLT), 2024, pp. 818–824

2024
[63]

HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,

J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” inNeural Information Processing System, vol. 33, 2020, pp. 17 022–17 033

2020
[64]

V oice Conversion With Just Nearest Neighbors,

M. Baas, B. van Niekerk, and H. Kamper, “V oice Conversion With Just Nearest Neighbors,” inInterspeech, 2023

2023
[65]

LinearVC: Linear transformations of self-supervised features through the lens of voice conversion,

H. Kamper, B. van Niekerk, J. Za ¨ıdi, and M.-A. Carbonneau, “LinearVC: Linear transformations of self-supervised features through the lens of voice conversion,” inInterspeech, 2025

2025
[66]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Consumer sentinel network data book 2024,

Federal Trade Commission, “Consumer sentinel network data book 2024,” Federal Trade Commission, Tech. Rep., Mar. 2025, includes identity theft report categories, including medical services identity theft reports for calendar year 2024. [Online]. Available: https: //www.ftc.gov/reports/consumer-sentinel-network-data-book-2024

2024

[2] [2]

AutoVC: Zero-shot voice style transfer with only autoencoder loss,

K. Qian, Y . Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “AutoVC: Zero-shot voice style transfer with only autoencoder loss,” inICML, 2019, pp. 5210–5219. [Online]. Available: http: //proceedings.mlr.press/v97/qian19c.html

2019

[3] [3]

Vqmivc: Vector quantization and mutual information- based unsupervised speech representation disentanglement for one- shot voice conversion,

W. Wanget al., “Vqmivc: Vector quantization and mutual information- based unsupervised speech representation disentanglement for one- shot voice conversion,” inInterspeech, 2021. [Online]. Available: https://arxiv.org/pdf/2106.10132

work page arXiv 2021

[4] [4]

Can diffusion models disentangle? a theoretical perspective,

L. Wang, M. J. Mirza, Y . Gong, Y . Gong, J. Zhang, B. H. Tracey, K. Placek, M. Vilela, and J. R. Glass, “Can diffusion models disentangle? a theoretical perspective,” inNeurIPS, 2025. [Online]. Available: https://arxiv.org/abs/2504.00220

work page arXiv 2025

[5] [5]

Generative modeling by estimating gradients of the data distribution,

Y . Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” inNeural Information Processing System,

[6] [6]

Generative Modeling by Estimating Gradients of the Data Distribution

[Online]. Available: https://arxiv.org/pdf/1907.05600

work page internal anchor Pith review Pith/arXiv arXiv 1907

[7] [7]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 6840–6851

2020

[8] [8]

Grad- TTS: A diffusion probabilistic model for text-to-speech,

V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudinov, “Grad- TTS: A diffusion probabilistic model for text-to-speech,” inICML, ser. Proceedings of Machine Learning Research, vol. 139. PMLR, 2021

2021

[9] [9]

Diffusion-based voice conversion with fast maximum likelihood sampling scheme,

V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, M. S. Kudinov, and J. Wei, “Diffusion-based voice conversion with fast maximum likelihood sampling scheme,” inICLR, 2022. [Online]. Available: https://openreview.net/forum?id=8c50f-DoW Au

2022

[10] [10]

Vevo: Control- lable zero-shot voice imitation with self-supervised disentanglement,

X. Zhang, X. Zhang, K. Peng, Z. Tang, V . Manohar, Y . Liu, J. Hwang, D. Li, Y . Wang, J. Chan, Y . Huang, Z. Wu, and M. Ma, “Vevo: Control- lable zero-shot voice imitation with self-supervised disentanglement,” in ICLR, 2025

2025

[11] [11]

DiffVC+: Improving Diffusion-based V oice Conversion for Speaker Anonymization,

F. Huang, K. Zeng, and W. Zhu, “DiffVC+: Improving Diffusion-based V oice Conversion for Speaker Anonymization,” inInterspeech 2024, 2024, pp. 4453–4457

2024

[12] [12]

Diffusion model alignment using direct preference optimization,

B. Wallace, M. Dang, R. Rafailov, L. Zhou, A. Lou, S. Purushwalkam, S. Ermon, C. Xiong, S. Joty, and N. Naik, “Diffusion model alignment using direct preference optimization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 8228–8238

2024

[13] [13]

TriAAN- VC: Triple Adaptive Attention Normalization for Any-to-Any V oice Conversion,

H. J. Park, S. W. Yang, J. S. Kim, W. Shin, and S. W. Han, “TriAAN- VC: Triple Adaptive Attention Normalization for Any-to-Any V oice Conversion,” inICASSP, 2023

2023

[14] [14]

LM-VC: Zero- shot voice conversion via speech generation based on language models,

Z. Wang, Y . Chen, L. Xie, Q. Tian, and Y . Wang, “LM-VC: Zero- shot voice conversion via speech generation based on language models,” IEEE Signal Processing Letters, vol. 30, pp. 1157–1161, 2023

2023

[15] [15]

Emo-stargan: A semi-supervised any-to-many non-parallel emotion- preserving voice conversion,

S. Ghosh, A. Das, Y . Sinha, I. Siegert, T. Polzehl, and S. Stober, “Emo-stargan: A semi-supervised any-to-many non-parallel emotion- preserving voice conversion,” inProceedings of Interspeech 2023, 2023, pp. 1498–1502. [Online]. Available: https://arxiv.org/abs/2309.07586

work page arXiv 2023

[16] [16]

Available: https://arxiv.org/abs/2403.03100

Z. Ju, Y . Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y . Liu, Y . Leng, K. Song, S. Tanget al., “Naturalspeech 3: Zero-shot speech syn- thesis with factorized codec and diffusion models,”arXiv preprint arXiv:2403.03100, 2024

work page arXiv 2024

[17] [17]

Speaker anonymisation using the mcadams coefficient,

J. Patino, N. Tomashenko, M. Todisco, A. Nautsch, and N. Evans, “Speaker anonymisation using the mcadams coefficient,” inInterspeech 2021, 2021, pp. 1099–1103

2021

[18] [18]

Speaker anonymization using x-vector and neural waveform models,

F. Fang, X. Wang, J. Yamagishi, I. Echizen, M. Todisco, N. Evans, and J.-F. Bonastre, “Speaker anonymization using x-vector and neural waveform models,” inProceedings of the 10th ISCA Speech Synthesis Workshop, 2019

2019

[19] [19]

Evaluating voice conversion-based privacy protection against informed attackers,

B. M. L. Srivastava, N. Vauquier, M. Sahidullah, A. Bellet, M. Tommasi, and E. Vincent, “Evaluating voice conversion-based privacy protection against informed attackers,” inICASSP, 2020, pp. 2802–2806

2020

[20] [20]

Design choices for x-vector based speaker anonymization,

B. M. L. Srivastava, N. Tomashenko, X. Wang, E. Vincent, J. Yamagishi, M. Maouche, A. Bellet, and M. Tommasi, “Design choices for x-vector based speaker anonymization,” inInterspeech 2020, 2020, pp. 1693– 1697

2020

[21] [21]

Speaker anonymization with distribution-preserving x-vector generation for the voiceprivacy challenge 2020,

H. Turner, G. Lovisotto, and I. Martinovic, “Speaker anonymization with distribution-preserving x-vector generation for the voiceprivacy challenge 2020,” inProceedings of the V oicePrivacy 2020 Challenge Workshop, 2020, challenge system description / preprint

2020

[22] [22]

Asynchronous voice anonymization by learning from speaker-adversarial speech,

R. Wang, L. Chen, K. A. Lee, and Z.-H. Ling, “Asynchronous voice anonymization by learning from speaker-adversarial speech,”IEEE Signal Processing Letters, vol. 32, pp. 1905–1909, 2025

1905

[23] [23]

Are disentangled repre- sentations all you need to build speaker anonymization systems?

P. Champion, D. Jouvet, and A. Larcher, “Are disentangled repre- sentations all you need to build speaker anonymization systems?” in Interspeech 2022, 2022, pp. 821–825

2022

[24] [24]

Speaker anonymization using neural audio codec language models,

M. Panariello, F. Nespoli, M. Todisco, and N. Evans, “Speaker anonymization using neural audio codec language models,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing, 2024, pp. 4725–4729

2024

[25] [25]

Speaker anonymization with phonetic intermediate representations,

S. Meyer, F. Lux, P. Denisov, J. Koch, P. Tilli, and N. T. Vu, “Speaker anonymization with phonetic intermediate representations,” inInter- speech 2022, 2022, pp. 1983–1987

2022

[26] [26]

EASY: Emotion-aware speaker anonymization via factorized distillation,

J. Yao, H. Liu, E. S. Chng, and L. Xie, “EASY: Emotion-aware speaker anonymization via factorized distillation,” inInterspeech 2025, 2025

2025

[27] [27]

Anonymiz- ing speech with generative adversarial networks to preserve speaker privacy,

S. Meyer, P. Tilli, P. Denisov, F. Lux, J. Koch, and N. T. Vu, “Anonymiz- ing speech with generative adversarial networks to preserve speaker privacy,” in2022 IEEE Spoken Language Technology Workshop, 2022, pp. 912–919

2022

[28] [28]

Private kNN-VC: Interpretable Anonymization of Converted Speech,

C. Franzreb, A. Das, T. Polzehl, and S. M ¨oller, “Private kNN-VC: Interpretable Anonymization of Converted Speech,” inInterspeech, 2025, pp. 3224–3228

2025

[29] [29]

Differentially private speaker anonymization,

A. S. Shamsabadi, B. M. L. Srivastava, A. Bellet, N. Vauquier, E. Vin- cent, M. Maouche, M. Tommasi, and N. Papernot, “Differentially private speaker anonymization,”Proceedings on Privacy Enhancing Technolo- gies, vol. 2023, no. 1, pp. 98–114, 2023

2023

[30] [30]

The third voiceprivacy challenge: Preserving emotional expressiveness and linguistic content in voice anonymization,

N. Tomashenko, X. Miao, P. Champion, S. Meyer, M. Panariello, X. Wang, N. Evans, E. Vincent, J. Yamagishi, and M. Todisco, “The third voiceprivacy challenge: Preserving emotional expressiveness and linguistic content in voice anonymization,” 2026

2026

[31] [31]

Why disentanglement-based speaker anonymization systems fail at preserving emotions?

¨U. E. Gaznepoglu and N. Peters, “Why disentanglement-based speaker anonymization systems fail at preserving emotions?” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, 2025, pp. 1–5

2025

[32] [32]

Ad- dressing challenges in speaker anonymization to maintain utility while ensuring privacy of pathological speech,

S. T. Arasteh, T. Arias-Vergara, P. A. P ´erez-Toro, T. Weise, K. Packh¨auser, M. Schuster, E. Noeth, A. Maier, S. H. Yanget al., “Ad- dressing challenges in speaker anonymization to maintain utility while ensuring privacy of pathological speech,”Communications Medicine, vol. 4, p. 182, 2024

2024

[33] [33]

Distinctive and natural speaker anonymization via singular value transformation-assisted matrix,

J. Yao, Q. Wang, P. Guo, Z. Ning, and L. Xie, “Distinctive and natural speaker anonymization via singular value transformation-assisted matrix,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2944–2956, 2024

2024

[34] [34]

Deep reinforcement learning from human preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” inAdvances in Neural Information Processing Systems, 2017

2017

[35] [35]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inAdvances in Neural Information Processing Systems, 2023

2023

[36] [36]

Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability,

R. Liu, B. Sisman, and H. Li, “Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability,” in Interspeech, 2021

2021

[37] [37]

Koel-TTS: Enhancing LLM based speech generation with preference alignment and classifier free guidance,

S. Hussain, P. Neekhara, X. Yang, E. Casanova, S. Ghosh, M. T. Desta, R. Fejgin, R. Valle, and J. Li, “Koel-TTS: Enhancing LLM based speech generation with preference alignment and classifier free guidance,”arXiv preprint arXiv:2502.05236, 2025

work page arXiv 2025

[38] [38]

Robust zero- shot text-to-speech synthesis with reverse inference optimization,

Y . Hu, C. Chen, S. Wang, E. S. Chng, and C. Zhang, “Robust zero- shot text-to-speech synthesis with reverse inference optimization,”arXiv preprint arXiv:2407.02243, 2024

work page arXiv 2024

[39] [39]

Re-ENACT: Reinforcement learn- ing for emotional speech generation using actor-critic strategy,

R. Shankar and A. Venkataraman, “Re-ENACT: Reinforcement learn- ing for emotional speech generation using actor-critic strategy,”arXiv preprint arXiv:2408.01892, 2024

work page arXiv 2024

[40] [40]

Pref- erence alignment improves language model-based tts,

J. Tian, C. Zhang, J. Shi, H. Zhang, J. Yu, S. Watanabe, and D. Yu, “Pref- erence alignment improves language model-based tts,”arXiv preprint arXiv:2409.12403, 2024

work page arXiv 2024

[41] [41]

Speechalign: Aligning speech generation to human preferences,

D. Zhang, Z. Li, S. Li, X. Zhang, P. Wang, Y . Zhou, and X. Qiu, “Speechalign: Aligning speech generation to human preferences,” in NeurIPS, 2024

2024

[42] [42]

Differentiable reward optimization for llm based tts system,

C. Gao, Z. Du, and S. Zhang, “Differentiable reward optimization for llm based tts system,” inInterspeech, 2025

2025

[43] [43]

Fine-grained preference optimization improves zero-shot text- to-speech,

J. Yao, Y . Yang, Y . Pan, Y . Feng, Z. Ning, J. Ye, H. Zhou, and L. Xie, “Fine-grained preference optimization improves zero-shot text- to-speech,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2025

2025

[44] [44]

Align-slm: Textless spoken language models with reinforcement learning from ai feedback,

G.-T. Lin, P. G. Shivakumar, A. Gourav, Y . Gu, A. Gandhe, H. yi Lee, and I. Bulyko, “Align-slm: Textless spoken language models with reinforcement learning from ai feedback,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

2025

[45] [45]

Advancing zero-shot text-to-speech intelligibility across diverse domains via pref- erence alignment,

X. Zhang, Y . Wang, C. Wang, Z. Li, Z. Chen, and Z. Wu, “Advancing zero-shot text-to-speech intelligibility across diverse domains via pref- erence alignment,” inACL, 2025

2025

[46] [46]

MPO: Multidimensional preference optimization for language model-based text-to-speech,

K. Xia, X. Zhu, J. Yao, and L. Xie, “MPO: Multidimensional preference optimization for language model-based text-to-speech,”arXiv preprint arXiv:2509.00685, 2025

work page arXiv 2025

[47] [47]

Speechjudge: Towards human- level judgment for speech naturalness,

X. Zhang, C. Wang, H. Liao, Z. Li, Y . Wang, L. Wang, D. Jia, Y . Chen, X. Li, Z. Chen, and Z. Wu, “Speechjudge: Towards human- level judgment for speech naturalness,” inInternational Conference on Learning Representations, 2026

2026

[48] [48]

GSRM: Generative speech reward model for speech rlhf,

M. Shen, T. Jayashankar, O. Hanna, N. Kanda, Y . Wang, K. ˇZmol´ıkov´a, R. Xie, N. Moritz, A. Xu, Y . Gaur, G. Wornell, Q. He, and J. Wu, “GSRM: Generative speech reward model for speech rlhf,”arXiv preprint arXiv:2602.13891, 2026

work page arXiv 2026

[49] [49]

FlexiV oice: Enabling flexible style control in zero-shot TTS with natural language instructions,

D. Chen, X. Zhang, Y . Wang, K. Dai, L. Ma, and Z. Wu, “FlexiV oice: Enabling flexible style control in zero-shot TTS with natural language instructions,” inICLR, 2026

2026

[50] [50]

VGPO: Fine-tuning speech autoregressive diffusion models with value guided policy optimization,

Z. Liu, D. Jia, Y . A. Li, C. Du, X. Zhuang, Z. Chen, Y . Wang, Y . Wang, S. Wang, and H. Li, “VGPO: Fine-tuning speech autoregressive diffusion models with value guided policy optimization,” 2026, submitted to International Conference on Learning Representations. [Online]. Available: https://openreview.net/pdf?id=LLWIaUZvEu

2026

[51] [51]

Training diffusion models with reinforcement learning,

K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine, “Training diffusion models with reinforcement learning,” inInternational Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=YCWjhGrJFD

2024

[52] [52]

DLPO: Diffusion model loss-guided reinforcement learning for fine-tuning text-to-speech diffusion models,

J. Chen, J.-S. Byun, M. Elsner, and A. Perrault, “DLPO: Diffusion model loss-guided reinforcement learning for fine-tuning text-to-speech diffusion models,”arXiv preprint arXiv:2405.14632, 2024

work page arXiv 2024

[53] [53]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,”arXiv, 2021

2021

[54] [54]

Ecapa-tdnn: Em- phasized channel attention, propagation and aggregation in tdnn based speaker verification,

B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Em- phasized channel attention, propagation and aggregation in tdnn based speaker verification,” inInterspeech, 2020

2020

[55] [55]

Recognizing dementia from neuropsychological tests with state space models,

L. Wang, S. Bhati, C. Karjadi, R. Au, and J. Glass, “Recognizing dementia from neuropsychological tests with state space models,” in 2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Honolulu, HI, USA: IEEE, 2025, pp. 1–7

2025

[56] [56]

Alzheimer’s dementia recognition through spontaneous speech: The ADReSS Challenge,

S. Luz, F. Haider, S. de la Fuente, D. Fromm, and B. MacWhinney, “Alzheimer’s dementia recognition through spontaneous speech: The ADReSS Challenge,” inInterspeech, Shanghai, China, 2020. [Online]. Available: https://arxiv.org/abs/2004.06833

work page arXiv 2020

[57] [57]

Role-specific language models for processing recorded neuropsychological exams,

T. Al Hanai, R. Au, and J. Glass, “Role-specific language models for processing recorded neuropsychological exams,” inProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 2 (Short Papers), M. Walker, H. Ji, and A. Stent, Eds. New Orleans, Louisiana: Associat...

2018

[58] [58]

Leveraging pretrained representations with task-related keywords for alzheimer’s disease detection,

J. Li, K. Song, J. Li, B. Zheng, D. Li, X. Wu, X. Liu, and H. Meng, “Leveraging pretrained representations with task-related keywords for alzheimer’s disease detection,” inArXiv, 2023. [Online]. Available: https://arxiv.org/pdf/2303.08019.pdf

work page arXiv 2023

[59] [59]

Robust speech recognition via large-scale weak super- vision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” inICML, 2023

2023

[60] [60]

Efficientnet: Rethinking model scaling for convolutional neural networks,

M. Tan and Q. V . Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” inICML, vol. 97. PMLR, 2019, pp. 6105–6114. [Online]. Available: http://proceedings.mlr.press/v97/ tan19a.html

2019

[61] [61]

Detecting dementia from long neuropsychological interviews,

N. Dawalatabad, Y . Gong, S. Khurana, R. Au, and J. Glass, “Detecting dementia from long neuropsychological interviews,” inFindings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp. 5270–5283. [Online]. Available: https://aclanthology.org/ 2022.findings-emnlp.386/

2022

[62] [62]

The t05 system for the VoiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,

K. Baba, W. Nakata, Y . Saito, and H. Saruwatari, “The t05 system for the VoiceMOS Challenge 2024: Transfer learning from deep image classifier to naturalness MOS prediction of high-quality synthetic speech,” inIEEE Spoken Language Technology Workshop (SLT), 2024, pp. 818–824

2024

[63] [63]

HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,

J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” inNeural Information Processing System, vol. 33, 2020, pp. 17 022–17 033

2020

[64] [64]

V oice Conversion With Just Nearest Neighbors,

M. Baas, B. van Niekerk, and H. Kamper, “V oice Conversion With Just Nearest Neighbors,” inInterspeech, 2023

2023

[65] [65]

LinearVC: Linear transformations of self-supervised features through the lens of voice conversion,

H. Kamper, B. van Niekerk, J. Za ¨ıdi, and M.-A. Carbonneau, “LinearVC: Linear transformations of self-supervised features through the lens of voice conversion,” inInterspeech, 2025

2025

[66] [66]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li, L. He, S. Zhao, and F. Wei, “Neural codec language models are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023