DDSP-QbE++: Improving Speech Quality for Speech Anonymisation for Atypical Speech

Sebastian Stober; Suhita Ghosh; Yamini Sinha

arxiv: 2604.09246 · v1 · submitted 2026-04-10 · 💻 cs.SD · cs.AI

DDSP-QbE++: Improving Speech Quality for Speech Anonymisation for Atypical Speech

Suhita Ghosh , Yamini Sinha , Sebastian Stober This is my paper

Pith reviewed 2026-05-10 16:32 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords DDSPspeech anonymisationvoice conversionPolyBLEPvoicing detectionsubtractive synthesisatypical speechMOS

0 comments

The pith

Two changes to the DDSP excitation stage replace abrupt phase wraps with smooth corrections and suppress harmonics in unvoiced regions to cut aliasing artefacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the excitation generator inside a differentiable subtractive synthesizer used for voice conversion and anonymisation. A voicing detector now turns off the periodic component and substitutes filtered noise wherever the signal is unvoiced, preventing aliased harmonics from appearing in those segments. At the same time, a polynomial correction smooths the hard discontinuity that occurs each time the phase accumulator wraps around, removing the high-frequency components that cause buzz without requiring oversampling. These steps produce a cleaner spectral roll-off and higher mean opinion scores while adding no trainable parameters to the pipeline. The improvements matter for anonymising atypical speech, where listeners are especially sensitive to unnatural artefacts that can make the output sound robotic or distorted.

Core claim

In DDSP-QbE subtractive synthesis the phase-accumulated sawtooth excitation creates abrupt discontinuities that generate aliasing, perceived as buzziness and spectral distortion especially at higher fundamental frequencies. Adding explicit voicing detection to gate the harmonic excitation and replace it with filtered noise in unvoiced frames, together with Polynomial Band-Limited Step correction at each phase wrap, removes the alias-generating components. The resulting excitation yields a cleaner harmonic roll-off, reduced high-frequency artefacts, and measurably higher perceptual naturalness without extra learnable parameters or loss of differentiability.

What carries the argument

The modified excitation stage that pairs voicing-gated harmonic-plus-noise generation with PolyBLEP (Polynomial Band-Limited Step) correction of the phase-accumulated oscillator.

If this is right

The modified synthesizer integrates into the existing DDSP-QbE training pipeline with no added parameters.
High-frequency artefacts decrease because abrupt waveform discontinuities are replaced by smooth polynomial residuals.
Unvoiced regions avoid aliased harmonic content that would otherwise appear as buzz.
Mean opinion scores rise because the harmonic roll-off becomes cleaner across the spectrum.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same excitation fixes could be dropped into other DDSP-based audio generators that rely on phase accumulation.
For atypical speech, where voicing patterns are often irregular, the explicit voicing gate may reduce distortion more than it does for typical voices.
The parameter-free nature makes the method suitable for lightweight, on-device anonymisation systems.

Load-bearing premise

The reported gains in sound quality are produced by the voicing detector and PolyBLEP correction rather than by other unstated differences in training data, model capacity, or listening conditions.

What would settle it

A direct A/B listening test in which the same set of atypical-speech utterances is anonymised once with the original DDSP-QbE excitation and once with the two proposed changes, with listeners rating naturalness while all other pipeline elements remain identical.

Figures

Figures reproduced from arXiv: 2604.09246 by Sebastian Stober, Suhita Ghosh, Yamini Sinha.

**Figure 1.** Figure 1: Overview of the DDSP-QbE++ pipeline. The system comprises three stages: (1) Mapping – source features are extracted [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

read the original abstract

Differentiable Digital Signal Processing (DDSP) pipelines for voice conversion rely on subtractive synthesis, where a periodic excitation signal is shaped by a learned spectral envelope to reconstruct the target voice. In DDSP-QbE, the excitation is generated via phase accumulation, producing a sawtooth-like waveform whose abrupt discontinuities introduce aliasing artefacts that manifest perceptually as buzziness and spectral distortion, particularly at higher fundamental frequencies. We propose two targeted improvements to the excitation stage of the DDSP-QbE subtractive synthesizer. First, we incorporate explicit voicing detection to gate the harmonic excitation, suppressing the periodic component in unvoiced regions and replacing it with filtered noise, thereby avoiding aliased harmonic content where it is most perceptually disruptive. Second, we apply Polynomial Band-Limited Step (PolyBLEP) correction to the phase-accumulated oscillator, substituting the hard waveform discontinuity at each phase wrap with a smooth polynomial residual that cancels alias-generating components without oversampling or spectral truncation. Together, these modifications yield a cleaner harmonic roll-off, reduced high-frequency artefacts, and improved perceptual naturalness, as measured by MOS. The proposed approach is lightweight, differentiable, and integrates seamlessly into the existing DDSP-QbE training pipeline with no additional learnable parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds voicing gating and PolyBLEP to the DDSP-QbE excitation stage to cut aliasing in atypical-speech anonymization, but the causal link to the claimed MOS gains still needs isolation.

read the letter

The two changes here—voicing-gated noise for unvoiced regions and PolyBLEP correction on the phase accumulator—address aliasing in the subtractive synthesis part of DDSP-QbE. They target the sawtooth discontinuities that cause buzz at higher pitches, which is especially noticeable in atypical speech anonymization. What the paper does well is keep the fix lightweight. Both additions require no extra learnable parameters and integrate directly into the training pipeline. This is useful because anonymization systems often need to run efficiently and preserve the base model's behavior while improving output quality. The main concern is whether the reported perceptual improvements are really from these two modifications. The abstract mentions higher MOS and cleaner harmonics, but there are no ablations shown that isolate the voicing gate and the PolyBLEP from other possible differences in training or evaluation. Also, voicing detectors trained on typical speech can misfire on atypical patterns, potentially creating gating artifacts or leaving aliasing in place. That assumption needs checking in the full experiments. This kind of work is for people in the speech processing community who build or adapt DDSP models for voice conversion tasks. A reader looking for incremental DSP enhancements rather than new architectures will find it relevant. It is not groundbreaking, but it tackles a concrete quality issue in a niche area. The thinking is clear on the signal processing side and the citation pattern seems appropriate for the DDSP literature. It deserves a serious referee because the changes are reproducible and the problem is practical. I would recommend sending this to peer review. Reviewers can ask for the missing ablations and domain-specific tests on atypical speech to strengthen the causal claims.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes DDSP-QbE++, an enhancement to the DDSP-QbE subtractive synthesizer for speech anonymization of atypical speech. It adds two modifications to the excitation stage: (1) explicit voicing detection that gates the periodic harmonic excitation and substitutes filtered noise in unvoiced regions, and (2) Polynomial Band-Limited Step (PolyBLEP) correction applied to the phase-accumulated oscillator to smooth waveform discontinuities. These changes are described as lightweight, differentiable, and parameter-free; together they are claimed to produce cleaner harmonic roll-off, fewer high-frequency artefacts, and higher MOS scores for perceptual naturalness.

Significance. If the claimed perceptual gains can be isolated and replicated, the work supplies a practical, zero-parameter engineering improvement to DDSP-based voice conversion that directly targets aliasing artefacts known to be especially disruptive for atypical speech. The approach preserves the core DDSP advantages of differentiability and seamless pipeline integration while addressing a concrete limitation in the synthesis stage. This is a modest but useful incremental contribution rather than a foundational advance.

major comments (3)

[Abstract] Abstract: The central claim that the two modifications 'yield a cleaner harmonic roll-off, reduced high-frequency artefacts, and improved perceptual naturalness, as measured by MOS' is asserted without experimental details, baselines, statistical tests, or ablation results. This prevents verification of the causal link between the excitation-stage changes and the reported gains.
[§4] §4 (Experiments): No controlled experiments or ablations are described that isolate the voicing-gated excitation and PolyBLEP correction while holding training data, model capacity, optimizer, and evaluation conditions fixed. Without such isolation, the MOS improvements cannot be attributed to the proposed changes rather than confounding factors.
[§3.2] §3.2 (Voicing detection): The manuscript provides no validation of the voicing detector's reliability or stability on atypical speech patterns (e.g., irregular phonation or breathy segments). Standard detectors frequently fail in this domain, which would either re-introduce aliasing or create unnatural gating artefacts and thereby undermine applicability to the target use case.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments identify important areas where additional clarity and experimental rigor will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the two modifications 'yield a cleaner harmonic roll-off, reduced high-frequency artefacts, and improved perceptual naturalness, as measured by MOS' is asserted without experimental details, baselines, statistical tests, or ablation results. This prevents verification of the causal link between the excitation-stage changes and the reported gains.

Authors: The abstract is a concise summary of the proposed changes and their claimed benefits. Supporting details, including the MOS evaluation protocol and comparison to the DDSP-QbE baseline, appear in Section 4. To make the causal connection explicit at the summary level, we will revise the abstract to reference the evaluation methodology and the specific perceptual improvements observed. revision: yes
Referee: [§4] §4 (Experiments): No controlled experiments or ablations are described that isolate the voicing-gated excitation and PolyBLEP correction while holding training data, model capacity, optimizer, and evaluation conditions fixed. Without such isolation, the MOS improvements cannot be attributed to the proposed changes rather than confounding factors.

Authors: The current experiments compare the full DDSP-QbE++ system against the original DDSP-QbE. We acknowledge that separate ablations isolating each modification were not reported. In the revised manuscript we will add controlled ablation studies: variants using only voicing-gated noise, only PolyBLEP correction, and the combined system, all trained and evaluated under identical conditions, to quantify the individual contributions to harmonic roll-off and MOS scores. revision: yes
Referee: [§3.2] §3.2 (Voicing detection): The manuscript provides no validation of the voicing detector's reliability or stability on atypical speech patterns (e.g., irregular phonation or breathy segments). Standard detectors frequently fail in this domain, which would either re-introduce aliasing or create unnatural gating artefacts and thereby undermine applicability to the target use case.

Authors: The voicing detector follows a standard energy- and periodicity-based approach integrated into the DDSP pipeline; its impact is assessed indirectly via overall MOS results. We agree that explicit validation on atypical speech is necessary. We will add to the revision a quantitative assessment of the detector on the atypical speech corpus (e.g., agreement with manual annotations or comparison to alternative detectors) together with a discussion of observed failure cases and their perceptual consequences. revision: yes

Circularity Check

0 steps flagged

No circularity: modifications are direct parameter-free changes with external MOS evaluation

full rationale

The paper proposes two explicit, differentiable alterations to the DDSP-QbE excitation stage (voicing-gated noise replacement and PolyBLEP discontinuity correction) and reports their effect via separate MOS listening tests. No equations, fitted parameters, or self-citations are presented that would make the claimed cleaner roll-off or perceptual gains equivalent to the inputs by construction. The improvements are described as lightweight additions that integrate without additional learnable parameters, and the evaluation relies on human ratings rather than any self-referential metric. This is a standard engineering modification paper with no load-bearing derivation chain that collapses into its own definitions or fits.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard assumptions of differentiable subtractive synthesis and the perceptual relevance of aliasing reduction; it introduces no new free parameters, axioms beyond domain conventions, or invented entities.

axioms (2)

domain assumption Phase accumulation produces a sawtooth-like waveform whose discontinuities generate audible aliasing.
Stated as the starting point for the proposed corrections in the abstract.
domain assumption Voicing detection can be performed reliably enough to gate excitation without introducing new artefacts.
Implicit in the claim that gating suppresses periodic content only where it is disruptive.

pith-pipeline@v0.9.0 · 5527 in / 1369 out tokens · 61478 ms · 2026-05-10T16:32:34.713838+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Survey talk at interspeech 2024,

J. Yamagishi and X. Wang, “Survey talk at interspeech 2024, ” Interspeech 2024, Kos Island, Greece, 2024, survey Talk. [Online]. Available: https: //interspeech2024.org/survey-talks/

work page 2024
[2]

Emotional speech anonymization: Preserving emotion characteristics in pseudo-speaker speech generation,

H. Hua, Z. Shang, X. Li, P. Shi, C. Yang, L. Wang, and P. Zhang, “Emotional speech anonymization: Preserving emotion characteristics in pseudo-speaker speech generation, ” inProc. SPSC 2024, 2024, pp. 55–60

work page 2024
[3]

Enhance- ment of virtual assistants through multimodal ai for emotion recognition,

S. G. Rajesh, S. V. Madangarli, G. S. Pisharady, and R. Subrahmanyam, “Enhance- ment of virtual assistants through multimodal ai for emotion recognition, ”IEEE Access, 2025

work page 2025
[4]

An overview of voice conversion and its challenges: From statistical modeling to deep learning,

B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview of voice conversion and its challenges: From statistical modeling to deep learning, ”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 132–157, 2020

work page 2020
[5]

AVQVC: One-shot voice conversion by vector quantization with applying contrastive learning,

H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “AVQVC: One-shot voice conversion by vector quantization with applying contrastive learning, ” inProc. of the IEEE ICASSP, 2022, pp. 4613–4617

work page 2022
[6]

Improving voice qual- ity in speech anonymization with just perception-informed losses,

S. Ghosh, F. Dreyer, T. Thiele, F. Lorbeer, and S. Stober, “Improving voice qual- ity in speech anonymization with just perception-informed losses, ” inAudio Imagination: NeurIPS 2024 Workshop, 2024

work page 2024
[7]

CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks,

T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks, ” in2018 26th European Signal Processing Conference (EUSIPCO), 2018, pp. 2100–2104

work page 2018
[8]

Overview of voice conversion methods based on deep learning,

T. Walczyna and Z. Piotrowski, “Overview of voice conversion methods based on deep learning, ”Applied Sciences, vol. 13, no. 5, p. 3100, 2023

work page 2023
[9]

StarGAN-VC++: Towards emotion preserving voice conversion using deep embeddings,

A. Das, S. Ghosh, T. Polzehl, I. Siegert, and S. Stober, “StarGAN-VC++: Towards emotion preserving voice conversion using deep embeddings, ” in12th ISCA SSW2023, 2023, pp. 81–87

work page 2023
[10]

Voice conversion with just nearest neighbors,

M. Baas, B. van Niekerk, and H. Kamper, “Voice conversion with just nearest neighbors, ” inInterspeech 2023, 2023, pp. 2053–2057

work page 2023
[11]

WavLM: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing, ”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[12]

Anonymising elderly and pathological speech: Voice conversion using DDSP and Query-by-Example,

S. Ghosh, M. Jouaiti, A. Das, Y. Sinha, T. Polzehl, I. Siegert, and S. Stober, “Anonymising elderly and pathological speech: Voice conversion using DDSP and Query-by-Example, ” inInterspeech 2024, 2024, pp. 4438–4442

work page 2024
[13]

Discrete-time synthesis of the sawtooth waveform with reduced aliasing,

V. Valimaki, “Discrete-time synthesis of the sawtooth waveform with reduced aliasing, ”IEEE Signal Processing Letters, vol. 12, no. 3, pp. 214–217, 2005

work page 2005
[14]

DDSP: Differentiable digital signal processing,

J. Engel, C. Gu, A. Robertset al., “DDSP: Differentiable digital signal processing, ” inICLR, 2019

work page 2019
[15]

SEP-28k: A dataset for stuttering event detection from podcasts with people who stutter,

C. Lea, V. Mitra, A. Joshi, S. Kajarekar, and J. P. Bigham, “SEP-28k: A dataset for stuttering event detection from podcasts with people who stutter, ” ICASSP 2021

work page 2021
[16]

The Influence of Dataset- Partitioning on Dysfluency Detection Systems,

S. P. Bayerl, D. Wagner, T. Bocklet, and K. Riedhammer, “The Influence of Dataset- Partitioning on Dysfluency Detection Systems, ” inText, Speech, and Dialogue, 2022

work page 2022
[17]

Alzheimer’s dementia recognition through spontaneous speech,

S. Luz, F. Haider, S. de la Fuente Garcia, D. Fromm, and B. MacWhinney, “Alzheimer’s dementia recognition through spontaneous speech, ” p. 780169, 2021

work page 2021
[18]

Emotional voice conversion: Theory, databases and ESD,

K. Zhou, B. Sisman, R. Liu, and H. Li, “Emotional voice conversion: Theory, databases and ESD, ”Speech Communication, vol. 137, pp. 1–18, 2022

work page 2022
[19]

The VoicePrivacy 2024 challenge evaluation plan,

N. Tomashenko, X. Miao, P. Champion, S. Meyer, X. Wang, E. Vincent, M. Pa- nariello, N. Evans, J. Yamagishi, and M. Todisco, “The VoicePrivacy 2024 challenge evaluation plan, ” 2024. Received 16 March 2026

work page 2024

[1] [1]

Survey talk at interspeech 2024,

J. Yamagishi and X. Wang, “Survey talk at interspeech 2024, ” Interspeech 2024, Kos Island, Greece, 2024, survey Talk. [Online]. Available: https: //interspeech2024.org/survey-talks/

work page 2024

[2] [2]

Emotional speech anonymization: Preserving emotion characteristics in pseudo-speaker speech generation,

H. Hua, Z. Shang, X. Li, P. Shi, C. Yang, L. Wang, and P. Zhang, “Emotional speech anonymization: Preserving emotion characteristics in pseudo-speaker speech generation, ” inProc. SPSC 2024, 2024, pp. 55–60

work page 2024

[3] [3]

Enhance- ment of virtual assistants through multimodal ai for emotion recognition,

S. G. Rajesh, S. V. Madangarli, G. S. Pisharady, and R. Subrahmanyam, “Enhance- ment of virtual assistants through multimodal ai for emotion recognition, ”IEEE Access, 2025

work page 2025

[4] [4]

An overview of voice conversion and its challenges: From statistical modeling to deep learning,

B. Sisman, J. Yamagishi, S. King, and H. Li, “An overview of voice conversion and its challenges: From statistical modeling to deep learning, ”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 132–157, 2020

work page 2020

[5] [5]

AVQVC: One-shot voice conversion by vector quantization with applying contrastive learning,

H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “AVQVC: One-shot voice conversion by vector quantization with applying contrastive learning, ” inProc. of the IEEE ICASSP, 2022, pp. 4613–4617

work page 2022

[6] [6]

Improving voice qual- ity in speech anonymization with just perception-informed losses,

S. Ghosh, F. Dreyer, T. Thiele, F. Lorbeer, and S. Stober, “Improving voice qual- ity in speech anonymization with just perception-informed losses, ” inAudio Imagination: NeurIPS 2024 Workshop, 2024

work page 2024

[7] [7]

CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks,

T. Kaneko and H. Kameoka, “CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks, ” in2018 26th European Signal Processing Conference (EUSIPCO), 2018, pp. 2100–2104

work page 2018

[8] [8]

Overview of voice conversion methods based on deep learning,

T. Walczyna and Z. Piotrowski, “Overview of voice conversion methods based on deep learning, ”Applied Sciences, vol. 13, no. 5, p. 3100, 2023

work page 2023

[9] [9]

StarGAN-VC++: Towards emotion preserving voice conversion using deep embeddings,

A. Das, S. Ghosh, T. Polzehl, I. Siegert, and S. Stober, “StarGAN-VC++: Towards emotion preserving voice conversion using deep embeddings, ” in12th ISCA SSW2023, 2023, pp. 81–87

work page 2023

[10] [10]

Voice conversion with just nearest neighbors,

M. Baas, B. van Niekerk, and H. Kamper, “Voice conversion with just nearest neighbors, ” inInterspeech 2023, 2023, pp. 2053–2057

work page 2023

[11] [11]

WavLM: Large-scale self-supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiaoet al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing, ”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022

[12] [12]

Anonymising elderly and pathological speech: Voice conversion using DDSP and Query-by-Example,

S. Ghosh, M. Jouaiti, A. Das, Y. Sinha, T. Polzehl, I. Siegert, and S. Stober, “Anonymising elderly and pathological speech: Voice conversion using DDSP and Query-by-Example, ” inInterspeech 2024, 2024, pp. 4438–4442

work page 2024

[13] [13]

Discrete-time synthesis of the sawtooth waveform with reduced aliasing,

V. Valimaki, “Discrete-time synthesis of the sawtooth waveform with reduced aliasing, ”IEEE Signal Processing Letters, vol. 12, no. 3, pp. 214–217, 2005

work page 2005

[14] [14]

DDSP: Differentiable digital signal processing,

J. Engel, C. Gu, A. Robertset al., “DDSP: Differentiable digital signal processing, ” inICLR, 2019

work page 2019

[15] [15]

SEP-28k: A dataset for stuttering event detection from podcasts with people who stutter,

C. Lea, V. Mitra, A. Joshi, S. Kajarekar, and J. P. Bigham, “SEP-28k: A dataset for stuttering event detection from podcasts with people who stutter, ” ICASSP 2021

work page 2021

[16] [16]

The Influence of Dataset- Partitioning on Dysfluency Detection Systems,

S. P. Bayerl, D. Wagner, T. Bocklet, and K. Riedhammer, “The Influence of Dataset- Partitioning on Dysfluency Detection Systems, ” inText, Speech, and Dialogue, 2022

work page 2022

[17] [17]

Alzheimer’s dementia recognition through spontaneous speech,

S. Luz, F. Haider, S. de la Fuente Garcia, D. Fromm, and B. MacWhinney, “Alzheimer’s dementia recognition through spontaneous speech, ” p. 780169, 2021

work page 2021

[18] [18]

Emotional voice conversion: Theory, databases and ESD,

K. Zhou, B. Sisman, R. Liu, and H. Li, “Emotional voice conversion: Theory, databases and ESD, ”Speech Communication, vol. 137, pp. 1–18, 2022

work page 2022

[19] [19]

The VoicePrivacy 2024 challenge evaluation plan,

N. Tomashenko, X. Miao, P. Champion, S. Meyer, X. Wang, E. Vincent, M. Pa- nariello, N. Evans, J. Yamagishi, and M. Todisco, “The VoicePrivacy 2024 challenge evaluation plan, ” 2024. Received 16 March 2026

work page 2024