ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

· 2026 · eess.AS · arXiv 2604.13229

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Speech deepfake detection (SDD) systems perform well on standard benchmarks datasets but often fail to generalize to expressive and emotional spoofing attacks. Many methods rely on spoof-heavy training data, learning dataset-specific artifacts rather than transferable cues of natural speech. In contrast, humans internalize variability in real speech and detect fakes as deviations from it. We introduce ProSDD, a two-stage framework that enriches model embeddings through supervised masked prediction of speaker-conditioned prosodic variation based on pitch, voice activity, and energy. Stage I learns prosodic variability from real speech, and Stage II jointly optimizes this objective with spoof classification. ProSDD consistently outperforms baselines under both ASVspoof 2019 and 2024 training, reducing ASVspoof 2024 EER from 25.43% to 16.14% (2019-trained) and from 39.62% to 7.38% (2024-trained), while achieving 50% relative reductions on EmoFake and EmoSpoof-TTS.

representative citing papers

ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

eess.AS · 2026-04-14 · unverdicted · novelty 6.0

ProSDD learns speaker-conditioned prosodic variation from real speech via supervised masked prediction and jointly optimizes it with spoof detection, cutting EER substantially on ASVspoof 2024 and emotional datasets.

citing papers explorer

Showing 1 of 1 citing paper.

ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks eess.AS · 2026-04-14 · unverdicted · none · ref 1 · internal anchor
ProSDD learns speaker-conditioned prosodic variation from real speech via supervised masked prediction and jointly optimizes it with spoof detection, cutting EER substantially on ASVspoof 2024 and emotional datasets.

ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

fields

years

verdicts

representative citing papers

citing papers explorer