Forensic Similarity for Speech Deepfakes

Daniele Ugo Leonzio; Davide Salvi; Paolo Bestagini; Stefano Tubaro; Viola Negroni

arxiv: 2510.02864 · v2 · submitted 2025-10-03 · 💻 cs.SD

Forensic Similarity for Speech Deepfakes

Viola Negroni , Davide Salvi , Daniele Ugo Leonzio , Paolo Bestagini , Stefano Tubaro This is my paper

Pith reviewed 2026-05-18 10:24 UTC · model grok-4.3

classification 💻 cs.SD

keywords forensic similarityspeech deepfakesaudio forensicsdeepfake detectionsource verificationsplicing detectionSiamese network

0 comments

The pith

Forensic similarity between audio pairs can reveal if two speech deepfakes come from the same model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper transfers the idea of forensic similarity from images to speech deepfakes. It builds a system that extracts features from audio using a Siamese network and then compares pairs with a similarity network to decide if they share the same forensic traces. The goal is to support tasks like checking if samples were generated by the same model and detecting if audio has been spliced from different sources. Results indicate the method handles new, unseen traces effectively, which is useful because deepfake techniques evolve quickly in practice.

Core claim

The central claim is that a two-stage deep learning framework with a Siamese-based feature extractor and a similarity network can map pairs of speech segments to a similarity score that indicates whether they contain identical or different forensic traces, allowing determination of whether samples originate from the same source or have been spliced.

What carries the argument

The similarity network that compares forensic features extracted by a Siamese network from pairs of audio segments to produce a score for shared traces.

If this is right

The method supports source verification to check if two samples were generated by the same deepfake model.
It can be used for audio splicing detection by identifying segments with mismatched traces.
The approach generalizes to previously unseen forensic traces from new models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar methods might be developed for other media like video to track consistent generator signatures across content.
Integrating this with existing deepfake detectors could create more robust verification pipelines for audio evidence.
Testing on real-world mixed audio from social media could show practical limits in noisy environments.

Load-bearing premise

Speech deepfakes leave consistent, model-specific forensic traces that can be reliably extracted and compared across samples.

What would settle it

Training the system on deepfakes from several models and then testing whether it correctly assigns low similarity scores to pairs from a completely new, unseen model.

Figures

Figures reproduced from arXiv: 2510.02864 by Daniele Ugo Leonzio, Davide Salvi, Paolo Bestagini, Stefano Tubaro, Viola Negroni.

**Figure 1.** Figure 1: shows the complete architecture of the proposed framework: xA and xB are a pair of input speech samples, while f(xA) and f(xB) are, respectively, the embeddings we extract from these samples. An embedding is a dense representation of the input that captures the most informative and discriminative features learned by the model, i.e., the feature extractor, reflecting the patterns relevant to the task it was… view at source ↗

**Figure 2.** Figure 2: FIGURE 2 [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: FIGURE 4 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 3.** Figure 3: FIGURE 3 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

In this paper, we introduce the concept of forensic similarity in the speech deepfake detection domain, which aims to determine whether two audio segments share the same underlying forensic traces. Our approach is inspired by prior work in the image domain. To transfer this idea to the audio domain, we propose a two-stage deep learning framework consisting of a Siamese-based feature extractor and a core decision module, referred to as the similarity network. The system goal to assess whether two speech samples originate from the same source by comparing their forensic characteristics. In practice, the model maps pairs of audio segments to a similarity score indicating whether they contain identical or different forensic traces. We evaluate the proposed method on the emerging task of source verification, demonstrating its ability to determine whether two speech samples were generated by the same model. In addition, we explore its applicability to audio splicing detection as a complementary use case. Experimental results show that the proposed approach generalizes well to previously unseen forensic traces, highlighting its robustness, flexibility, and practical relevance for digital audio forensics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces forensic similarity for speech deepfakes, proposing a two-stage framework with a Siamese feature extractor and similarity network to determine whether two audio segments share identical forensic traces. It evaluates the method on source verification (same-model detection) and audio splicing detection, claiming robust generalization to previously unseen forensic traces.

Significance. If the separation of model-specific traces from content and speaker features holds, the approach could provide a flexible, transferable tool for audio forensics that extends image-domain forensic similarity methods and addresses the practical challenge of emerging deepfake generators.

major comments (2)

[Proposed Method] The premise that the Siamese extractor isolates model-specific forensic traces separable from linguistic content or speaker identity underpins the generalization claim to unseen traces, yet the manuscript provides no ablations, embedding visualizations, or controls to verify this separation in the audio domain (see Proposed Method and Experimental Evaluation sections).
[Experimental Results] Claims of generalization to unseen forensic traces rest on experimental outcomes, but the evaluation lacks explicit reporting of datasets, baselines, metrics, error bars, or exclusion criteria, making it impossible to rule out post-hoc issues or confirm support for the central robustness assertions.

minor comments (2)

[Abstract] Grammatical issue in the abstract: 'The system goal to assess' should read 'The system's goal is to assess whether two speech samples originate from the same source'.
[Proposed Method] Notation for the similarity score and forensic trace comparison could be formalized with an equation in the method section for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below and describe the changes we will incorporate in the revised manuscript to improve clarity and support for our claims.

read point-by-point responses

Referee: [Proposed Method] The premise that the Siamese extractor isolates model-specific forensic traces separable from linguistic content or speaker identity underpins the generalization claim to unseen traces, yet the manuscript provides no ablations, embedding visualizations, or controls to verify this separation in the audio domain (see Proposed Method and Experimental Evaluation sections).

Authors: We agree that explicit verification of the separation between model-specific forensic traces and content/speaker features would strengthen the generalization argument. In the revised manuscript we will add t-SNE visualizations of the extracted embeddings colored by model, speaker, and linguistic content, along with ablation studies that systematically vary speaker identity and textual content while holding the generation model fixed. These additions will be placed in the Experimental Evaluation section to directly address the concern. revision: yes
Referee: [Experimental Results] Claims of generalization to unseen forensic traces rest on experimental outcomes, but the evaluation lacks explicit reporting of datasets, baselines, metrics, error bars, or exclusion criteria, making it impossible to rule out post-hoc issues or confirm support for the central robustness assertions.

Authors: We acknowledge that the current presentation of the experimental protocol is insufficiently detailed. In the revision we will expand the Experimental Evaluation section to include: (i) complete dataset descriptions with train/validation/test splits and any exclusion criteria, (ii) explicit listing of all baselines and implementation details, (iii) full metric tables with standard deviations computed over multiple random seeds, and (iv) a clear statement of the evaluation protocol to preclude post-hoc selection concerns. These changes will make the robustness claims reproducible and verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: new two-stage framework evaluated on independent experimental tasks

full rationale

The paper introduces forensic similarity for speech deepfakes via a Siamese feature extractor plus similarity network, explicitly framed as a transfer from image-domain prior work to a new audio construction. Central claims of generalization to unseen traces rest on experimental results for source verification and splicing detection rather than any internal definitions, fitted parameters renamed as predictions, or self-citation chains. No equations or derivations are presented that reduce to their own inputs by construction; the method is self-contained against external benchmarks and falsifiable via held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The central claim rests on the domain assumption that model-specific forensic traces exist and are comparable, plus standard deep-learning assumptions about feature extraction. No explicit free parameters or invented entities are named.

axioms (1)

domain assumption Speech deepfakes contain consistent, model-specific forensic traces that can be extracted and compared across samples.
This assumption is invoked when the paper defines forensic similarity and designs the Siamese-based feature extractor plus similarity network to transfer the concept from images to audio.

pith-pipeline@v0.9.0 · 5713 in / 1342 out tokens · 78732 ms · 2026-05-18T10:24:57.030242+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage deep learning framework consisting of a Siamese-based feature extractor and a core decision module, referred to as the similarity network... maps pairs of audio segments to a similarity score
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

generalizes well to previously unseen forensic traces

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Deepfake media forensics: Status and future challenges,

I. Amerini, M. Barni, S. Battiato, P. Bestagini, G. Boato, V . Bruni, R. Caldelli, F. De Natale, R. De Nicola, L. Guarneraet al., “Deepfake media forensics: Status and future challenges,”Journal of Imaging, vol. 11, no. 3, p. 73, 2025

work page 2025
[2]

The state of deepfakes: Landscape, threats, and impact,

H. Ajder, G. Patrini, F. Cavalli, and L. Cullen, “The state of deepfakes: Landscape, threats, and impact,”Amsterdam: Deeptrace, vol. 27, 2019

work page 2019
[3]

Determining authenticity of video evidence in the age of artificial intelligence and in the wake of deepfake videos,

M.-H. Maras and A. Alexandrou, “Determining authenticity of video evidence in the age of artificial intelligence and in the wake of deepfake videos,”The International Journal of Evidence & Proof, vol. 23, no. 3, pp. 255–262, 2019

work page 2019
[4]

Deepfakes and disinformation: Ex- ploring the impact of synthetic political video on deception, uncer- tainty, and trust in news,

C. Vaccari and A. Chadwick, “Deepfakes and disinformation: Ex- ploring the impact of synthetic political video on deception, uncer- tainty, and trust in news,”Social media+ society, vol. 6, no. 1, p. 2056305120903408, 2020

work page 2020
[5]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inIEEE conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[6]

A light CNN for deep face representation with noisy labels,

X. Wu, R. He, Z. Sun, and T. Tan, “A light CNN for deep face representation with noisy labels,”IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, 2018

work page 2018
[7]

Speaker recognition from raw waveform with sincnet,

M. Ravanelli and Y . Bengio, “Speaker recognition from raw waveform with sincnet,” in2018 IEEE spoken language technology workshop (SLT). IEEE, 2018, pp. 1021–1028

work page 2018
[8]

End-to-end anti-spoofing with rawnet2,

H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6369–6373

work page 2021
[9]

Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,

J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.- J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” inICASSP 2022-2022 IEEE international conference on acoustics, speech and signal pro- cessing (ICASSP). IEEE, 2022, pp. 6367–6371

work page 2022
[10]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

work page 2020
[11]

Xls-r: Self-supervised cross-lingual speech representation learning at scale.arXiv preprint arXiv:2111.09296,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “Xls-r: Self-supervised cross-lingual speech representation learning at scale,”arXiv, vol. abs/2111.09296, 2021

work page arXiv 2021
[12]

Audio deepfake detection with self- supervised xls-r and sls classifier,

Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detection with self- supervised xls-r and sls classifier,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6765–6773

work page 2024
[13]

Xlsr-mamba: A dual-column bidirectional state space model for spoofing attack detection,

Y . Xiao and R. K. Das, “Xlsr-mamba: A dual-column bidirectional state space model for spoofing attack detection,”IEEE Signal Pro- cessing Letters, 2025

work page 2025
[14]

Harder or different? understanding generalization of audio deepfake detection,

N. M. M ¨uller, N. Evans, H. Tak, P. Sperl, and K. B¨ottinger, “Harder or different? understanding generalization of audio deepfake detection,” inProceedings of Interspeech 2024, 2024

work page 2024
[15]

Attack agnostic dataset: Towards generalization and stabilization of audio deepfake detection,

P. Kawa, M. Plata, and P. Syga, “Attack agnostic dataset: Towards generalization and stabilization of audio deepfake detection,” inPro- ceedings of Interspeech 2022, 2022

work page 2022
[16]

Deep learning-based expres- sive speech synthesis: a systematic review of approaches, challenges, and resources,

H. Barakat, O. Turk, and C. Demiroglu, “Deep learning-based expres- sive speech synthesis: a systematic review of approaches, challenges, and resources,”EURASIP Journal on Audio, Speech, and Music Processing, vol. 2024, no. 1, p. 11, 2024

work page 2024
[17]

Synthesized speech attribution using the patchout spectrogram attribution transformer,

K. Bhagtani, E. R. Bartusiak, A. K. S. Yadav, P. Bestagini, and E. J. Delp, “Synthesized speech attribution using the patchout spectrogram attribution transformer,” inProc. ACM Workshop on Information Hiding and Multimedia Security, 2023

work page 2023
[18]

Towards explainable spoofed speech attribution and detection: a probabilistic approach for characterizing speech synthesizer components,

J. Mishra, M. Chhibber, H.-j. Shim, and T. H. Kinnunen, “Towards explainable spoofed speech attribution and detection: a probabilistic approach for characterizing speech synthesizer components,”Com- puter Speech & Language, vol. 95, 2026

work page 2026
[19]

Source verification for speech deepfakes,

V . Negroni, D. Salvi, P. Bestagini, and S. Tubaro, “Source verification for speech deepfakes,” inProc. Interspeech 2025, 2025

work page 2025
[20]

Forensic similarity for digital images,

O. Mayer and M. C. Stamm, “Forensic similarity for digital images,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 1331–1346, 2019

work page 2019
[21]

Syn- thetic speech detection through short-term and long-term prediction traces,

C. Borrelli, P. Bestagini, F. Antonacci, A. Sarti, and S. Tubaro, “Syn- thetic speech detection through short-term and long-term prediction traces,”EURASIP Journal on Information Security, vol. 2021, 2021

work page 2021
[22]

Exploring the synthetic speech attribution problem through data-driven detectors,

D. Salvi, P. Bestagini, and S. Tubaro, “Exploring the synthetic speech attribution problem through data-driven detectors,” inIEEE Interna- tional Workshop on Information Forensics and Security (WIFS), 2022

work page 2022
[23]

Source tracing of audio deepfake systems,

N. Klein, T. Chen, H. Tak, R. Casal, and E. Khoury, “Source tracing of audio deepfake systems,” inProc. INTERSPEECH, 2024

work page 2024
[24]

Tada: Training-free attribution and out-of-domain detection of audio deepfakes,

A. Stan, D. Combei, D. Oneata, and H. Cucu, “Tada: Training-free attribution and out-of-domain detection of audio deepfakes,” inProc. Interspeech 2025, 2025

work page 2025
[25]

Synthetic speech source tracing using metric learning,

D. Koutsianos, S. Zacharopoulos, Y . Panagakis, and T. Stafylakis, “Synthetic speech source tracing using metric learning,” inProc. Interspeech 2025, 2025

work page 2025
[26]

Audio deepfake source tracing using multi-attribute open-set identification and verification,

P. Falez, T. Marteau, D. Lolive, and A. Delhay, “Audio deepfake source tracing using multi-attribute open-set identification and verification,” inProc. Interspeech 2025, 2025

work page 2025
[27]

Open-set source tracing of audio deepfake systems,

N. Klein, H. Tak, and E. Khoury, “Open-set source tracing of audio deepfake systems,” inProc. Interspeech 2025, 2025

work page 2025
[28]

Noiseprint: A CNN-based camera model fingerprint,

D. Cozzolino and L. Verdoliva, “Noiseprint: A CNN-based camera model fingerprint,”IEEE Transactions on Information Forensics and Security, vol. 15, pp. 144–159, 2019

work page 2019
[29]

MLAAD: The Multi- Language Audio Anti-Spoofing Dataset,

N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G ¨olge, T. M¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “MLAAD: The Multi- Language Audio Anti-Spoofing Dataset,”IEEE International Joint Conference on Neural Networks (IJCNN), 2024

work page 2024
[30]

Using mlaad for source tracing of audio deepfakes,

N. M ¨uller, “Using mlaad for source tracing of audio deepfakes,” https: //deepfake-total.com/sourcetracing, Fraunhofer AISEC, 11 2024

work page 2024
[31]

ASVspoof 2019: Future horizons in spoofed and fake audio detec- tion,

M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future horizons in spoofed and fake audio detec- tion,” inProc. INTERSPEECH, 2019

work page 2019
[32]

Superseded-CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR V oice Cloning Toolkit,

C. Veaux, J. Yamagishi, K. MacDonaldet al., “Superseded-CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR V oice Cloning Toolkit,”University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2016

work page 2016
[33]

TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection,

D. Salvi, B. Hosler, P. Bestagini, M. C. Stamm, and S. Tubaro, “TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection,”IEEE Access, 2023

work page 2023
[34]

The VidTIMIT database,

C. Sanderson, “The VidTIMIT database,” IDIAP, Tech. Rep., 2002

work page 2002
[35]

The LJ Speech Dataset,

K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017

work page 2017
[36]

Librispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015

work page 2015
[37]

The par- tialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance,

L. Zhang, X. Wang, E. Cooper, N. Evans, and J. Yamagishi, “The par- tialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 813–825, 2022

work page 2022
[38]

Learning a similarity metric discriminatively, with application to face verification,

S. Chopra, R. Hadsell, and Y . LeCun, “Learning a similarity metric discriminatively, with application to face verification,” inIEEE Com- puter society conference on Computer Vision and Pattern Recognition (CVPR), 2005. VOLUME , 9

work page 2005

[1] [1]

Deepfake media forensics: Status and future challenges,

I. Amerini, M. Barni, S. Battiato, P. Bestagini, G. Boato, V . Bruni, R. Caldelli, F. De Natale, R. De Nicola, L. Guarneraet al., “Deepfake media forensics: Status and future challenges,”Journal of Imaging, vol. 11, no. 3, p. 73, 2025

work page 2025

[2] [2]

The state of deepfakes: Landscape, threats, and impact,

H. Ajder, G. Patrini, F. Cavalli, and L. Cullen, “The state of deepfakes: Landscape, threats, and impact,”Amsterdam: Deeptrace, vol. 27, 2019

work page 2019

[3] [3]

Determining authenticity of video evidence in the age of artificial intelligence and in the wake of deepfake videos,

M.-H. Maras and A. Alexandrou, “Determining authenticity of video evidence in the age of artificial intelligence and in the wake of deepfake videos,”The International Journal of Evidence & Proof, vol. 23, no. 3, pp. 255–262, 2019

work page 2019

[4] [4]

Deepfakes and disinformation: Ex- ploring the impact of synthetic political video on deception, uncer- tainty, and trust in news,

C. Vaccari and A. Chadwick, “Deepfakes and disinformation: Ex- ploring the impact of synthetic political video on deception, uncer- tainty, and trust in news,”Social media+ society, vol. 6, no. 1, p. 2056305120903408, 2020

work page 2020

[5] [5]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inIEEE conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016

[6] [6]

A light CNN for deep face representation with noisy labels,

X. Wu, R. He, Z. Sun, and T. Tan, “A light CNN for deep face representation with noisy labels,”IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, 2018

work page 2018

[7] [7]

Speaker recognition from raw waveform with sincnet,

M. Ravanelli and Y . Bengio, “Speaker recognition from raw waveform with sincnet,” in2018 IEEE spoken language technology workshop (SLT). IEEE, 2018, pp. 1021–1028

work page 2018

[8] [8]

End-to-end anti-spoofing with rawnet2,

H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6369–6373

work page 2021

[9] [9]

Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,

J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.- J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” inICASSP 2022-2022 IEEE international conference on acoustics, speech and signal pro- cessing (ICASSP). IEEE, 2022, pp. 6367–6371

work page 2022

[10] [10]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

work page 2020

[11] [11]

Xls-r: Self-supervised cross-lingual speech representation learning at scale.arXiv preprint arXiv:2111.09296,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “Xls-r: Self-supervised cross-lingual speech representation learning at scale,”arXiv, vol. abs/2111.09296, 2021

work page arXiv 2021

[12] [12]

Audio deepfake detection with self- supervised xls-r and sls classifier,

Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detection with self- supervised xls-r and sls classifier,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6765–6773

work page 2024

[13] [13]

Xlsr-mamba: A dual-column bidirectional state space model for spoofing attack detection,

Y . Xiao and R. K. Das, “Xlsr-mamba: A dual-column bidirectional state space model for spoofing attack detection,”IEEE Signal Pro- cessing Letters, 2025

work page 2025

[14] [14]

Harder or different? understanding generalization of audio deepfake detection,

N. M. M ¨uller, N. Evans, H. Tak, P. Sperl, and K. B¨ottinger, “Harder or different? understanding generalization of audio deepfake detection,” inProceedings of Interspeech 2024, 2024

work page 2024

[15] [15]

Attack agnostic dataset: Towards generalization and stabilization of audio deepfake detection,

P. Kawa, M. Plata, and P. Syga, “Attack agnostic dataset: Towards generalization and stabilization of audio deepfake detection,” inPro- ceedings of Interspeech 2022, 2022

work page 2022

[16] [16]

Deep learning-based expres- sive speech synthesis: a systematic review of approaches, challenges, and resources,

H. Barakat, O. Turk, and C. Demiroglu, “Deep learning-based expres- sive speech synthesis: a systematic review of approaches, challenges, and resources,”EURASIP Journal on Audio, Speech, and Music Processing, vol. 2024, no. 1, p. 11, 2024

work page 2024

[17] [17]

Synthesized speech attribution using the patchout spectrogram attribution transformer,

K. Bhagtani, E. R. Bartusiak, A. K. S. Yadav, P. Bestagini, and E. J. Delp, “Synthesized speech attribution using the patchout spectrogram attribution transformer,” inProc. ACM Workshop on Information Hiding and Multimedia Security, 2023

work page 2023

[18] [18]

Towards explainable spoofed speech attribution and detection: a probabilistic approach for characterizing speech synthesizer components,

J. Mishra, M. Chhibber, H.-j. Shim, and T. H. Kinnunen, “Towards explainable spoofed speech attribution and detection: a probabilistic approach for characterizing speech synthesizer components,”Com- puter Speech & Language, vol. 95, 2026

work page 2026

[19] [19]

Source verification for speech deepfakes,

V . Negroni, D. Salvi, P. Bestagini, and S. Tubaro, “Source verification for speech deepfakes,” inProc. Interspeech 2025, 2025

work page 2025

[20] [20]

Forensic similarity for digital images,

O. Mayer and M. C. Stamm, “Forensic similarity for digital images,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 1331–1346, 2019

work page 2019

[21] [21]

Syn- thetic speech detection through short-term and long-term prediction traces,

C. Borrelli, P. Bestagini, F. Antonacci, A. Sarti, and S. Tubaro, “Syn- thetic speech detection through short-term and long-term prediction traces,”EURASIP Journal on Information Security, vol. 2021, 2021

work page 2021

[22] [22]

Exploring the synthetic speech attribution problem through data-driven detectors,

D. Salvi, P. Bestagini, and S. Tubaro, “Exploring the synthetic speech attribution problem through data-driven detectors,” inIEEE Interna- tional Workshop on Information Forensics and Security (WIFS), 2022

work page 2022

[23] [23]

Source tracing of audio deepfake systems,

N. Klein, T. Chen, H. Tak, R. Casal, and E. Khoury, “Source tracing of audio deepfake systems,” inProc. INTERSPEECH, 2024

work page 2024

[24] [24]

Tada: Training-free attribution and out-of-domain detection of audio deepfakes,

A. Stan, D. Combei, D. Oneata, and H. Cucu, “Tada: Training-free attribution and out-of-domain detection of audio deepfakes,” inProc. Interspeech 2025, 2025

work page 2025

[25] [25]

Synthetic speech source tracing using metric learning,

D. Koutsianos, S. Zacharopoulos, Y . Panagakis, and T. Stafylakis, “Synthetic speech source tracing using metric learning,” inProc. Interspeech 2025, 2025

work page 2025

[26] [26]

Audio deepfake source tracing using multi-attribute open-set identification and verification,

P. Falez, T. Marteau, D. Lolive, and A. Delhay, “Audio deepfake source tracing using multi-attribute open-set identification and verification,” inProc. Interspeech 2025, 2025

work page 2025

[27] [27]

Open-set source tracing of audio deepfake systems,

N. Klein, H. Tak, and E. Khoury, “Open-set source tracing of audio deepfake systems,” inProc. Interspeech 2025, 2025

work page 2025

[28] [28]

Noiseprint: A CNN-based camera model fingerprint,

D. Cozzolino and L. Verdoliva, “Noiseprint: A CNN-based camera model fingerprint,”IEEE Transactions on Information Forensics and Security, vol. 15, pp. 144–159, 2019

work page 2019

[29] [29]

MLAAD: The Multi- Language Audio Anti-Spoofing Dataset,

N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G ¨olge, T. M¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “MLAAD: The Multi- Language Audio Anti-Spoofing Dataset,”IEEE International Joint Conference on Neural Networks (IJCNN), 2024

work page 2024

[30] [30]

Using mlaad for source tracing of audio deepfakes,

N. M ¨uller, “Using mlaad for source tracing of audio deepfakes,” https: //deepfake-total.com/sourcetracing, Fraunhofer AISEC, 11 2024

work page 2024

[31] [31]

ASVspoof 2019: Future horizons in spoofed and fake audio detec- tion,

M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future horizons in spoofed and fake audio detec- tion,” inProc. INTERSPEECH, 2019

work page 2019

[32] [32]

Superseded-CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR V oice Cloning Toolkit,

C. Veaux, J. Yamagishi, K. MacDonaldet al., “Superseded-CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR V oice Cloning Toolkit,”University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2016

work page 2016

[33] [33]

TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection,

D. Salvi, B. Hosler, P. Bestagini, M. C. Stamm, and S. Tubaro, “TIMIT-TTS: a Text-to-Speech Dataset for Multimodal Synthetic Media Detection,”IEEE Access, 2023

work page 2023

[34] [34]

The VidTIMIT database,

C. Sanderson, “The VidTIMIT database,” IDIAP, Tech. Rep., 2002

work page 2002

[35] [35]

The LJ Speech Dataset,

K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/ LJ-Speech-Dataset/, 2017

work page 2017

[36] [36]

Librispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015

work page 2015

[37] [37]

The par- tialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance,

L. Zhang, X. Wang, E. Cooper, N. Evans, and J. Yamagishi, “The par- tialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 813–825, 2022

work page 2022

[38] [38]

Learning a similarity metric discriminatively, with application to face verification,

S. Chopra, R. Hadsell, and Y . LeCun, “Learning a similarity metric discriminatively, with application to face verification,” inIEEE Com- puter society conference on Computer Vision and Pattern Recognition (CVPR), 2005. VOLUME , 9

work page 2005