pith. sign in

arxiv: 2606.21584 · v1 · pith:GZ7M5NSWnew · submitted 2026-06-19 · 💻 cs.SD

When EER Hides Deployment Failure: Auditing Threshold Transfer and Unlabeled Score Calibration for Speech Deepfake Detectors

Pith reviewed 2026-06-26 12:47 UTC · model grok-4.3

classification 💻 cs.SD
keywords speech deepfake detectionequal error ratethreshold transferscore calibrationhalf total error rateunlabeled evaluationdeployment failurecountermeasures
0
0 comments X

The pith

Any strictly increasing score transform leaves equal error rate unchanged, so calibration cannot close the deployment gap for speech deepfake detectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that equal error rate is invariant under any strictly increasing transformation of detector scores, including z-norm, temperature scaling, shift calibration, and embedding alignment. This invariance means such corrections applied at test time on unlabeled data cannot alter EER values. When a threshold calibrated on labeled training data is transferred to new corpora, the resulting half total error rate can reach 39.5 percent even when EER on the target set looks moderate at 11.2 percent. An audit of seven corrections on In-the-Wild and ASVspoof 2021 DF data confirms the invariance holds empirically and reveals additional collapse modes when spoof priors differ. A reader would care because standard EER reporting may therefore mask large practical failures once the oracle threshold is removed.

Core claim

The paper establishes that for any detector the equal error rate computed on a test set remains identical after any strictly increasing score transform. An empirical audit of seven unlabeled corrections on In-the-Wild and ASVspoof 2021 DF confirms that none reduces EER by more than 1 percent relative, that AS-norm can raise EER from 11.2 percent to 60.2 percent, and that pseudo-label calibration that helps on one corpus reaches 50 percent HTER on DF21 whose spoof prior is 96 percent. The authors therefore recommend reporting half total error rate at a transferred threshold alongside EER.

What carries the argument

The invariance of equal error rate to any strictly increasing score transform, which blocks all monotonic corrections from changing the metric.

If this is right

  • Transferring the LA-calibrated threshold to In-the-Wild produces 39.5 percent HTER with 78.7 percent bona-fide rejection.
  • AS-norm using an unlabeled target cohort raises EER from 11.2 percent to 60.2 percent.
  • Pseudo-label calibration reduces HTER on In-the-Wild yet reaches 50 percent HTER on DF21.
  • No audited correction reduces EER by more than 1 percent relative on either corpus.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar threshold-transfer audits may be needed for detectors in other audio or signal-classification tasks.
  • Training objectives could be extended to penalize sensitivity to threshold choice on shifted distributions.
  • Cohort selection for normalization methods requires explicit modeling of possible prior mismatch.

Load-bearing premise

The In-the-Wild corpus and ASVspoof 2021 DF represent realistic unlabeled deployment scenarios whose distribution shift from training data is typical of practical use.

What would settle it

A strictly increasing score transform that yields a different equal error rate on the same labeled test set, or an unlabeled correction that consistently lowers half total error rate at the transferred threshold across both corpora.

read the original abstract

Speech deepfake countermeasures (CMs) are compared almost exclusively by equal error rate (EER), a metric computed at an oracle threshold chosen on the labeled test set. Deployed CMs enjoy no such oracle: a threshold must be fixed in advance and applied to unlabeled target data. We audit this gap with a frozen state-of-the-art SSL-AASIST detector trained on ASVspoof 2019 LA. While its in-domain EER is 0.21%, transferring its LA-calibrated threshold to the In-the-Wild corpus yields a half total error rate (HTER) of 39.5%, with 78.7% of bona fide speech rejected, even though the In-the-Wild EER (11.2%) appears moderate. We then test whether popular unlabeled test-time corrections close this gap, and first prove a simple proposition: any strictly increasing score transform, including z-norm, temperature/shift calibration, and embedding mean alignment under a frozen linear head, cannot change EER. An audit of seven corrections on In-the-Wild and ASVspoof 2021 DF confirms the proposition empirically and exposes two further failure modes: AS-norm with an unlabeled target cohort collapses (EER 11.2% to 60.2%), and pseudo-label calibration that reduces HTER by 38% relative on In-the-Wild degenerates to 50% HTER on DF21, whose spoof prior is 96%. No audited correction reduces EER by more than 1% relative. We recommend reporting HTER at a transferred threshold alongside EER.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that EER computed with an oracle threshold on labeled test data conceals deployment failures for speech deepfake detectors, as transferring a fixed threshold to unlabeled data yields high HTER (e.g., 39.5% on In-the-Wild). It proves that any strictly increasing score transform (z-norm, temperature/shift calibration, embedding mean alignment) cannot change EER, audits seven such corrections on In-the-Wild and ASVspoof 2021 DF confirming the invariance empirically while exposing failure modes like AS-norm collapse and pseudo-label degeneration, and recommends reporting HTER at a transferred threshold alongside EER.

Significance. If the result holds, the work is significant for exposing a systematic evaluation gap in speech deepfake detection. The central invariance result is a parameter-free derivation from the definition of EER (preservation of score ordering and thus the ROC curve), supported by reproducible experiments on fixed public benchmarks without fitting to outcomes. This provides a concrete, falsifiable basis for revising benchmarking practices.

minor comments (2)
  1. [Abstract] The abstract states that seven corrections were audited but does not enumerate them; adding a short list or reference to the methods section would improve immediate readability.
  2. [The proposition section] The proposition is described as 'simple' and 'sketched'; consider stating it formally as a numbered definition or lemma with the exact conditions (strictly increasing transform) for precision.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and for recommending acceptance. The referee's summary accurately reflects the manuscript's central claims regarding the limitations of EER for deployment evaluation of speech deepfake detectors.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The core proposition that strictly increasing score transforms preserve EER follows directly from the definition of EER (equal FAR/FRR on the ROC curve) depending solely on relative ordering, which is invariant under monotonic transforms; this is a self-contained mathematical observation with no reduction to fitted parameters, self-citations, or ansatzes. The empirical audit applies fixed public benchmarks (In-the-Wild, ASVspoof 2021 DF) to a frozen detector without any parameter fitting to the reported HTER or failure-mode outcomes. No load-bearing self-citation chains, uniqueness theorems, or renamings appear in the derivation. The result is therefore independent of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on the standard definitions of EER and HTER from speaker verification literature together with the mathematical property that EER depends only on the ordering of scores.

axioms (2)
  • standard math EER is the operating point at which false acceptance rate equals false rejection rate on a labeled test set.
    Standard definition used throughout binary classification and speaker verification.
  • standard math HTER is half the sum of false acceptance and false rejection rates evaluated at a fixed threshold on unlabeled data.
    Common deployment metric when labels are unavailable.

pith-pipeline@v0.9.1-grok · 5828 in / 1434 out tokens · 55990 ms · 2026-06-26T12:47:01.008398+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 3 linked inside Pith

  1. [1]

    transfer EER

    INTRODUCTION Speech deepfake countermeasures have evolved rapidly, from hand- crafted spectral features [1] through end-to-end raw-waveform net- works [2, 3] to self-supervised front-ends [4, 5] fine-tuned for spoofing detection [6, 7], with progress tracked by the ASVspoof challenge series [8, 9, 10, 11]. State-of-the-art systems now reach near-zero EER ...

  2. [2]

    THRESHOLD TRANSFER AND MONOTONE INV ARIANCE Setup.A CM assigns a scalar scores(x)∈R, higher meaning more bona fide; throughout,s=ℓ bona −ℓ spoof is the logit difference. On a labeled test set,FRR(τ)andFAR(τ)denote the bona fide rejec- tion and spoof acceptance rates at thresholdτ; EER is their common arXiv:2606.21584v1 [cs.SD] 19 Jun 2026 value at the thr...

  3. [3]

    AUDITED CORRECTIONS All corrections use no target labels and keep all model weights frozen. For threshold transfer, each correction is applied identically to the source evaluation scores (or the source set is re-scored by the adapted model), andτis re-derived at the source EER point in the corrected score space. (C1) z-norm.Global standardization ofswith ...

  4. [4]

    EXPERIMENTS 4.1. Setup Model.SSL-AASIST [7]: a wav2vec 2.0 XLS-R 300M front- end [5] with an AASIST graph-attention back-end [3], using the authors’ released checkpoint trained on ASVspoof 2019 LA (LA model.pth, obtained from a public mirror of the official release). Inputs are 64,600 samples (about 4 s) at 16 kHz, cropped or tiled as in the original reci...

  5. [5]

    First, EER hides deployment failure: a 0.21%-EER detector operated at its source threshold rejects 78.7% of genuine In-the-Wild speech while its target EER still reads 11.2%

    CONCLUSIONS AND RECOMMENDATIONS Auditing a frozen state-of-the-art CM yields three findings. First, EER hides deployment failure: a 0.21%-EER detector operated at its source threshold rejects 78.7% of genuine In-the-Wild speech while its target EER still reads 11.2%. Second, the most common unla- beled fixes are structurally unable to repair EER (Prop. 1)...

  6. [6]

    Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification,

    Massimiliano Todisco, H ´ector Delgado, and Nicholas Evans, “Constant Q cepstral coefficients: A spoofing countermeasure for automatic speaker verification,”Computer Speech & Lan- guage, vol. 45, 2017

  7. [7]

    End-to-end anti-spoofing with RawNet2,

    Hemlata Tak, Jose Patino, Massimiliano Todisco, Andreas Nautsch, Nicholas Evans, and Anthony Larcher, “End-to-end anti-spoofing with RawNet2,” inProc. IEEE ICASSP, 2021

  8. [8]

    AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks,

    Jee weon Jung, Hee-Soo Heo, Hemlata Tak, Hye jin Shim, Joon Son Chung, Bong-Jin Lee, Ha-Jin Yu, and Nicholas Evans, “AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks,” inProc. IEEE ICASSP, 2022

  9. [9]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” inProc. NeurIPS, 2020

  10. [10]

    XLS-R: Self-supervised cross- lingual speech representation learning at scale,

    Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakho- tia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Con- neau, and Michael Auli, “XLS-R: Self-supervised cross- lingual speech representation learning at scale,” inProc. In- terspeech, 2022

  11. [11]

    Investigating self- supervised front ends for speech spoofing countermeasures,

    Xin Wang and Junichi Yamagishi, “Investigating self- supervised front ends for speech spoofing countermeasures,” in Proc. Odyssey: The Speaker and Language Recognition Work- shop, 2022

  12. [12]

    Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

    Hemlata Tak, Massimiliano Todisco, Xin Wang, Jee weon Jung, Junichi Yamagishi, and Nicholas Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” inProc. Odyssey: The Speaker and Language Recognition Workshop, 2022

  13. [13]

    ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge,

    Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yam- agishi, Cemal Hanilc ¸i, Md Sahidullah, and Aleksandr Sizov, “ASVspoof 2015: The first automatic speaker verification spoofing and countermeasures challenge,” inProc. Inter- speech, 2015

  14. [14]

    ASVspoof 2019: Future horizons in spoofed and fake audio detection,

    Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidul- lah, H ´ector Delgado, Andreas Nautsch, Junichi Yamag- ishi, Nicholas Evans, Tomi Kinnunen, and Kong Aik Lee, “ASVspoof 2019: Future horizons in spoofed and fake audio detection,” inProc. Interspeech, 2019

  15. [15]

    ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection,

    Junichi Yamagishi, Xin Wang, Massimiliano Todisco, Md Sahidullah, Jose Patino, Andreas Nautsch, Xuechen Liu, Kong Aik Lee, Tomi Kinnunen, Nicholas Evans, and H ´ector Delgado, “ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection,” inProc. ASVspoof 2021 Work- shop, 2021

  16. [16]

    ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,

    Xin Wang, H ´ector Delgado, Hemlata Tak, Jee weon Jung, Hye jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, and Junichi Yamagishi, “ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,” in Proc. ASVspoof Workshop, 2024

  17. [17]

    Spoofing and coun- termeasures for speaker verification: A survey,

    Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi Yam- agishi, Federico Alegre, and Haizhou Li, “Spoofing and coun- termeasures for speaker verification: A survey,”Speech Com- munication, vol. 66, 2015

  18. [18]

    Cross-database eval- uation of audio-based spoofing detection systems,

    Pavel Korshunov and S ´ebastien Marcel, “Cross-database eval- uation of audio-based spoofing detection systems,” inProc. Interspeech, 2016

  19. [19]

    Generalization of audio deep- fake detection,

    Tianxiang Chen, Avrosh Kumar, Parav Nagarsheth, Ganesh Sivaraman, and Elie Khoury, “Generalization of audio deep- fake detection,” inProc. Odyssey: The Speaker and Language Recognition Workshop, 2020

  20. [20]

    One-class learn- ing towards synthetic voice spoofing detection,

    You Zhang, Fei Jiang, and Zhiyao Duan, “One-class learn- ing towards synthetic voice spoofing detection,”IEEE Signal Processing Letters, vol. 28, 2021

  21. [21]

    Does audio deep- fake detection generalize?,

    Nicolas M. M ¨uller, Pavel Czempin, Franziska Diekmann, Adam Froghyar, and Konstantin B¨ottinger, “Does audio deep- fake detection generalize?,” inProc. Interspeech, 2022

  22. [22]

    MLAAD: The multi-language audio anti-spoofing dataset,

    Nicolas M. M ¨uller, Piotr Kawa, Wei Herng Choong, Edres- son Casanova, Eren G ¨olge, Thorsten M ¨uller, Piotr Syga, Philip Sperl, and Konstantin B ¨ottinger, “MLAAD: The multi-language audio anti-spoofing dataset,”arXiv preprint arXiv:2401.09512, 2024

  23. [23]

    RawBoost: A raw data boosting and augmentation method applied to automatic speaker verifi- cation anti-spoofing,

    Hemlata Tak, Madhu Kamble, Jose Patino, Massimiliano Todisco, and Nicholas Evans, “RawBoost: A raw data boosting and augmentation method applied to automatic speaker verifi- cation anti-spoofing,” inProc. IEEE ICASSP, 2022

  24. [24]

    ASVspoof 2019: A large-scale public database of synthe- sized, converted and replayed speech,

    Xin Wang, Junichi Yamagishi, Massimiliano Todisco, H ´ector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidul- lah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, et al., “ASVspoof 2019: A large-scale public database of synthe- sized, converted and replayed speech,”Computer Speech & Language, vol. 64, 2020

  25. [25]

    Score normalization for text-independent speaker verification systems,

    Roland Auckenthaler, Michael Carey, and Harvey Lloyd- Thomas, “Score normalization for text-independent speaker verification systems,”Digital Signal Processing, vol. 10, no. 1–3, 2000

  26. [26]

    Compar- ison of speaker recognition approaches for real applications,

    Sandro Cumani, Pier Domenico Batzu, Daniele Colibro, Clau- dio Vair, Pietro Laface, and Vasileios Vasilakakis, “Compar- ison of speaker recognition approaches for real applications,” inProc. Interspeech, 2011

  27. [27]

    Analysis of score normalization in multilingual speaker recognition,

    Pavel Mat ˇejka, Ondˇrej Novotn´y, Oldˇrich Plchot, Luk´aˇs Burget, Mireia Diez S ´anchez, and Jan ˇCernock´y, “Analysis of score normalization in multilingual speaker recognition,” inProc. Interspeech, 2017

  28. [28]

    On calibration of modern neural networks,

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger, “On calibration of modern neural networks,” inProc. ICML, 2017

  29. [29]

    Return of frus- tratingly easy domain adaptation,

    Baochen Sun, Jiashi Feng, and Kate Saenko, “Return of frus- tratingly easy domain adaptation,” inProc. AAAI, 2016

  30. [30]

    Improving ro- bustness against common corruptions by covariate shift adap- tation,

    Steffen Schneider, Evgenia Rusak, Luisa Eck, Oliver Bring- mann, Wieland Brendel, and Matthias Bethge, “Improving ro- bustness against common corruptions by covariate shift adap- tation,” inProc. NeurIPS, 2020

  31. [31]

    Tent: Fully test-time adaptation by entropy minimization,

    Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Ol- shausen, and Trevor Darrell, “Tent: Fully test-time adaptation by entropy minimization,” inProc. ICLR, 2021

  32. [33]

    TRACE: Training-free partial audio deepfake detection via embedding trajectory analysis of speech founda- tion models,

    Awais Khan, Muhammad Umar Farooq, Kutub Uddin, and Khalid Malik, “TRACE: Training-free partial audio deepfake detection via embedding trajectory analysis of speech founda- tion models,”arXiv preprint arXiv:2604.01083, 2026

  33. [34]

    Think twice before adaptation: Im- proving adaptability of deepfake detection via online test-time adaptation,

    Hong-Hanh Nguyen-Le, Van-Tuan Tran, Dinh-Thuc Nguyen, and Nhien-An Le-Khac, “Think twice before adaptation: Im- proving adaptability of deepfake detection via online test-time adaptation,” inProc. IJCAI, 2025

  34. [35]

    t-DCF: A detection cost function for the tandem assessment of spoofing countermea- sures and automatic speaker verification,

    Tomi Kinnunen, Kong Aik Lee, H ´ector Delgado, Nicholas Evans, Massimiliano Todisco, Md Sahidullah, Junichi Yam- agishi, and Douglas A. Reynolds, “t-DCF: A detection cost function for the tandem assessment of spoofing countermea- sures and automatic speaker verification,” inProc. Odyssey: The Speaker and Language Recognition Workshop, 2018

  35. [36]

    Application-independent evaluation of speaker detection,

    Niko Br ¨ummer and Johan du Preez, “Application-independent evaluation of speaker detection,”Computer Speech & Lan- guage, vol. 20, no. 2–3, 2006

  36. [37]

    When spoof detectors travel: Evaluation across 66 languages in the low-resource language spoofing corpus,

    K. Borodin, V . Kudryavtsev, M. Maslov, M. Gorodnichev, and G. Mkrtchian, “When spoof detectors travel: Evaluation across 66 languages in the low-resource language spoofing corpus,” arXiv preprint arXiv:2603.02364, 2026

  37. [38]

    Adjusting the outputs of a classifier to new a priori probabili- ties: A simple procedure,

    Marco Saerens, Patrice Latinne, and Christine Decaestecker, “Adjusting the outputs of a classifier to new a priori probabili- ties: A simple procedure,”Neural Computation, vol. 14, no. 1, 2002

  38. [39]

    Detecting and correcting for label shift with black box predic- tors,

    Zachary C. Lipton, Yu-Xiang Wang, and Alexander J. Smola, “Detecting and correcting for label shift with black box predic- tors,” inProc. ICML, 2018