The Calibrated Deepfake Trust Score (CDTS): Competence-Coupled Trust Degradation Across Deepfake Detectors

Md Anas Biswas

arxiv: 2606.29484 · v1 · pith:XMSDWFADnew · submitted 2026-06-28 · 💻 cs.CR · cs.CV· cs.LG

The Calibrated Deepfake Trust Score (CDTS): Competence-Coupled Trust Degradation Across Deepfake Detectors

Md Anas Biswas This is my paper

Pith reviewed 2026-06-30 07:08 UTC · model grok-4.3

classification 💻 cs.CR cs.CVcs.LG

keywords deepfake detectiontrust calibrationcompetence estimationcalibration degradationdetector trustworthinesslabel-free monitoringtrust score

0 comments

The pith

As deepfake detectors lose their ability to tell fakes from real, their output probabilities stop being well-calibrated trust measures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that deepfake detectors produce trust scores whose calibration worsens exactly when the detector's ability to separate real from fake content decreases. The coupling between competence and calibration was measured across 32 configurations and held for three different detector architectures as well as on held-out data the models had never seen. A reader should care because many moderation and verification systems treat the detector's raw probability as a dependable trust level for deciding authenticity. The work further shows competence can be estimated without ground-truth labels, letting systems flag when the trust score is likely to mislead.

Core claim

The authors identify a competence-calibration coupling in which the calibration of the trust score degrades as the detector's discriminative competence falls. This relation appears with a pooled Pearson correlation of -0.81 across 32 configurations, persists across three architecturally distinct detectors, and replicates on a fourth held-out dataset.

What carries the argument

The competence-calibration coupling, which organizes detector trustworthiness and is implemented through the Calibrated Deepfake Trust Score wrapper that turns detector output into a self-auditing trust instrument.

If this is right

A single calibrator frozen on in-domain data fails exactly on the low-competence generators the coupling identifies, with its error tracking competence at r = -0.98.
Competence estimated without labels can flag calibration risk on unseen generators.
Routing source batches by a reference-free competence estimate lowers overall AURC and improves the low-to-mid coverage region relative to confidence-based routing.
The same competence factor drives calibration inequity across demographic subgroups and affects explanation faithfulness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The coupling may appear in other classification settings where model probabilities are used as trust or confidence signals.
One could test whether deliberately lowering competence in unrelated models reproduces the same drop in calibration.
Extending the analysis to audio or video deepfakes would check whether competence remains the central driver outside image detectors.

Load-bearing premise

The label-free competence estimate remains independent of the calibration metric and is not confounded by dataset-specific artifacts or how competence is measured.

What would settle it

A new experiment on fresh deepfake generators that finds the correlation between measured competence and calibration error is near zero or positive.

Figures

Figures reproduced from arXiv: 2606.29484 by Md Anas Biswas.

**Figure 1.** Figure 1: The CDTS wrapper. Any backbone's logit is mapped through a post-hoc calibrator to a trust score, augmented with a label-free competence monitor that drives a zero-trust routing policy. The equity audit and explanation-faithfulness analysis are evaluated on the trust score. Detection competence is the shared driver of these trust properties; the coupling and label-free monitoring hold across CNN and transfo… view at source ↗

**Figure 2.** Figure 2: Inducing low competence. Applying a mismatched (reenactment-trained) checkpoint to DF40 face-swaps craters AUC and inflates calibrated ECE on otherwise-detectable forgeries, isolating the role of competence from generator identity. 4.4 Architecture-generality The coupling is not specific to one architecture or even to one architecture family. EfficientNet-B4 (Tan & Le, 2019) exhibits the same relationship … view at source ↗

**Figure 3.** Figure 3: The competence–calibration coupling across 32 configurations (two convolutional architectures, two datasets, four forgery methods, three checkpoints). Calibrated ECE rises as competence falls; pooled r = −0.81 (95% bootstrap CI [−0.90, −0.70], 5,000 resamples over the 32 configurations), p = 2.5 × 10 ⁸. ⁻ 4.5 External validity on a fourth dataset (DFD) The coupling so far is established on FaceForensics++,… view at source ↗

**Figure 4.** Figure 4: External validity on a fourth dataset. The competence–calibration line (dashed) is fit from the 32 configurations of the CDTS coupling set (grey). The three DFD points (stars), scored on a held-out dataset that contributed nothing to the line, fall on it: EfficientNet at competence 0.70 sits essentially on the prediction, the CLIP transformer at 0.83 is on the line, and Xception at 0.59 is just below it (d… view at source ↗

**Figure 5.** Figure 5: Deployable (transferred) calibration. A single calibrator is fit once on the high-competence in-domain generators and applied frozen to all generators. Left: raw, oracle (per-generator best case), and transferred (frozen, deployable) ECE against competence; the transferred calibrator tracks competence at r = −0.98 and approaches the oracle only where competence is 11 [PITH_FULL_IMAGE:figures/full_fig_p011… view at source ↗

**Figure 6.** Figure 6: Left: per-subgroup calibration error (worst-vs-pooled view), with low-power cells marked. Right: the pairwise subgroup contrast, accuracy-equity versus calibration-equity between two named groups per axis. Skin-tone groups are equally accurate but unequally calibrated, the disparity an accuracy-only audit misses. 6 Explanation Faithfulness A trust instrument that ships explanations should ship faithful one… view at source ↗

**Figure 7.** Figure 7: Saturation-free explanation faithfulness against competence across 20 generators (r = +0.94). When a detector cannot discriminate, its explanation carries no faithful signal [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Per-region faithfulness. For competent detectors (left two panels), Grad-CAM importance predicts the actual fakeprobability drop when a region is removed (rising trend, faithful); for incompetent detectors (right two), importance is uncorrelated with effect (flat, unfaithful). The metric is saturation-robust by construction. 7 Label-Free Competence Monitoring The coupling is only deployable if competence … view at source ↗

**Figure 9.** Figure 9: Label-free detection of calibration risk. Predictive entropy, the portable reference-free monitor, flags high-ECE generators at ROC-AUC 0.76 using no labels. KS divergence scores higher here (0.86) but its orientation is Xception-specific: because KS reverses sign across architectures (Section 7.2, [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: The unifying result on Xception. Left: principal-component analysis of the five trust signals; the first component explains 84.7% of the variance. Right: that component aligns with competence (r = 0.98). On this detector the trust signals share one competence factor; Section 8.2 tests how far this generalizes. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Trust-signal correlation matrix. Every signal co-varies with every other; competence (AUC) is the shared axis. 8.2 How far the unification generalizes across architectures The intrinsic-signal factor holds on both CNNs. The intrinsic trust signals, calibration and explanation faithfulness, collapse onto a single competence-aligned factor on both convolutional detectors: on Xception the intrinsic first com… view at source ↗

**Figure 12.** Figure 12: The intrinsic-signal competence factor across convolutional architectures. First principal component of the intrinsic trust signals (calibration and explanation faithfulness) against competence for Xception (PC1 = 90.7%, r = 0.95) and EfficientNet-B4 (PC1 = 86.8%, r = 0.85). The intrinsic trust properties share a competence-aligned factor on both backbones. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Risk-coverage. Routing on batch-level reference-free competence (score dispersion, green) achieves a lower overall risk-coverage curve than per-frame confidence (red), with the advantage concentrated at low-to-mid coverage and a small crossover at high coverage. 10 Discussion 10.1 One factor, not five guarantees The practical upshot of Section 8 is a change in how trustworthiness should be engineered. The… view at source ↗

read the original abstract

Modern deepfake detectors are rarely consumed as bare classifiers. In moderation, provenance, and verification pipelines their output probability is read as a degree of trust, so its calibration matters as much as raw accuracy. We reframe deepfake detection as a calibrated, self-auditing trust instrument, the Calibrated Deepfake Trust Score (CDTS), and identify what governs its trustworthiness. Our central finding is a competence-calibration coupling: the calibration of the trust score degrades as the detector's discriminative competence falls. We establish it across 32 configurations (pooled Pearson r = -0.81), demonstrate it within a single dataset, reinforce it by inducing low competence directly, and replicate it on a fourth held-out dataset the detectors never trained on. It holds across three architecturally distinct detectors, two convolutional networks and a CLIP vision transformer (r = -0.88, -0.83, -0.86). The result is also deployable: a single calibrator frozen on in-domain data fails on exactly the low-competence generators the coupling flags (its error tracks competence at r = -0.98), and competence is estimable without labels, so a label-free monitor flags calibration risk on unseen generators and routing source-batches on a reference-free competence estimate lowers overall AURC and improves the low-to-mid coverage operating region relative to confidence-based routing. The same competence factor also drives calibration inequity across demographic subgroups (distinct from accuracy inequity) and explanation faithfulness. We therefore argue that detector trustworthiness is organized by competence as a shared driver, that competence is the right quantity to estimate and condition on, and that trust scoring must be competence-aware. We offer the CDTS wrapper as the mechanism, and report openly where the unification is tight and where it is architecture-specific.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports a pooled r=-0.81 link between falling detector competence and worsening calibration of trust scores, with a label-free proxy that could flag risk on new generators, but the proxy's independence from the calibration metric is not demonstrated.

read the letter

The central observation is that deepfake detector calibration degrades as discriminative competence drops, shown across three architectures with correlations from -0.83 to -0.88 and a pooled -0.81 over 32 setups. They back this with direct induction of low competence, a held-out generator test, and a frozen calibrator whose error tracks competence at r=-0.98. The practical piece is using a label-free competence estimate to route batches and improve AURC in the low-to-mid coverage region.

The work does a reasonable job of checking the pattern on multiple models and datasets rather than one cherry-picked case. The demographic inequity and explanation faithfulness angles are noted as downstream effects of the same competence factor.

The soft spot is the competence proxy. The abstract treats it as independent and label-free, yet if it draws from the same output statistics that feed the calibration error (Brier, ECE, etc.), the negative correlation can appear mechanically. No operational definition, ablation, or orthogonality check is visible in the provided material, and the held-out test does not resolve this if the proxy and metric share the detector's decision channel. That leaves the claim vulnerable to the circularity concern.

This is for people building or auditing deepfake moderation pipelines who already care about calibration. A reader who wants empirical patterns on trust scoring would find the replications useful. It deserves peer review because the deployment angle is concrete and the replications are there, even though the independence of the competence measure needs direct verification in the methods.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Calibrated Deepfake Trust Score (CDTS) wrapper for deepfake detectors and claims a competence-calibration coupling: calibration of the trust score degrades as the detector's discriminative competence falls. This is established across 32 configurations (pooled Pearson r = -0.81), holds across three architecturally distinct detectors (r = -0.88, -0.83, -0.86), is reinforced by direct induction of low competence, and replicated on a held-out dataset never seen in training. The result is presented as deployable via a frozen calibrator whose error tracks competence at r = -0.98, with label-free competence estimation enabling routing that improves AURC and addressing calibration inequity across demographics.

Significance. If the independence of the label-free competence proxy from the calibration metric is demonstrated, the result would be significant for trust scoring in deepfake moderation and verification pipelines by identifying competence as a shared driver of calibration, inequity, and explanation faithfulness. The manuscript earns credit for its multiple replications, held-out dataset test, direct competence induction, and open reporting of where the unification is tight versus architecture-specific.

major comments (2)

[Abstract] Abstract: The central claim of a competence-calibration coupling (pooled r = -0.81) depends on the label-free competence estimate being independent of the calibration metric (Brier score, ECE, or reliability diagrams). The abstract asserts independence and generalization to unseen generators but supplies no operational definition, ablation, or proof of orthogonality; if the proxy shares detector output statistics with the calibration error, the observed correlation can arise mechanically rather than reflect an independent property.
[Replication and held-out test sections] The held-out dataset replication and low-competence induction: while these are presented as strengthening the result, the manuscript must explicitly show that the competence proxy construction and the calibration error computation remain independent under these interventions; without this, the r = -0.98 tracking by the frozen calibrator does not resolve the potential circularity for the generalization claim.

minor comments (2)

[Abstract] The abstract introduces the CDTS but could more explicitly state its mathematical formulation as a wrapper (e.g., how the calibrated probability is computed from the base detector output) before discussing empirical results.
[Results tables/figures] Table or figure reporting the per-detector r values (-0.88, -0.83, -0.86) should include exact sample sizes and confidence intervals to allow assessment of the pooled r = -0.81.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's insightful comments on the need to explicitly demonstrate independence of the label-free competence proxy from calibration metrics. We address each point below and will revise the manuscript accordingly to strengthen these aspects.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of a competence-calibration coupling (pooled r = -0.81) depends on the label-free competence estimate being independent of the calibration metric (Brier score, ECE, or reliability diagrams). The abstract asserts independence and generalization to unseen generators but supplies no operational definition, ablation, or proof of orthogonality; if the proxy shares detector output statistics with the calibration error, the observed correlation can arise mechanically rather than reflect an independent property.

Authors: We agree that the manuscript would benefit from an explicit operational definition and ablation to rule out mechanical correlations arising from shared detector output statistics. The current abstract relies on the label-free construction of the proxy (which uses no ground-truth labels, unlike Brier/ECE) to claim independence, but does not provide the requested ablations or orthogonality tests. In revision we will add a dedicated methods subsection with the proxy definition and an ablation that orthogonalizes the proxy from output statistics before recomputing the pooled correlation with calibration error. revision: yes
Referee: [Replication and held-out test sections] The held-out dataset replication and low-competence induction: while these are presented as strengthening the result, the manuscript must explicitly show that the competence proxy construction and the calibration error computation remain independent under these interventions; without this, the r = -0.98 tracking by the frozen calibrator does not resolve the potential circularity for the generalization claim.

Authors: We concur that explicit independence checks are required under the held-out dataset and direct competence-induction settings to support the generalization and frozen-calibrator claims. The manuscript does not currently include these targeted verifications. In the revised version we will expand the relevant sections with partial-correlation analyses and independence tests between proxy and calibration error under each intervention, confirming that the r = -0.98 tracking is not due to circularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical correlations reported as independent measurements

full rationale

The paper reports observed Pearson correlations (pooled r = -0.81 across 32 configurations, with per-detector values -0.88/-0.83/-0.86) between a label-free competence estimate and calibration degradation metrics. It further states that a frozen calibrator's error tracks competence at r = -0.98 and that competence is estimable without labels, with replication on held-out data. No equations, definitions, or self-citations are supplied in the provided text that reduce the competence proxy to the calibration error (or vice versa) by construction. The central claims rest on statistical associations across detectors and datasets rather than definitional equivalence or fitted-input renaming. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on standard statistical correlation measures and introduces CDTS as a new wrapper construct; no explicit free parameters or invented physical entities are described in the abstract.

axioms (1)

standard math Pearson correlation is an appropriate measure for the relationship between competence and calibration degradation
Used to quantify the coupling with reported r values across configurations.

invented entities (1)

CDTS no independent evidence
purpose: Wrapper mechanism to make deepfake trust scoring competence-aware
Introduced in the abstract as the deployable output of the work.

pith-pipeline@v0.9.1-grok · 5869 in / 1334 out tokens · 33151 ms · 2026-06-30T07:08:41.329948+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Brown, R., & Russell, C. (2026). Face-Feature Tuning: Post-hoc calibration for fair and accurate deepfake detection. International Conference on Learning Representations (ICLR). Face-Fairness (FF). Chandra, N. A., Murtfeldt, R., Qiu, L., Karmakar, A., Lee, H., Tanumihardja, E., Farhat, K., Caffee, B., Paik, S., Lee, C., Choi, J., Kim, A., & Etzioni, O. (2...

2026
[2]

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

arXiv preprint. arXiv:2503.02857. Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1251–1258. DeYoung, J., Jain, S., Rajani, N. F., Lehman, E., Xiong, C., Socher, R., & Wallace, B. C. (2020). ERASER: A benchmark to evaluate rationalized ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s44267-025-00100-2 2017
[3]

Nadimpalli, A

DiCE. Nadimpalli, A. V., & Rattani, A. (2022). GBDF: Gender balanced DeepFake dataset towards fair DeepFake detection. arXiv preprint. arXiv:2207.10246. Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 61–74. Radford, A., Kim, J. W., Hallacy, C., Ra...

work page doi:10.1109/tts.2024.3365421 2022

[1] [1]

Brown, R., & Russell, C. (2026). Face-Feature Tuning: Post-hoc calibration for fair and accurate deepfake detection. International Conference on Learning Representations (ICLR). Face-Fairness (FF). Chandra, N. A., Murtfeldt, R., Qiu, L., Karmakar, A., Lee, H., Tanumihardja, E., Farhat, K., Caffee, B., Paik, S., Lee, C., Choi, J., Kim, A., & Etzioni, O. (2...

2026

[2] [2]

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

arXiv preprint. arXiv:2503.02857. Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1251–1258. DeYoung, J., Jain, S., Rajani, N. F., Lehman, E., Xiong, C., Socher, R., & Wallace, B. C. (2020). ERASER: A benchmark to evaluate rationalized ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s44267-025-00100-2 2017

[3] [3]

Nadimpalli, A

DiCE. Nadimpalli, A. V., & Rattani, A. (2022). GBDF: Gender balanced DeepFake dataset towards fair DeepFake detection. arXiv preprint. arXiv:2207.10246. Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 61–74. Radford, A., Kim, J. W., Hallacy, C., Ra...

work page doi:10.1109/tts.2024.3365421 2022