pith. machine review for the scientific record. sign in

arxiv: 2604.21478 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords face forgery detectioncross-domain evaluationCross-AUC metricmixture of expertssemantic alignmentCLIPgeneralization
0
0 comments X

The pith

Cross-AUC metric reveals that face forgery detectors suffer large performance drops when real and fake samples come from different datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard cross-dataset AUC evaluation for face forgery detectors is inadequate because it masks significant shifts in detection scores when data comes from different sources. To address this, the authors introduce Cross-AUC, which measures performance by pairing real faces from one dataset with manipulated faces from another and vice versa. Evaluations of existing detectors under this metric show clear drops that standard metrics overlook, pointing to an unaddressed robustness challenge. They further present the SFAM framework, built around patch-level image-text alignment to heighten sensitivity to forgery artifacts and a mixture-of-experts module that assigns different facial regions to specialized experts for more targeted analysis.

Core claim

The commonly used cross-dataset AUC metric fails to reveal important issues where detection scores shift significantly across data domains; the proposed Cross-AUC metric computes AUC across dataset pairs by contrasting real samples from one dataset with fake samples from another, exposing substantial performance drops in representative detectors, while the SFAM framework achieves superior performance compared with state-of-the-art methods across suitable metrics.

What carries the argument

Cross-AUC, which evaluates by mixing real samples from one dataset with fake samples from another to test score comparability, together with the SFAM framework's patch-level image-text alignment module that improves CLIP sensitivity to manipulation artifacts and its facial region mixture-of-experts module that routes region features to specialized experts.

If this is right

  • Detectors must maintain comparable score distributions across domains, not merely high accuracy on isolated test sets.
  • Patch-level alignment with textual descriptions of artifacts can increase sensitivity to subtle face manipulations that global features miss.
  • Routing features from distinct facial regions to separate experts enables region-specific forgery pattern recognition.
  • More reliable cross-domain evaluation supports deployment of detectors in environments where training and test data sources differ.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Calibration methods that normalize scores across domains could be combined with SFAM to further reduce the observed drops.
  • Testing the Cross-AUC metric on forgeries generated by newer diffusion models would clarify whether the robustness issue is model-agnostic.
  • The mixture-of-experts design may generalize to other region-sensitive tasks such as expression analysis or attribute editing detection.

Load-bearing premise

That mixing real samples from one dataset with fake samples from another produces a metric whose drops directly reflect a detector's real-world cross-domain robustness rather than introducing new confounding biases from dataset-specific statistics.

What would settle it

If a detector maintains high accuracy on mixed real-world forgeries collected from varied sources yet still shows large drops under Cross-AUC, or if SFAM fails to outperform baselines on a new unseen dataset pair despite strong Cross-AUC results.

Figures

Figures reproduced from arXiv: 2604.21478 by Decheng Liu, Tao Chen, Yuhan Luo.

Figure 1
Figure 1. Figure 1: Prediction score distributions of authentic and forged samples across FF++, WDF, and Celeb-DF datasets, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The workflow of the proposed SFAM framework. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The structure of vision encoder with our proposed FaRMoE module. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE visualization results of four methods. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Nowadays, visual data forgery detection plays an increasingly important role in social and economic security with the rapid development of generative models. Existing face forgery detectors still can't achieve satisfactory performance because of poor generalization ability across datasets. The key factor that led to this phenomenon is the lack of suitable metrics: the commonly used cross-dataset AUC metric fails to reveal an important issue where detection scores may shift significantly across data domains. To explicitly evaluate cross-domain score comparability, we propose \textbf{Cross-AUC}, an evaluation metric that can compute AUC across dataset pairs by contrasting real samples from one dataset with fake samples from another (and vice versa). It is interesting to find that evaluating representative detectors under the Cross-AUC metric reveals substantial performance drops, exposing an overlooked robustness problem. Besides, we also propose the novel framework \textbf{S}emantic \textbf{F}ine-grained \textbf{A}lignment and \textbf{M}ixture-of-Experts (\textbf{SFAM}), consisting of a patch-level image-text alignment module that enhances CLIP's sensitivity to manipulation artifacts, and the facial region mixture-of-experts module, which routes features from different facial regions to specialized experts for region-aware forgery analysis. Extensive qualitative and quantitative experiments on the public datasets prove that the proposed method achieves superior performance compared with the state-of-the-art methods with various suitable metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper argues that standard cross-dataset AUC fails to capture score shifts in face forgery detectors across domains. It introduces Cross-AUC, which computes AUC by pairing real samples from one dataset with fake samples from another (and vice versa), and reports substantial performance drops for representative detectors, indicating an overlooked robustness issue. To mitigate this, the authors propose the SFAM framework comprising a patch-level image-text alignment module (to improve CLIP sensitivity to artifacts) and a facial region mixture-of-experts module (for region-aware analysis), claiming superior results over SOTA methods on public datasets under multiple metrics.

Significance. If Cross-AUC can be shown to isolate forgery-specific robustness rather than real-image domain statistics and if SFAM's gains prove reproducible with proper controls, the work would usefully highlight evaluation gaps in cross-domain face forgery detection and offer a concrete architectural direction for region-aware and semantically aligned models.

major comments (2)
  1. [Abstract / Cross-AUC] Abstract / Cross-AUC definition: pairing real samples from dataset A with fakes from B (and vice versa) without any normalization or matching of real-image statistics (resolution, compression, head-pose distribution, demographics) risks attributing non-forgery domain shifts to forgery robustness failures. No ablation that equalizes real-sample distributions before cross-pairing is described, so the headline claim that observed AUC drops expose an 'overlooked robustness problem' rests on an untested assumption.
  2. [Experimental results] Experimental results: the abstract asserts superiority 'with various suitable metrics' yet supplies no information on baseline selection criteria, statistical significance tests, variance across runs, or data filtering steps. This absence makes the quantitative claims of outperformance unverifiable and load-bearing for the superiority conclusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. These have prompted us to clarify the motivation and assumptions behind Cross-AUC and to strengthen the experimental reporting. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract / Cross-AUC] Abstract / Cross-AUC definition: pairing real samples from dataset A with fakes from B (and vice versa) without any normalization or matching of real-image statistics (resolution, compression, head-pose distribution, demographics) risks attributing non-forgery domain shifts to forgery robustness failures. No ablation that equalizes real-sample distributions before cross-pairing is described, so the headline claim that observed AUC drops expose an 'overlooked robustness problem' rests on an untested assumption.

    Authors: We appreciate the referee's identification of a potential confound. Cross-AUC is intended to simulate a realistic deployment scenario in which a detector must classify faces whose real and fake components originate from different data sources. Nevertheless, we agree that unaccounted differences in real-image statistics could inflate the observed drops. In the revised manuscript we have added a controlled ablation that matches real samples across datasets on resolution, compression level, and head-pose distribution (using available metadata and nearest-neighbor subsampling). Under these matched conditions the Cross-AUC performance drops remain substantial, indicating that the robustness issue is not solely attributable to real-image domain statistics. We have also inserted a dedicated limitations paragraph that explicitly discusses the remaining un-matched factors (e.g., demographic distributions) and their possible influence. Full demographic equalization would require new annotations outside the scope of the current datasets; we therefore treat this as a limitation rather than a solved issue. revision: partial

  2. Referee: [Experimental results] Experimental results: the abstract asserts superiority 'with various suitable metrics' yet supplies no information on baseline selection criteria, statistical significance tests, variance across runs, or data filtering steps. This absence makes the quantitative claims of outperformance unverifiable and load-bearing for the superiority conclusion.

    Authors: We fully agree that the experimental section must supply sufficient detail for independent verification. The revised manuscript now includes an expanded experimental protocol subsection that specifies: (i) baseline selection criteria (representative recent methods spanning CNN, transformer, and CLIP-based families, chosen for recency and public availability); (ii) statistical significance testing (paired t-tests with reported p-values); (iii) run-to-run variance (all tables report mean ± standard deviation over five independent runs with distinct random seeds); and (iv) data filtering steps (official splits retained, with additional face-detection confidence threshold of 0.9 and exclusion of images below 128×128 pixels). These additions render the superiority claims transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: new metric and modules are independently defined and empirically evaluated

full rationale

The paper defines Cross-AUC explicitly as AUC computed on cross-paired real/fake samples from different datasets, without reducing it to any prior fitted quantity or self-citation. SFAM is introduced as two novel modules (patch-level CLIP alignment and region-specific MoE) whose design is not derived from the metric by construction or from author prior work. Performance superiority is asserted via direct experiments on public datasets under multiple metrics, with no equations or claims that collapse the improvements to tautological redefinitions of inputs. The derivation chain is self-contained and does not invoke load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The work relies on standard deep-learning assumptions about CLIP's transferability and the utility of mixture-of-experts routing; no explicit free parameters or invented physical entities are declared in the abstract.

axioms (2)
  • domain assumption CLIP embeddings can be fine-tuned at patch level to become sensitive to local manipulation artifacts
    Invoked to justify the semantic fine-grained alignment module.
  • domain assumption Features from different facial regions benefit from routing to specialized expert networks
    Basis for the mixture-of-experts component.
invented entities (2)
  • Patch-level image-text alignment module no independent evidence
    purpose: Enhance CLIP sensitivity to manipulation artifacts
    New component introduced to address the generalization issue.
  • Facial region mixture-of-experts module no independent evidence
    purpose: Enable region-aware forgery analysis by routing to specialized experts
    New architectural block for handling facial sub-regions.

pith-pipeline@v0.9.0 · 5543 in / 1627 out tokens · 48312 ms · 2026-05-09T21:43:01.083907+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 11 canonical work pages · 3 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cao, J., Ma, C., Yao, T., Chen, S., Ding, S., Yang, X.: End-to-end reconstruction-classification learning for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4113–4122 (June 2022)

  2. [2]

    MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , isbn =

    Chen, L., Zhang, Y ., Song, Y ., Liu, L., Wang, J.: Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18689–18698 (2022). https://doi.org/10.1109/CVPR52688.2022.01815

  3. [3]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Chen, S., Yao, T., Chen, Y ., Ding, S., Li, J., Ji, R.: Local relation learning for face forgery detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 1081–1088 (2021). https://doi.org/10.1609/aaai.v35i2.16193, https://doi.org/10.1609/aaai.v35i2.16193

  4. [4]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

    Chollet, F.: Xception: Deep learning with depth- wise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

  5. [5]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cui, X., Li, Y ., Luo, A., Zhou, J., Dong, J.: Forensics adapter: Adapting clip for generaliz- able face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19207–19217 (June 2025)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

    Dang, H., Liu, F., Stehouwer, J., Liu, X., Jain, A.K.: On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)

  7. [7]

    Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., Ferrer, C.C.: The deepfake detection challenge (dfdc) dataset (2020), https://arxiv.org/ abs/2006.07397

  8. [8]

    Dolhansky, B., Howes, R., Pflaum, B., Baram, N., Ferrer, C.C.: The deepfake detection challenge (dfdc) preview dataset (2019), https://arxiv.org/abs/ JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APR. 2026 10 1910.08854

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weis- senborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. CoRR abs/2010.11929(2020), https://arxiv.org/abs/2010. 11929

  10. [10]

    In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

    Fan, L., Krishnan, D., Isola, P., Katabi, D., Tian, Y .: Improving clip training with language rewrites. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neu- ral Information Processing Systems. vol. 36, pp. 35544–35575. Curran Associates, Inc. (2023)

  11. [11]

    Journal of Machine Learning Research 9(86), 2579–2605 (Jan 2008)

    Hinton, G., Maaten, L.v.d.: Visualizing Data us- ing t-SNE. Journal of Machine Learning Research 9(86), 2579–2605 (Jan 2008)

  12. [12]

    In: 2018 IEEE International Workshop on Information Forensics and Security (WIFS)

    Li, Y ., Chang, M.C., Lyu, S.: In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In: 2018 IEEE International Workshop on Information Forensics and Security (WIFS). pp. 1–7 (2018). https://doi.org/10.1109/WIFS.2018.8630787

  13. [13]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recogni- tion (CVPR) (June 2020)

    Li, Y ., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb- df: A large-scale challenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recogni- tion (CVPR) (June 2020)

  14. [14]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Lin, Y ., Song, W., Li, B., Li, Y ., Ni, J., Chen, H., Li, Q.: Fake it till you make it: Curricular dynamic forgery augmentations towards general deepfake detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 104–

  15. [15]

    Springer Nature Switzerland, Cham (2025)

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Luo, Y ., Zhang, Y ., Yan, J., Liu, W.: Generalizing face forgery detection with high-frequency features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16317–16326 (June 2021)

  17. [17]

    arXiv: Computer Vision and Pattern Recognition (Nov 2018)

    Lyu, S., Li, Y .: Exposing DeepFake Videos By Detecting Face Warping Artifacts. arXiv: Computer Vision and Pattern Recognition (Nov 2018)

  18. [18]

    In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M

    Qian, Y ., Yin, G., Sheng, L., Chen, Z., Shao, J.: Thinking in frequency: Face forgery detection by mining frequency-aware clues. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 86–103. Springer Inter- national Publishing, Cham (2020)

  19. [19]

    In: Meila, M., Zhang, T

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Con- ference on Machine Learning. Proceedings of Ma- chine Learning R...

  20. [20]

    In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

    Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Niessner, M.: Faceforensics++: Learning to detect manipulated facial images. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)

  21. [21]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Shiohara, K., Yamasaki, T.: Detecting deepfakes with self-blended images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18720–18729 (June 2022)

  22. [22]

    Sun, Q., Fang, Y ., Wu, L., Wang, X., Cao, Y .: Eva- clip: Improved training techniques for clip at scale (2023), https://arxiv.org/abs/2303.15389

  23. [23]

    In: Chaudhuri, K., Salakhutdinov, R

    Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceed- ings of the 36th International Conference on Ma- chine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6105–6114. PMLR (09– 15 Jun 2019), https://proceedings.mlr.press/v97/ tan19a.html

  24. [24]

    ACM Trans

    Thies, J., Zollh ¨ofer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph.38(4) (Jul 2019). https://doi.org/10.1145/3306346.3323035, https://doi.org/10.1145/3306346.3323035

  25. [25]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

    Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Niessner, M.: Face2face: Real-time face capture and reenactment of rgb videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)

  26. [26]

    Yan, Z., Wang, J., Jin, P., Zhang, K.Y ., Liu, C., Chen, S., Yao, T., Ding, S., Wu, B., Yuan, L.: Orthogonal subspace decomposition for general- izable ai-generated image detection (2025), https: //arxiv.org/abs/2411.15633

  27. [27]

    In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV)

    Yan, Z., Zhang, Y ., Fan, Y ., Wu, B.: Ucf: Uncov- ering common features for generalizable deepfake detection. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV). pp. 22412–22423 (October 2023)

  28. [28]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Zhang, B., Zhang, P., Dong, X., Zang, Y ., Wang, J.: Long-clip: Unlocking the long-text capability of clip. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Com- puter Vision – ECCV 2024. pp. 310–325. Springer Nature Switzerland, Cham (2025)

  29. [29]

    Zhang, Y ., Wang, T., Yu, Z., Gao, Z., Shen, L., Chen, S.: Mfclip: Multi-modal fine-grained JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APR. 2026 11 clip for generalizable diffusion face forgery detection. IEEE Transactions on Information Forensics and Security20, 5888–5903 (2025). https://doi.org/10.1109/TIFS.2025.3576577

  30. [30]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Zhang, Y ., Colman, B., Guo, X., Shahriyari, A., Bharaj, G.: Common sense reasoning for deepfake detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Com- puter Vision – ECCV 2024. pp. 399–415. Springer Nature Switzerland, Cham (2025)

  31. [31]

    arXiv: Computer Vision and Pattern Recognition (Mar 2021)

    Zhao, H., Zhou, W., Chen, D., Wei, T., Zhang, W., Yu, N.: Multi-attentional Deepfake Detection. arXiv: Computer Vision and Pattern Recognition (Mar 2021)

  32. [32]

    In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    Zhou, P., Han, X., Morariu, V .I., Davis, L.S.: Two-stream neural networks for tampered face detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 1831–1839 (2017). https://doi.org/10.1109/CVPRW.2017.229