Recognition: unknown
Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts
Pith reviewed 2026-05-09 21:43 UTC · model grok-4.3
The pith
Cross-AUC metric reveals that face forgery detectors suffer large performance drops when real and fake samples come from different datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The commonly used cross-dataset AUC metric fails to reveal important issues where detection scores shift significantly across data domains; the proposed Cross-AUC metric computes AUC across dataset pairs by contrasting real samples from one dataset with fake samples from another, exposing substantial performance drops in representative detectors, while the SFAM framework achieves superior performance compared with state-of-the-art methods across suitable metrics.
What carries the argument
Cross-AUC, which evaluates by mixing real samples from one dataset with fake samples from another to test score comparability, together with the SFAM framework's patch-level image-text alignment module that improves CLIP sensitivity to manipulation artifacts and its facial region mixture-of-experts module that routes region features to specialized experts.
If this is right
- Detectors must maintain comparable score distributions across domains, not merely high accuracy on isolated test sets.
- Patch-level alignment with textual descriptions of artifacts can increase sensitivity to subtle face manipulations that global features miss.
- Routing features from distinct facial regions to separate experts enables region-specific forgery pattern recognition.
- More reliable cross-domain evaluation supports deployment of detectors in environments where training and test data sources differ.
Where Pith is reading between the lines
- Calibration methods that normalize scores across domains could be combined with SFAM to further reduce the observed drops.
- Testing the Cross-AUC metric on forgeries generated by newer diffusion models would clarify whether the robustness issue is model-agnostic.
- The mixture-of-experts design may generalize to other region-sensitive tasks such as expression analysis or attribute editing detection.
Load-bearing premise
That mixing real samples from one dataset with fake samples from another produces a metric whose drops directly reflect a detector's real-world cross-domain robustness rather than introducing new confounding biases from dataset-specific statistics.
What would settle it
If a detector maintains high accuracy on mixed real-world forgeries collected from varied sources yet still shows large drops under Cross-AUC, or if SFAM fails to outperform baselines on a new unseen dataset pair despite strong Cross-AUC results.
Figures
read the original abstract
Nowadays, visual data forgery detection plays an increasingly important role in social and economic security with the rapid development of generative models. Existing face forgery detectors still can't achieve satisfactory performance because of poor generalization ability across datasets. The key factor that led to this phenomenon is the lack of suitable metrics: the commonly used cross-dataset AUC metric fails to reveal an important issue where detection scores may shift significantly across data domains. To explicitly evaluate cross-domain score comparability, we propose \textbf{Cross-AUC}, an evaluation metric that can compute AUC across dataset pairs by contrasting real samples from one dataset with fake samples from another (and vice versa). It is interesting to find that evaluating representative detectors under the Cross-AUC metric reveals substantial performance drops, exposing an overlooked robustness problem. Besides, we also propose the novel framework \textbf{S}emantic \textbf{F}ine-grained \textbf{A}lignment and \textbf{M}ixture-of-Experts (\textbf{SFAM}), consisting of a patch-level image-text alignment module that enhances CLIP's sensitivity to manipulation artifacts, and the facial region mixture-of-experts module, which routes features from different facial regions to specialized experts for region-aware forgery analysis. Extensive qualitative and quantitative experiments on the public datasets prove that the proposed method achieves superior performance compared with the state-of-the-art methods with various suitable metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that standard cross-dataset AUC fails to capture score shifts in face forgery detectors across domains. It introduces Cross-AUC, which computes AUC by pairing real samples from one dataset with fake samples from another (and vice versa), and reports substantial performance drops for representative detectors, indicating an overlooked robustness issue. To mitigate this, the authors propose the SFAM framework comprising a patch-level image-text alignment module (to improve CLIP sensitivity to artifacts) and a facial region mixture-of-experts module (for region-aware analysis), claiming superior results over SOTA methods on public datasets under multiple metrics.
Significance. If Cross-AUC can be shown to isolate forgery-specific robustness rather than real-image domain statistics and if SFAM's gains prove reproducible with proper controls, the work would usefully highlight evaluation gaps in cross-domain face forgery detection and offer a concrete architectural direction for region-aware and semantically aligned models.
major comments (2)
- [Abstract / Cross-AUC] Abstract / Cross-AUC definition: pairing real samples from dataset A with fakes from B (and vice versa) without any normalization or matching of real-image statistics (resolution, compression, head-pose distribution, demographics) risks attributing non-forgery domain shifts to forgery robustness failures. No ablation that equalizes real-sample distributions before cross-pairing is described, so the headline claim that observed AUC drops expose an 'overlooked robustness problem' rests on an untested assumption.
- [Experimental results] Experimental results: the abstract asserts superiority 'with various suitable metrics' yet supplies no information on baseline selection criteria, statistical significance tests, variance across runs, or data filtering steps. This absence makes the quantitative claims of outperformance unverifiable and load-bearing for the superiority conclusion.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. These have prompted us to clarify the motivation and assumptions behind Cross-AUC and to strengthen the experimental reporting. We address each major comment below and indicate the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract / Cross-AUC] Abstract / Cross-AUC definition: pairing real samples from dataset A with fakes from B (and vice versa) without any normalization or matching of real-image statistics (resolution, compression, head-pose distribution, demographics) risks attributing non-forgery domain shifts to forgery robustness failures. No ablation that equalizes real-sample distributions before cross-pairing is described, so the headline claim that observed AUC drops expose an 'overlooked robustness problem' rests on an untested assumption.
Authors: We appreciate the referee's identification of a potential confound. Cross-AUC is intended to simulate a realistic deployment scenario in which a detector must classify faces whose real and fake components originate from different data sources. Nevertheless, we agree that unaccounted differences in real-image statistics could inflate the observed drops. In the revised manuscript we have added a controlled ablation that matches real samples across datasets on resolution, compression level, and head-pose distribution (using available metadata and nearest-neighbor subsampling). Under these matched conditions the Cross-AUC performance drops remain substantial, indicating that the robustness issue is not solely attributable to real-image domain statistics. We have also inserted a dedicated limitations paragraph that explicitly discusses the remaining un-matched factors (e.g., demographic distributions) and their possible influence. Full demographic equalization would require new annotations outside the scope of the current datasets; we therefore treat this as a limitation rather than a solved issue. revision: partial
-
Referee: [Experimental results] Experimental results: the abstract asserts superiority 'with various suitable metrics' yet supplies no information on baseline selection criteria, statistical significance tests, variance across runs, or data filtering steps. This absence makes the quantitative claims of outperformance unverifiable and load-bearing for the superiority conclusion.
Authors: We fully agree that the experimental section must supply sufficient detail for independent verification. The revised manuscript now includes an expanded experimental protocol subsection that specifies: (i) baseline selection criteria (representative recent methods spanning CNN, transformer, and CLIP-based families, chosen for recency and public availability); (ii) statistical significance testing (paired t-tests with reported p-values); (iii) run-to-run variance (all tables report mean ± standard deviation over five independent runs with distinct random seeds); and (iv) data filtering steps (official splits retained, with additional face-detection confidence threshold of 0.9 and exclusion of images below 128×128 pixels). These additions render the superiority claims transparent and reproducible. revision: yes
Circularity Check
No circularity: new metric and modules are independently defined and empirically evaluated
full rationale
The paper defines Cross-AUC explicitly as AUC computed on cross-paired real/fake samples from different datasets, without reducing it to any prior fitted quantity or self-citation. SFAM is introduced as two novel modules (patch-level CLIP alignment and region-specific MoE) whose design is not derived from the metric by construction or from author prior work. Performance superiority is asserted via direct experiments on public datasets under multiple metrics, with no equations or claims that collapse the improvements to tautological redefinitions of inputs. The derivation chain is self-contained and does not invoke load-bearing self-citations or ansatz smuggling.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption CLIP embeddings can be fine-tuned at patch level to become sensitive to local manipulation artifacts
- domain assumption Features from different facial regions benefit from routing to specialized expert networks
invented entities (2)
-
Patch-level image-text alignment module
no independent evidence
-
Facial region mixture-of-experts module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Cao, J., Ma, C., Yao, T., Chen, S., Ding, S., Yang, X.: End-to-end reconstruction-classification learning for face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4113–4122 (June 2022)
2022
-
[2]
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , isbn =
Chen, L., Zhang, Y ., Song, Y ., Liu, L., Wang, J.: Self-supervised learning of adversarial example: Towards good generalizations for deepfake detection. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18689–18698 (2022). https://doi.org/10.1109/CVPR52688.2022.01815
-
[3]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Chen, S., Yao, T., Chen, Y ., Ding, S., Li, J., Ji, R.: Local relation learning for face forgery detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 1081–1088 (2021). https://doi.org/10.1609/aaai.v35i2.16193, https://doi.org/10.1609/aaai.v35i2.16193
-
[4]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
Chollet, F.: Xception: Deep learning with depth- wise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
2017
-
[5]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Cui, X., Li, Y ., Luo, A., Zhou, J., Dong, J.: Forensics adapter: Adapting clip for generaliz- able face forgery detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 19207–19217 (June 2025)
2025
-
[6]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)
Dang, H., Liu, F., Stehouwer, J., Liu, X., Jain, A.K.: On the detection of digital face manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2020)
2020
-
[7]
Dolhansky, B., Bitton, J., Pflaum, B., Lu, J., Howes, R., Wang, M., Ferrer, C.C.: The deepfake detection challenge (dfdc) dataset (2020), https://arxiv.org/ abs/2006.07397
work page internal anchor Pith review arXiv 2020
- [8]
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weis- senborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. CoRR abs/2010.11929(2020), https://arxiv.org/abs/2010. 11929
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S
Fan, L., Krishnan, D., Isola, P., Katabi, D., Tian, Y .: Improving clip training with language rewrites. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neu- ral Information Processing Systems. vol. 36, pp. 35544–35575. Curran Associates, Inc. (2023)
2023
-
[11]
Journal of Machine Learning Research 9(86), 2579–2605 (Jan 2008)
Hinton, G., Maaten, L.v.d.: Visualizing Data us- ing t-SNE. Journal of Machine Learning Research 9(86), 2579–2605 (Jan 2008)
2008
-
[12]
In: 2018 IEEE International Workshop on Information Forensics and Security (WIFS)
Li, Y ., Chang, M.C., Lyu, S.: In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In: 2018 IEEE International Workshop on Information Forensics and Security (WIFS). pp. 1–7 (2018). https://doi.org/10.1109/WIFS.2018.8630787
-
[13]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recogni- tion (CVPR) (June 2020)
Li, Y ., Yang, X., Sun, P., Qi, H., Lyu, S.: Celeb- df: A large-scale challenging dataset for deepfake forensics. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recogni- tion (CVPR) (June 2020)
2020
-
[14]
In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G
Lin, Y ., Song, W., Li, B., Li, Y ., Ni, J., Chen, H., Li, Q.: Fake it till you make it: Curricular dynamic forgery augmentations towards general deepfake detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV 2024. pp. 104–
2024
-
[15]
Springer Nature Switzerland, Cham (2025)
2025
-
[16]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Luo, Y ., Zhang, Y ., Yan, J., Liu, W.: Generalizing face forgery detection with high-frequency features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 16317–16326 (June 2021)
2021
-
[17]
arXiv: Computer Vision and Pattern Recognition (Nov 2018)
Lyu, S., Li, Y .: Exposing DeepFake Videos By Detecting Face Warping Artifacts. arXiv: Computer Vision and Pattern Recognition (Nov 2018)
2018
-
[18]
In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M
Qian, Y ., Yin, G., Sheng, L., Chen, Z., Shao, J.: Thinking in frequency: Face forgery detection by mining frequency-aware clues. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision – ECCV 2020. pp. 86–103. Springer Inter- national Publishing, Cham (2020)
2020
-
[19]
In: Meila, M., Zhang, T
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Con- ference on Machine Learning. Proceedings of Ma- chine Learning R...
2021
-
[20]
In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)
Rossler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., Niessner, M.: Faceforensics++: Learning to detect manipulated facial images. In: Proceed- ings of the IEEE/CVF International Conference on Computer Vision (ICCV) (October 2019)
2019
-
[21]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Shiohara, K., Yamasaki, T.: Detecting deepfakes with self-blended images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 18720–18729 (June 2022)
2022
-
[22]
Sun, Q., Fang, Y ., Wu, L., Wang, X., Cao, Y .: Eva- clip: Improved training techniques for clip at scale (2023), https://arxiv.org/abs/2303.15389
work page internal anchor Pith review arXiv 2023
-
[23]
In: Chaudhuri, K., Salakhutdinov, R
Tan, M., Le, Q.: EfficientNet: Rethinking model scaling for convolutional neural networks. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceed- ings of the 36th International Conference on Ma- chine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6105–6114. PMLR (09– 15 Jun 2019), https://proceedings.mlr.press/v97/ tan19a.html
2019
-
[24]
Thies, J., Zollh ¨ofer, M., Nießner, M.: Deferred neural rendering: image synthesis using neural textures. ACM Trans. Graph.38(4) (Jul 2019). https://doi.org/10.1145/3306346.3323035, https://doi.org/10.1145/3306346.3323035
-
[25]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)
Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Niessner, M.: Face2face: Real-time face capture and reenactment of rgb videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)
2016
- [26]
-
[27]
In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV)
Yan, Z., Zhang, Y ., Fan, Y ., Wu, B.: Ucf: Uncov- ering common features for generalizable deepfake detection. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV). pp. 22412–22423 (October 2023)
2023
-
[28]
In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G
Zhang, B., Zhang, P., Dong, X., Zang, Y ., Wang, J.: Long-clip: Unlocking the long-text capability of clip. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Com- puter Vision – ECCV 2024. pp. 310–325. Springer Nature Switzerland, Cham (2025)
2024
-
[29]
Zhang, Y ., Wang, T., Yu, Z., Gao, Z., Shen, L., Chen, S.: Mfclip: Multi-modal fine-grained JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APR. 2026 11 clip for generalizable diffusion face forgery detection. IEEE Transactions on Information Forensics and Security20, 5888–5903 (2025). https://doi.org/10.1109/TIFS.2025.3576577
-
[30]
In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G
Zhang, Y ., Colman, B., Guo, X., Shahriyari, A., Bharaj, G.: Common sense reasoning for deepfake detection. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Com- puter Vision – ECCV 2024. pp. 399–415. Springer Nature Switzerland, Cham (2025)
2024
-
[31]
arXiv: Computer Vision and Pattern Recognition (Mar 2021)
Zhao, H., Zhou, W., Chen, D., Wei, T., Zhang, W., Yu, N.: Multi-attentional Deepfake Detection. arXiv: Computer Vision and Pattern Recognition (Mar 2021)
2021
-
[32]
In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Zhou, P., Han, X., Morariu, V .I., Davis, L.S.: Two-stream neural networks for tampered face detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 1831–1839 (2017). https://doi.org/10.1109/CVPRW.2017.229
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.