pith. sign in

arxiv: 2605.13108 · v1 · pith:2UX53KM2new · submitted 2026-05-13 · 💻 cs.CV

Flow Augmentation and Knowledge Distillation for Lightweight Face Presentation Attack Detection

Pith reviewed 2026-05-14 19:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords face presentation attack detectionknowledge distillationoptical flowlightweight modelmotion cuesteacher-studentreal-time inferencespoof detection
0
0 comments X

The pith

Lightweight RGB-only student learns motion cues via distillation from a flow-augmented teacher, matching detection accuracy without computing optical flow at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that optical flow can augment training for face presentation attack detection by supplying explicit motion information to a dual-branch teacher model. Logit distillation then transfers this motion sensitivity to a simpler RGB-only student, so the student implicitly captures temporal consistency and micro-motions using only appearance cues. The resulting student matches or exceeds the teacher on standard benchmarks while cutting parameters and FLOPs enough to reach 52 FPS on an edge device. A reader would care because the method removes the usual speed-accuracy trade-off for real-time security applications on constrained hardware.

Core claim

A dual-branch teacher that fuses RGB frames with colorwheel-encoded optical flow produces motion-aware representations that can be transferred through logit distillation to an RGB-only student; the student thereby learns to detect presentation attacks by implicitly modeling motion and temporal consistency without any flow computation or extra blocks at inference time.

What carries the argument

Logit distillation that transfers motion-aware knowledge from the flow-augmented dual-branch teacher to the lightweight RGB-only student.

If this is right

  • The student reaches 0.0 percent HTER on both Replay-Attack and Replay-Mobile datasets.
  • It records 0.94 percent HTER on ROSE-Youtu and 0.42 percent ACER on OULU-NPU.
  • Parameter and FLOP counts drop sharply compared with the teacher while inference speed reaches 52 FPS on an NVIDIA Jetson Orin Nano.
  • The same RGB-only architecture handles 2D print, replay, 3D mask, makeup, and occlusion attacks under varied capture conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same training-time flow augmentation plus distillation pattern could be tested on other video tasks where motion matters but real-time RGB inference is required.
  • Extending the student to accept occasional depth or infrared frames at training time might further improve robustness without changing the inference footprint.
  • If the distillation generalizes, similar teacher-student pairs could reduce the cost of temporal modeling in surveillance or medical video analysis.

Load-bearing premise

That the teacher's logit outputs alone carry enough motion-discriminative signal for the student to replicate accurate behavior from RGB inputs only.

What would settle it

If the student shows a large rise in HTER relative to the teacher on a replay-attack dataset that emphasizes subtle motion differences, the distillation would be shown not to have transferred the necessary features.

Figures

Figures reproduced from arXiv: 2605.13108 by Kejie Huang, Muhammad Shahid Jabbar, Muhammad Sohail Ibrahim, Shujaat Khan, Taha Hasan Masood Siddique.

Figure 1
Figure 1. Figure 1: Proposed dual-branch (RGB+Flow) teacher and distilled RGB-only student for FacePAD. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Grad-CAM overlays for Image-only (I), flow￾augmented (I&F), and distilled student (DS) models, show￾ing sharper motion-aware activations with flow guidance. Grad-CAM visualizations ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Face presentation attack detection (FacePAD) remains challenging under diverse spoofing representation, including 2D print and replay, 3D mask-based spoofing, makeup-induced appearance manipulation, and physical occlusions, as well as under varying capture conditions. Motion cues are highly discriminative for FacePAD but typically require explicit optical flow estimation, which introduces substantial computational overhead and limits real-time deployment. In this work, we leverage optical flow to enhance motion representation during training while eliminating the need for flow computation at inference. We propose a dual-branch teacher model that fuses appearance cues from RGB frames with motion cues derived from colorwheel-encoded optical flow, enabling effective modeling of micro-motions and temporal consistency. To enable efficient deployment, we introduce a knowledge distillation framework that transfers motion-aware knowledge from the flow-augmented teacher to a lightweight RGB-only student via logit distillation. As a result, the student implicitly learns motion-sensitive representations without requiring explicit flow estimation or additional feature extraction blocks at inference. Extensive experiments demonstrate strong performance across multiple benchmarks, achieving 0.0% HTER on Replay-Attack and Replay-Mobile, 0.94% HTER on ROSE-Youtu, 5.65% HTER on SiW-Mv2, and 0.42% ACER on OULU-NPU. The distilled student achieves performance comparable to or better than the teacher while significantly reducing parameters and FLOPs, achieving 52 FPS on an NVIDIA Jetson Orin Nano, indicating its suitability for real-time and resource-constrained FacePAD deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a dual-branch teacher model that fuses RGB appearance features with motion cues from colorwheel-encoded optical flow to improve FacePAD. It then applies logit distillation to transfer this knowledge to a lightweight RGB-only student, allowing the student to implicitly capture motion-sensitive representations without explicit flow computation or extra blocks at inference. The student reportedly achieves 0.0% HTER on Replay-Attack and Replay-Mobile, 0.94% HTER on ROSE-Youtu, 5.65% HTER on SiW-Mv2, and 0.42% ACER on OULU-NPU, while running at 52 FPS on an NVIDIA Jetson Orin Nano with reduced parameters and FLOPs compared to the teacher.

Significance. If the logit distillation successfully equips the RGB student with motion-discriminative capability, the approach would provide a practical route to real-time, resource-efficient FacePAD on edge devices by eliminating flow overhead at inference while matching or exceeding teacher performance. The reported benchmark numbers indicate potential utility for deployment under diverse spoofing conditions, though the significance depends on confirming that the gains stem from the proposed flow-augmented transfer rather than the student architecture or data alone.

major comments (2)
  1. [Experiments] Experiments section: the manuscript does not include an ablation comparing the distilled RGB-only student against an identical student architecture trained with standard supervision on RGB frames alone (no teacher logits). This comparison is load-bearing for the central claim, as the 0.0% HTER on Replay-Attack/Replay-Mobile could otherwise arise from the student's own temporal modeling of RGB sequences or dataset biases rather than successful transfer of motion cues from the flow-augmented teacher.
  2. [Method] Method and Experiments sections: training protocols, data splits, hyperparameter settings, and exact distillation loss formulation are not detailed sufficiently to allow reproduction or to verify that the student truly acquires implicit motion sensitivity beyond what a plain RGB baseline would achieve.
minor comments (2)
  1. [Abstract] Abstract: the abstract omits any mention of training protocols, data splits, or ablation studies, which would better contextualize the strong reported HTER values.
  2. [Figures] Figures: ensure the dual-branch teacher architecture and the distillation pipeline diagrams include explicit labels for all input branches, fusion points, and loss terms to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects for strengthening the central claims of our work on flow-augmented distillation for lightweight FacePAD. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the manuscript does not include an ablation comparing the distilled RGB-only student against an identical student architecture trained with standard supervision on RGB frames alone (no teacher logits). This comparison is load-bearing for the central claim, as the 0.0% HTER on Replay-Attack/Replay-Mobile could otherwise arise from the student's own temporal modeling of RGB sequences or dataset biases rather than successful transfer of motion cues from the flow-augmented teacher.

    Authors: We agree that this ablation is essential to isolate the contribution of the logit distillation. In the revised manuscript, we will add a direct comparison of the identical RGB-only student architecture trained under standard supervision (cross-entropy loss on RGB frames) versus the distilled version using teacher logits. The updated Experiments section will report the resulting HTER/ACER metrics on the same benchmarks, demonstrating that distillation yields measurable gains attributable to implicit motion sensitivity transferred from the flow-augmented teacher. revision: yes

  2. Referee: [Method] Method and Experiments sections: training protocols, data splits, hyperparameter settings, and exact distillation loss formulation are not detailed sufficiently to allow reproduction or to verify that the student truly acquires implicit motion sensitivity beyond what a plain RGB baseline would achieve.

    Authors: We acknowledge the need for greater detail to support reproducibility. The revised manuscript will expand both the Method and Experiments sections to include: (i) the precise distillation loss formulation (temperature-scaled KL divergence between teacher and student logits, combined with any auxiliary terms), (ii) full training protocols including optimizer, learning rate schedule, batch size, and number of epochs, (iii) exact data splits and preprocessing for each benchmark dataset, and (iv) all hyperparameter values. These additions will enable verification that performance improvements stem from the proposed knowledge transfer rather than baseline RGB training alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical claims on external benchmarks

full rationale

The paper contains no equations, derivations, or parameter-fitting steps that could reduce to inputs by construction. All performance numbers (HTER, ACER, FPS) are reported as direct experimental outcomes on public datasets (Replay-Attack, Replay-Mobile, ROSE-Youtu, SiW-Mv2, OULU-NPU). The knowledge-distillation procedure is described at the level of training protocol rather than any closed-form identity or self-referential prediction. No self-citation is invoked to justify uniqueness or to substitute for an independent result. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Approach rests on standard assumptions of deep learning optimization and knowledge distillation effectiveness; no explicit free parameters, axioms, or invented entities are introduced beyond conventional neural network components.

pith-pipeline@v0.9.0 · 5597 in / 1094 out tokens · 40072 ms · 2026-05-14T19:50:23.360938+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Deep learning for face anti-spoofing: A survey,

    Z. Yu, Y . Qin, X. Li, C. Zhao, Z. Lei, and G. Zhao, “Deep learning for face anti-spoofing: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 5, pp. 5609–5631, 2023

  2. [2]

    Transfer learning using convolutional neural networks for face anti- spoofing,

    O. Lucena, A. Junior, V . Moia, R. Souza, E. Valle, and R. Lotufo, “Transfer learning using convolutional neural networks for face anti- spoofing,” inImage Analysis and Recognition: 14th International Conference, ICIAR 2017, Montreal, QC, Canada, July 5–7, 2017, Proceedings 14. Springer, 2017, pp. 27–34

  3. [3]

    A performance evaluation of convolu- tional neural networks for face anti spoofing,

    C. Nagpal and S. R. Dubey, “A performance evaluation of convolu- tional neural networks for face anti spoofing,” in2019 international joint conference on neural networks (IJCNN). IEEE, 2019, pp. 1–8

  4. [4]

    3d convolutional neural network based on face anti-spoofing,

    J. Gan, S. Li, Y . Zhai, and C. Liu, “3d convolutional neural network based on face anti-spoofing,” in2017 2nd international conference on multimedia and image processing (ICMIP). IEEE, 2017, pp. 1–5

  5. [5]

    Learning deep models for face anti-spoofing: Binary or auxiliary supervision,

    Y . Liu, A. Jourabloo, and X. Liu, “Learning deep models for face anti-spoofing: Binary or auxiliary supervision,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 389–398

  6. [6]

    Face anti-spoofing using patch and depth-based cnns,

    Y . Atoum, Y . Liu, A. Jourabloo, and X. Liu, “Face anti-spoofing using patch and depth-based cnns,” in2017 IEEE international joint conference on biometrics (IJCB). IEEE, 2017, pp. 319–328

  7. [7]

    Knowledge distillation with predicted depth for robust and lightweight face presentation attack detection,

    M. S. Jabbar, T. H. M. Siddique, K. Huang, and S. Khan, “Knowledge distillation with predicted depth for robust and lightweight face presentation attack detection,”Knowledge-Based Systems, p. 114325, 2025

  8. [8]

    A liveness detection method for face recognition based on optical flow field,

    W. Bao, H. Li, N. Li, and W. Jiang, “A liveness detection method for face recognition based on optical flow field,” in2009 International Conference on Image Analysis and Signal Processing. IEEE, 2009, pp. 233–236

  9. [9]

    Non-intrusive liveness detection by face images,

    K. Kollreider, H. Fronthaler, and J. Bigun, “Non-intrusive liveness detection by face images,”Image and Vision Computing, vol. 27, no. 3, pp. 233–244, 2009

  10. [10]

    A face anti-spoofing method based on optical flow field,

    W. Yin, Y . Ming, and L. Tian, “A face anti-spoofing method based on optical flow field,” in2016 IEEE 13th international conference on signal processing (ICSP). IEEE, 2016, pp. 1333–1337

  11. [11]

    Integration of image quality and motion cues for face anti-spoofing: A neural network approach,

    L. Feng, L.-M. Po, Y . Li, X. Xu, F. Yuan, T. C.-H. Cheung, and K.-W. Cheung, “Integration of image quality and motion cues for face anti-spoofing: A neural network approach,”Journal of Visual Communication and Image Representation, vol. 38, pp. 451–460, 2016

  12. [12]

    Joint discriminative learning of deep dynamic textures for 3d mask face anti-spoofing,

    R. Shao, X. Lan, and P. C. Yuen, “Joint discriminative learning of deep dynamic textures for 3d mask face anti-spoofing,”IEEE Transactions on Information Forensics and Security, vol. 14, no. 4, pp. 923–938, 2018

  13. [13]

    Integrating fine-grained classification and motion relation analysis for face anti-spoofing,

    Z. Cheng and X. Zhang, “Integrating fine-grained classification and motion relation analysis for face anti-spoofing,”IEEE Access, 2025

  14. [14]

    Face presentation attack detection based on optical flow and texture analysis,

    L. Li, Z. Xia, J. Wu, L. Yang, and H. Han, “Face presentation attack detection based on optical flow and texture analysis,”Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 4, pp. 1455–1467, 2022

  15. [15]

    Unifying flow, stereo and depth estimation,

    H. Xu, J. Zhang, J. Cai, H. Rezatofighi, F. Yu, D. Tao, and A. Geiger, “Unifying flow, stereo and depth estimation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 11, pp. 13 941– 13 958, 2023

  16. [16]

    Spatio-temporal deep learning for improved face presen- tation attack detection,

    S. Khan, T. H. M. Siddique, M. S. Ibrahim, A. J. Siddiqui, and K. Huang, “Spatio-temporal deep learning for improved face presen- tation attack detection,”Knowledge-Based Systems, p. 113059, 2025

  17. [17]

    Advspoofguard: Optimal transport driven robust face presentation attack detection system,

    T. H. M. Siddique, S. Khan, Z. Wang, and K. Huang, “Advspoofguard: Optimal transport driven robust face presentation attack detection system,”Knowledge-Based Systems, p. 113759, 2025

  18. [18]

    Improving face presentation attack detection through deformable convolution and transfer learning,

    S. M. Ibrahim, M. S. Ibrahim, S. Khan, Y .-W. Ko, and J.-G. Lee, “Improving face presentation attack detection through deformable convolution and transfer learning,”IEEE Access, 2025

  19. [19]

    On the effectiveness of local binary patterns in face anti-spoofing,

    I. Chingovska, A. Anjos, and S. Marcel, “On the effectiveness of local binary patterns in face anti-spoofing,” in2012 BIOSIG-proceedings of the international conference of biometrics special interest group (BIOSIG). IEEE, 2012, pp. 1–7

  20. [20]

    The replay-mobile face presentation-attack database,

    A. Costa-Pazo, S. Bhattacharjee, E. Vazquez-Fernandez, and S. Mar- cel, “The replay-mobile face presentation-attack database,” in2016 international conference of the Biometrics Special Interest Group (BIOSIG). IEEE, 2016, pp. 1–7

  21. [21]

    Unsupervised domain adaptation for face anti-spoofing,

    H. Li, W. Li, H. Cao, S. Wang, F. Huang, and A. C. Kot, “Unsupervised domain adaptation for face anti-spoofing,”IEEE Transactions on Information Forensics and Security, vol. 13, pp. 1794–1809, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:4624345

  22. [22]

    Oulu- npu: A mobile face presentation attack database with real-world variations,

    Z. Boulkenafet, J. Komulainen, L. Li, X. Feng, and A. Hadid, “Oulu- npu: A mobile face presentation attack database with real-world variations,” in2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017). IEEE, 2017, pp. 612–618

  23. [23]

    Multi-domain learning for updating face anti-spoofing models,

    X. Guo, Y . Liu, A. Jain, and X. Liu, “Multi-domain learning for updating face anti-spoofing models,” inEuropean conference on computer vision. Springer, 2022, pp. 230–249

  24. [24]

    Confidence aware learning for reli- able face anti-spoofing,

    X. Long, J. Zhang, and S. Shan, “Confidence aware learning for reli- able face anti-spoofing,”IEEE Transactions on Information Forensics and Security, 2025

  25. [25]

    Fully supervised contrastive learning in latent space for face presentation attack detection,

    M. O. Alassafi, M. S. Ibrahim, I. Naseem, R. AlGhamdi, R. Alotaibi, F. A. Kateb, H. M. Oqaibi, A. A. Alshdadi, and S. A. Yusuf, “Fully supervised contrastive learning in latent space for face presentation attack detection,”Applied Intelligence, vol. 53, no. 19, pp. 21 770– 21 787, 2023

  26. [26]

    Securing phygital gameplay: Strategies for video-replay spoofing detection,

    V . D. Husz ´ar and V . K. Adhikarla, “Securing phygital gameplay: Strategies for video-replay spoofing detection,”IEEE Access, 2024

  27. [27]

    Face spoofing detection based on chromatic ed-lbp texture feature,

    X. Shu, H. Tang, and S. Huang, “Face spoofing detection based on chromatic ed-lbp texture feature,”Multimedia Systems, vol. 27, no. 2, pp. 161–176, 2021

  28. [28]

    Texture and quality analysis for face spoofing detection,

    N. Daniel and A. Anitha, “Texture and quality analysis for face spoofing detection,”Computers & Electrical Engineering, vol. 94, p. 107293, 2021

  29. [29]

    Face-fake-net: The deep learning method for image face anti-spoofing detection: Paper id 45,

    M. Alshaikhli, O. Elharrouss, S. Al-Maadeed, and A. Bouridane, “Face-fake-net: The deep learning method for image face anti-spoofing detection: Paper id 45,” in2021 9th European Workshop on Visual Information Processing (EUVIP). IEEE, 2021, pp. 1–6

  30. [30]

    Face presentation attack detection via ensemble learning algorithm,

    K. W. Lee, J. Y . Lim, K. M. Lim, and C. P. Lee, “Face presentation attack detection via ensemble learning algorithm,” in2023 IEEE 11th Conference on Systems, Process & Control (ICSPC). IEEE, 2023, pp. 101–106

  31. [31]

    Face anti-spoofing based on 3d learnable convolutional operators,

    Z. Ning, W. Zhang, and J. Yang, “Face anti-spoofing based on 3d learnable convolutional operators,” in2024 36th Chinese Control and Decision Conference (CCDC). IEEE, 2024, pp. 4034–4040