arxiv: 2604.16486 · v1 · submitted 2026-04-13 · 💻 cs.CV · cs.LG

Aletheia: Physics-Conditioned Localized Artifact Attention (PhyLAA-X) for End-to-End Generalizable and Robust Deepfake Video Detection

Devendra Ghori This is my paper

Pith reviewed 2026-05-10 14:56 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords deepfake detectionphysics-conditioned attentionoptical flowrPPGadversarial robustnesscross-generator generalizationvideo forgerylocalized artifact attention

0 comments

The pith

Physics-derived features injected into attention make deepfake detectors generalize across generators while resisting attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix the drop in performance that deepfake detectors show when moving to new video generators or facing adversarial changes. It does so by extending localized artifact attention with three physics-based volumes computed from the input video: optical-flow curl, specular-reflectance skewness, and spatially upsampled rPPG power spectra. These volumes are fed into the attention layers through cross-attention gating and a resonance consistency loss so the network learns to flag regions where both semantic artifacts and physical violations appear together. An ensemble of three efficient backbones with uncertainty-weighted fusion then produces the final decision. Experiments report strong numbers on FaceForensics++, Celeb-DF v2, and DFDC plus retained accuracy under PGD attacks.

Core claim

PhyLAA-X conditions the localized artifact attention computation on three end-to-end differentiable physics-derived feature volumes—optical-flow curl, specular-reflectance skewness, and spatially-upsampled rPPG power spectra—via cross-attention gating and a resonance consistency loss. This forces the model to attend to manipulation boundaries where semantic inconsistencies and physical violations co-occur, regions that current generative models cannot replicate consistently.

What carries the argument

PhyLAA-X, the physics-conditioned extension of Localized Artifact Attention that injects optical-flow curl, specular-reflectance skewness, and rPPG spectra into attention computation through cross-attention gating and a resonance consistency loss.

If this is right

Accuracy reaches 97.2 percent and AUC 0.992 on FaceForensics++ c23 compression.
Cross-generator gains of 4.1 to 7.3 percent over the prior LAA-Net baseline on Celeb-DF and DFDC.
Accuracy remains 79.4 percent under epsilon=0.02 PGD-10 adversarial attacks.
Single-backbone ablations alone deliver a 4.2 percent cross-dataset AUC improvement.
Uncertainty-aware ensemble weighting further stabilizes decisions across the three backbones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Generators that explicitly simulate optical flow and rPPG would need to be tested to see whether the performance gap closes.
The same physics-gating pattern could be applied to still-image forgery or audio-video lip-sync detection.
Removing the resonance loss or the physics volumes in isolation would quantify how much each component contributes to robustness.
The method suggests that future detectors should treat physical consistency as a first-class signal rather than a post-hoc check.

Load-bearing premise

Generative models cannot reliably reproduce physical invariants such as optical flow discontinuities, specular reflection patterns, and cardiac-modulated reflectance at the same time as semantic artifacts.

What would settle it

A new generator that produces videos whose optical-flow curl, specular-reflectance skewness, and rPPG spectra match those of real videos at manipulation boundaries would cause cross-generator accuracy to fall to near-random levels.

Figures

Figures reproduced from arXiv: 2604.16486 by Devendra Ghori.

**Figure 2.** Figure 2: Detailed PhyLAA-X Module. 3.3 Uncertainty-Aware Ensemble Fusion Logits zi are fused with weights: wi = P exp(ηi/τ ) · (1 − ui) · ri j exp(ηj/τ ) · (1 − uj ) · rj (5) where ηi = validation AUC-ROC, ui = Monte-Carlo dropout entropy (K=32), ri = physics-resonance agreement, τ = 0.4. ECE drops to 0.029. 3.4 Explainability and Production Inference PhyLAA-X produces more localized and physically meaningful GradC… view at source ↗

**Figure 3.** Figure 3: Example PhyLAA-X-Enhanced GradCAM++ Visualization. Loss. Focal loss (α = 0.25, γ = 2) + auxiliary LAA-X segmentation loss + Lres (weight 0.3). Optimization. AdamW (LR=3 × 10−4 , cosine annealing with warm restarts), mixed precision, DDP (4–8 GPUs), gradient checkpointing, effective batch 256. Augmentations include random temporal crop, HEVC compression, and 20% adversarial samples. Attack Protocols. PGD-10… view at source ↗

**Figure 4.** Figure 4: Ablation Bar Chart – Cross-Dataset AUC Gain. 5.4 Detailed Ablations PhyLAA-X Conditioning • Standard LAA-X (no physics): 0.923 cross-AUC • Post-hoc concatenation: 0.944 (+2.1%) • PhyLAA-X (cross-attention + Lres): 0.951 (+6.8%) Per-Physics Contribution (remove one conditioner) • –Flow curl: –3.9% • –Specular: –2.7% • –rPPG: –4.1% Ensemble Weight Sensitivity. Fixed equal weights drop AUC by 1.8%; uncertaint… view at source ↗

read the original abstract

State-of-the-art deepfake detectors achieve near-perfect in-domain accuracy yet degrade under cross-generator shifts, heavy compression, and adversarial perturbations. The core limitation remains the decoupling of semantic artifact learning from physical invariants: optical-flow discontinuities, specular-reflection inconsistencies, and cardiac-modulated reflectance (rPPG) are treated either as post-hoc features or ignored. We introduce PhyLAA-X, a novel physics-conditioned extension of Localized Artifact Attention (LAA-X). PhyLAA-X injects three end-to-end differentiable physics-derived feature volumes - optical-flow curl, specular-reflectance skewness, and spatially-upsampled rPPG power spectra - directly into the LAA-X attention computation via cross-attention gating and a resonance consistency loss. This forces the network to learn manipulation boundaries where semantic inconsistencies and physical violations co-occur - regions inherently harder for generative models to replicate consistently. PhyLAA-X is embedded across an efficient spatiotemporal ensemble (EfficientNet-B4+BiLSTM, ResNeXt-101+Transformer, Xception+causal Conv1D) with uncertainty-aware adaptive weighting. On FaceForensics++ (c23), Aletheia reaches 97.2% accuracy / 0.992 AUC-ROC; on Celeb-DF v2, 94.9% / 0.981; on DFDC, 90.8% / 0.966 - outperforming the strongest published baseline (LAA-Net [1]) by 4.1-7.3% in cross-generator settings and maintaining 79.4% accuracy under epsilon = 0.02 PGD-10 attacks. Single-backbone ablations confirm PhyLAA-X alone delivers a 4.2% cross-dataset AUC gain. The full production system is open-sourced at https://github.com/devghori1264/Aletheia (v1.2, April 2026) with pretrained weights, the adversarial corpus (referred to as ADC-2026 in this work), and complete reproducibility artifacts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds three physics features to artifact attention for deepfake detection and reports clear benchmark gains plus adversarial robustness, but the end-to-end conditioning claim needs checking against the actual implementation.

read the letter

The main takeaway is that PhyLAA-X conditions localized artifact attention on optical-flow curl, specular-reflectance skewness, and rPPG spectra through cross-attention gating and a resonance consistency loss. It posts 97.2% accuracy on FaceForensics++ c23, 94.9% on Celeb-DF v2, and 90.8% on DFDC, beating LAA-Net by 4-7 points in cross-generator tests while holding 79.4% under epsilon=0.02 PGD-10 attacks. The code and adversarial corpus are released, which helps reproducibility.

Referee Report

2 major / 1 minor

Summary. The paper introduces Aletheia (PhyLAA-X), a physics-conditioned extension of Localized Artifact Attention for deepfake video detection. It injects three claimed end-to-end differentiable physics-derived feature volumes (optical-flow curl, specular-reflectance skewness, and spatially-upsampled rPPG power spectra) into the LAA-X attention via cross-attention gating and a resonance consistency loss. This is embedded in a spatiotemporal ensemble (EfficientNet-B4+BiLSTM, ResNeXt-101+Transformer, Xception+causal Conv1D) with uncertainty-aware weighting. The method reports strong results: 97.2% acc / 0.992 AUC on FaceForensics++ (c23), 94.9% / 0.981 on Celeb-DF v2, 90.8% / 0.966 on DFDC, outperforming LAA-Net by 4.1-7.3% in cross-generator settings, plus 79.4% accuracy under PGD-10 attacks, with single-backbone ablations showing 4.2% cross-dataset AUC gain from PhyLAA-X. The full system is open-sourced.

Significance. If the central mechanism holds, the work could meaningfully advance generalizable and robust deepfake detection by explicitly coupling semantic artifact attention with physical invariants that generative models struggle to replicate consistently. The reported gains in cross-generator and adversarial settings, combined with the open release of code, pretrained weights, and the ADC-2026 adversarial corpus, would support reproducibility and further research in the field.

major comments (2)

[Abstract] Abstract: The central claim that the three physics-derived feature volumes are 'end-to-end differentiable' and that the resonance consistency loss 'forces the network to learn manipulation boundaries where semantic inconsistencies and physical violations co-occur' is load-bearing for the proposed mechanism. Standard extraction of optical-flow curl, rPPG power spectra (involving frequency binning), and reflectance skewness typically includes non-differentiable operations (iterative solvers, argmax, post-processing). The manuscript provides no explicit description of how differentiability is achieved (e.g., via surrogate gradients, straight-through estimators, or fully differentiable approximations), so it is unclear whether back-propagation actually enforces consistency with the physical invariants or whether the method reduces to non-conditioned feature concatenation.
[Ablation studies (referenced in Abstract)] The single-backbone ablations are cited as confirming a 4.2% cross-dataset AUC gain attributable to PhyLAA-X. However, without a dedicated ablation table or section detailing the exact configurations (e.g., LAA-X baseline vs. PhyLAA-X with/without each physics volume, resonance loss weight, and cross-attention gating), it is difficult to isolate the contribution of the physics conditioning from the ensemble architecture or other design choices.

minor comments (1)

[Abstract] The GitHub link references v1.2 dated April 2026; please correct the date to the actual release.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and outline the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the three physics-derived feature volumes are 'end-to-end differentiable' and that the resonance consistency loss 'forces the network to learn manipulation boundaries where semantic inconsistencies and physical violations co-occur' is load-bearing for the proposed mechanism. Standard extraction of optical-flow curl, rPPG power spectra (involving frequency binning), and reflectance skewness typically includes non-differentiable operations (iterative solvers, argmax, post-processing). The manuscript provides no explicit description of how differentiability is achieved (e.g., via surrogate gradients, straight-through estimators, or fully differentiable approximations), so it is unclear whether back-propagation actually enforces consistency with the physical invariants or whether the method reduces to non-conditioned feature concatenation.

Authors: We agree that the abstract lacks sufficient detail on differentiability, which is a valid concern. We will revise the abstract to note that the physics volumes are computed via fully differentiable approximations (finite-difference curl operator for optical flow, smoothed skewness formula for reflectance, and differentiable DFT approximation for rPPG spectra). The resonance consistency loss is a standard differentiable L2 term applied after cross-attention gating. This ensures end-to-end gradient flow enforces physical consistency rather than simple concatenation. A new paragraph will be added to the Methods section with implementation specifics. revision: yes
Referee: [Ablation studies (referenced in Abstract)] The single-backbone ablations are cited as confirming a 4.2% cross-dataset AUC gain attributable to PhyLAA-X. However, without a dedicated ablation table or section detailing the exact configurations (e.g., LAA-X baseline vs. PhyLAA-X with/without each physics volume, resonance loss weight, and cross-attention gating), it is difficult to isolate the contribution of the physics conditioning from the ensemble architecture or other design choices.

Authors: The referee is correct that the current manuscript does not include a dedicated ablation table breaking down the contributions. We will add a new ablation subsection and table in the revised manuscript. This table will report single-backbone cross-dataset AUC for the LAA-X baseline, PhyLAA-X with each physics volume added individually, with/without the resonance loss, and with/without cross-attention gating, to isolate the physics conditioning effects. revision: yes

Circularity Check

0 steps flagged

No circularity: physics features and attention gating are independent architectural additions

full rationale

The paper's core contribution is an architectural modification that injects externally computed physics-derived volumes (optical-flow curl, specular skewness, rPPG spectra) into an existing LAA-X attention block via cross-attention and an auxiliary loss. No derivation chain, equation, or performance claim reduces to a self-definition, a fitted parameter renamed as prediction, or a self-citation that alone justifies the central result. The cited LAA-Net baseline serves only for empirical comparison; the claimed gains are presented as empirical outcomes of the new conditioning, not as logical consequences of prior self-work. The differentiability of the physics extractors is asserted but not derived from the model itself, so no self-referential loop exists in the stated mechanism.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that physics features provide an independent signal not captured by semantic learning alone, and introduces a new attention mechanism conditioned on them.

free parameters (1)

resonance consistency loss weight
Likely a hyperparameter tuned for the model, though not specified in abstract.

axioms (1)

domain assumption Generative models struggle to consistently replicate physical invariants such as optical-flow discontinuities, specular-reflection inconsistencies, and cardiac-modulated reflectance.
Invoked in the abstract as the reason why the method works.

pith-pipeline@v0.9.0 · 5693 in / 1355 out tokens · 48443 ms · 2026-05-10T14:56:04.367292+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 3 canonical work pages · 1 internal anchor

[1]

LAA-Net: Localized Artifact Attention Network for Deepfake Detection

Nguyen et al. LAA-Net: Localized Artifact Attention Network for Deepfake Detection. InCVPR, 2024

2024
[2]

DF40: Toward Next-Generation Deepfake Detection

Yan et al. DF40: Toward Next-Generation Deepfake Detection. InNeurIPS, 2024

2024
[3]

Exploring Specular Reflection Inconsistency for Deepfake Detection.arXiv:2602.06452, 2026

Fei et al. Exploring Specular Reflection Inconsistency for Deepfake Detection.arXiv:2602.06452, 2026

work page arXiv 2026
[4]

BioVerify: Invariant Deepfake Detection via Remote Photoplethysmography.TechRxiv, 2026

Kolay. BioVerify: Invariant Deepfake Detection via Remote Photoplethysmography.TechRxiv, 2026

2026
[5]

DeepFakes Detection based on Heart Rate Estimation

Hernandez-Ortega et al. DeepFakes Detection based on Heart Rate Estimation. 2020

2020
[6]

Certified Adversarial Robustness via Randomizedα-Smoothing

Rekavandi et al. Certified Adversarial Robustness via Randomizedα-Smoothing. InNeurIPS, 2024

2024
[7]

Towards More General Video-based Deepfake Detection

Yang et al. Towards More General Video-based Deepfake Detection. InCVPR, 2025

2025
[8]

Yermakov, J

Guo et al. Deepfake Detection that Generalizes Across Benchmarks.arXiv:2508.06248, 2025

work page internal anchor Pith review arXiv 2025
[9]

Deepfake-eval-2024: A multi-modal in-the- wild benchmark of deepfakes circulated in 2024,

Chandra et al. Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark.arXiv:2503.02857, 2025. 7

work page arXiv 2024