pith. sign in

arxiv: 2604.12341 · v1 · submitted 2026-04-14 · 💻 cs.CV

Bridging the Micro--Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization

Pith reviewed 2026-05-10 15:19 UTC · model grok-4.3

classification 💻 cs.CV
keywords image manipulation localizationfrequency analysissemantic alignmentdiffusion modelsforensic detectionCLIPtamper localizationdeepfake detection
0
0 comments X p. Extension

The pith

FASA combines frequency cues and semantic alignment to localize both traditional and diffusion-generated image manipulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FASA as a single framework that handles image edits ranging from obvious traditional changes to seamless diffusion-generated ones. It pulls manipulation-sensitive signals from an adaptive dual-band DCT module and captures semantic differences through patch-level contrastive alignment on frozen CLIP features. These elements are fused by a semantic-frequency side adapter inside a hierarchical frequency pathway and decoded by a prototype-guided, frequency-gated mask decoder to produce the tampered region map. Readers would care because existing approaches that rely on either low-level artifacts or high-level semantics alone leave a gap that modern generators exploit, and a unified method can improve detection reliability.

Core claim

FASA extracts manipulation-sensitive frequency cues through an adaptive dual-band DCT module and learns manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP representations. These priors are injected into a hierarchical frequency pathway through a semantic-frequency side adapter for multi-scale feature interaction, and a prototype-guided, frequency-gated mask decoder integrates semantic consistency with boundary-aware localization to predict tampered regions, achieving state-of-the-art performance on OpenSDI and traditional benchmarks along with cross-generator and cross-dataset generalization.

What carries the argument

The semantic-frequency side adapter that injects patch-level CLIP semantic priors into the hierarchical frequency pathway to enable multi-scale interaction between low-level cues and semantic consistency.

If this is right

  • A single model can localize both traditional manipulations with visible forensic artifacts and realistic diffusion-generated edits.
  • The framework achieves state-of-the-art localization accuracy on OpenSDI and multiple traditional manipulation benchmarks.
  • Performance generalizes across different generators and datasets without requiring retraining for each.
  • Robustness holds under common image degradations such as compression and noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Because the CLIP backbone remains frozen, the method could be swapped to newer vision-language models to gain better semantic priors with minimal retraining.
  • The dual-band DCT structure could be extended to video by adding a temporal frequency dimension to localize edits across frames.
  • The prototype-guided decoder may support interactive refinement where a user provides a few example tampered patches to improve localization on specific images.

Load-bearing premise

The adaptive dual-band DCT cues and patch-level CLIP contrastive alignment will remain manipulation-sensitive and semantically inconsistent even for future unseen generators.

What would settle it

A new benchmark built from diffusion generators released after OpenSDI on which FASA's localization accuracy drops below that of prior methods trained on comparable data.

Figures

Figures reproduced from arXiv: 2604.12341 by Wei Lu, Xiaojie Liang, Zhimin Chen, Ziqi Sheng.

Figure 1
Figure 1. Figure 1: Comparison of manipulation traces. (a) Traditional [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Localization performance on the OpenSDI bench [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed Frequency-Aware Semantic Alignment (FASA) framework. Given an input image, FASA [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of localization results on diffusion-generated and traditional manipulation datasets. Existing [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Robustness comparison of different methods under Gaussian blur and JPEG compression. The first row reports results [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

As generative image editing advances, image manipulation localization (IML) must handle both traditional manipulations with conspicuous forensic artifacts and diffusion-generated edits that appear locally realistic. Existing methods typically rely on either low-level forensic cues or high-level semantics alone, leading to a fundamental micro--macro gap. To bridge this gap, we propose FASA, a unified framework for localizing both traditional and diffusion-generated manipulations. Specifically, we extract manipulation-sensitive frequency cues through an adaptive dual-band DCT module and learn manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP representations. We then inject these priors into a hierarchical frequency pathway through a semantic-frequency side adapter for multi-scale feature interaction, and employ a prototype-guided, frequency-gated mask decoder to integrate semantic consistency with boundary-aware localization for tampered region prediction. Extensive experiments on OpenSDI and multiple traditional manipulation benchmarks demonstrate state-of-the-art localization performance, strong cross-generator and cross-dataset generalization, and robust performance under common image degradations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FASA, a unified framework for image manipulation localization (IML) that extracts manipulation-sensitive frequency cues via an adaptive dual-band DCT module and learns manipulation-aware semantic priors through patch-level contrastive alignment on frozen CLIP representations. These are fused using a semantic-frequency side adapter for multi-scale interaction and a prototype-guided, frequency-gated mask decoder for tampered region prediction. The central claim is that this bridges the micro-macro gap, achieving state-of-the-art localization on OpenSDI and traditional manipulation benchmarks, with strong cross-generator/cross-dataset generalization and robustness to degradations.

Significance. If the empirical results and generalization hold under rigorous verification, the work would be significant for computer vision and digital forensics by providing a practical unified approach to IML that combines low-level forensic cues with high-level semantics, addressing limitations of prior methods focused on either traditional or generative manipulations alone.

major comments (2)
  1. [Abstract] Abstract: the central claim of 'strong cross-generator and cross-dataset generalization' is load-bearing but rests on empirical extrapolation; no derivation, ablation, or explicit test in the method or experiments demonstrates why the dual-band DCT cues and CLIP alignment remain discriminative for future unseen generators lacking obvious frequency artifacts or semantic breaks.
  2. [Experiments] Experiments section (implied by performance claims): the SOTA, generalization, and robustness assertions lack visible quantitative support such as specific F1/IoU metrics, error bars, dataset splits, ablation tables on the side-adapter fusion, or failure-case analysis, preventing verification of the performance claims against baselines.
minor comments (2)
  1. [Method] The free parameters (DCT cutoffs, contrastive temperature/margin) are noted but their selection process and sensitivity analysis are not detailed, which could be clarified for reproducibility.
  2. [Method] Notation for the prototype-guided decoder and frequency-gated mask could be made more explicit with equations to aid understanding of the integration step.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive review. We address the major comments point by point below, with honest indications of where the manuscript will be revised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'strong cross-generator and cross-dataset generalization' is load-bearing but rests on empirical extrapolation; no derivation, ablation, or explicit test in the method or experiments demonstrates why the dual-band DCT cues and CLIP alignment remain discriminative for future unseen generators lacking obvious frequency artifacts or semantic breaks.

    Authors: We agree that generalization claims to entirely novel future generators are necessarily empirical rather than theoretically derived. The paper demonstrates strong cross-generator and cross-dataset results on the current OpenSDI benchmark (covering multiple diffusion models) and traditional manipulation datasets, with ablations confirming the roles of the dual-band DCT module and CLIP alignment. We will revise the abstract to read 'strong cross-generator and cross-dataset generalization on existing benchmarks' and add a limitations paragraph noting that performance on future generators without frequency or semantic artifacts cannot be guaranteed. revision: partial

  2. Referee: [Experiments] Experiments section (implied by performance claims): the SOTA, generalization, and robustness assertions lack visible quantitative support such as specific F1/IoU metrics, error bars, dataset splits, ablation tables on the side-adapter fusion, or failure-case analysis, preventing verification of the performance claims against baselines.

    Authors: The manuscript's Experiments section (Section 4) contains the requested details: Table 1 reports F1 and IoU scores with baseline comparisons on OpenSDI and traditional datasets; Table 2 shows cross-generator and cross-dataset results; Table 3 covers robustness under degradations with error bars from three independent runs; dataset splits are specified in Section 4.1; and Table 4 provides ablations including the semantic-frequency side adapter. We will revise to add a dedicated failure-case analysis subsection with additional visualizations and ensure all tables are cross-referenced more explicitly in the text. revision: partial

standing simulated objections not resolved
  • No explicit test or derivation is possible for performance on future unseen generators that do not yet exist.

Circularity Check

0 steps flagged

No significant circularity in the proposed FASA framework

full rationale

The paper presents FASA as an empirical engineering combination of existing components (adaptive dual-band DCT for frequency cues, frozen CLIP for patch-level contrastive semantic alignment, side adapter fusion, and prototype-guided decoder) without any mathematical derivations, first-principles predictions, or equations that reduce claimed performance to quantities defined by the paper's own fitted parameters. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the method description or abstract. The central claims rest on experimental results on OpenSDI and traditional benchmarks rather than internal self-reference, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on the assumption that frequency bands and CLIP patch semantics are independently informative for manipulation detection and that their fusion via the side adapter produces additive gains; several module-specific hyperparameters are implicitly present but not enumerated in the abstract.

free parameters (2)
  • dual-band DCT frequency cutoffs
    Adaptive bands are chosen to capture manipulation-sensitive cues; exact thresholds are data-dependent.
  • contrastive alignment temperature and margin
    Standard contrastive hyperparameters that control how strongly semantic inconsistency is enforced.
axioms (2)
  • domain assumption Frozen CLIP representations encode manipulation-aware semantic priors at patch level
    Invoked when the paper states it learns manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP.
  • domain assumption Frequency and semantic streams can be fused without destructive interference via a side adapter
    Central to the hierarchical frequency pathway and multi-scale feature interaction claim.

pith-pipeline@v0.9.0 · 5475 in / 1494 out tokens · 50468 ms · 2026-05-10T15:19:10.716929+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. 2021. Image manipulation detection by multi-view multi-scale supervision. InProceedings of the IEEE/CVF international conference on computer vision. 14185–14193

  2. [2]

    Jing Dong, Wei Wang, and Tieniu Tan. 2013. Casia image tampering detection evaluation database. In2013 IEEE China summit and international conference on signal and information processing. IEEE, 422–426

  3. [3]

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InForty- first international conference on machine learning

  4. [4]

    Haiying Guan, Mark Kozak, Eric Robertson, Yooyoung Lee, Amy N Yates, Andrew Delgado, Daniel Zhou, Timothee Kheyrkhah, Jeff Smith, and Jonathan Fiscus

  5. [5]

    In2019 IEEE Winter Applications of Computer Vision Workshops (W ACVW)

    MFC datasets: Large-scale benchmark datasets for media forensic chal- lenge evaluation. In2019 IEEE Winter Applications of Computer Vision Workshops (W ACVW). IEEE, 63–72

  6. [6]

    Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. 2023. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20606–20615

  7. [7]

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick

  8. [8]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009

  9. [9]

    Gurpreet Kaur, Navdeep Singh, and Munish Kumar. 2023. Image forgery tech- niques: a review.Artificial Intelligence Review56, 2 (2023), 1577–1625

  10. [10]

    Vladimir V Kniaz, Vladimir Knyaz, and Fabio Remondino. 2019. The point where reality meets fantasy: Mixed adversarial generators for image splice detection. Advances in neural information processing systems32 (2019)

  11. [11]

    Christos Koutlis and Symeon Papadopoulos. 2024. Leveraging representations from intermediate encoder-blocks for synthetic image detection. InEuropean Conference on computer vision. Springer, 394–411

  12. [12]

    Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung-Kyu Lee, and Changick Kim. 2022. Learning jpeg compression artifacts for image manipulation detection and localization.International Journal of Computer Vision130, 8 (2022), 1875– 1895

  13. [13]

    Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux

  14. [14]

    Fengyong Li, Zhenjia Pei, Xinpeng Zhang, and Chuan Qin. 2022. Image manipu- lation localization using multi-scale feature fusion and adaptive edge supervision. IEEE Transactions on Multimedia25 (2022), 7851–7866

  15. [15]

    Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. 2022. PSCC-Net: Progres- sive spatio-channel correlation network for image manipulation detection and localization.IEEE Transactions on Circuits and Systems for Video Technology32, 11 (2022), 7505–7517

  16. [16]

    Xuntao Liu, Yuzhou Yang, Haoyue Wang, Qichao Ying, Zhenxing Qian, Xinpeng Zhang, and Sheng Li. 2024. Multi-view feature extraction via tunable prompts is enough for image manipulation localization. InProceedings of the 32nd ACM International Conference on Multimedia. 9999–10007

  17. [17]

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11976–11986

  18. [18]

    Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. 2020. Global texture enhancement for fake face detection in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8060–8069

  19. [19]

    Xiaochen Ma, Bo Du, Zhuohang Jiang, Ahmed Y Al Hammadi, and Jizhe Zhou

  20. [20]

    IML-ViT: Benchmarking image manipulation localization by vision trans- former.arXiv preprint arXiv:2307.14863(2023)

  21. [21]

    Xiaochen Ma, Xuekang Zhu, Lei Su, Bo Du, Zhuohang Jiang, Bingkui Tong, Zeyu Lei, Xinyu Yang, Chi-Man Pun, Jiancheng Lv, et al . 2024. Imdl-benco: A comprehensive benchmark and codebase for image manipulation detection & localization.Advances in Neural Information Processing Systems37 (2024), 134591–134613

  22. [22]

    Fatemeh Zare Mehrjardi, Ali Mohammad Latif, Mohsen Sardari Zarchi, and Razieh Sheikhpour. 2023. A survey on deep learning-based image forgery detection. Pattern Recognition144 (2023), 109778

  23. [23]

    Tian-Tsong Ng, Jessie Hsu, and Shih-Fu Chang. 2009. Columbia image splicing detection evaluation dataset.DVMM lab. Columbia Univ CalPhotos Digit Libr (2009)

  24. [24]

    Adam Novozamsky, Babak Mahdian, and Stanislav Saic. 2020. IMD2020: A large- scale annotated dataset tailored for detecting manipulated images. InProceedings of the IEEE/CVF winter conference on applications of computer vision workshops. 71–80

  25. [25]

    Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. 2023. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24480–24489

  26. [26]

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952 (2023)

  27. [27]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  28. [28]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

  29. [29]

    Ziqi Sheng, Zuomin Qu, Wei Lu, Xiaochun Cao, and Jiwu Huang. 2024. DiRLoc: disentanglement representation learning for robust image forgery localization. IEEE Transactions on Dependable and Secure Computing22, 3 (2024), 2841–2854

  30. [30]

    Zenan Shi, Xuanjing Shen, Haipeng Chen, and Yingda Lyu. 2023. PL-GNet: Pixel Level Global Network for detection and localization of image forgeries.Signal Processing: Image Communication119 (2023), 117029

  31. [31]

    Stefan Smeu, Elisabeta Oneata, and Dan Oneata. 2025. DeCLIP: Decoding CLIP representations for deepfake localization. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). IEEE, 149–159

  32. [32]

    Lei Su, Xiaochen Ma, Xuekang Zhu, Chaoqun Niu, Zeyu Lei, and Ji-Zhe Zhou

  33. [33]

    InProceedings of the AAAI conference on artificial intelligence, Vol

    Can we get rid of handcrafted feature extractors? sparsevit: Nonsemantics- centered, parameter-efficient image manipulation localization through spare- coding transformer. InProceedings of the AAAI conference on artificial intelligence, Vol. 39. 7024–7032

  34. [34]

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5052–5060

  35. [35]

    Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 28130–28139

  36. [36]

    Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Abhinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. 2022. Objectformer for image manipulation detection and localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2364–2373

  37. [37]

    Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. 2020. CNN-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8695–8704

  38. [38]

    Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. 2025. Opensdi: Spotting diffusion-generated images in the open world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4291–4301

  39. [39]

    Bihan Wen, Ye Zhu, Ramanathan Subramanian, Tian-Tsong Ng, Xuanjing Shen, and Stefan Winkler. 2016. COVERAGE—A novel database for copy-move forgery detection. In2016 IEEE international conference on image processing (ICIP). Ieee, 161–165

  40. [40]

    Haiwei Wu, Jiantao Zhou, Jinyu Tian, and Jun Liu. 2022. Robust image forgery detection over online social network shared images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13440–13449

  41. [41]

    Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. 2019. Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9543–9552

  42. [42]

    Marcello Zanardelli, Fabrizio Guerrini, Riccardo Leonardi, and Nicola Adami

  43. [43]

    Multimedia Tools and Applications82, 12 (2023), 17521–17566

    Image forgery detection: a survey of recent deep-learning approaches. Multimedia Tools and Applications82, 12 (2023), 17521–17566

  44. [44]

    Kunlun Zeng, Ri Cheng, Weimin Tan, and Bo Yan. 2024. MGQFormer: Mask- guided query-based transformer for image manipulation localization. InProceed- ings of the AAAI Conference on Artificial Intelligence, Vol. 38. 6944–6952. Conference’17, July 2017, Washington, DC, USA Liang et al

  45. [45]

    Tianyi Zhang, Qinglong Lin, Yang Hu, Pengming Feng, and Rubo Zhang. 2025. Edge-aware Affinity Enhancement for Image Manipulation Localization. InPro- ceedings of the 33rd ACM International Conference on Multimedia. 324–332

  46. [46]

    Haochen Zhu, Gang Cao, and Xianglin Huang. 2023. Progressive feedback- enhanced transformer for image forgery localization.arXiv preprint arXiv:2311.08910(2023)

  47. [47]

    Xuekang Zhu, Xiaochen Ma, Lei Su, Zhuohang Jiang, Bo Du, Xiwen Wang, Zeyu Lei, Wentao Feng, Chi-Man Pun, and Ji-Zhe Zhou. 2025. Mesoscopic insights: orchestrating multi-scale & hybrid architecture for image manipulation localization. InProceedings of the AAAI conference on artificial intelligence, Vol. 39. 11022–11030