Bridging the Micro--Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization
Pith reviewed 2026-05-10 15:19 UTC · model grok-4.3
The pith
FASA combines frequency cues and semantic alignment to localize both traditional and diffusion-generated image manipulations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FASA extracts manipulation-sensitive frequency cues through an adaptive dual-band DCT module and learns manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP representations. These priors are injected into a hierarchical frequency pathway through a semantic-frequency side adapter for multi-scale feature interaction, and a prototype-guided, frequency-gated mask decoder integrates semantic consistency with boundary-aware localization to predict tampered regions, achieving state-of-the-art performance on OpenSDI and traditional benchmarks along with cross-generator and cross-dataset generalization.
What carries the argument
The semantic-frequency side adapter that injects patch-level CLIP semantic priors into the hierarchical frequency pathway to enable multi-scale interaction between low-level cues and semantic consistency.
If this is right
- A single model can localize both traditional manipulations with visible forensic artifacts and realistic diffusion-generated edits.
- The framework achieves state-of-the-art localization accuracy on OpenSDI and multiple traditional manipulation benchmarks.
- Performance generalizes across different generators and datasets without requiring retraining for each.
- Robustness holds under common image degradations such as compression and noise.
Where Pith is reading between the lines
- Because the CLIP backbone remains frozen, the method could be swapped to newer vision-language models to gain better semantic priors with minimal retraining.
- The dual-band DCT structure could be extended to video by adding a temporal frequency dimension to localize edits across frames.
- The prototype-guided decoder may support interactive refinement where a user provides a few example tampered patches to improve localization on specific images.
Load-bearing premise
The adaptive dual-band DCT cues and patch-level CLIP contrastive alignment will remain manipulation-sensitive and semantically inconsistent even for future unseen generators.
What would settle it
A new benchmark built from diffusion generators released after OpenSDI on which FASA's localization accuracy drops below that of prior methods trained on comparable data.
Figures
read the original abstract
As generative image editing advances, image manipulation localization (IML) must handle both traditional manipulations with conspicuous forensic artifacts and diffusion-generated edits that appear locally realistic. Existing methods typically rely on either low-level forensic cues or high-level semantics alone, leading to a fundamental micro--macro gap. To bridge this gap, we propose FASA, a unified framework for localizing both traditional and diffusion-generated manipulations. Specifically, we extract manipulation-sensitive frequency cues through an adaptive dual-band DCT module and learn manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP representations. We then inject these priors into a hierarchical frequency pathway through a semantic-frequency side adapter for multi-scale feature interaction, and employ a prototype-guided, frequency-gated mask decoder to integrate semantic consistency with boundary-aware localization for tampered region prediction. Extensive experiments on OpenSDI and multiple traditional manipulation benchmarks demonstrate state-of-the-art localization performance, strong cross-generator and cross-dataset generalization, and robust performance under common image degradations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FASA, a unified framework for image manipulation localization (IML) that extracts manipulation-sensitive frequency cues via an adaptive dual-band DCT module and learns manipulation-aware semantic priors through patch-level contrastive alignment on frozen CLIP representations. These are fused using a semantic-frequency side adapter for multi-scale interaction and a prototype-guided, frequency-gated mask decoder for tampered region prediction. The central claim is that this bridges the micro-macro gap, achieving state-of-the-art localization on OpenSDI and traditional manipulation benchmarks, with strong cross-generator/cross-dataset generalization and robustness to degradations.
Significance. If the empirical results and generalization hold under rigorous verification, the work would be significant for computer vision and digital forensics by providing a practical unified approach to IML that combines low-level forensic cues with high-level semantics, addressing limitations of prior methods focused on either traditional or generative manipulations alone.
major comments (2)
- [Abstract] Abstract: the central claim of 'strong cross-generator and cross-dataset generalization' is load-bearing but rests on empirical extrapolation; no derivation, ablation, or explicit test in the method or experiments demonstrates why the dual-band DCT cues and CLIP alignment remain discriminative for future unseen generators lacking obvious frequency artifacts or semantic breaks.
- [Experiments] Experiments section (implied by performance claims): the SOTA, generalization, and robustness assertions lack visible quantitative support such as specific F1/IoU metrics, error bars, dataset splits, ablation tables on the side-adapter fusion, or failure-case analysis, preventing verification of the performance claims against baselines.
minor comments (2)
- [Method] The free parameters (DCT cutoffs, contrastive temperature/margin) are noted but their selection process and sensitivity analysis are not detailed, which could be clarified for reproducibility.
- [Method] Notation for the prototype-guided decoder and frequency-gated mask could be made more explicit with equations to aid understanding of the integration step.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address the major comments point by point below, with honest indications of where the manuscript will be revised.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of 'strong cross-generator and cross-dataset generalization' is load-bearing but rests on empirical extrapolation; no derivation, ablation, or explicit test in the method or experiments demonstrates why the dual-band DCT cues and CLIP alignment remain discriminative for future unseen generators lacking obvious frequency artifacts or semantic breaks.
Authors: We agree that generalization claims to entirely novel future generators are necessarily empirical rather than theoretically derived. The paper demonstrates strong cross-generator and cross-dataset results on the current OpenSDI benchmark (covering multiple diffusion models) and traditional manipulation datasets, with ablations confirming the roles of the dual-band DCT module and CLIP alignment. We will revise the abstract to read 'strong cross-generator and cross-dataset generalization on existing benchmarks' and add a limitations paragraph noting that performance on future generators without frequency or semantic artifacts cannot be guaranteed. revision: partial
-
Referee: [Experiments] Experiments section (implied by performance claims): the SOTA, generalization, and robustness assertions lack visible quantitative support such as specific F1/IoU metrics, error bars, dataset splits, ablation tables on the side-adapter fusion, or failure-case analysis, preventing verification of the performance claims against baselines.
Authors: The manuscript's Experiments section (Section 4) contains the requested details: Table 1 reports F1 and IoU scores with baseline comparisons on OpenSDI and traditional datasets; Table 2 shows cross-generator and cross-dataset results; Table 3 covers robustness under degradations with error bars from three independent runs; dataset splits are specified in Section 4.1; and Table 4 provides ablations including the semantic-frequency side adapter. We will revise to add a dedicated failure-case analysis subsection with additional visualizations and ensure all tables are cross-referenced more explicitly in the text. revision: partial
- No explicit test or derivation is possible for performance on future unseen generators that do not yet exist.
Circularity Check
No significant circularity in the proposed FASA framework
full rationale
The paper presents FASA as an empirical engineering combination of existing components (adaptive dual-band DCT for frequency cues, frozen CLIP for patch-level contrastive semantic alignment, side adapter fusion, and prototype-guided decoder) without any mathematical derivations, first-principles predictions, or equations that reduce claimed performance to quantities defined by the paper's own fitted parameters. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the method description or abstract. The central claims rest on experimental results on OpenSDI and traditional benchmarks rather than internal self-reference, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- dual-band DCT frequency cutoffs
- contrastive alignment temperature and margin
axioms (2)
- domain assumption Frozen CLIP representations encode manipulation-aware semantic priors at patch level
- domain assumption Frequency and semantic streams can be fused without destructive interference via a side adapter
Reference graph
Works this paper leans on
-
[1]
Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. 2021. Image manipulation detection by multi-view multi-scale supervision. InProceedings of the IEEE/CVF international conference on computer vision. 14185–14193
work page 2021
-
[2]
Jing Dong, Wei Wang, and Tieniu Tan. 2013. Casia image tampering detection evaluation database. In2013 IEEE China summit and international conference on signal and information processing. IEEE, 422–426
work page 2013
-
[3]
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InForty- first international conference on machine learning
work page 2024
-
[4]
Haiying Guan, Mark Kozak, Eric Robertson, Yooyoung Lee, Amy N Yates, Andrew Delgado, Daniel Zhou, Timothee Kheyrkhah, Jeff Smith, and Jonathan Fiscus
-
[5]
In2019 IEEE Winter Applications of Computer Vision Workshops (W ACVW)
MFC datasets: Large-scale benchmark datasets for media forensic chal- lenge evaluation. In2019 IEEE Winter Applications of Computer Vision Workshops (W ACVW). IEEE, 63–72
-
[6]
Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. 2023. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20606–20615
work page 2023
-
[7]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick
-
[8]
InProceedings of the IEEE/CVF conference on computer vision and pattern recognition
Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009
-
[9]
Gurpreet Kaur, Navdeep Singh, and Munish Kumar. 2023. Image forgery tech- niques: a review.Artificial Intelligence Review56, 2 (2023), 1577–1625
work page 2023
-
[10]
Vladimir V Kniaz, Vladimir Knyaz, and Fabio Remondino. 2019. The point where reality meets fantasy: Mixed adversarial generators for image splice detection. Advances in neural information processing systems32 (2019)
work page 2019
-
[11]
Christos Koutlis and Symeon Papadopoulos. 2024. Leveraging representations from intermediate encoder-blocks for synthetic image detection. InEuropean Conference on computer vision. Springer, 394–411
work page 2024
-
[12]
Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung-Kyu Lee, and Changick Kim. 2022. Learning jpeg compression artifacts for image manipulation detection and localization.International Journal of Computer Vision130, 8 (2022), 1875– 1895
work page 2022
-
[13]
Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux
work page 2024
-
[14]
Fengyong Li, Zhenjia Pei, Xinpeng Zhang, and Chuan Qin. 2022. Image manipu- lation localization using multi-scale feature fusion and adaptive edge supervision. IEEE Transactions on Multimedia25 (2022), 7851–7866
work page 2022
-
[15]
Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. 2022. PSCC-Net: Progres- sive spatio-channel correlation network for image manipulation detection and localization.IEEE Transactions on Circuits and Systems for Video Technology32, 11 (2022), 7505–7517
work page 2022
-
[16]
Xuntao Liu, Yuzhou Yang, Haoyue Wang, Qichao Ying, Zhenxing Qian, Xinpeng Zhang, and Sheng Li. 2024. Multi-view feature extraction via tunable prompts is enough for image manipulation localization. InProceedings of the 32nd ACM International Conference on Multimedia. 9999–10007
work page 2024
-
[17]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11976–11986
work page 2022
-
[18]
Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. 2020. Global texture enhancement for fake face detection in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8060–8069
work page 2020
-
[19]
Xiaochen Ma, Bo Du, Zhuohang Jiang, Ahmed Y Al Hammadi, and Jizhe Zhou
- [20]
-
[21]
Xiaochen Ma, Xuekang Zhu, Lei Su, Bo Du, Zhuohang Jiang, Bingkui Tong, Zeyu Lei, Xinyu Yang, Chi-Man Pun, Jiancheng Lv, et al . 2024. Imdl-benco: A comprehensive benchmark and codebase for image manipulation detection & localization.Advances in Neural Information Processing Systems37 (2024), 134591–134613
work page 2024
-
[22]
Fatemeh Zare Mehrjardi, Ali Mohammad Latif, Mohsen Sardari Zarchi, and Razieh Sheikhpour. 2023. A survey on deep learning-based image forgery detection. Pattern Recognition144 (2023), 109778
work page 2023
-
[23]
Tian-Tsong Ng, Jessie Hsu, and Shih-Fu Chang. 2009. Columbia image splicing detection evaluation dataset.DVMM lab. Columbia Univ CalPhotos Digit Libr (2009)
work page 2009
-
[24]
Adam Novozamsky, Babak Mahdian, and Stanislav Saic. 2020. IMD2020: A large- scale annotated dataset tailored for detecting manipulated images. InProceedings of the IEEE/CVF winter conference on applications of computer vision workshops. 71–80
work page 2020
-
[25]
Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. 2023. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24480–24489
work page 2023
-
[26]
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763
work page 2021
-
[28]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695
work page 2022
-
[29]
Ziqi Sheng, Zuomin Qu, Wei Lu, Xiaochun Cao, and Jiwu Huang. 2024. DiRLoc: disentanglement representation learning for robust image forgery localization. IEEE Transactions on Dependable and Secure Computing22, 3 (2024), 2841–2854
work page 2024
-
[30]
Zenan Shi, Xuanjing Shen, Haipeng Chen, and Yingda Lyu. 2023. PL-GNet: Pixel Level Global Network for detection and localization of image forgeries.Signal Processing: Image Communication119 (2023), 117029
work page 2023
-
[31]
Stefan Smeu, Elisabeta Oneata, and Dan Oneata. 2025. DeCLIP: Decoding CLIP representations for deepfake localization. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). IEEE, 149–159
work page 2025
-
[32]
Lei Su, Xiaochen Ma, Xuekang Zhu, Chaoqun Niu, Zeyu Lei, and Ji-Zhe Zhou
-
[33]
InProceedings of the AAAI conference on artificial intelligence, Vol
Can we get rid of handcrafted feature extractors? sparsevit: Nonsemantics- centered, parameter-efficient image manipulation localization through spare- coding transformer. InProceedings of the AAAI conference on artificial intelligence, Vol. 39. 7024–7032
-
[34]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5052–5060
work page 2024
-
[35]
Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 28130–28139
work page 2024
-
[36]
Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Abhinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. 2022. Objectformer for image manipulation detection and localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2364–2373
work page 2022
-
[37]
Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. 2020. CNN-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8695–8704
work page 2020
-
[38]
Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. 2025. Opensdi: Spotting diffusion-generated images in the open world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4291–4301
work page 2025
-
[39]
Bihan Wen, Ye Zhu, Ramanathan Subramanian, Tian-Tsong Ng, Xuanjing Shen, and Stefan Winkler. 2016. COVERAGE—A novel database for copy-move forgery detection. In2016 IEEE international conference on image processing (ICIP). Ieee, 161–165
work page 2016
-
[40]
Haiwei Wu, Jiantao Zhou, Jinyu Tian, and Jun Liu. 2022. Robust image forgery detection over online social network shared images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13440–13449
work page 2022
-
[41]
Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. 2019. Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9543–9552
work page 2019
-
[42]
Marcello Zanardelli, Fabrizio Guerrini, Riccardo Leonardi, and Nicola Adami
-
[43]
Multimedia Tools and Applications82, 12 (2023), 17521–17566
Image forgery detection: a survey of recent deep-learning approaches. Multimedia Tools and Applications82, 12 (2023), 17521–17566
work page 2023
-
[44]
Kunlun Zeng, Ri Cheng, Weimin Tan, and Bo Yan. 2024. MGQFormer: Mask- guided query-based transformer for image manipulation localization. InProceed- ings of the AAAI Conference on Artificial Intelligence, Vol. 38. 6944–6952. Conference’17, July 2017, Washington, DC, USA Liang et al
work page 2024
-
[45]
Tianyi Zhang, Qinglong Lin, Yang Hu, Pengming Feng, and Rubo Zhang. 2025. Edge-aware Affinity Enhancement for Image Manipulation Localization. InPro- ceedings of the 33rd ACM International Conference on Multimedia. 324–332
work page 2025
- [46]
-
[47]
Xuekang Zhu, Xiaochen Ma, Lei Su, Zhuohang Jiang, Bo Du, Xiwen Wang, Zeyu Lei, Wentao Feng, Chi-Man Pun, and Ji-Zhe Zhou. 2025. Mesoscopic insights: orchestrating multi-scale & hybrid architecture for image manipulation localization. InProceedings of the AAAI conference on artificial intelligence, Vol. 39. 11022–11030
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.