Bridging the Micro--Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization

arxiv: 2604.12341 · v1 · submitted 2026-04-14 · 💻 cs.CV

Bridging the Micro--Macro Gap: Frequency-Aware Semantic Alignment for Image Manipulation Localization

Xiaojie Liang , Zhimin Chen , Ziqi Sheng , Wei Lu This is my paper

Pith reviewed 2026-05-10 15:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords image manipulation localizationfrequency analysissemantic alignmentdiffusion modelsforensic detectionCLIPtamper localizationdeepfake detection

0 comments p. Extension

The pith

FASA combines frequency cues and semantic alignment to localize both traditional and diffusion-generated image manipulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FASA as a single framework that handles image edits ranging from obvious traditional changes to seamless diffusion-generated ones. It pulls manipulation-sensitive signals from an adaptive dual-band DCT module and captures semantic differences through patch-level contrastive alignment on frozen CLIP features. These elements are fused by a semantic-frequency side adapter inside a hierarchical frequency pathway and decoded by a prototype-guided, frequency-gated mask decoder to produce the tampered region map. Readers would care because existing approaches that rely on either low-level artifacts or high-level semantics alone leave a gap that modern generators exploit, and a unified method can improve detection reliability.

Core claim

FASA extracts manipulation-sensitive frequency cues through an adaptive dual-band DCT module and learns manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP representations. These priors are injected into a hierarchical frequency pathway through a semantic-frequency side adapter for multi-scale feature interaction, and a prototype-guided, frequency-gated mask decoder integrates semantic consistency with boundary-aware localization to predict tampered regions, achieving state-of-the-art performance on OpenSDI and traditional benchmarks along with cross-generator and cross-dataset generalization.

What carries the argument

The semantic-frequency side adapter that injects patch-level CLIP semantic priors into the hierarchical frequency pathway to enable multi-scale interaction between low-level cues and semantic consistency.

If this is right

A single model can localize both traditional manipulations with visible forensic artifacts and realistic diffusion-generated edits.
The framework achieves state-of-the-art localization accuracy on OpenSDI and multiple traditional manipulation benchmarks.
Performance generalizes across different generators and datasets without requiring retraining for each.
Robustness holds under common image degradations such as compression and noise.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Because the CLIP backbone remains frozen, the method could be swapped to newer vision-language models to gain better semantic priors with minimal retraining.
The dual-band DCT structure could be extended to video by adding a temporal frequency dimension to localize edits across frames.
The prototype-guided decoder may support interactive refinement where a user provides a few example tampered patches to improve localization on specific images.

Load-bearing premise

The adaptive dual-band DCT cues and patch-level CLIP contrastive alignment will remain manipulation-sensitive and semantically inconsistent even for future unseen generators.

What would settle it

A new benchmark built from diffusion generators released after OpenSDI on which FASA's localization accuracy drops below that of prior methods trained on comparable data.

Figures

Figures reproduced from arXiv: 2604.12341 by Wei Lu, Xiaojie Liang, Zhimin Chen, Ziqi Sheng.

**Figure 2.** Figure 2: Localization performance on the OpenSDI bench [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed Frequency-Aware Semantic Alignment (FASA) framework. Given an input image, FASA [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of localization results on diffusion-generated and traditional manipulation datasets. Existing [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Robustness comparison of different methods under Gaussian blur and JPEG compression. The first row reports results [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

As generative image editing advances, image manipulation localization (IML) must handle both traditional manipulations with conspicuous forensic artifacts and diffusion-generated edits that appear locally realistic. Existing methods typically rely on either low-level forensic cues or high-level semantics alone, leading to a fundamental micro--macro gap. To bridge this gap, we propose FASA, a unified framework for localizing both traditional and diffusion-generated manipulations. Specifically, we extract manipulation-sensitive frequency cues through an adaptive dual-band DCT module and learn manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP representations. We then inject these priors into a hierarchical frequency pathway through a semantic-frequency side adapter for multi-scale feature interaction, and employ a prototype-guided, frequency-gated mask decoder to integrate semantic consistency with boundary-aware localization for tampered region prediction. Extensive experiments on OpenSDI and multiple traditional manipulation benchmarks demonstrate state-of-the-art localization performance, strong cross-generator and cross-dataset generalization, and robust performance under common image degradations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FASA combines adaptive DCT frequency cues with CLIP semantic alignment in a fresh way for IML, but the generalization claims rest on unshown experiments.

read the letter

The paper's core move is to close the gap between low-level forensic signals and high-level semantic consistency when localizing edits, whether they come from classic Photoshop-style changes or diffusion models. It does this by running an adaptive dual-band DCT to pull manipulation-sensitive frequencies, then using patch-level contrastive alignment on frozen CLIP to capture semantic inconsistencies, feeding both into a side adapter and a prototype-guided frequency-gated decoder. That specific stack of components is the concrete new piece; prior IML work has used DCT or CLIP separately, but not this integrated pathway with the gated decoder for boundary-aware output. The approach is sensible on its face and directly targets the stated problem of handling both artifact-heavy and artifact-light manipulations. The side-adapter fusion and multi-scale interaction look like a reasonable engineering choice for keeping the two streams from fighting each other. The soft spots sit in the support for the big claims. The abstract asserts SOTA results, strong cross-generator generalization on OpenSDI, and robustness to degradations, yet no numbers, ablations, dataset details, or failure modes are visible to check whether the dual-band cues and CLIP alignment actually stay reliable once generators improve. The free parameters around frequency cutoffs and contrastive margins also raise the usual overfitting risk on the chosen benchmarks. Nothing in the method description derives why these features must remain discriminative for unseen future models. This work is aimed at computer-vision researchers who build practical image-forensics tools. Someone already working on manipulation localization could pick up the architecture ideas and test them, but the paper needs the full experimental section verified before it changes practice. I would send it to peer review so the numbers and ablations can be examined properly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FASA, a unified framework for image manipulation localization (IML) that extracts manipulation-sensitive frequency cues via an adaptive dual-band DCT module and learns manipulation-aware semantic priors through patch-level contrastive alignment on frozen CLIP representations. These are fused using a semantic-frequency side adapter for multi-scale interaction and a prototype-guided, frequency-gated mask decoder for tampered region prediction. The central claim is that this bridges the micro-macro gap, achieving state-of-the-art localization on OpenSDI and traditional manipulation benchmarks, with strong cross-generator/cross-dataset generalization and robustness to degradations.

Significance. If the empirical results and generalization hold under rigorous verification, the work would be significant for computer vision and digital forensics by providing a practical unified approach to IML that combines low-level forensic cues with high-level semantics, addressing limitations of prior methods focused on either traditional or generative manipulations alone.

major comments (2)

[Abstract] Abstract: the central claim of 'strong cross-generator and cross-dataset generalization' is load-bearing but rests on empirical extrapolation; no derivation, ablation, or explicit test in the method or experiments demonstrates why the dual-band DCT cues and CLIP alignment remain discriminative for future unseen generators lacking obvious frequency artifacts or semantic breaks.
[Experiments] Experiments section (implied by performance claims): the SOTA, generalization, and robustness assertions lack visible quantitative support such as specific F1/IoU metrics, error bars, dataset splits, ablation tables on the side-adapter fusion, or failure-case analysis, preventing verification of the performance claims against baselines.

minor comments (2)

[Method] The free parameters (DCT cutoffs, contrastive temperature/margin) are noted but their selection process and sensitivity analysis are not detailed, which could be clarified for reproducibility.
[Method] Notation for the prototype-guided decoder and frequency-gated mask could be made more explicit with equations to aid understanding of the integration step.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive review. We address the major comments point by point below, with honest indications of where the manuscript will be revised.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'strong cross-generator and cross-dataset generalization' is load-bearing but rests on empirical extrapolation; no derivation, ablation, or explicit test in the method or experiments demonstrates why the dual-band DCT cues and CLIP alignment remain discriminative for future unseen generators lacking obvious frequency artifacts or semantic breaks.

Authors: We agree that generalization claims to entirely novel future generators are necessarily empirical rather than theoretically derived. The paper demonstrates strong cross-generator and cross-dataset results on the current OpenSDI benchmark (covering multiple diffusion models) and traditional manipulation datasets, with ablations confirming the roles of the dual-band DCT module and CLIP alignment. We will revise the abstract to read 'strong cross-generator and cross-dataset generalization on existing benchmarks' and add a limitations paragraph noting that performance on future generators without frequency or semantic artifacts cannot be guaranteed. revision: partial
Referee: [Experiments] Experiments section (implied by performance claims): the SOTA, generalization, and robustness assertions lack visible quantitative support such as specific F1/IoU metrics, error bars, dataset splits, ablation tables on the side-adapter fusion, or failure-case analysis, preventing verification of the performance claims against baselines.

Authors: The manuscript's Experiments section (Section 4) contains the requested details: Table 1 reports F1 and IoU scores with baseline comparisons on OpenSDI and traditional datasets; Table 2 shows cross-generator and cross-dataset results; Table 3 covers robustness under degradations with error bars from three independent runs; dataset splits are specified in Section 4.1; and Table 4 provides ablations including the semantic-frequency side adapter. We will revise to add a dedicated failure-case analysis subsection with additional visualizations and ensure all tables are cross-referenced more explicitly in the text. revision: partial

standing simulated objections not resolved

No explicit test or derivation is possible for performance on future unseen generators that do not yet exist.

Circularity Check

0 steps flagged

No significant circularity in the proposed FASA framework

full rationale

The paper presents FASA as an empirical engineering combination of existing components (adaptive dual-band DCT for frequency cues, frozen CLIP for patch-level contrastive semantic alignment, side adapter fusion, and prototype-guided decoder) without any mathematical derivations, first-principles predictions, or equations that reduce claimed performance to quantities defined by the paper's own fitted parameters. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the method description or abstract. The central claims rest on experimental results on OpenSDI and traditional benchmarks rather than internal self-reference, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The framework rests on the assumption that frequency bands and CLIP patch semantics are independently informative for manipulation detection and that their fusion via the side adapter produces additive gains; several module-specific hyperparameters are implicitly present but not enumerated in the abstract.

free parameters (2)

dual-band DCT frequency cutoffs
Adaptive bands are chosen to capture manipulation-sensitive cues; exact thresholds are data-dependent.
contrastive alignment temperature and margin
Standard contrastive hyperparameters that control how strongly semantic inconsistency is enforced.

axioms (2)

domain assumption Frozen CLIP representations encode manipulation-aware semantic priors at patch level
Invoked when the paper states it learns manipulation-aware semantic priors via patch-level contrastive alignment on frozen CLIP.
domain assumption Frequency and semantic streams can be fused without destructive interference via a side adapter
Central to the hierarchical frequency pathway and multi-scale feature interaction claim.

pith-pipeline@v0.9.0 · 5475 in / 1494 out tokens · 50468 ms · 2026-05-10T15:19:10.716929+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

[1]

Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. 2021. Image manipulation detection by multi-view multi-scale supervision. InProceedings of the IEEE/CVF international conference on computer vision. 14185–14193

work page 2021
[2]

Jing Dong, Wei Wang, and Tieniu Tan. 2013. Casia image tampering detection evaluation database. In2013 IEEE China summit and international conference on signal and information processing. IEEE, 422–426

work page 2013
[3]

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InForty- first international conference on machine learning

work page 2024
[4]

Haiying Guan, Mark Kozak, Eric Robertson, Yooyoung Lee, Amy N Yates, Andrew Delgado, Daniel Zhou, Timothee Kheyrkhah, Jeff Smith, and Jonathan Fiscus

work page
[5]

In2019 IEEE Winter Applications of Computer Vision Workshops (W ACVW)

MFC datasets: Large-scale benchmark datasets for media forensic chal- lenge evaluation. In2019 IEEE Winter Applications of Computer Vision Workshops (W ACVW). IEEE, 63–72

work page
[6]

Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. 2023. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20606–20615

work page 2023
[7]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick

work page
[8]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009

work page
[9]

Gurpreet Kaur, Navdeep Singh, and Munish Kumar. 2023. Image forgery tech- niques: a review.Artificial Intelligence Review56, 2 (2023), 1577–1625

work page 2023
[10]

Vladimir V Kniaz, Vladimir Knyaz, and Fabio Remondino. 2019. The point where reality meets fantasy: Mixed adversarial generators for image splice detection. Advances in neural information processing systems32 (2019)

work page 2019
[11]

Christos Koutlis and Symeon Papadopoulos. 2024. Leveraging representations from intermediate encoder-blocks for synthetic image detection. InEuropean Conference on computer vision. Springer, 394–411

work page 2024
[12]

Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung-Kyu Lee, and Changick Kim. 2022. Learning jpeg compression artifacts for image manipulation detection and localization.International Journal of Computer Vision130, 8 (2022), 1875– 1895

work page 2022
[13]

Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux

work page 2024
[14]

Fengyong Li, Zhenjia Pei, Xinpeng Zhang, and Chuan Qin. 2022. Image manipu- lation localization using multi-scale feature fusion and adaptive edge supervision. IEEE Transactions on Multimedia25 (2022), 7851–7866

work page 2022
[15]

Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. 2022. PSCC-Net: Progres- sive spatio-channel correlation network for image manipulation detection and localization.IEEE Transactions on Circuits and Systems for Video Technology32, 11 (2022), 7505–7517

work page 2022
[16]

Xuntao Liu, Yuzhou Yang, Haoyue Wang, Qichao Ying, Zhenxing Qian, Xinpeng Zhang, and Sheng Li. 2024. Multi-view feature extraction via tunable prompts is enough for image manipulation localization. InProceedings of the 32nd ACM International Conference on Multimedia. 9999–10007

work page 2024
[17]

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11976–11986

work page 2022
[18]

Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. 2020. Global texture enhancement for fake face detection in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8060–8069

work page 2020
[19]

Xiaochen Ma, Bo Du, Zhuohang Jiang, Ahmed Y Al Hammadi, and Jizhe Zhou

work page
[20]

IML-ViT: Benchmarking image manipulation localization by vision trans- former.arXiv preprint arXiv:2307.14863(2023)

work page arXiv 2023
[21]

Xiaochen Ma, Xuekang Zhu, Lei Su, Bo Du, Zhuohang Jiang, Bingkui Tong, Zeyu Lei, Xinyu Yang, Chi-Man Pun, Jiancheng Lv, et al . 2024. Imdl-benco: A comprehensive benchmark and codebase for image manipulation detection & localization.Advances in Neural Information Processing Systems37 (2024), 134591–134613

work page 2024
[22]

Fatemeh Zare Mehrjardi, Ali Mohammad Latif, Mohsen Sardari Zarchi, and Razieh Sheikhpour. 2023. A survey on deep learning-based image forgery detection. Pattern Recognition144 (2023), 109778

work page 2023
[23]

Tian-Tsong Ng, Jessie Hsu, and Shih-Fu Chang. 2009. Columbia image splicing detection evaluation dataset.DVMM lab. Columbia Univ CalPhotos Digit Libr (2009)

work page 2009
[24]

Adam Novozamsky, Babak Mahdian, and Stanislav Saic. 2020. IMD2020: A large- scale annotated dataset tailored for detecting manipulated images. InProceedings of the IEEE/CVF winter conference on applications of computer vision workshops. 71–80

work page 2020
[25]

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. 2023. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24480–24489

work page 2023
[26]

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021
[28]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

work page 2022
[29]

Ziqi Sheng, Zuomin Qu, Wei Lu, Xiaochun Cao, and Jiwu Huang. 2024. DiRLoc: disentanglement representation learning for robust image forgery localization. IEEE Transactions on Dependable and Secure Computing22, 3 (2024), 2841–2854

work page 2024
[30]

Zenan Shi, Xuanjing Shen, Haipeng Chen, and Yingda Lyu. 2023. PL-GNet: Pixel Level Global Network for detection and localization of image forgeries.Signal Processing: Image Communication119 (2023), 117029

work page 2023
[31]

Stefan Smeu, Elisabeta Oneata, and Dan Oneata. 2025. DeCLIP: Decoding CLIP representations for deepfake localization. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). IEEE, 149–159

work page 2025
[32]

Lei Su, Xiaochen Ma, Xuekang Zhu, Chaoqun Niu, Zeyu Lei, and Ji-Zhe Zhou

work page
[33]

InProceedings of the AAAI conference on artificial intelligence, Vol

Can we get rid of handcrafted feature extractors? sparsevit: Nonsemantics- centered, parameter-efficient image manipulation localization through spare- coding transformer. InProceedings of the AAAI conference on artificial intelligence, Vol. 39. 7024–7032

work page
[34]

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5052–5060

work page 2024
[35]

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 28130–28139

work page 2024
[36]

Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Abhinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. 2022. Objectformer for image manipulation detection and localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2364–2373

work page 2022
[37]

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. 2020. CNN-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8695–8704

work page 2020
[38]

Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. 2025. Opensdi: Spotting diffusion-generated images in the open world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4291–4301

work page 2025
[39]

Bihan Wen, Ye Zhu, Ramanathan Subramanian, Tian-Tsong Ng, Xuanjing Shen, and Stefan Winkler. 2016. COVERAGE—A novel database for copy-move forgery detection. In2016 IEEE international conference on image processing (ICIP). Ieee, 161–165

work page 2016
[40]

Haiwei Wu, Jiantao Zhou, Jinyu Tian, and Jun Liu. 2022. Robust image forgery detection over online social network shared images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13440–13449

work page 2022
[41]

Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. 2019. Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9543–9552

work page 2019
[42]

Marcello Zanardelli, Fabrizio Guerrini, Riccardo Leonardi, and Nicola Adami

work page
[43]

Multimedia Tools and Applications82, 12 (2023), 17521–17566

Image forgery detection: a survey of recent deep-learning approaches. Multimedia Tools and Applications82, 12 (2023), 17521–17566

work page 2023
[44]

Kunlun Zeng, Ri Cheng, Weimin Tan, and Bo Yan. 2024. MGQFormer: Mask- guided query-based transformer for image manipulation localization. InProceed- ings of the AAAI Conference on Artificial Intelligence, Vol. 38. 6944–6952. Conference’17, July 2017, Washington, DC, USA Liang et al

work page 2024
[45]

Tianyi Zhang, Qinglong Lin, Yang Hu, Pengming Feng, and Rubo Zhang. 2025. Edge-aware Affinity Enhancement for Image Manipulation Localization. InPro- ceedings of the 33rd ACM International Conference on Multimedia. 324–332

work page 2025
[46]

Haochen Zhu, Gang Cao, and Xianglin Huang. 2023. Progressive feedback- enhanced transformer for image forgery localization.arXiv preprint arXiv:2311.08910(2023)

work page arXiv 2023
[47]

Xuekang Zhu, Xiaochen Ma, Lei Su, Zhuohang Jiang, Bo Du, Xiwen Wang, Zeyu Lei, Wentao Feng, Chi-Man Pun, and Ji-Zhe Zhou. 2025. Mesoscopic insights: orchestrating multi-scale & hybrid architecture for image manipulation localization. InProceedings of the AAAI conference on artificial intelligence, Vol. 39. 11022–11030

work page 2025

[1] [1]

Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. 2021. Image manipulation detection by multi-view multi-scale supervision. InProceedings of the IEEE/CVF international conference on computer vision. 14185–14193

work page 2021

[2] [2]

Jing Dong, Wei Wang, and Tieniu Tan. 2013. Casia image tampering detection evaluation database. In2013 IEEE China summit and international conference on signal and information processing. IEEE, 422–426

work page 2013

[3] [3]

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. 2024. Scaling rectified flow transformers for high-resolution image synthesis. InForty- first international conference on machine learning

work page 2024

[4] [4]

Haiying Guan, Mark Kozak, Eric Robertson, Yooyoung Lee, Amy N Yates, Andrew Delgado, Daniel Zhou, Timothee Kheyrkhah, Jeff Smith, and Jonathan Fiscus

work page

[5] [5]

In2019 IEEE Winter Applications of Computer Vision Workshops (W ACVW)

MFC datasets: Large-scale benchmark datasets for media forensic chal- lenge evaluation. In2019 IEEE Winter Applications of Computer Vision Workshops (W ACVW). IEEE, 63–72

work page

[6] [6]

Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. 2023. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20606–20615

work page 2023

[7] [7]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick

work page

[8] [8]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009

work page

[9] [9]

Gurpreet Kaur, Navdeep Singh, and Munish Kumar. 2023. Image forgery tech- niques: a review.Artificial Intelligence Review56, 2 (2023), 1577–1625

work page 2023

[10] [10]

Vladimir V Kniaz, Vladimir Knyaz, and Fabio Remondino. 2019. The point where reality meets fantasy: Mixed adversarial generators for image splice detection. Advances in neural information processing systems32 (2019)

work page 2019

[11] [11]

Christos Koutlis and Symeon Papadopoulos. 2024. Leveraging representations from intermediate encoder-blocks for synthetic image detection. InEuropean Conference on computer vision. Springer, 394–411

work page 2024

[12] [12]

Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung-Kyu Lee, and Changick Kim. 2022. Learning jpeg compression artifacts for image manipulation detection and localization.International Journal of Computer Vision130, 8 (2022), 1875– 1895

work page 2022

[13] [13]

Black Forest Labs. 2024. FLUX. https://github.com/black-forest-labs/flux

work page 2024

[14] [14]

Fengyong Li, Zhenjia Pei, Xinpeng Zhang, and Chuan Qin. 2022. Image manipu- lation localization using multi-scale feature fusion and adaptive edge supervision. IEEE Transactions on Multimedia25 (2022), 7851–7866

work page 2022

[15] [15]

Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. 2022. PSCC-Net: Progres- sive spatio-channel correlation network for image manipulation detection and localization.IEEE Transactions on Circuits and Systems for Video Technology32, 11 (2022), 7505–7517

work page 2022

[16] [16]

Xuntao Liu, Yuzhou Yang, Haoyue Wang, Qichao Ying, Zhenxing Qian, Xinpeng Zhang, and Sheng Li. 2024. Multi-view feature extraction via tunable prompts is enough for image manipulation localization. InProceedings of the 32nd ACM International Conference on Multimedia. 9999–10007

work page 2024

[17] [17]

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11976–11986

work page 2022

[18] [18]

Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. 2020. Global texture enhancement for fake face detection in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8060–8069

work page 2020

[19] [19]

Xiaochen Ma, Bo Du, Zhuohang Jiang, Ahmed Y Al Hammadi, and Jizhe Zhou

work page

[20] [20]

IML-ViT: Benchmarking image manipulation localization by vision trans- former.arXiv preprint arXiv:2307.14863(2023)

work page arXiv 2023

[21] [21]

Xiaochen Ma, Xuekang Zhu, Lei Su, Bo Du, Zhuohang Jiang, Bingkui Tong, Zeyu Lei, Xinyu Yang, Chi-Man Pun, Jiancheng Lv, et al . 2024. Imdl-benco: A comprehensive benchmark and codebase for image manipulation detection & localization.Advances in Neural Information Processing Systems37 (2024), 134591–134613

work page 2024

[22] [22]

Fatemeh Zare Mehrjardi, Ali Mohammad Latif, Mohsen Sardari Zarchi, and Razieh Sheikhpour. 2023. A survey on deep learning-based image forgery detection. Pattern Recognition144 (2023), 109778

work page 2023

[23] [23]

Tian-Tsong Ng, Jessie Hsu, and Shih-Fu Chang. 2009. Columbia image splicing detection evaluation dataset.DVMM lab. Columbia Univ CalPhotos Digit Libr (2009)

work page 2009

[24] [24]

Adam Novozamsky, Babak Mahdian, and Stanislav Saic. 2020. IMD2020: A large- scale annotated dataset tailored for detecting manipulated images. InProceedings of the IEEE/CVF winter conference on applications of computer vision workshops. 71–80

work page 2020

[25] [25]

Utkarsh Ojha, Yuheng Li, and Yong Jae Lee. 2023. Towards universal fake image detectors that generalize across generative models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 24480–24489

work page 2023

[26] [26]

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. 2023. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

work page 2021

[28] [28]

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695

work page 2022

[29] [29]

Ziqi Sheng, Zuomin Qu, Wei Lu, Xiaochun Cao, and Jiwu Huang. 2024. DiRLoc: disentanglement representation learning for robust image forgery localization. IEEE Transactions on Dependable and Secure Computing22, 3 (2024), 2841–2854

work page 2024

[30] [30]

Zenan Shi, Xuanjing Shen, Haipeng Chen, and Yingda Lyu. 2023. PL-GNet: Pixel Level Global Network for detection and localization of image forgeries.Signal Processing: Image Communication119 (2023), 117029

work page 2023

[31] [31]

Stefan Smeu, Elisabeta Oneata, and Dan Oneata. 2025. DeCLIP: Decoding CLIP representations for deepfake localization. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). IEEE, 149–159

work page 2025

[32] [32]

Lei Su, Xiaochen Ma, Xuekang Zhu, Chaoqun Niu, Zeyu Lei, and Ji-Zhe Zhou

work page

[33] [33]

InProceedings of the AAAI conference on artificial intelligence, Vol

Can we get rid of handcrafted feature extractors? sparsevit: Nonsemantics- centered, parameter-efficient image manipulation localization through spare- coding transformer. InProceedings of the AAAI conference on artificial intelligence, Vol. 39. 7024–7032

work page

[34] [34]

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Frequency-aware deepfake detection: Improving generalizability through frequency space domain learning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5052–5060

work page 2024

[35] [35]

Chuangchuang Tan, Yao Zhao, Shikui Wei, Guanghua Gu, Ping Liu, and Yunchao Wei. 2024. Rethinking the up-sampling operations in cnn-based generative network for generalizable deepfake detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 28130–28139

work page 2024

[36] [36]

Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Abhinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. 2022. Objectformer for image manipulation detection and localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2364–2373

work page 2022

[37] [37]

Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. 2020. CNN-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8695–8704

work page 2020

[38] [38]

Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. 2025. Opensdi: Spotting diffusion-generated images in the open world. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4291–4301

work page 2025

[39] [39]

Bihan Wen, Ye Zhu, Ramanathan Subramanian, Tian-Tsong Ng, Xuanjing Shen, and Stefan Winkler. 2016. COVERAGE—A novel database for copy-move forgery detection. In2016 IEEE international conference on image processing (ICIP). Ieee, 161–165

work page 2016

[40] [40]

Haiwei Wu, Jiantao Zhou, Jinyu Tian, and Jun Liu. 2022. Robust image forgery detection over online social network shared images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13440–13449

work page 2022

[41] [41]

Yue Wu, Wael AbdAlmageed, and Premkumar Natarajan. 2019. Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9543–9552

work page 2019

[42] [42]

Marcello Zanardelli, Fabrizio Guerrini, Riccardo Leonardi, and Nicola Adami

work page

[43] [43]

Multimedia Tools and Applications82, 12 (2023), 17521–17566

Image forgery detection: a survey of recent deep-learning approaches. Multimedia Tools and Applications82, 12 (2023), 17521–17566

work page 2023

[44] [44]

Kunlun Zeng, Ri Cheng, Weimin Tan, and Bo Yan. 2024. MGQFormer: Mask- guided query-based transformer for image manipulation localization. InProceed- ings of the AAAI Conference on Artificial Intelligence, Vol. 38. 6944–6952. Conference’17, July 2017, Washington, DC, USA Liang et al

work page 2024

[45] [45]

Tianyi Zhang, Qinglong Lin, Yang Hu, Pengming Feng, and Rubo Zhang. 2025. Edge-aware Affinity Enhancement for Image Manipulation Localization. InPro- ceedings of the 33rd ACM International Conference on Multimedia. 324–332

work page 2025

[46] [46]

Haochen Zhu, Gang Cao, and Xianglin Huang. 2023. Progressive feedback- enhanced transformer for image forgery localization.arXiv preprint arXiv:2311.08910(2023)

work page arXiv 2023

[47] [47]

Xuekang Zhu, Xiaochen Ma, Lei Su, Zhuohang Jiang, Bo Du, Xiwen Wang, Zeyu Lei, Wentao Feng, Chi-Man Pun, and Ji-Zhe Zhou. 2025. Mesoscopic insights: orchestrating multi-scale & hybrid architecture for image manipulation localization. InProceedings of the AAAI conference on artificial intelligence, Vol. 39. 11022–11030

work page 2025