Revisiting Image Manipulation Localization under Realistic Manipulation Scenarios

Chenfan Qu; Jian Liu; Ji-Zhe Zhou; Kaiwen Feng; Liting Zhou; Xiwen Wang; Xuekang Zhu; Yunfei Wang

arxiv: 2509.20006 · v3 · submitted 2025-09-24 · 💻 cs.CV

Revisiting Image Manipulation Localization under Realistic Manipulation Scenarios

Xuekang Zhu , Ji-Zhe Zhou , Kaiwen Feng , Chenfan Qu , Xiwen Wang , Yunfei Wang , Liting Zhou , Jian Liu This is my paper

Pith reviewed 2026-05-18 14:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords image manipulation localizationconditional sequence predictionHSIM benchmarkforgery detectionhierarchical modelinggeneralizationrobustnesscomputer vision

0 comments

The pith

Reformulating image manipulation localization as conditional sequence prediction captures editing hierarchies and improves generalization on complex cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that existing one-shot methods for image manipulation localization suffer from dimensional collapse because they compress multi-step edits into a single binary mask and discard order and structure. By recasting the task as conditional sequence prediction, the RITA framework generates manipulated-region predictions layer by layer, feeding each result forward as a condition for the next step. This preserves temporal dependencies and hierarchical relations among operations. A sympathetic reader would care because today's realistic deceptions are built through sequences of edits rather than single actions, so ignoring that structure makes detectors brittle on real data.

Core claim

The central claim is that image manipulation localization should be treated as a conditional sequence prediction task in which manipulated regions are predicted in ordered layers, with each step conditioned on the output of the prior step. This modeling of temporal dependencies and hierarchical structures among editing operations is enabled by synthesizing multi-step data to form the HSIM benchmark and by introducing the HSS metric to measure sequential and hierarchical alignment. Experiments establish that the resulting approach attains state-of-the-art generalization and robustness on conventional benchmarks while remaining computationally efficient.

What carries the argument

RITA, the conditional sequence-prediction framework that generates manipulated-region masks layer by layer using each prior prediction as the conditioning input for the next.

Load-bearing premise

The multi-step manipulation sequences synthesized for the HSIM benchmark accurately reflect the hierarchical structures and temporal dependencies of real-world editing processes.

What would settle it

Evaluating RITA on a set of real multi-step manipulated images created independently of the authors' synthesis pipeline and finding that it loses its reported advantage over one-shot baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2509.20006 by Chenfan Qu, Jian Liu, Ji-Zhe Zhou, Kaiwen Feng, Liting Zhou, Xiwen Wang, Xuekang Zhu, Yunfei Wang.

**Figure 2.** Figure 2: Illustration of the synthetic multi-step manipulated process. (A) Sequential application of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overall architecture of our proposed framework. Given the manipulated input image, the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative analysis of SOTA models on conventional datasets. We randomly selected [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results on the proposed HSIM dataset. Each row corresponds to one sample, [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

With the large models easing the labor-intensive manipulation process, image manipulations in today's real scenarios often entail a complex manipulation process, comprising a series of editing operations to create a deceptive image. However, existing IML methods remain manipulation-process-agnostic, directly producing localization masks in a one-shot prediction paradigm without modeling the underlying editing steps. This one-shot paradigm compresses the high-dimensional compositional space into a single binary mask, inducing severe dimensional collapse, which forces the model to discard essential structural cues and ultimately leads to overfitting and degraded generalization. To address this, we are the first to reformulate image manipulation localization as a conditional sequence prediction task, proposing the RITA framework. RITA predicts manipulated regions layer-by-layer in an ordered manner, using each step's prediction as the condition for the next, thereby explicitly modeling temporal dependencies and hierarchical structures among editing operations. To enable training and evaluation, we synthesize multi-step manipulation data and construct a new benchmark HSIM. We further propose the HSS metric to assess sequential order and hierarchical alignment. Extensive experiments show that: 1) RITA achieves SOTA generalization and robustness on traditional benchmarks; 2) it remains computationally efficient despite explicitly modeling multi-step sequences; and 3) it establishes a viable foundation for hierarchical, process-aware manipulation localization. Code and dataset are available at https://github.com/scu-zjz/RITA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RITA reframes IML as ordered sequence prediction with a new benchmark, but the gains rest on synthesized data whose match to real edits is unproven.

read the letter

The main point is that this paper moves image manipulation localization away from one-shot mask output toward predicting a sequence of editing steps, with each prediction conditioning the next. They call the model RITA, build HSIM for multi-step data, and add the HSS metric to check order and hierarchy. That shift directly targets the problem that complex, multi-operation fakes are now common and that single-mask methods lose structural information in the process.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes the RITA framework, which reformulates image manipulation localization (IML) as a conditional sequence prediction task to explicitly model temporal dependencies and hierarchical structures among multi-step editing operations. It introduces the HSIM benchmark constructed from synthesized multi-step manipulation data and the HSS metric to evaluate sequential order and hierarchical alignment. The authors claim that RITA achieves SOTA generalization and robustness on traditional IML benchmarks while remaining computationally efficient despite the sequential modeling.

Significance. If the results hold, this work could meaningfully advance IML by shifting from one-shot to process-aware localization, better addressing complex real-world manipulations. The HSIM benchmark and HSS metric would provide useful new resources for the field, and the reported efficiency alongside explicit multi-step modeling would be a notable strength if reproducible and substantiated by ablations.

major comments (3)

[Abstract and data synthesis paragraph] Abstract and benchmark construction paragraph: The central generalization and robustness claims on traditional benchmarks rest on training with HSIM's synthesized multi-step sequences accurately capturing real-world hierarchical editing structures and temporal dependencies. The synthesis via ordered successive operations (e.g., splicing followed by retouching) is not validated against distributions of actual deceptive imagery, which is load-bearing for the claim that this drives improved performance rather than encoding synthesis artifacts.
[§3] §3 (RITA framework): No details are provided on the model architecture, how previous-step predictions are integrated as conditions for subsequent steps, the loss functions, or training procedure and hyperparameters. This absence directly undermines verification of the efficiency claim and the assertion that the approach avoids dimensional collapse.
[Experiments section] Experiments section (performance tables): The SOTA results lack error bars, standard deviations across runs, or statistical significance tests, making it impossible to assess whether the reported gains over baselines are reliable or could be due to variance.

minor comments (2)

[§4] The HSS metric would benefit from an explicit equation or pseudocode definition to clarify computation of order and alignment scores.
[Figures] Figure captions describing sequence predictions should include more detail on visualization conventions and what each layer represents.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We appreciate the opportunity to clarify key aspects of our work and have prepared point-by-point responses below. Revisions will be incorporated in the next version of the manuscript to address the concerns.

read point-by-point responses

Referee: [Abstract and data synthesis paragraph] Abstract and benchmark construction paragraph: The central generalization and robustness claims on traditional benchmarks rest on training with HSIM's synthesized multi-step sequences accurately capturing real-world hierarchical editing structures and temporal dependencies. The synthesis via ordered successive operations (e.g., splicing followed by retouching) is not validated against distributions of actual deceptive imagery, which is load-bearing for the claim that this drives improved performance rather than encoding synthesis artifacts.

Authors: We agree that explicit validation of the synthesized sequences against real-world deceptive imagery distributions would further support the claims. Obtaining such large-scale, annotated real-world multi-step data remains difficult due to ethical and privacy constraints. Our synthesis procedure applies ordered sequences of standard editing operations drawn from established image forensics practices to create hierarchical structures. The fact that RITA trained on HSIM generalizes to real benchmarks (CASIA, NIST, etc.) indicates that the data captures useful process-aware cues beyond synthesis artifacts. In the revision we will expand the benchmark construction section with additional details on the synthesis pipeline, its design rationale, and an explicit limitations discussion acknowledging the lack of direct real-world distribution matching. revision: yes
Referee: [§3] §3 (RITA framework): No details are provided on the model architecture, how previous-step predictions are integrated as conditions for subsequent steps, the loss functions, or training procedure and hyperparameters. This absence directly undermines verification of the efficiency claim and the assertion that the approach avoids dimensional collapse.

Authors: We apologize for the insufficient architectural and training details in the submitted version. The RITA model employs a conditional transformer decoder in which each step's predicted mask is encoded and fused with image features via cross-attention to condition the next prediction. The per-step loss combines binary cross-entropy with a Dice term plus an auxiliary ordering consistency regularizer. Training uses the Adam optimizer with a cosine annealing schedule and specific hyperparameters (learning rate, batch size, number of steps). We will revise Section 3 to include a detailed architecture diagram, pseudocode for the sequential conditioning mechanism, full loss equations, and a complete hyperparameter table so that the efficiency and dimensional-collapse claims can be independently verified. revision: yes
Referee: [Experiments section] Experiments section (performance tables): The SOTA results lack error bars, standard deviations across runs, or statistical significance tests, making it impossible to assess whether the reported gains over baselines are reliable or could be due to variance.

Authors: We concur that variability measures are necessary to establish the reliability of the reported improvements. In the revised manuscript we will repeat all experiments across multiple random seeds (minimum of three runs) and report mean performance together with standard deviations in the tables. We will also add paired statistical significance tests (e.g., t-tests) comparing RITA against the strongest baselines to quantify whether the observed gains are statistically meaningful. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces RITA as a novel reformulation of image manipulation localization into a conditional sequence prediction task that explicitly models temporal dependencies via layer-by-layer predictions. It constructs the HSIM benchmark through synthesis of multi-step manipulations and defines the HSS metric for sequential evaluation. These are presented as independent methodological contributions. SOTA claims rest on empirical results obtained by training on HSIM and testing on separate traditional one-shot benchmarks, without any quoted equations or steps in which a prediction reduces by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation chain. The derivation chain therefore remains self-contained with external falsifiability on held-out benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The approach depends on the validity of synthesized multi-step data representing real edits and on the assumption that explicit sequence modeling captures the key failure mode of prior methods.

axioms (1)

domain assumption Synthesized multi-step manipulation data in HSIM accurately represents real-world complex editing processes and their hierarchical structures.
Used to train the model and construct the new benchmark for evaluation.

invented entities (3)

RITA framework no independent evidence
purpose: To perform layer-by-layer conditional sequence prediction of manipulated regions
Core proposed model architecture
HSIM benchmark no independent evidence
purpose: Dataset of multi-step manipulations for training and evaluation
New synthesized dataset
HSS metric no independent evidence
purpose: To assess sequential order and hierarchical alignment of predictions
New evaluation metric

pith-pipeline@v0.9.0 · 5795 in / 1278 out tokens · 45985 ms · 2026-05-18T14:25:41.467965+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reformulate image manipulation localization as a conditional sequence prediction task... progressive containment property: Mt ⊆ Mt+1
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

tree-structured reverse sampling... hierarchical Sequential Score (HSS)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Venus-DeFakerOne: Unified Fake Image Detection & Localization
cs.CV 2026-05 unverdicted novelty 6.0

DeFakerOne integrates InternVL2 and SAM2 into a single model that achieves state-of-the-art results on 39 detection and 9 localization benchmarks for unified fake image detection and localization.
PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks
cs.CR 2026-05 unverdicted novelty 6.0

PASA is a semantic-level watermarking method for LLM text that uses embedding-space clusters and synchronized randomness to remain detectable after paraphrasing while preserving text quality.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

Walk in the cloud: Learning curves for point clouds shape analysis, pp

Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. Image manipulation detection by multi-view multi-scale supervision. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 14165–14173, Montreal, QC, Canada, Oct 2021. IEEE. ISBN 978-1-66542-812-5. doi:10.1109/ICCV48922.2021.01392. URL https://ieeexplore.ieee.org/document/9710015/

work page doi:10.1109/iccv48922.2021.01392 2021
[2]

Towards reliable identification of diffusion-based image manipulations

Alex Costanzino, Woody Bayliss, Juil Sock, Marc Gorriz Blanch, Danijela Horak, Ivan Laptev, Philip Torr, and Fabio Pizzati. Towards reliable identification of diffusion-based image manipulations. arXiv preprint arXiv:2506.05466, 2025

work page arXiv 2025
[3]

URLhttps://doi.org/10.1109/chinasip.2013.6625374

Jing Dong, Wei Wang, and Tieniu Tan. Casia image tampering detection evaluation database. In 2013 IEEE China Summit and International Conference on Signal and Information Processing, pp.\ 422–426, Beijing, China, Jul 2013. IEEE. ISBN 978-1-4799-1043-4. doi:10.1109/ChinaSIP.2013.6625374. URL http://ieeexplore.ieee.org/document/6625374/

work page doi:10.1109/chinasip.2013.6625374 2013
[4]

Forensichub: A unified benchmark & codebase for all- domain fake image detection and localization.arXiv preprint arXiv:2505.11003, 2025

Bo Du, Xuekang Zhu, Xiaochen Ma, Chenfan Qu, Kaiwen Feng, Zhe Yang, Chi-Man Pun, Jian Liu, and Jizhe Zhou. Forensichub: A unified benchmark & codebase for all-domain fake image detection and localization. arXiv preprint arXiv:2505.11003, 2025

work page arXiv 2025
[5]

N., Delgado , A., Zhou , D., Kheyrkhah , T., Smith , J., and Fiscus , J

Haiying Guan, Mark Kozak, Eric Robertson, Yooyoung Lee, Amy N. Yates, Andrew Delgado, Daniel Zhou, Timothee Kheyrkhah, Jeff Smith, and Jonathan Fiscus. Mfc datasets: Large-scale benchmark datasets for media forensic challenge evaluation. In 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), pp.\ 63–72, Waikoloa Village, HI, USA, Jan 2019....

work page doi:10.1109/wacvw.2019.00018 2019
[6]

Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization

Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 20606--20615, 2023

work page 2023
[7]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 770–778, Las Vegas, NV, USA, Jun 2016. IEEE. ISBN 978-1-4673-8851-1. doi:10.1109/CVPR.2016.90. URL http://ieeexplore.ieee.org/document/7780459/

work page doi:10.1109/cvpr.2016.90 2016
[8]

Detecting image splicing using geometry invariants and camera characteristics consistency

Yu-feng Hsu and Shih-fu Chang. Detecting image splicing using geometry invariants and camera characteristics consistency. In 2006 IEEE International Conference on Multimedia and Expo, pp.\ 549–552, Toronto, ON, Canada, Jul 2006. IEEE. ISBN 978-1-4244-0367-7. doi:10.1109/ICME.2006.262447. URL http://ieeexplore.ieee.org/document/4036658/

work page doi:10.1109/icme.2006.262447 2006
[9]

Autosplice: A text-prompt manipulated image dataset for media forensics

Shan Jia, Mingzhen Huang, Zhou Zhou, Yan Ju, Jialing Cai, and Siwei Lyu. Autosplice: A text-prompt manipulated image dataset for media forensics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 893--903, 2023

work page 2023
[10]

Learning jpeg compression artifacts for image manipulation detection and localization

Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung-Kyu Lee, and Changick Kim. Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision, 130 0 (8): 0 1875--1895, 2022

work page 2022
[11]

Safire: Segment any forged image region

Myung-Joon Kwon, Wonjun Lee, Seung-Hun Nam, Minji Son, and Changick Kim. Safire: Segment any forged image region. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 4437--4445, 2025

work page 2025
[12]

Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization

Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology, 32 0 (11): 0 7505--7517, 2022 a

work page 2022
[13]

Walk in the cloud: Learning curves for point clouds shape analysis, pp

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 9992–10002, Montreal, QC, Canada, Oct 2021. IEEE. ISBN 978-1-66542-812-5. doi:10.1109/ICCV48922.2021.00986. URL https:...

work page doi:10.1109/iccv48922.2021.00986 2021
[14]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 11976--11986, 2022 b

work page 2022
[15]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. 0 (arXiv:1608.03983), May 2017. URL http://arxiv.org/abs/1608.03983. arXiv:1608.03983 [cs, math]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. 0 (arXiv:1711.05101), Jan 2019. URL http://arxiv.org/abs/1711.05101. arXiv:1711.05101 [cs, math]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[17]

arXiv preprint arXiv:2307.14863 (2023)

Xiaochen Ma, Bo Du, Xianggen Liu, Ahmed Y Al Hammadi, and Jizhe Zhou. Iml-vit: Image manipulation localization by vision transformer. arXiv preprint arXiv:2307.14863, 2023

work page arXiv 2023
[18]

Imdl-benco: A comprehensive benchmark and codebase for image manipulation detection & localization, 2024

Xiaochen Ma, Xuekang Zhu, Lei Su, Bo Du, Zhuohang Jiang, Bingkui Tong, Zeyu Lei, Xinyu Yang, Chi-Man Pun, Jiancheng Lv, and Jizhe Zhou. Imdl-benco: A comprehensive benchmark and codebase for image manipulation detection & localization, 2024

work page 2024
[19]

Imd2020: A large-scale annotated dataset tailored for detecting manipulated images

Adam Novozamsky, Babak Mahdian, and Stanislav Saic. Imd2020: A large-scale annotated dataset tailored for detecting manipulated images. In 2020 IEEE Winter Applications of Computer Vision Workshops (WACVW), pp.\ 71–80, Snowmass Village, CO, USA, March 2020. IEEE. ISBN 978-1-72817-162-3. doi:10.1109/WACVW50321.2020.9096940. URL https://ieeexplore.ieee.org/...

work page doi:10.1109/wacvw50321.2020.9096940 2020
[20]

Can we get rid of handcrafted feature extractors? sparsevit: Nonsemantics-centered, parameter-efficient image manipulation localization through spare-coding transformer

Lei Su, Xiaochen Ma, Xuekang Zhu, Chaoqun Niu, Zeyu Lei, and Ji-Zhe Zhou. Can we get rid of handcrafted feature extractors? sparsevit: Nonsemantics-centered, parameter-efficient image manipulation localization through spare-coding transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 7024--7032, 2025

work page 2025
[21]

A ConvNet for the 2020s

Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Abhinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. Objectformer for image manipulation detection and localization. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2354–2363, New Orleans, LA, USA, Jun 2022. IEEE. ISBN 978-1-66546-946-3. doi:10.1109/CVPR52688.2022.00240...

work page doi:10.1109/cvpr52688.2022.00240 2022
[22]

Coverage — a novel database for copy-move forgery detection

Bihan Wen, Ye Zhu, Ramanathan Subramanian, Tian-Tsong Ng, Xuanjing Shen, and Stefan Winkler. Coverage — a novel database for copy-move forgery detection. In 2016 IEEE International Conference on Image Processing (ICIP), pp.\ 161–165, Phoenix, AZ, USA, Sep 2016. IEEE. ISBN 978-1-4673-9961-6. doi:10.1109/ICIP.2016.7532339. URL http://ieeexplore.ieee.org/doc...

work page doi:10.1109/icip.2016.7532339 2016
[23]

Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features

Yue Wu et al. Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 9535–9544, Long Beach, CA, USA, Jun 2019. IEEE. ISBN 978-1-72813-293-8. doi:10.1109/CVPR.2019.00977. URL https://ieeexplore.ieee.org/document/8953774/

work page doi:10.1109/cvpr.2019.00977 2019
[24]

Segformer: Simple and efficient design for semantic segmentation with transformers

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34: 0 12077--12090, 2021

work page 2021
[25]

Pre-training-free image manipulation localization through non-mutually exclusive contrastive learning

Jizhe Zhou, Xiaochen Ma, Xia Du, Ahmed Y Alhammadi, and Wentao Feng. Pre-training-free image manipulation localization through non-mutually exclusive contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 22346--22356, 2023

work page 2023
[26]

Mesoscopic insights: Orchestrating multi-scale & hybrid architecture for image manipulation localization

Xuekang Zhu, Xiaochen Ma, Lei Su, Zhuohang Jiang, Bo Du, Xiwen Wang, Zeyu Lei, Wentao Feng, Chi-Man Pun, and Ji-Zhe Zhou. Mesoscopic insights: Orchestrating multi-scale & hybrid architecture for image manipulation localization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 11022--11030, 2025

work page 2025
[27]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[28]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[29]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[30]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv

[1] [1]

Walk in the cloud: Learning curves for point clouds shape analysis, pp

Xinru Chen, Chengbo Dong, Jiaqi Ji, Juan Cao, and Xirong Li. Image manipulation detection by multi-view multi-scale supervision. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 14165–14173, Montreal, QC, Canada, Oct 2021. IEEE. ISBN 978-1-66542-812-5. doi:10.1109/ICCV48922.2021.01392. URL https://ieeexplore.ieee.org/document/9710015/

work page doi:10.1109/iccv48922.2021.01392 2021

[2] [2]

Towards reliable identification of diffusion-based image manipulations

Alex Costanzino, Woody Bayliss, Juil Sock, Marc Gorriz Blanch, Danijela Horak, Ivan Laptev, Philip Torr, and Fabio Pizzati. Towards reliable identification of diffusion-based image manipulations. arXiv preprint arXiv:2506.05466, 2025

work page arXiv 2025

[3] [3]

URLhttps://doi.org/10.1109/chinasip.2013.6625374

Jing Dong, Wei Wang, and Tieniu Tan. Casia image tampering detection evaluation database. In 2013 IEEE China Summit and International Conference on Signal and Information Processing, pp.\ 422–426, Beijing, China, Jul 2013. IEEE. ISBN 978-1-4799-1043-4. doi:10.1109/ChinaSIP.2013.6625374. URL http://ieeexplore.ieee.org/document/6625374/

work page doi:10.1109/chinasip.2013.6625374 2013

[4] [4]

Forensichub: A unified benchmark & codebase for all- domain fake image detection and localization.arXiv preprint arXiv:2505.11003, 2025

Bo Du, Xuekang Zhu, Xiaochen Ma, Chenfan Qu, Kaiwen Feng, Zhe Yang, Chi-Man Pun, Jian Liu, and Jizhe Zhou. Forensichub: A unified benchmark & codebase for all-domain fake image detection and localization. arXiv preprint arXiv:2505.11003, 2025

work page arXiv 2025

[5] [5]

N., Delgado , A., Zhou , D., Kheyrkhah , T., Smith , J., and Fiscus , J

Haiying Guan, Mark Kozak, Eric Robertson, Yooyoung Lee, Amy N. Yates, Andrew Delgado, Daniel Zhou, Timothee Kheyrkhah, Jeff Smith, and Jonathan Fiscus. Mfc datasets: Large-scale benchmark datasets for media forensic challenge evaluation. In 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), pp.\ 63–72, Waikoloa Village, HI, USA, Jan 2019....

work page doi:10.1109/wacvw.2019.00018 2019

[6] [6]

Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization

Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 20606--20615, 2023

work page 2023

[7] [7]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 770–778, Las Vegas, NV, USA, Jun 2016. IEEE. ISBN 978-1-4673-8851-1. doi:10.1109/CVPR.2016.90. URL http://ieeexplore.ieee.org/document/7780459/

work page doi:10.1109/cvpr.2016.90 2016

[8] [8]

Detecting image splicing using geometry invariants and camera characteristics consistency

Yu-feng Hsu and Shih-fu Chang. Detecting image splicing using geometry invariants and camera characteristics consistency. In 2006 IEEE International Conference on Multimedia and Expo, pp.\ 549–552, Toronto, ON, Canada, Jul 2006. IEEE. ISBN 978-1-4244-0367-7. doi:10.1109/ICME.2006.262447. URL http://ieeexplore.ieee.org/document/4036658/

work page doi:10.1109/icme.2006.262447 2006

[9] [9]

Autosplice: A text-prompt manipulated image dataset for media forensics

Shan Jia, Mingzhen Huang, Zhou Zhou, Yan Ju, Jialing Cai, and Siwei Lyu. Autosplice: A text-prompt manipulated image dataset for media forensics. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 893--903, 2023

work page 2023

[10] [10]

Learning jpeg compression artifacts for image manipulation detection and localization

Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung-Kyu Lee, and Changick Kim. Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision, 130 0 (8): 0 1875--1895, 2022

work page 2022

[11] [11]

Safire: Segment any forged image region

Myung-Joon Kwon, Wonjun Lee, Seung-Hun Nam, Minji Son, and Changick Kim. Safire: Segment any forged image region. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 4437--4445, 2025

work page 2025

[12] [12]

Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization

Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology, 32 0 (11): 0 7505--7517, 2022 a

work page 2022

[13] [13]

Walk in the cloud: Learning curves for point clouds shape analysis, pp

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 9992–10002, Montreal, QC, Canada, Oct 2021. IEEE. ISBN 978-1-66542-812-5. doi:10.1109/ICCV48922.2021.00986. URL https:...

work page doi:10.1109/iccv48922.2021.00986 2021

[14] [14]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 11976--11986, 2022 b

work page 2022

[15] [15]

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. 0 (arXiv:1608.03983), May 2017. URL http://arxiv.org/abs/1608.03983. arXiv:1608.03983 [cs, math]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. 0 (arXiv:1711.05101), Jan 2019. URL http://arxiv.org/abs/1711.05101. arXiv:1711.05101 [cs, math]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[17] [17]

arXiv preprint arXiv:2307.14863 (2023)

Xiaochen Ma, Bo Du, Xianggen Liu, Ahmed Y Al Hammadi, and Jizhe Zhou. Iml-vit: Image manipulation localization by vision transformer. arXiv preprint arXiv:2307.14863, 2023

work page arXiv 2023

[18] [18]

Imdl-benco: A comprehensive benchmark and codebase for image manipulation detection & localization, 2024

Xiaochen Ma, Xuekang Zhu, Lei Su, Bo Du, Zhuohang Jiang, Bingkui Tong, Zeyu Lei, Xinyu Yang, Chi-Man Pun, Jiancheng Lv, and Jizhe Zhou. Imdl-benco: A comprehensive benchmark and codebase for image manipulation detection & localization, 2024

work page 2024

[19] [19]

Imd2020: A large-scale annotated dataset tailored for detecting manipulated images

Adam Novozamsky, Babak Mahdian, and Stanislav Saic. Imd2020: A large-scale annotated dataset tailored for detecting manipulated images. In 2020 IEEE Winter Applications of Computer Vision Workshops (WACVW), pp.\ 71–80, Snowmass Village, CO, USA, March 2020. IEEE. ISBN 978-1-72817-162-3. doi:10.1109/WACVW50321.2020.9096940. URL https://ieeexplore.ieee.org/...

work page doi:10.1109/wacvw50321.2020.9096940 2020

[20] [20]

Can we get rid of handcrafted feature extractors? sparsevit: Nonsemantics-centered, parameter-efficient image manipulation localization through spare-coding transformer

Lei Su, Xiaochen Ma, Xuekang Zhu, Chaoqun Niu, Zeyu Lei, and Ji-Zhe Zhou. Can we get rid of handcrafted feature extractors? sparsevit: Nonsemantics-centered, parameter-efficient image manipulation localization through spare-coding transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 7024--7032, 2025

work page 2025

[21] [21]

A ConvNet for the 2020s

Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Abhinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. Objectformer for image manipulation detection and localization. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2354–2363, New Orleans, LA, USA, Jun 2022. IEEE. ISBN 978-1-66546-946-3. doi:10.1109/CVPR52688.2022.00240...

work page doi:10.1109/cvpr52688.2022.00240 2022

[22] [22]

Coverage — a novel database for copy-move forgery detection

Bihan Wen, Ye Zhu, Ramanathan Subramanian, Tian-Tsong Ng, Xuanjing Shen, and Stefan Winkler. Coverage — a novel database for copy-move forgery detection. In 2016 IEEE International Conference on Image Processing (ICIP), pp.\ 161–165, Phoenix, AZ, USA, Sep 2016. IEEE. ISBN 978-1-4673-9961-6. doi:10.1109/ICIP.2016.7532339. URL http://ieeexplore.ieee.org/doc...

work page doi:10.1109/icip.2016.7532339 2016

[23] [23]

Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features

Yue Wu et al. Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 9535–9544, Long Beach, CA, USA, Jun 2019. IEEE. ISBN 978-1-72813-293-8. doi:10.1109/CVPR.2019.00977. URL https://ieeexplore.ieee.org/document/8953774/

work page doi:10.1109/cvpr.2019.00977 2019

[24] [24]

Segformer: Simple and efficient design for semantic segmentation with transformers

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34: 0 12077--12090, 2021

work page 2021

[25] [25]

Pre-training-free image manipulation localization through non-mutually exclusive contrastive learning

Jizhe Zhou, Xiaochen Ma, Xia Du, Ahmed Y Alhammadi, and Wentao Feng. Pre-training-free image manipulation localization through non-mutually exclusive contrastive learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 22346--22356, 2023

work page 2023

[26] [26]

Mesoscopic insights: Orchestrating multi-scale & hybrid architecture for image manipulation localization

Xuekang Zhu, Xiaochen Ma, Lei Su, Zhuohang Jiang, Bo Du, Xiwen Wang, Zeyu Lei, Wentao Feng, Chi-Man Pun, and Ji-Zhe Zhou. Mesoscopic insights: Orchestrating multi-scale & hybrid architecture for image manipulation localization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 11022--11030, 2025

work page 2025

[27] [27]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[28] [28]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[29] [29]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[30] [30]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv