arxiv: 2604.14703 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

The Courtroom Trial of Pixels: Robust Image Manipulation Localization via Adversarial Evidence and Reinforcement Learning Judgment

Songlin Li , Zhiqing Guo , Dan Ma , Changtao Miao , Gaobo Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords image manipulation localizationadversarial evidencereinforcement learning judgedual-hypothesis segmentationedge priorsevidence confrontationprosecution defense streams

0 comments

The pith

A courtroom-style framework with opposing evidence streams and a reinforcement learning judge localizes image manipulations more reliably than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing image manipulation localization methods treat authenticity information as an auxiliary training signal instead of explicit opposing evidence. This paper models the task as a confrontation between a prosecution stream that asserts manipulation and a defense stream that asserts authenticity, both built on a shared encoder with edge-guided fusion and suppression mechanisms. A reinforcement learning judge then refines uncertain regions through strategic re-inference, trained with advantage rewards and calibrated by entropy plus cross-hypothesis consistency. If correct, the approach yields superior average performance on benchmarks by producing more reliable masks even when traces are subtle or post-processed. A sympathetic reader cares because trustworthy localization helps verify whether an image has been altered in high-stakes settings.

Core claim

The paper establishes that regarding image manipulation localization as an adversarial trial of evidence produces better results. It introduces a dual-hypothesis segmentation architecture with a prosecution stream asserting manipulation and a defense stream asserting authenticity, generating evidence via cascaded multi-level fusion, bidirectional disagreement suppression, and dynamic debate refinement under edge priors. A reinforcement learning judge model then performs re-inference on uncertain regions using advantage-based rewards and a soft-IoU objective, with reliability calibrated through entropy and cross-hypothesis consistency.

What carries the argument

The dual-hypothesis segmentation architecture with prosecution and defense streams that generate opposing evidence through edge-guided fusion and suppression, plus the reinforcement learning judge that refines predictions on ambiguous areas.

If this is right

The model produces more reliable localization masks in ambiguous regions where manipulation traces are weak or degraded.
Explicit modeling of opposing evidence allows direct comparison of manipulated versus authentic regions rather than indirect sensitivity training.
The reinforcement learning judge improves predictions on uncertain pixels through strategic re-inference and consistency calibration.
Overall average performance exceeds current state-of-the-art image manipulation localization methods on standard evaluation datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The evidence-confrontation structure could be adapted to localize anomalies in other media such as video frames or audio waveforms.
Training the judge with entropy-based calibration might increase resistance to adversarial perturbations aimed at fooling manipulation detectors.
The framework suggests a path toward detectors that remain effective on real-world social media images whose editing history is unknown and varied.
Integrating the prosecution-defense dynamic with generative models could help produce training data that forces detectors to learn more robust distinctions.

Load-bearing premise

The prosecution and defense streams can reliably generate opposing evidence for subtle or post-processed manipulations using edge priors and bidirectional suppression so the judge can improve the output.

What would settle it

A benchmark test set of images containing only very subtle manipulations after heavy compression, noise, and other post-processing where the proposed model's localization accuracy falls below that of leading prior methods.

Figures

Figures reproduced from arXiv: 2604.14703 by Changtao Miao, Dan Ma, Gaobo Yang, Songlin Li, Zhiqing Guo.

**Figure 2.** Figure 2: Overview of our courtroom-style adjudication framework, which consists of two main stages: the courtroom debate stage and the judge’s ruling stage. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization comparison of our method compared with other SOTA IML methods. GT represents ground truth. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Robustness analysis under standard perturbations on the CASIAv1 dataset. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Although some existing image manipulation localization (IML) methods incorporate authenticity-related supervision, this information is typically utilized merely as an auxiliary training signal to enhance the model's sensitivity to manipulation artifacts, rather than being explicitly modeled as localization evidence opposing the manipulated regions. Consequently, when manipulation traces are subtle or degraded by post-processing and noise, these methods struggle to explicitly compare manipulated and authentic evidence, resulting in unreliable predictions in ambiguous areas. To address these issues, we propose a courtroom-style adjudication framework that regards IML task as the confrontation of evidence followed by judgment. The framework comprises a prosecution stream, a defense stream, and a judge model. We first build a dual-hypothesis segmentation architecture on a shared multi-scale encoder, in which the prosecution stream asserts manipulation and the defense stream asserts authenticity. Guided by edge priors, it produces evidence for manipulated and authentic regions through cascaded multi-level fusion, bidirectional disagreement suppression, and dynamic debate refinement. We further develop a reinforcement learning judge model that performs strategic re-inference and refinement on uncertain regions, yielding a manipulated-region mask. The judge model is trained with advantage-based rewards and a soft-IoU objective, and reliability is calibrated via entropy and cross-hypothesis consistency. Experimental results show that our model achieves superior average performance compared with SOTA IML methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a courtroom-style framework for image manipulation localization (IML) that models the task as confrontation of evidence followed by judgment. It introduces a dual-hypothesis segmentation architecture with a prosecution stream asserting manipulation and a defense stream asserting authenticity, both built on a shared multi-scale encoder and guided by edge priors through cascaded multi-level fusion, bidirectional disagreement suppression, and dynamic debate refinement. A reinforcement learning judge model then performs strategic re-inference on uncertain regions, trained via advantage-based rewards and a soft-IoU objective while calibrated by entropy and cross-hypothesis consistency. The authors claim this yields superior average performance over state-of-the-art IML methods, particularly for subtle or post-processed manipulations.

Significance. If the dual-stream evidence generation and RL adjudication prove effective, the work could advance robust IML by explicitly contrasting manipulated versus authentic evidence rather than treating authenticity as auxiliary supervision, potentially improving localization reliability in ambiguous regimes where current methods degrade.

major comments (2)

Abstract: the central claim that the model 'achieves superior average performance compared with SOTA IML methods' is stated without any quantitative metrics, datasets, baselines, or ablation results, which is load-bearing for evaluating whether the prosecution/defense streams and RL judge deliver the promised robustness gains.
Method (prosecution and defense streams): the description of edge-prior-guided bidirectional disagreement suppression and dynamic debate refinement does not demonstrate or test that opposing manipulated/authentic evidence is reliably produced when traces are subtle or degraded by post-processing; if suppression collapses both streams to the same weak signal, the RL judge's entropy/consistency refinement has no distinct hypotheses to adjudicate, directly undermining the framework's motivation.

minor comments (2)

The manuscript would benefit from explicit equations or pseudocode defining the cascaded multi-level fusion, the advantage-based reward, and the soft-IoU objective to support reproducibility.
Clarify how the RL judge's strategic re-inference is implemented (e.g., policy network architecture or action space) beyond the high-level description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and indicate the revisions we will make.

read point-by-point responses

Referee: Abstract: the central claim that the model 'achieves superior average performance compared with SOTA IML methods' is stated without any quantitative metrics, datasets, baselines, or ablation results, which is load-bearing for evaluating whether the prosecution/defense streams and RL judge deliver the promised robustness gains.

Authors: We agree that the abstract presents the performance claim without supporting numbers. In the revised manuscript we will update the abstract to include concrete quantitative results drawn from the experimental section, specifically the average F1-score and IoU improvements across the evaluated datasets together with the primary baselines used. revision: yes
Referee: Method (prosecution and defense streams): the description of edge-prior-guided bidirectional disagreement suppression and dynamic debate refinement does not demonstrate or test that opposing manipulated/authentic evidence is reliably produced when traces are subtle or degraded by post-processing; if suppression collapses both streams to the same weak signal, the RL judge's entropy/consistency refinement has no distinct hypotheses to adjudicate, directly undermining the framework's motivation.

Authors: This concern is well-founded and directly touches the motivation for the dual-stream design. The current manuscript contains component ablations and qualitative visualizations on post-processed images, yet these do not explicitly quantify stream divergence on subtle-manipulation subsets. We will add a targeted analysis in the revision that measures disagreement between the prosecution and defense streams (via divergence metrics) on challenging subsets, reports failure cases of evidence collapse, and discusses the downstream effect on the RL judge. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architectural proposal with independent experimental claims

full rationale

The paper introduces a novel courtroom-style framework consisting of prosecution and defense streams on a shared encoder, guided by edge priors with cascaded fusion, bidirectional suppression, and dynamic refinement, plus an RL judge trained via advantage rewards and soft-IoU. No equations, derivations, or fitted parameters are described that reduce by construction to the inputs (e.g., no self-definitional ratios or predictions of the same quantity used for fitting). Performance superiority is asserted via external experimental comparison to SOTA methods rather than any internal consistency loop. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. This is a standard non-circular finding for an architectural ML innovation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Ledger populated from abstract description only; no specific free parameters, axioms, or entities quantified.

axioms (1)

domain assumption Dual streams can generate reliable opposing evidence for manipulated and authentic regions via edge-guided fusion and disagreement suppression.
Core premise of the prosecution/defense architecture stated in the abstract.

invented entities (2)

Prosecution stream and defense stream no independent evidence
purpose: To assert manipulation and authenticity respectively in a dual-hypothesis setup.
New architectural components introduced to model evidence confrontation.
Reinforcement learning judge model no independent evidence
purpose: To perform strategic re-inference on uncertain regions using advantage-based rewards and soft-IoU.
Novel judge component trained with entropy and consistency calibration.

pith-pipeline@v0.9.0 · 5543 in / 1231 out tokens · 51938 ms · 2026-05-10T11:54:29.959630+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 4 canonical work pages

[1]

Pscc-net: Progressive spatio- channel correlation network for image manipulation detection and localization,

X. Liu, Y . Liu, J. Chen, and X. Liu, “Pscc-net: Progressive spatio- channel correlation network for image manipulation detection and localization,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 11, pp. 7505–7517, 2022

2022
[2]

Image manipulation localization using spatial–channel fusion excitation and fine-grained feature enhance- ment,

F. Li, H. Zhai, X. Zhang, and C. Qin, “Image manipulation localization using spatial–channel fusion excitation and fine-grained feature enhance- ment,”IEEE Transactions on Instrumentation and Measurement, vol. 73, pp. 1–14, 2024

2024
[3]

Beyond fully supervised pixel annotations: Scribble-driven weakly-supervised framework for image manipulation localization,

S. Li, G. Yu, Z. Guo, Y . Diao, D. Ma, and G. Yang, “Beyond fully supervised pixel annotations: Scribble-driven weakly-supervised framework for image manipulation localization,” inProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026
[4]

From passive perception to active memory: A weakly supervised image manipulation localization framework driven by coarse-grained annotations,

Z. Guo, D. Xi, S. Li, and G. Yang, “From passive perception to active memory: A weakly supervised image manipulation localization framework driven by coarse-grained annotations,” inProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026
[5]

Mvss-net: Multi- view multi-scale supervised networks for image manipulation detec- tion,

C. Dong, X. Chen, R. Hu, J. Cao, and X. Li, “Mvss-net: Multi- view multi-scale supervised networks for image manipulation detec- tion,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3539–3553, 2022

2022
[6]

Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization,

F. Guillaro, D. Cozzolino, A. Sud, N. Dufour, and L. Verdoliva, “Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 20 606–20 615

2023
[7]

Ean: Edge-aware network for image manipulation localization,

Y . Chen, H. Cheng, H. Wang, X. Liu, F. Chen, F. Li, X. Zhang, and M. Wang, “Ean: Edge-aware network for image manipulation localization,”IEEE Transactions on Circuits and Systems for Video Technology, 2024

2024
[8]

Pre-training- free image manipulation localization through non-mutually exclusive contrastive learning,

J. Zhou, X. Ma, X. Du, A. Y . Alhammadi, and W. Feng, “Pre-training- free image manipulation localization through non-mutually exclusive contrastive learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 346–22 356

2023
[9]

At- tentive and contrastive image manipulation localization with boundary guidance,

W. Liu, H. Zhang, X. Lin, Q. Zhang, Q. Li, X. Liu, and Y . Cao, “At- tentive and contrastive image manipulation localization with boundary guidance,”IEEE Transactions on Information Forensics and Security, 2024

2024
[10]

A novel copy-move detection and location technique based on tamper detection and similarity feature fusion,

G. He, X. Zhang, F. Wang, and Z. Fu, “A novel copy-move detection and location technique based on tamper detection and similarity feature fusion,”Int. J. Auton. Adapt. Commun. Syst., vol. 17, no. 6, p. 514–529, Jan. 2024. [Online]. Available: https://doi.org/10.1504/ijaacs.2024.142523

work page doi:10.1504/ijaacs.2024.142523 2024
[11]

Deepfake detection and localisation based on illumination inconsistency,

F. Gu, Y . Dai, J. Fei, and X. Chen, “Deepfake detection and localisation based on illumination inconsistency,”Int. J. Auton. Adapt. Commun. Syst., vol. 17, no. 4, p. 352–368, Jan. 2024. [Online]. Available: https://doi.org/10.1504/ijaacs.2024.139383

work page doi:10.1504/ijaacs.2024.139383 2024
[12]

Mesoscopic insights: orchestrating multi-scale & hybrid architecture for image manipulation localization,

X. Zhu, X. Ma, L. Su, Z. Jiang, B. Du, X. Wang, Z. Lei, W. Feng, C.-M. Pun, and J.-Z. Zhou, “Mesoscopic insights: orchestrating multi-scale & hybrid architecture for image manipulation localization,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 10, 2025, pp. 11 022–11 030

2025
[13]

Auto-generating neural networks with reinforcement learning for multi-purpose image forensics,

Y . Wei, Y . Chen, X. Kang, Z. J. Wang, and L. Xiao, “Auto-generating neural networks with reinforcement learning for multi-purpose image forensics,” in2020 IEEE International Conference on Multimedia and Expo (ICME), 2020, pp. 1–6

2020
[14]

Automated design of neural network architectures with reinforcement learning for detection of global manipulations,

Y . Chen, Z. Wang, Z. J. Wang, and X. Kang, “Automated design of neural network architectures with reinforcement learning for detection of global manipulations,”IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5, pp. 997–1011, 2020

2020
[15]

Video splicing detection and localization based on multi-level deep feature fusion and reinforcement learning,

X. Jin, Z. He, J. Xu, Y . Wang, and Y . Su, “Video splicing detection and localization based on multi-level deep feature fusion and reinforcement learning,”Multimedia Tools and Applications, vol. 81, no. 28, pp. 40 993–41 011, 2022

2022
[16]

Employing reinforcement learning to construct a decision-making environment for image forgery localization,

R. Peng, S. Tan, X. Mo, B. Li, and J. Huang, “Employing reinforcement learning to construct a decision-making environment for image forgery localization,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 4820–4834, 2024

2024
[17]

Cbam: Convolutional block attention module,

S. Woo, J. Park, J.-Y . Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” inProceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19

2018
[18]

Boundary-guided camouflaged object detection,

Y . Sun, S. Wang, C. Chen, and T.-Z. Xiang, “Boundary-guided camou- flaged object detection,”arXiv preprint arXiv:2207.00794, 2022

work page arXiv 2022
[19]

Casia image tampering detection evaluation database,

J. Dong, W. Wang, and T. Tan, “Casia image tampering detection evaluation database,” in2013 IEEE China summit and international conference on signal and information processing. IEEE, 2013, pp. 422–426

2013
[20]

Coverage—a novel database for copy-move forgery detection,

B. Wen, Y . Zhu, R. Subramanian, T.-T. Ng, X. Shen, and S. Winkler, “Coverage—a novel database for copy-move forgery detection,” in2016 IEEE international conference on image processing (ICIP). IEEE, 2016, pp. 161–165

2016
[21]

Mfc datasets: Large- scale benchmark datasets for media forensic challenge evaluation,

H. Guan, M. Kozak, E. Robertson, Y . Lee, A. N. Yates, A. Delgado, D. Zhou, T. Kheyrkhah, J. Smith, and J. Fiscus, “Mfc datasets: Large- scale benchmark datasets for media forensic challenge evaluation,” in2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). IEEE, 2019, pp. 63–72

2019
[22]

Columbia uncompressed image splicing detection evaluation dataset,

J. Hsu and S. Chang, “Columbia uncompressed image splicing detection evaluation dataset,”Columbia DVMM Research Lab, vol. 6, 2006

2006
[23]

Evaluation of random field models in multi- modal unsupervised tampering localization,

P. Korus and J. Huang, “Evaluation of random field models in multi- modal unsupervised tampering localization,” in2016 IEEE international workshop on information forensics and security (WIFS). IEEE, 2016, pp. 1–6

2016
[24]

Exposing digital image forgeries by illumination color classification,

T. J. De Carvalho, C. Riess, E. Angelopoulou, H. Pedrini, and A. de Rezende Rocha, “Exposing digital image forgeries by illumination color classification,”IEEE Transactions on Information Forensics and Security, vol. 8, no. 7, pp. 1182–1194, 2013

2013
[25]

Imd2020: A large-scale annotated dataset tailored for detecting manipulated images,

A. Novozamsky, B. Mahdian, and S. Saic, “Imd2020: A large-scale annotated dataset tailored for detecting manipulated images,” in2020 IEEE Winter Applications of Computer Vision Workshops (WACVW), March 2020, pp. 71–80

2020
[26]

F 3net: fusion, feedback and focus for salient object detection,

J. Wei, S. Wang, and Q. Huang, “F 3net: fusion, feedback and focus for salient object detection,” inProceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 12 321–12 328

2020
[27]

Iml-vit: Benchmarking image manipulation localization by vision transformer,

X. Ma, B. Du, Z. Jiang, X. Du, A. Y . A. Hammadi, and J. Zhou, “Iml-vit: Benchmarking image manipulation localization by vision transformer,”
[28]

arXiv preprint arXiv:2307.14863 , year=

[Online]. Available: https://arxiv.org/abs/2307.14863

work page arXiv
[29]

Mfi- net: Multi-feature fusion identification networks for artificial intelligence manipulation,

R. Ren, Q. Hao, S. Niu, K. Xiong, J. Zhang, and M. Wang, “Mfi- net: Multi-feature fusion identification networks for artificial intelligence manipulation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 2, pp. 1266–1280, 2023

2023
[30]

Can we get rid of handcrafted feature extractors? sparsevit: Nonsemantics- centered, parameter-efficient image manipulation localization through spare-coding transformer,

L. Su, X. Ma, X. Zhu, C. Niu, Z. Lei, and J.-Z. Zhou, “Can we get rid of handcrafted feature extractors? sparsevit: Nonsemantics- centered, parameter-efficient image manipulation localization through spare-coding transformer,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 7, 2025, pp. 7024–7032

2025
[31]

Pixel- inconsistency modeling for image manipulation localization,

C. Kong, A. Luo, S. Wang, H. Li, A. Rocha, and A. C. Kot, “Pixel- inconsistency modeling for image manipulation localization,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[32]

Imdl-benco: A comprehensive benchmark and codebase for image manipulation detection & localization,

X. Ma, X. Zhu, L. Su, B. Du, Z. Jiang, B. Tong, Z. Lei, X. Yang, C.-M. Pun, J. Lvet al., “Imdl-benco: A comprehensive benchmark and codebase for image manipulation detection & localization,”Advances in Neural Information Processing Systems, vol. 37, pp. 134 591–134 613, 2025

2025