pith. sign in

arxiv: 2606.22285 · v1 · pith:Q5DZGRQBnew · submitted 2026-06-21 · 💻 cs.CV

Efficient Document Tampering Localization with Multi-Level Discrepancy Features and Unified DCT-Quantization Embedding

Pith reviewed 2026-06-26 11:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords document tampering localizationcross-domain evaluationDCT embeddingdiscrepancy featuresforgery detectionRGB-DCT fusionmulti-level featuresquantization embedding
0
0 comments X

The pith

DiffNet achieves state-of-the-art document tampering localization on cross-domain and human-made forgeries using multi-level discrepancy features and DCT-quantization embedding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Document tampering often leaves only subtle traces that are hard for both humans and models to spot, particularly when forgeries are created by people or come from document sources and pipelines different from training data. Prior approaches have shown gains mainly on synthetic benchmarks that match the training distribution but translate poorly to real cases. This paper introduces DiffNet, an RGB-DCT early-fusion network that applies a lightweight multi-level discrepancy transformation at each backbone stage, replacing features with magnitude-only responses to learned zero-sum filters, and pairs it with a frequency-index-aware DCT-quantization joint embedding. The design aims to steer the decoder toward multi-scale inconsistency evidence rather than content activations. If the approach works, it would enable more reliable and faster detection in practical settings where distribution shift is common.

Core claim

The authors propose DiffNet, an RGB-DCT early-fusion architecture driven by two choices: a lightweight multi-level discrepancy transformation at the output of each backbone stage that replaces features with magnitude-only responses to learned zero-sum filters, and a lightweight frequency-index-aware DCT-quantization joint embedding for the DCT-domain backbone. This yields state-of-the-art performance on cross-domain and human-made document tampering localization, outperforming prior methods by around 30 percent with up to 7 times higher throughput than the previous best model.

What carries the argument

Lightweight multi-level discrepancy transformation that replaces backbone-stage features with magnitude-only responses to learned zero-sum filters, together with frequency-index-aware DCT-quantization joint embedding.

If this is right

  • The decoder aggregates multi-scale inconsistency evidence rather than operating on raw content-heavy activations.
  • Performance improves on distribution shifts arising from new document sources and tampering pipelines.
  • Throughput reaches up to 7 times that of the previous best model while maintaining higher accuracy.
  • Results on human-made forgeries exceed those of methods tuned primarily on synthetic data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same discrepancy transformation could be tested on other subtle anomaly detection tasks such as medical image inspection.
  • Emphasizing zero-sum filter responses might reduce the need for elaborate downstream fusion modules in related multimodal detection problems.
  • The efficiency gains suggest the architecture could support real-time verification pipelines on modest hardware.

Load-bearing premise

That the multi-level discrepancy transformation will cause the decoder to aggregate multi-scale inconsistency evidence rather than content-heavy activations even under distribution shift from new tampering pipelines and document sources.

What would settle it

A new test set of human-made forgeries drawn from previously unseen document sources and tampering tools where the accuracy improvement over prior methods falls below 10 percent or the throughput gain disappears.

Figures

Figures reproduced from arXiv: 2606.22285 by Aymen Shabou, Mohamed Dhouib, Sonia Vanier, Ye Zhu.

Figure 1
Figure 1. Figure 1: Overview of the proposed DiffNet architecture. 3 Method 3.1 Overview Our model follows a two-branch early-fusion design. The RGB branch processes the input image with a four-stage ConvNeXt-V2 [34] backbone, producing a multi-scale feature pyramid. In parallel, the DCT branch encodes the quantized DCT coefficients together with the quantization table to produce a frequency feature map. We fuse the DCT featu… view at source ↗
Figure 2
Figure 2. Figure 2: Activation intensity maps before and after the proposed zero-sum filters. Relation to constrained convolution residual modalities. Several works in im￾age forensics use high-pass convolutions to extract a residual signal as an addi￾tional input modality. Common examples include fixed residual features such as SRM [10] and constrained convolution layers that enforce a residual form, as in Bayar convolutions… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison on two examples of our model and the previous state-of-the-art method, FFDN, with both models pretrained on TDoc-2.8M. 4.5 Qualitative analysis Qualitative comparison [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of hard cases where our model struggles to recover the full manip￾ulated region [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Localizing document tampering is extremely challenging, as manipulations are crafted to appear visually consistent and often leave only subtle traces that are nearly invisible to the human eye. In prior work, evaluation has been largely dominated by synthetic benchmarks that closely match the training distribution, and methods have shown steady progress under this setting. However, these gains often translate poorly to human-made forgeries and to cross-domain evaluation, where both the source documents and the tampering pipeline can change, leading to a distribution shift. In addition, since the introduction of the Frequency Perception Head for the discrete cosine transform (DCT) modality, it has become a standard choice, and subsequent work has largely focused on downstream modules and fusion strategies rather than revisiting the backbone itself. To help close this gap in cross-domain performance and improve the DCT backbone design, we propose \textbf{DiffNet}, a relatively simple yet effective RGB--DCT early-fusion architecture driven by two key design choices. First, to ensure that the decoder aggregates multi-scale inconsistency evidence rather than operating on raw, content-heavy activations, we apply a lightweight multi-level discrepancy transformation at the output of each backbone stage, replacing features with magnitude-only responses to learned zero-sum filters. Second, we design an efficient DCT-domain backbone that relies on a lightweight frequency-index-aware DCT--quantization joint embedding. Our approach achieves state-of-the-art performance on cross-domain and human-made document tampering localization, outperforming prior methods by around 30\%, with up to $7\times$ higher throughput than the previous best model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes DiffNet, an RGB-DCT early-fusion architecture for document tampering localization. Key innovations include a lightweight multi-level discrepancy transformation that replaces backbone-stage features with magnitude-only responses to learned zero-sum filters, and a frequency-index-aware DCT-quantization joint embedding. It claims state-of-the-art performance on cross-domain and human-made document tampering localization, outperforming prior methods by around 30% with up to 7× higher throughput.

Significance. If the performance claims are substantiated, the work would address a recognized limitation in the field by improving generalization beyond synthetic benchmarks to more realistic human-made forgeries and cross-domain settings. The efficiency gains could also be practically relevant for deployment.

major comments (2)
  1. [Abstract] Abstract: The central claim of ~30% performance gains and 7× throughput on cross-domain and human-made forgeries is asserted without any reference to datasets, metrics, ablation studies, or error analysis, rendering verification of the SOTA result impossible from the provided text.
  2. [Abstract] Abstract: The multi-level discrepancy transformation is motivated as ensuring the decoder aggregates multi-scale inconsistency evidence rather than content-heavy activations under distribution shift, but no experimental validation, ablation, or analysis is supplied to confirm that the learned zero-sum filters and magnitude-only replacement remain effective when source documents or tampering pipelines change.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments on the abstract. We agree that the abstract can be strengthened to better support the performance claims and design motivations with explicit references to the experimental setup. We will revise the abstract accordingly in the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of ~30% performance gains and 7× throughput on cross-domain and human-made forgeries is asserted without any reference to datasets, metrics, ablation studies, or error analysis, rendering verification of the SOTA result impossible from the provided text.

    Authors: We agree that the abstract would benefit from more concrete references to enable verification. The full manuscript evaluates the ~30% gains and 7× throughput on specific cross-domain and human-made forgery benchmarks (detailed in Section 4), using pixel-level F1-score and AUC metrics, with full comparisons in Tables 2-4 and error analysis in Section 4.5. We will revise the abstract to briefly cite these datasets and metrics while retaining its summary nature. revision: yes

  2. Referee: [Abstract] Abstract: The multi-level discrepancy transformation is motivated as ensuring the decoder aggregates multi-scale inconsistency evidence rather than content-heavy activations under distribution shift, but no experimental validation, ablation, or analysis is supplied to confirm that the learned zero-sum filters and magnitude-only replacement remain effective when source documents or tampering pipelines change.

    Authors: Section 3.2 motivates the transformation and Section 4.3 reports its effectiveness via cross-domain experiments that vary source documents and tampering pipelines, while Section 4.4 provides ablations isolating the zero-sum filters and magnitude-only responses. These results demonstrate maintained gains under distribution shift. We will revise the abstract to explicitly reference these supporting experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture and performance claims are independent of fitted inputs or self-citations

full rationale

The paper introduces DiffNet with two explicit design choices (multi-level discrepancy transformation using magnitude-only responses to learned zero-sum filters, and frequency-index-aware DCT-quantization embedding) and reports empirical SOTA results on cross-domain and human-made forgeries. No equations or sections reduce a claimed prediction to a fitted parameter by construction, nor does any load-bearing premise rest on a self-citation chain. The derivation chain consists of novel architectural components whose effectiveness is evaluated externally via benchmarks, satisfying the criteria for a self-contained, non-circular presentation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

Performance claims rest on the unverified effectiveness of two newly introduced learned components whose generalization is assumed rather than demonstrated in the abstract; standard deep-learning training is the only external grounding.

free parameters (2)
  • learned zero-sum filters
    Filters are trained to produce magnitude-only discrepancy responses at each backbone stage.
  • network weights and hyperparameters
    All model parameters are fitted to training data in the usual supervised manner.
axioms (1)
  • domain assumption Replacing backbone features with magnitude-only responses to learned zero-sum filters causes the decoder to aggregate multi-scale inconsistency evidence rather than content-heavy activations
    Explicitly stated as the purpose of the first key design choice.
invented entities (2)
  • multi-level discrepancy transformation no independent evidence
    purpose: Ensure decoder aggregates multi-scale inconsistency evidence
    New component introduced to replace raw activations with magnitude responses from zero-sum filters.
  • frequency-index-aware DCT-quantization joint embedding no independent evidence
    purpose: Efficient DCT-domain backbone
    New embedding design for the DCT pathway.

pith-pipeline@v0.9.1-grok · 5820 in / 1422 out tokens · 40970 ms · 2026-06-26T11:03:55.998480+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 6 canonical work pages

  1. [1]

    In: First International Workshop on Computational Document Forensics (2017)

    Artaud, C., Doucet, A., Ogier, J.M., d’Andecy, V.P.: Receipt dataset for fraud detection. In: First International Workshop on Computational Document Forensics (2017)

  2. [2]

    In: Proceedings of the 4th ACM workshop on information hiding and multimedia security

    Bayar, B., Stamm, M.C.: A deep learning approach to universal image manipula- tion detection using a new convolutional layer. In: Proceedings of the 4th ACM workshop on information hiding and multimedia security. pp. 5–10 (2016)

  3. [3]

    IEEE Transactions on Information Forensics and Security13(11), 2691–2706 (2018)

    Bayar, B., Stamm, M.C.: Constrained convolutional neural networks: A new ap- proach towards general purpose image manipulation detection. IEEE Transactions on Information Forensics and Security13(11), 2691–2706 (2018)

  4. [4]

    In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)

    Bibi, M., Hamid, A., Moetesum, M., Siddiqi, I.: Document forgery detection using printer source identification—a text-independent approach. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). vol. 8, pp. 7–12. IEEE (2019)

  5. [5]

    In: Proceedings of the IEEE/CVF international con- ference on computer vision

    Chen, X., Dong, C., Ji, J., Cao, J., Li, X.: Image manipulation detection by multi- view multi-scale supervision. In: Proceedings of the IEEE/CVF international con- ference on computer vision. pp. 14185–14193 (2021)

  6. [6]

    In: European Conference on Computer Vision

    Chen, Z., Chen, S., Yao, T., Sun, K., Ding, S., Lin, X., Cao, L., Ji, R.: Enhancing tampered text detection through frequency feature fusion and decomposition. In: European Conference on Computer Vision. pp. 200–217. Springer (2024)

  7. [7]

    Dhouib, M., Buscaldi, D., Vanier, S., Shabou, A.: Leveraging contrastive learning for a similarity-guided tampered document data generation pipeline (2026),https: //arxiv.org/abs/2602.17322

  8. [8]

    IEEE Transactions on Con- sumer Electronics (2024) 16 M

    Dong, L., Liang, W., Wang, R.: Robust text image tampering localization via forgery traces enhancement and multiscale attention. IEEE Transactions on Con- sumer Electronics (2024) 16 M. Dhouib et al

  9. [9]

    In: The Thirty-ninth Annual Conference on Neural Information ProcessingSystemsDatasetsandBenchmarksTrack(2025), https://openreview

    Du, B., Zhu, X., Ma, X., Qu, C., Feng, K., Yang, Z., Pun, C.M., liu, J., Zhou, J.Z.: Forensichub: A unified benchmark & codebase for all-domain fake image detection and localization. In: The Thirty-ninth Annual Conference on Neural Information ProcessingSystemsDatasetsandBenchmarksTrack(2025), https://openreview. net/forum?id=IKK0mEUTfE

  10. [10]

    IEEE Transactions on information Forensics and Security7(3), 868–882 (2012)

    Fridrich, J., Kodovsky, J.: Rich models for steganalysis of digital images. IEEE Transactions on information Forensics and Security7(3), 868–882 (2012)

  11. [11]

    In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Guillaro, F., Cozzolino, D., Sud, A., Dufour, N., Verdoliva, L.: Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20606–20615 (2023)

  12. [12]

    IEEE Transactions on Information Forensics and Security 5(4), 848–856 (2010)

    Huang, F., Huang, J., Shi, Y.Q.: Detecting double jpeg compression with the same quantization matrix. IEEE Transactions on Information Forensics and Security 5(4), 848–856 (2010)

  13. [13]

    International Journal of Computer Vision130(8), 1875–1895 (May 2022).https://doi.org/10.1007/ s11263-022-01617-5, http://dx.doi.org/10.1007/s11263-022-01617-5

    Kwon, M.J., Nam, S.H., Yu, I.J., Lee, H.K., Kim, C.: Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision130(8), 1875–1895 (May 2022).https://doi.org/10.1007/ s11263-022-01617-5, http://dx.doi.org/10.1007/s11263-022-01617-5

  14. [14]

    In: 2006 International Conference on Computational Intelligence and Security

    Lampert, C.H., Mei, L., Breuel, T.M.: Printing technique classification for docu- ment counterfeit detection. In: 2006 International Conference on Computational Intelligence and Security. vol. 1, pp. 639–644. IEEE (2006)

  15. [15]

    In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

    Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 665–666 (2006)

  16. [16]

    IET Image Processing 19(1), e70007 (2025)

    Li, L., Zhang, K., Lu, J., Zhang, S., Chu, N.: Multiclassification tampering detec- tion algorithm based on spatial-frequency fusion and swin-t. IET Image Processing 19(1), e70007 (2025)

  17. [17]

    IEEE Transactions on Circuits and Systems for Video Technology32(11), 7505–7517 (2022)

    Liu, X., Liu, Y., Chen, J., Liu, X.: Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology32(11), 7505–7517 (2022)

  18. [18]

    Pattern Recognition157, 110828 (2025)

    Luo, D., Liu, Y., Yang, R., Liu, X., Zeng, J., Zhou, Y., Bai, X.: Toward real text manipulation detection: New dataset and new solution. Pattern Recognition157, 110828 (2025)

  19. [19]

    arXiv preprint arXiv:2307.14863 (2023)

    Ma, X., Du, B., Jiang, Z., Hammadi, A.Y.A., Zhou, J.: Iml-vit: Bench- marking image manipulation localization by vision transformer. arXiv preprint arXiv:2307.14863 (2023)

  20. [20]

    IEEE Access (2025)

    Nguyen, A.D., Kim, H.Y., Nguyen, H.N.: Taliu: A novel decoder and augmentation strategy for boosting tampered document image detection. IEEE Access (2025)

  21. [21]

    In: Proceedings of the AAAI conference on artificial intelligence

    Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

  22. [22]

    IEEE Transactions on information forensics and security 3(2), 247–258 (2008)

    Pevny, T., Fridrich, J.: Detection of double-compression in jpeg images for appli- cations in steganography. IEEE Transactions on information forensics and security 3(2), 247–258 (2008)

  23. [23]

    In: International workshop on information hiding

    Popescu, A.C., Farid, H.: Statistical tools for digital forensics. In: International workshop on information hiding. pp. 128–147. Springer (2004)

  24. [24]

    In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition

    Qu, C., Liu, C., Liu, Y., Chen, X., Peng, D., Guo, F., Jin, L.: Towards robust tampered text detection in document image: New dataset and new solution. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 5937–5946 (2023) Efficient Document Tampering Localization 17

  25. [25]

    arXiv preprint arXiv:2411.14823 (2024)

    Qu, C., Zhong, Y., Guo, F., Jin, L.: Omni-iml: towards unified image manipulation localization. arXiv preprint arXiv:2411.14823 (2024)

  26. [26]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Qu, C., Zhong, Y., Guo, F., Jin, L.: Revisiting tampered scene text detection in the era of generative ai. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 694–702 (2025)

  27. [27]

    Journal of Electronic Imaging24(2), 023008–023008 (2015)

    Shang, S., Kong, X., You, X.: Document forgery detection using distortion muta- tion of geometric parameters in characters. Journal of Electronic Imaging24(2), 023008–023008 (2015)

  28. [28]

    ACM Transactions on Multimedia Computing, Communications and Applications 21(2), 1–24 (2025)

    Song, Y., Jiang, W., Chai, X., Gan, Z., Zhou, M., Chen, L.: Cross-attention based two-branch networks for document image forgery localization in the metaverse. ACM Transactions on Multimedia Computing, Communications and Applications 21(2), 1–24 (2025)

  29. [29]

    In: International conference on machine learning

    Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. pp. 6105–6114. PMLR (2019)

  30. [30]

    In: International Conference on Document Analysis and Recognition

    Tornés, B.M., Taburet, T., Boros, E., Rouis, K., Doucet, A., Gomez-Krämer, P., Sidere, N., d’Andecy, V.P.: Receipt dataset for document forgery detection. In: International Conference on Document Analysis and Recognition. pp. 454–469. Springer (2023)

  31. [31]

    In: European Conference on Computer Vision

    Wang, Y., Xie, H., Xing, M., Wang, J., Zhu, S., Zhang, Y.: Detecting tampered scene text in the wild. In: European Conference on Computer Vision. pp. 215–232. Springer (2022)

  32. [32]

    Chinese Journal of Network and Information Security 8(3), 29–39 (2022).https://doi.org/10.11959/j.issn.2096- 109x

    Wang, Y., Zhang, B., Xie, H., Zhang, Y.: Tampered text detection via rgb and frequency relationship modeling. Chinese Journal of Network and Information Security 8(3), 29–39 (2022).https://doi.org/10.11959/j.issn.2096- 109x. 2022035, http://www.infocomm- journal.com/cjnis/CN/abstract/article_ 172502.shtml

  33. [33]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wong, K., Zhou, J., Wu, H., Si, Y.W., Zhou, J.: Adcd-net: Robust document image forgery localization via adaptive dct feature and hierarchical content disentangle- ment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19280–19289 (2025)

  34. [34]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S.: Convnext v2: Co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16133– 16142 (2023)

  35. [35]

    Wu, L., Zhang, C., Liu, J., Han, J., Liu, J., Ding, E., Bai, X.: Editing text in the wild (2019), https://arxiv.org/abs/1908.03047

  36. [36]

    In: Proceedings of the AAAI conference on artificial intelligence

    Zhu, X., Ma, X., Su, L., Jiang, Z., Du, B., Wang, X., Lei, Z., Feng, W., Pun, C.M., Zhou, J.Z.: Mesoscopic insights: orchestrating multi-scale & hybrid architecture for image manipulation localization. In: Proceedings of the AAAI conference on artificial intelligence. vol. 39, pp. 11022–11030 (2025) 18 M. Dhouib et al. A Custom CUDA kernel for zero-sum di...