Efficient Document Tampering Localization with Multi-Level Discrepancy Features and Unified DCT-Quantization Embedding
Pith reviewed 2026-06-26 11:03 UTC · model grok-4.3
The pith
DiffNet achieves state-of-the-art document tampering localization on cross-domain and human-made forgeries using multi-level discrepancy features and DCT-quantization embedding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors propose DiffNet, an RGB-DCT early-fusion architecture driven by two choices: a lightweight multi-level discrepancy transformation at the output of each backbone stage that replaces features with magnitude-only responses to learned zero-sum filters, and a lightweight frequency-index-aware DCT-quantization joint embedding for the DCT-domain backbone. This yields state-of-the-art performance on cross-domain and human-made document tampering localization, outperforming prior methods by around 30 percent with up to 7 times higher throughput than the previous best model.
What carries the argument
Lightweight multi-level discrepancy transformation that replaces backbone-stage features with magnitude-only responses to learned zero-sum filters, together with frequency-index-aware DCT-quantization joint embedding.
If this is right
- The decoder aggregates multi-scale inconsistency evidence rather than operating on raw content-heavy activations.
- Performance improves on distribution shifts arising from new document sources and tampering pipelines.
- Throughput reaches up to 7 times that of the previous best model while maintaining higher accuracy.
- Results on human-made forgeries exceed those of methods tuned primarily on synthetic data.
Where Pith is reading between the lines
- The same discrepancy transformation could be tested on other subtle anomaly detection tasks such as medical image inspection.
- Emphasizing zero-sum filter responses might reduce the need for elaborate downstream fusion modules in related multimodal detection problems.
- The efficiency gains suggest the architecture could support real-time verification pipelines on modest hardware.
Load-bearing premise
That the multi-level discrepancy transformation will cause the decoder to aggregate multi-scale inconsistency evidence rather than content-heavy activations even under distribution shift from new tampering pipelines and document sources.
What would settle it
A new test set of human-made forgeries drawn from previously unseen document sources and tampering tools where the accuracy improvement over prior methods falls below 10 percent or the throughput gain disappears.
Figures
read the original abstract
Localizing document tampering is extremely challenging, as manipulations are crafted to appear visually consistent and often leave only subtle traces that are nearly invisible to the human eye. In prior work, evaluation has been largely dominated by synthetic benchmarks that closely match the training distribution, and methods have shown steady progress under this setting. However, these gains often translate poorly to human-made forgeries and to cross-domain evaluation, where both the source documents and the tampering pipeline can change, leading to a distribution shift. In addition, since the introduction of the Frequency Perception Head for the discrete cosine transform (DCT) modality, it has become a standard choice, and subsequent work has largely focused on downstream modules and fusion strategies rather than revisiting the backbone itself. To help close this gap in cross-domain performance and improve the DCT backbone design, we propose \textbf{DiffNet}, a relatively simple yet effective RGB--DCT early-fusion architecture driven by two key design choices. First, to ensure that the decoder aggregates multi-scale inconsistency evidence rather than operating on raw, content-heavy activations, we apply a lightweight multi-level discrepancy transformation at the output of each backbone stage, replacing features with magnitude-only responses to learned zero-sum filters. Second, we design an efficient DCT-domain backbone that relies on a lightweight frequency-index-aware DCT--quantization joint embedding. Our approach achieves state-of-the-art performance on cross-domain and human-made document tampering localization, outperforming prior methods by around 30\%, with up to $7\times$ higher throughput than the previous best model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DiffNet, an RGB-DCT early-fusion architecture for document tampering localization. Key innovations include a lightweight multi-level discrepancy transformation that replaces backbone-stage features with magnitude-only responses to learned zero-sum filters, and a frequency-index-aware DCT-quantization joint embedding. It claims state-of-the-art performance on cross-domain and human-made document tampering localization, outperforming prior methods by around 30% with up to 7× higher throughput.
Significance. If the performance claims are substantiated, the work would address a recognized limitation in the field by improving generalization beyond synthetic benchmarks to more realistic human-made forgeries and cross-domain settings. The efficiency gains could also be practically relevant for deployment.
major comments (2)
- [Abstract] Abstract: The central claim of ~30% performance gains and 7× throughput on cross-domain and human-made forgeries is asserted without any reference to datasets, metrics, ablation studies, or error analysis, rendering verification of the SOTA result impossible from the provided text.
- [Abstract] Abstract: The multi-level discrepancy transformation is motivated as ensuring the decoder aggregates multi-scale inconsistency evidence rather than content-heavy activations under distribution shift, but no experimental validation, ablation, or analysis is supplied to confirm that the learned zero-sum filters and magnitude-only replacement remain effective when source documents or tampering pipelines change.
Simulated Author's Rebuttal
We thank the referee for the comments on the abstract. We agree that the abstract can be strengthened to better support the performance claims and design motivations with explicit references to the experimental setup. We will revise the abstract accordingly in the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of ~30% performance gains and 7× throughput on cross-domain and human-made forgeries is asserted without any reference to datasets, metrics, ablation studies, or error analysis, rendering verification of the SOTA result impossible from the provided text.
Authors: We agree that the abstract would benefit from more concrete references to enable verification. The full manuscript evaluates the ~30% gains and 7× throughput on specific cross-domain and human-made forgery benchmarks (detailed in Section 4), using pixel-level F1-score and AUC metrics, with full comparisons in Tables 2-4 and error analysis in Section 4.5. We will revise the abstract to briefly cite these datasets and metrics while retaining its summary nature. revision: yes
-
Referee: [Abstract] Abstract: The multi-level discrepancy transformation is motivated as ensuring the decoder aggregates multi-scale inconsistency evidence rather than content-heavy activations under distribution shift, but no experimental validation, ablation, or analysis is supplied to confirm that the learned zero-sum filters and magnitude-only replacement remain effective when source documents or tampering pipelines change.
Authors: Section 3.2 motivates the transformation and Section 4.3 reports its effectiveness via cross-domain experiments that vary source documents and tampering pipelines, while Section 4.4 provides ablations isolating the zero-sum filters and magnitude-only responses. These results demonstrate maintained gains under distribution shift. We will revise the abstract to explicitly reference these supporting experiments. revision: yes
Circularity Check
No circularity: architecture and performance claims are independent of fitted inputs or self-citations
full rationale
The paper introduces DiffNet with two explicit design choices (multi-level discrepancy transformation using magnitude-only responses to learned zero-sum filters, and frequency-index-aware DCT-quantization embedding) and reports empirical SOTA results on cross-domain and human-made forgeries. No equations or sections reduce a claimed prediction to a fitted parameter by construction, nor does any load-bearing premise rest on a self-citation chain. The derivation chain consists of novel architectural components whose effectiveness is evaluated externally via benchmarks, satisfying the criteria for a self-contained, non-circular presentation.
Axiom & Free-Parameter Ledger
free parameters (2)
- learned zero-sum filters
- network weights and hyperparameters
axioms (1)
- domain assumption Replacing backbone features with magnitude-only responses to learned zero-sum filters causes the decoder to aggregate multi-scale inconsistency evidence rather than content-heavy activations
invented entities (2)
-
multi-level discrepancy transformation
no independent evidence
-
frequency-index-aware DCT-quantization joint embedding
no independent evidence
Reference graph
Works this paper leans on
-
[1]
In: First International Workshop on Computational Document Forensics (2017)
Artaud, C., Doucet, A., Ogier, J.M., d’Andecy, V.P.: Receipt dataset for fraud detection. In: First International Workshop on Computational Document Forensics (2017)
2017
-
[2]
In: Proceedings of the 4th ACM workshop on information hiding and multimedia security
Bayar, B., Stamm, M.C.: A deep learning approach to universal image manipula- tion detection using a new convolutional layer. In: Proceedings of the 4th ACM workshop on information hiding and multimedia security. pp. 5–10 (2016)
2016
-
[3]
IEEE Transactions on Information Forensics and Security13(11), 2691–2706 (2018)
Bayar, B., Stamm, M.C.: Constrained convolutional neural networks: A new ap- proach towards general purpose image manipulation detection. IEEE Transactions on Information Forensics and Security13(11), 2691–2706 (2018)
2018
-
[4]
In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW)
Bibi, M., Hamid, A., Moetesum, M., Siddiqi, I.: Document forgery detection using printer source identification—a text-independent approach. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). vol. 8, pp. 7–12. IEEE (2019)
2019
-
[5]
In: Proceedings of the IEEE/CVF international con- ference on computer vision
Chen, X., Dong, C., Ji, J., Cao, J., Li, X.: Image manipulation detection by multi- view multi-scale supervision. In: Proceedings of the IEEE/CVF international con- ference on computer vision. pp. 14185–14193 (2021)
2021
-
[6]
In: European Conference on Computer Vision
Chen, Z., Chen, S., Yao, T., Sun, K., Ding, S., Lin, X., Cao, L., Ji, R.: Enhancing tampered text detection through frequency feature fusion and decomposition. In: European Conference on Computer Vision. pp. 200–217. Springer (2024)
2024
- [7]
-
[8]
IEEE Transactions on Con- sumer Electronics (2024) 16 M
Dong, L., Liang, W., Wang, R.: Robust text image tampering localization via forgery traces enhancement and multiscale attention. IEEE Transactions on Con- sumer Electronics (2024) 16 M. Dhouib et al
2024
-
[9]
In: The Thirty-ninth Annual Conference on Neural Information ProcessingSystemsDatasetsandBenchmarksTrack(2025), https://openreview
Du, B., Zhu, X., Ma, X., Qu, C., Feng, K., Yang, Z., Pun, C.M., liu, J., Zhou, J.Z.: Forensichub: A unified benchmark & codebase for all-domain fake image detection and localization. In: The Thirty-ninth Annual Conference on Neural Information ProcessingSystemsDatasetsandBenchmarksTrack(2025), https://openreview. net/forum?id=IKK0mEUTfE
2025
-
[10]
IEEE Transactions on information Forensics and Security7(3), 868–882 (2012)
Fridrich, J., Kodovsky, J.: Rich models for steganalysis of digital images. IEEE Transactions on information Forensics and Security7(3), 868–882 (2012)
2012
-
[11]
In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition
Guillaro, F., Cozzolino, D., Sud, A., Dufour, N., Verdoliva, L.: Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization. In: Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20606–20615 (2023)
2023
-
[12]
IEEE Transactions on Information Forensics and Security 5(4), 848–856 (2010)
Huang, F., Huang, J., Shi, Y.Q.: Detecting double jpeg compression with the same quantization matrix. IEEE Transactions on Information Forensics and Security 5(4), 848–856 (2010)
2010
-
[13]
Kwon, M.J., Nam, S.H., Yu, I.J., Lee, H.K., Kim, C.: Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision130(8), 1875–1895 (May 2022).https://doi.org/10.1007/ s11263-022-01617-5, http://dx.doi.org/10.1007/s11263-022-01617-5
-
[14]
In: 2006 International Conference on Computational Intelligence and Security
Lampert, C.H., Mei, L., Breuel, T.M.: Printing technique classification for docu- ment counterfeit detection. In: 2006 International Conference on Computational Intelligence and Security. vol. 1, pp. 639–644. IEEE (2006)
2006
-
[15]
In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Lewis, D., Agam, G., Argamon, S., Frieder, O., Grossman, D., Heard, J.: Building a test collection for complex document information processing. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 665–666 (2006)
2006
-
[16]
IET Image Processing 19(1), e70007 (2025)
Li, L., Zhang, K., Lu, J., Zhang, S., Chu, N.: Multiclassification tampering detec- tion algorithm based on spatial-frequency fusion and swin-t. IET Image Processing 19(1), e70007 (2025)
2025
-
[17]
IEEE Transactions on Circuits and Systems for Video Technology32(11), 7505–7517 (2022)
Liu, X., Liu, Y., Chen, J., Liu, X.: Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology32(11), 7505–7517 (2022)
2022
-
[18]
Pattern Recognition157, 110828 (2025)
Luo, D., Liu, Y., Yang, R., Liu, X., Zeng, J., Zhou, Y., Bai, X.: Toward real text manipulation detection: New dataset and new solution. Pattern Recognition157, 110828 (2025)
2025
-
[19]
arXiv preprint arXiv:2307.14863 (2023)
Ma, X., Du, B., Jiang, Z., Hammadi, A.Y.A., Zhou, J.: Iml-vit: Bench- marking image manipulation localization by vision transformer. arXiv preprint arXiv:2307.14863 (2023)
-
[20]
IEEE Access (2025)
Nguyen, A.D., Kim, H.Y., Nguyen, H.N.: Taliu: A novel decoder and augmentation strategy for boosting tampered document image detection. IEEE Access (2025)
2025
-
[21]
In: Proceedings of the AAAI conference on artificial intelligence
Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)
2018
-
[22]
IEEE Transactions on information forensics and security 3(2), 247–258 (2008)
Pevny, T., Fridrich, J.: Detection of double-compression in jpeg images for appli- cations in steganography. IEEE Transactions on information forensics and security 3(2), 247–258 (2008)
2008
-
[23]
In: International workshop on information hiding
Popescu, A.C., Farid, H.: Statistical tools for digital forensics. In: International workshop on information hiding. pp. 128–147. Springer (2004)
2004
-
[24]
In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition
Qu, C., Liu, C., Liu, Y., Chen, X., Peng, D., Guo, F., Jin, L.: Towards robust tampered text detection in document image: New dataset and new solution. In: ProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecog- nition. pp. 5937–5946 (2023) Efficient Document Tampering Localization 17
2023
-
[25]
arXiv preprint arXiv:2411.14823 (2024)
Qu, C., Zhong, Y., Guo, F., Jin, L.: Omni-iml: towards unified image manipulation localization. arXiv preprint arXiv:2411.14823 (2024)
-
[26]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Qu, C., Zhong, Y., Guo, F., Jin, L.: Revisiting tampered scene text detection in the era of generative ai. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 694–702 (2025)
2025
-
[27]
Journal of Electronic Imaging24(2), 023008–023008 (2015)
Shang, S., Kong, X., You, X.: Document forgery detection using distortion muta- tion of geometric parameters in characters. Journal of Electronic Imaging24(2), 023008–023008 (2015)
2015
-
[28]
ACM Transactions on Multimedia Computing, Communications and Applications 21(2), 1–24 (2025)
Song, Y., Jiang, W., Chai, X., Gan, Z., Zhou, M., Chen, L.: Cross-attention based two-branch networks for document image forgery localization in the metaverse. ACM Transactions on Multimedia Computing, Communications and Applications 21(2), 1–24 (2025)
2025
-
[29]
In: International conference on machine learning
Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. pp. 6105–6114. PMLR (2019)
2019
-
[30]
In: International Conference on Document Analysis and Recognition
Tornés, B.M., Taburet, T., Boros, E., Rouis, K., Doucet, A., Gomez-Krämer, P., Sidere, N., d’Andecy, V.P.: Receipt dataset for document forgery detection. In: International Conference on Document Analysis and Recognition. pp. 454–469. Springer (2023)
2023
-
[31]
In: European Conference on Computer Vision
Wang, Y., Xie, H., Xing, M., Wang, J., Zhu, S., Zhang, Y.: Detecting tampered scene text in the wild. In: European Conference on Computer Vision. pp. 215–232. Springer (2022)
2022
-
[32]
Wang, Y., Zhang, B., Xie, H., Zhang, Y.: Tampered text detection via rgb and frequency relationship modeling. Chinese Journal of Network and Information Security 8(3), 29–39 (2022).https://doi.org/10.11959/j.issn.2096- 109x. 2022035, http://www.infocomm- journal.com/cjnis/CN/abstract/article_ 172502.shtml
-
[33]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Wong, K., Zhou, J., Wu, H., Si, Y.W., Zhou, J.: Adcd-net: Robust document image forgery localization via adaptive dct feature and hierarchical content disentangle- ment. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19280–19289 (2025)
2025
-
[34]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S.: Convnext v2: Co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16133– 16142 (2023)
2023
- [35]
-
[36]
In: Proceedings of the AAAI conference on artificial intelligence
Zhu, X., Ma, X., Su, L., Jiang, Z., Du, B., Wang, X., Lei, Z., Feng, W., Pun, C.M., Zhou, J.Z.: Mesoscopic insights: orchestrating multi-scale & hybrid architecture for image manipulation localization. In: Proceedings of the AAAI conference on artificial intelligence. vol. 39, pp. 11022–11030 (2025) 18 M. Dhouib et al. A Custom CUDA kernel for zero-sum di...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.