pith. sign in

arxiv: 2604.16083 · v1 · submitted 2026-04-17 · 💻 cs.CV

DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics

Pith reviewed 2026-05-10 09:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords baselineloraacrossaveragebackbonedinov3fine-tuningfull
0
0 comments X

The pith

DINOv3 with LoRA adaptation and a lightweight convolutional decoder achieves higher average pixel-level F1 scores than previous state-of-the-art specialized methods for image manipulation localization on multiple benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Image forensics aims to detect and localize where an image has been altered, such as by AI generators creating deepfakes. Traditional methods use complex custom networks trained from scratch or heavily tuned for this task, but they often fail to generalize across different manipulation types or image conditions. This work instead starts with DINOv3, a large pre-trained vision transformer that has learned rich visual features from massive unlabeled data through self-supervision. The authors freeze most of the model and add a low-rank adaptation (LoRA) layer to efficiently update a small number of parameters. They attach a lightweight convolutional decoder that outputs a per-pixel map indicating likely manipulated regions. On the CAT-Net evaluation protocol across four common datasets, the best version improves average F1 score by 17 points over prior leaders while training only 9.1 million parameters. Even smaller variants beat all earlier specialized detectors. Under a data-scarce protocol, the LoRA version reaches 0.774 F1 versus 0.530 for the best prior method, and it stays stable where full fine-tuning collapses. The model also holds up well when images are degraded by noise, JPEG compression, or blur. The authors release code to let others reproduce and build on the baseline.

Core claim

Under the CAT-Net protocol, our best model improves average pixel-level F1 by 17.0 points over the previous state of the art on four standard benchmarks using only 9.1M trainable parameters on top of a frozen ViT-L backbone, and even our smallest variant surpasses all prior specialized methods.

Load-bearing premise

That the general visual representations learned by DINOv3 on natural images already encode sufficient forensic traces of manipulations, so that minimal adaptation suffices without domain-specific pre-training or architectural priors tailored to image forensics.

Figures

Figures reproduced from arXiv: 2604.16083 by Jieming Yu, Qiuxiao Feng, Xiaochen Ma, Zhuohan Wang.

Figure 1
Figure 1. Figure 1: Overview of our framework. A frozen DINOv3 ViT backbone with LoRA injected on QKV projections produces dense patch [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

With the rapid advancement of deep generative models, realistic fake images have become increasingly accessible, yet existing localization methods rely on complex designs and still struggle to generalize across manipulation types and imaging conditions. We present a simple but strong baseline based on DINOv3 with LoRA adaptation and a lightweight convolutional decoder. Under the CAT-Net protocol, our best model improves average pixel-level F1 by 17.0 points over the previous state of the art on four standard benchmarks using only 9.1\,M trainable parameters on top of a frozen ViT-L backbone, and even our smallest variant surpasses all prior specialized methods. LoRA consistently outperforms full fine-tuning across all backbone scales. Under the data-scarce MVSS-Net protocol, LoRA reaches an average F1 of 0.774 versus 0.530 for the strongest prior method, while full fine-tuning becomes highly unstable, suggesting that pre-trained representations encode forensic information that is better preserved than overwritten. The baseline also exhibits strong robustness to Gaussian noise, JPEG re-compression, and Gaussian blur. We hope this work can serve as a reliable baseline for the research community and a practical starting point for future image-forensic applications. Code is available at https://github.com/Irennnne/DINOv3-IML.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard transfer-learning assumptions from self-supervised vision models; no new entities are postulated and free parameters are limited to standard LoRA hyperparameters whose exact values are not detailed in the abstract.

free parameters (1)
  • LoRA rank and scaling
    Standard low-rank adaptation hyperparameters required for efficient fine-tuning; exact values not stated in abstract but implicitly fitted or chosen to achieve reported results.
axioms (1)
  • domain assumption Frozen DINOv3 ViT-L features contain transferable information relevant to pixel-level manipulation localization
    Invoked by using a frozen backbone plus minimal adaptation rather than training a forensics-specific model from scratch.

pith-pipeline@v0.9.0 · 5539 in / 1470 out tokens · 34673 ms · 2026-05-10T09:00:33.996473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Emerg- ing properties in self-supervised vision transformers, 2021

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers, 2021. 2

  2. [2]

    Noiseprint: a cnn- based camera model fingerprint, 2018

    Davide Cozzolino and Luisa Verdoliva. Noiseprint: a cnn- based camera model fingerprint, 2018. 2

  3. [3]

    Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(3): 3539–3553, 2023

    Chengbo Dong, Xinru Chen, Ruohan Hu, Juan Cao, and Xirong Li. Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(3): 3539–3553, 2023. 2, 3, 5, 6, 7

  4. [4]

    CASIA image tam- pering detection evaluation database

    Jing Dong, Wei Wang, and Tieniu Tan. CASIA image tam- pering detection evaluation database. In2013 IEEE China Summit and International Conference on Signal and Infor- mation Processing. IEEE, 2013. 2, 3

  5. [5]

    Forensichub: A unified benchmark & codebase for all-domain fake image detection and localization, 2026

    Bo Du, Xuekang Zhu, Xiaochen Ma, Chenfan Qu, Kai- wen Feng, Zhe Yang, Chi-Man Pun, Jian Liu, and Ji-Zhe Zhou. Forensichub: A unified benchmark & codebase for all-domain fake image detection and localization, 2026. 1

  6. [6]

    Yates, Andrew Delgado, Daniel Zhou, Timothee Kheyrkhah, Jeff Smith, and Jonathan Fiscus

    Haiying Guan, Mark Kozak, Eric Robertson, Yooyoung Lee, Amy N. Yates, Andrew Delgado, Daniel Zhou, Timothee Kheyrkhah, Jeff Smith, and Jonathan Fiscus. Mfc datasets: Large-scale benchmark datasets for media forensic challenge evaluation. In2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), pages 63–72, 2019. 2, 3

  7. [7]

    Trufor: Leveraging all-round clues for trustworthy image forgery detection and localiza- tion, 2023

    Fabrizio Guillaro, Davide Cozzolino, Avneesh Sud, Nicholas Dufour, and Luisa Verdoliva. Trufor: Leveraging all-round clues for trustworthy image forgery detection and localiza- tion, 2023. 2, 5, 6, 7

  8. [8]

    Detecting image splicing using geometry invariants and camera characteristics consis- tency

    Yu-feng Hsu and Shih-fu Chang. Detecting image splicing using geometry invariants and camera characteristics consis- tency. In2006 IEEE International Conference on Multime- dia and Expo, pages 549–552, 2006. 2, 3

  9. [9]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. 1, 2, 3

  10. [10]

    Kniaz, Vladimir Knyaz, and Fabio Remondino

    Vladimir V . Kniaz, Vladimir Knyaz, and Fabio Remondino. The point where reality meets fantasy: Mixed adversarial generators for image splice detection. InNeurIPS, 2019. 3

  11. [11]

    Learning jpeg compression ar- tifacts for image manipulation detection and localization.In- ternational Journal of Computer Vision, 130(8):1875–1895,

    Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung- Kyu Lee, and Changick Kim. Learning jpeg compression ar- tifacts for image manipulation detection and localization.In- ternational Journal of Computer Vision, 130(8):1875–1895,

  12. [12]

    Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization, 2022

    Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization, 2022. 2, 5, 6, 7

  13. [13]

    Al Hammadi, and Jizhe Zhou

    Xiaochen Ma, Bo Du, Zhuohang Jiang, Xia Du, Ahmed Y . Al Hammadi, and Jizhe Zhou. Iml-vit: Benchmarking image manipulation localization by vision transformer, 2024. 1, 2, 5

  14. [14]

    Imdl-benco: A comprehen- sive benchmark and codebase for image manipulation detec- tion & localization, 2024

    Xiaochen Ma, Xuekang Zhu, Lei Su, Bo Du, Zhuohang Jiang, Bingkui Tong, Zeyu Lei, Xinyu Yang, Chi-Man Pun, Jiancheng Lv, and Jizhe Zhou. Imdl-benco: A comprehen- sive benchmark and codebase for image manipulation detec- tion & localization, 2024. 1, 2, 4, 5

  15. [15]

    M2sformer: Multi-spectral and multi-scale attention with edge-aware difficulty guidance for image forgery localiza- tion, 2025

    Ju-Hyeon Nam, Dong-Hyun Moon, and Sang-Chul Lee. M2sformer: Multi-spectral and multi-scale attention with edge-aware difficulty guidance for image forgery localiza- tion, 2025. 2

  16. [16]

    Imd2020: A large-scale annotated dataset tailored for de- tecting manipulated images

    Adam Novoz ´amsk´y, Babak Mahdian, and Stanislav Saic. Imd2020: A large-scale annotated dataset tailored for de- tecting manipulated images. In2020 IEEE Winter Applica- tions of Computer Vision Workshops (WACVW), pages 71– 80, 2020. 3

  17. [17]

    Dinov2: Learning robust visual features with- out supervision, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

  18. [18]

    Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

  19. [19]

    Media forensics and deepfakes: An overview.IEEE Journal of Selected Topics in Signal Pro- cessing, 14(5):910–932, 2020

    Luisa Verdoliva. Media forensics and deepfakes: An overview.IEEE Journal of Selected Topics in Signal Pro- cessing, 14(5):910–932, 2020. 1

  20. [20]

    Ob- jectformer for image manipulation detection and localiza- tion, 2022

    Junke Wang, Zuxuan Wu, Jingjing Chen, Xintong Han, Ab- hinav Shrivastava, Ser-Nam Lim, and Yu-Gang Jiang. Ob- jectformer for image manipulation detection and localiza- tion, 2022. 5

  21. [21]

    Coverage — a novel database for copy-move forgery detection

    Bihan Wen, Ye Zhu, Ramanathan Subramanian, Tian-Tsong Ng, Xuanjing Shen, and Stefan Winkler. Coverage — a novel database for copy-move forgery detection. In2016 IEEE International Conference on Image Processing (ICIP), pages 161–165, 2016. 2, 3

  22. [22]

    ManTra-Net: Manipulation tracing network for detection and localization of image forgeries with anomalous features

    Yue Wu, Wael Abd-Almageed, and Premkumar Natarajan. ManTra-Net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. InCVPR, 2019. 5

  23. [23]

    Alhammadi, and Wentao Feng

    Jizhe Zhou, Xiaochen Ma, Xia Du, Ahmed Y . Alhammadi, and Wentao Feng. Pre-training-free image manipulation lo- calization through non-mutually exclusive contrastive learn- ing, 2023. 5

  24. [24]

    Mesoscopic insights: Orchestrating multi- scale & hybrid architecture for image manipulation localiza- tion, 2024

    Xuekang Zhu, Xiaochen Ma, Lei Su, Zhuohang Jiang, Bo Du, Xiwen Wang, Zeyu Lei, Wentao Feng, Chi-Man Pun, and Jizhe Zhou. Mesoscopic insights: Orchestrating multi- scale & hybrid architecture for image manipulation localiza- tion, 2024. 2, 5, 6, 7