Resolution scaling governs DINOv3 transfer performance in chest radiograph classification
Pith reviewed 2026-05-18 09:01 UTC · model grok-4.3
The pith
DINOv3 improves adult chest X-ray classification most at 512 by 512 pixels using ConvNeXt backbones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For adult chest radiograph classification, DINOv3 provides its most reliable benefit at 512 x 512 pixels, particularly with ConvNeXt-B, outperforming both DINOv2 and supervised ImageNet initialization under full fine-tuning while delivering the strongest gains on small focal and boundary-dependent abnormalities.
What carries the argument
Resolution scaling to 512 pixels with DINOv3 high-resolution adaptation on ConvNeXt-B, which together improve fine-grained feature transfer under full fine-tuning.
If this is right
- Full fine-tuning at 512 pixels with DINOv3 and ConvNeXt-B gives the best performance-cost trade-off compared with 1024-pixel inputs or parameter-efficient adaptation alone.
- External validation sets preserve the 512-pixel DINOv3 advantage for adult cohorts.
- Improvements concentrate on small focal and boundary-dependent abnormalities while large-structure findings change little.
- ConvNeXt-B stays superior to ViT-B/16 under both full and parameter-efficient adaptation.
Where Pith is reading between the lines
- If the resolution benefit generalizes, many existing 224-pixel medical imaging benchmarks may systematically undervalue newer self-supervised models.
- Repeating the same 512-pixel protocol on other body-part imaging tasks would test whether the advantage is specific to chest radiographs or a broader scaling phenomenon.
- Because label-noise experiments ruled out simple robustness as the explanation, the benefit may stem from better capture of subtle texture cues that only become visible at mid-range resolutions.
Load-bearing premise
The seven chosen datasets and the protocol of averaging AUROC across labels after full fine-tuning are representative enough to support general statements about DINOv3 transfer performance.
What would settle it
A new adult chest radiograph dataset in which DINOv3 at 512 pixels no longer outperforms DINOv2 or in which accuracy peaks at 224 pixels instead would falsify the main claim.
read the original abstract
Self-supervised learning (SSL) has improved visual representation learning, but its value in chest radiography remains uncertain. DINOv3 extends earlier SSL models through Gram-anchored self-distillation and explicit high-resolution adaptation. Whether these changes improve transfer learning for chest radiograph classification has not been established. We benchmarked DINOv3 against DINOv2 and supervised ImageNet initialization across seven chest radiograph datasets comprising 816,183 radiographs from pediatric and adult cohorts. ViT-B/16 and ConvNeXt-B were evaluated under full fine-tuning at 224 and 512 pixels, with targeted 1024 experiments on three cohorts. Additional analyses examined parameter-efficient adaptation, synthetic label corruption, external validation, frozen 7B features, and computational efficiency. The primary outcome was mean AUROC across labels. In adult cohorts, DINOv3 did not consistently outperform DINOv2 at 224 x 224 pixels, but became the strongest initialization at 512 x 512, especially with ConvNeXt-B. Gains were greatest for small focal and boundary-dependent abnormalities, whereas large-structure findings changed little. The pediatric cohort showed no significant benefit from DINOv3, higher resolution, or backbone choice. Scaling to 1024 x 1024 rarely improved performance and markedly increased computational cost. ConvNeXt-B remained superior to ViT-B/16 under both full and parameter-efficient adaptation. External validation preserved the 512 x 512 DINOv3 advantage, whereas synthetic label corruption showed that this benefit should not be interpreted simply as superior noise robustness. For adult chest radiograph classification, DINOv3 provides its most reliable benefit at 512 x 512 pixels, particularly with ConvNeXt-B. Fully adapted mid-sized models at 512 x 512 pixels provided the best performance-cost trade-off in our benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript benchmarks DINOv3 against DINOv2 and supervised ImageNet pretraining for chest radiograph classification across seven datasets (816k images, adult and pediatric cohorts). It evaluates ViT-B/16 and ConvNeXt-B under full fine-tuning at 224x224 and 512x512, with targeted 1024x1024 experiments on three cohorts. Primary outcome is mean AUROC across labels. Key finding: in adult cohorts DINOv3 becomes strongest at 512x512 (especially ConvNeXt-B), with gains on small focal abnormalities; pediatric shows no benefit; 1024 rarely improves and raises cost; ConvNeXt-B outperforms ViT; external validation and label corruption tests support the 512 advantage.
Significance. If the central empirical claims hold, the work supplies practical guidance on resolution and backbone choice for SSL transfer in chest radiography, showing that mid-resolution (512) with ConvNeXt-B yields the best performance-cost trade-off. Strengths include multi-dataset evaluation, external validation, and robustness checks via synthetic label corruption; these elements provide concrete evidence that DINOv3 benefits are resolution-dependent rather than uniform.
major comments (1)
- Abstract and results sections: the claim that DINOv3 provides its most reliable benefit at 512x512 (and that scaling to 1024 rarely improves performance) is based on 1024-resolution experiments limited to three cohorts, while 224 and 512 results cover all seven. Because the primary outcome is mean AUROC across adult cohorts and the optimality conclusion is asserted for the full set, any cohort-specific continued gains or reversals at 1024 would directly weaken the scaling-sweet-spot conclusion. The manuscript should either extend the 1024 experiments or qualify the claim to the tested subsets.
minor comments (2)
- Methods: additional detail on exact train/validation/test splits, hyperparameter search ranges, and the statistical procedure used to compare AUROCs across initializations would improve reproducibility and allow readers to assess whether post-hoc choices influenced the reported ordering.
- Figure clarity: ensure that error bars or confidence intervals are shown on all AUROC bar plots so that the magnitude of reported gains can be evaluated against variability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. The comment regarding the scope of the 1024-resolution experiments is well taken, and we agree that it requires a qualification of our claims to avoid overgeneralization. We address this point directly below and will incorporate the necessary revisions.
read point-by-point responses
-
Referee: Abstract and results sections: the claim that DINOv3 provides its most reliable benefit at 512x512 (and that scaling to 1024 rarely improves performance) is based on 1024-resolution experiments limited to three cohorts, while 224 and 512 results cover all seven. Because the primary outcome is mean AUROC across adult cohorts and the optimality conclusion is asserted for the full set, any cohort-specific continued gains or reversals at 1024 would directly weaken the scaling-sweet-spot conclusion. The manuscript should either extend the 1024 experiments or qualify the claim to the tested subsets.
Authors: We agree with the referee that the 1024×1024 experiments were performed on only three of the seven cohorts (specifically, two adult and one pediatric dataset) owing to the substantial computational cost of full fine-tuning at this resolution. Our core finding—that DINOv3 at 512×512 with ConvNeXt-B yields the strongest performance-cost trade-off—is supported by results across all seven datasets. The statement that scaling to 1024 “rarely improved performance” is accurate for the three cohorts tested, but we acknowledge that this does not constitute evidence for the remaining four cohorts. To prevent any implication that the 1024 results apply to the full set, we will revise the abstract, results, and discussion sections to explicitly state that the 1024-resolution findings are limited to the three evaluated cohorts. We will also add a sentence noting the computational constraints that precluded 1024 experiments on the full collection. These changes will be implemented in the revised manuscript. revision: yes
Circularity Check
No circularity: pure empirical benchmark with measured outcomes on held-out data
full rationale
The paper is an empirical benchmarking study that reports measured mean AUROC values for DINOv3, DINOv2, and supervised initializations across seven chest radiograph datasets under full fine-tuning at multiple resolutions. The central claims (DINOv3 advantage at 512x512 in adult cohorts, limited gains at 1024x1024) are direct summaries of these held-out performance numbers rather than quantities derived from equations or prior fitted parameters within the paper. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the results remain falsifiable against the external test sets and do not reduce to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The selected chest radiograph datasets are representative of real-world clinical distributions for the evaluated tasks.
- domain assumption Mean AUROC across labels is a sufficient summary metric for comparing initialization quality in multi-label classification.
Reference graph
Works this paper leans on
-
[1]
Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat Med 28, 31–38 (2022)
work page 2022
-
[2]
Tayebi Arasteh, S. et al. Large language models streamline automated machine learning for clinical studies. Nat Commun 15, 1603 (2024)
work page 2024
-
[3]
Haug, C. J. & Drazen, J. M. Artificial Intelligence and Machine Learning in Clinical Medicine,
-
[4]
N Engl J Med 388, 1201–1208 (2023)
work page 2023
-
[5]
Tayebi Arasteh, S. et al. The Treasure Trove Hidden in Plain Sight: The Utility of GPT-4 in Chest Radiograph Evaluation. Radiology 313, e233441 (2024)
work page 2024
-
[6]
Chen, Z. et al. A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation. Preprint at https://doi.org/10.48550/arXiv.2401.12208 (2024)
-
[7]
Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 6, 317 (2019)
work page 2019
-
[8]
Deng, J. et al. ImageNet: A large-scale hierarchical image database. in 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, Miami, FL, 2009). doi:10.1109/CVPR.2009.5206848
-
[9]
Ke, A., Ellsworth, W., Banerjee, O., Ng, A. Y. & Rajpurkar, P. CheXtransfer: performance and parameter efficiency of ImageNet models for chest X-Ray interpretation. in Proceedings of the Conference on Health, Inference, and Learning 116–124 (ACM, Virtual Event USA, 2021). doi:10.1145/3450439.3451867
-
[10]
Krishnan, R., Rajpurkar, P. & Topol, E. J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng 6, 1346–1352 (2022)
work page 2022
- [11]
-
[12]
He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 9729–9738 (2020)
work page 2020
-
[13]
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. in International Conference on Machine Learning vol. 119 (Vienna, Austria, 2020)
work page 2020
-
[14]
Grill, J.-B. et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33, 21271–21284 (2020)
work page 2020
-
[15]
Caron, M. et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. in Advances in neural information processing systems 33 9912–9924 (2020)
work page 2020
- [16]
-
[17]
Vaswani, A. et al. Attention Is All You Need. in NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems 6000–6010 (2017)
work page 2017
-
[18]
Dosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Preprint at http://arxiv.org/abs/2010.11929 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[19]
Liu, Z. et al. A convnet for the 2020s. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11976–11986 (2022)
work page 2022
-
[20]
Caron, M. et al. Emerging Properties in Self-Supervised Vision Transformers. in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 9650– 9660 (2021). 28
work page 2021
-
[21]
Oquab, M. et al. DINOv2: Learning Robust Visual Features without Supervision. Preprint at http://arxiv.org/abs/2304.07193 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Tayebi Arasteh, S., Misera, L., Kather, J. N., Truhn, D. & Nebelung, S. Enhancing diagnostic deep learning via self-supervised pretraining on large-scale, unlabeled non-medical images. Eur Radiol Exp 8, 10 (2024)
work page 2024
-
[23]
Siméoni, O. et al. DINOv3. Preprint at https://doi.org/10.48550/arXiv.2508.10104 (2025)
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10104 2025
-
[24]
Yang, S., Wang, H., Xing, Z., Chen, S. & Zhu, L. SegDINO: An Efficient Design for Medical and Natural Image Segmentation with DINO-V3. Preprint at https://doi.org/10.48550/arXiv.2509.00833 (2025)
-
[25]
Li, Y., Wu, Y., Lai, Y., Hu, M. & Yang, X. MedDINOv3: How to adapt vision foundation models for medical image segmentation? Preprint at https://doi.org/10.48550/arXiv.2509.02379 (2025)
-
[26]
Liu, C. et al. Does DINOv3 Set a New Medical Vision Standard? Preprint at https://doi.org/10.48550/arXiv.2509.06467 (2025)
-
[27]
Khader, F. et al. Multimodal Deep Learning for Integrating Chest Radiographs and Clinical Parameters: A Case for Transformers. Radiology 309, e230806 (2023)
work page 2023
- [28]
-
[29]
He, K. et al. Transformers in medical image analysis. Intelligent Medicine 3, 59–78 (2023)
work page 2023
-
[30]
Tanno, R. et al. Collaboration between clinicians and vision–language models in radiology report generation. Nat Med 31, 599–608 (2025)
work page 2025
-
[31]
Sloan, P., Clatworthy, P., Simpson, E. & Mirmehdi, M. Automated radiology report generation: A review of recent advances. IEEE Reviews in Biomedical Engineering 18, 368– 387 (2024)
work page 2024
-
[32]
Nguyen, N. H., Pham, H. H., Tran, T. T., Nguyen, T. N. M. & Nguyen, H. Q. VinDr-PCXR: An Open, Large-Scale Chest Radiograph Dataset for Interpretation of Common Thoracic Diseases in Children. http://medrxiv.org/lookup/doi/10.1101/2022.03.04.22271937 (2022) doi:10.1101/2022.03.04.22271937
-
[33]
Nguyen, H. Q. et al. VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. Sci Data 9, 429 (2022)
work page 2022
-
[34]
Wang, X. et al. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 3462–3471 (2017). doi:10.1109/CVPR.2017.369
-
[35]
Bustos, A., Pertusa, A., Salinas, J.-M. & de la Iglesia-Vayá, M. PadChest: A large chest x- ray image dataset with multi-label annotated reports. Medical Image Analysis 66, 101797 (2020)
work page 2020
-
[36]
Irvin, J. et al. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. AAAI 33, 590–597 (2019)
work page 2019
-
[37]
Khader, F. et al. Artificial Intelligence for Clinical Interpretation of Bedside Chest Radiographs. Radiology 307, e220510 (2022)
work page 2022
-
[38]
Tayebi Arasteh, S. et al. Collaborative training of medical artificial intelligence models with non-uniform labels. Sci Rep 13, 6046 (2023)
work page 2023
-
[39]
Tayebi Arasteh, S. et al. Preserving fairness and diagnostic accuracy in private large-scale AI models for medical imaging. Commun Med 4, 46 (2024)
work page 2024
-
[40]
Tayebi Arasteh, S. et al. Securing Collaborative Medical AI by Using Differential Privacy: Domain Transfer for Classification of Chest Radiographs. Radiology. Artificial Intelligence 6, e230212 (2024)
work page 2024
-
[41]
Tayebi Arasteh, S., Isfort, P., Kuhl, C., Nebelung, S. & Truhn, D. Automatic Evaluation of Chest Radiographs – The Data Source Matters, But How Much Exactly? in RöFo- 29 Fortschritte auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren vol. 195 ab99 (Georg Thieme Verlag, RheinMain CongressCenter (RMCC) in Wiesbaden, 2023)
work page 2023
-
[42]
Chiarenza, A. et al. Chest imaging using signs, symbols, and naturalistic images: a practical guide for radiologists and non-radiologists. Insights Imaging 10, 114 (2019)
work page 2019
-
[43]
Sabottke, C. F. & Spieler, B. M. The Effect of Image Resolution on Deep Learning in Radiography. Radiology: Artificial Intelligence 2, e190015 (2020)
work page 2020
-
[44]
Haque, M. I. U. et al. Effect of image resolution on automated classification of chest X-rays. J Med Imaging (Bellingham) 10, 044503 (2023)
work page 2023
-
[45]
Capitanio, M. A. Pitfalls in Pediatric Chest Radiography. Radiology 137, 656–656 (1980)
work page 1980
-
[46]
Lotfinia, M., Tayebiarasteh, A., Samiei, S., Joodaki, M. & Tayebi Arasteh, S. Boosting multi- demographic federated learning for chest radiograph analysis using general-purpose self- supervised representations. European Journal of Radiology Artificial Intelligence 3, 100028 (2025)
work page 2025
-
[47]
Tayebi Arasteh, S. et al. Enhancing domain generalization in the AI-based analysis of chest radiographs with federated learning. Sci Rep 13, 22576 (2023)
work page 2023
-
[48]
Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer Normalization. Preprint at https://doi.org/10.48550/arXiv.1607.06450 (2016)
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1607.06450 2016
-
[49]
Gaussian Error Linear Units (GELUs)
Hendrycks, D. & Gimpel, K. Gaussian Error Linear Units (GELUs). Preprint at https://doi.org/10.48550/arXiv.1606.08415 (2023)
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1606.08415 2023
-
[50]
Loshchilov, I. & Hutter, F. Decoupled Weight Decay Regularization. in Proceedings of Proceedings of Seventh International Conference on Learning Representations (ICLR) 2019 (New Orleans, LA, USA, 2019)
work page 2019
-
[51]
Rezaei-Dastjerdehei, M. R., Mijani, A. & Fatemizadeh, E. Addressing Imbalance in Multi- Label Classification Using Weighted Cross Entropy Loss Function. in 2020 27th National and 5th International Iranian Conference on Biomedical Engineering (ICBME) 333–338 (IEEE, Tehran, Iran, 2020). doi:10.1109/ICBME51989.2020.9319440
-
[52]
Sablayrolles, A., Douze, M., Schmid, C. & Jégou, H. Spreading vectors for similarity search. in Proceedings of Proceedings of Seventh International Conference on Learning Representations (ICLR) 2019 (arXiv, New Orleans, LA, USA, 2019). doi:10.48550/ARXIV.1806.03198
-
[53]
Defining an Optimal Cut-Point Value in ROC Analysis: An Alternative Approach
Unal, I. Defining an Optimal Cut-Point Value in ROC Analysis: An Alternative Approach. Comput Math Methods Med 2017, 3762651 (2017)
work page 2017
-
[54]
Konietschke, F. & Pauly, M. Bootstrapping and permuting paired t-test type statistics. Stat Comput 24, 283–296 (2014)
work page 2014
-
[55]
Tayebi Arasteh, S. et al. RadioRAG: Online Retrieval–Augmented Generation for Radiology Question Answering. Radiology: Artificial Intelligence 7, e240476 (2025). 30 Supplementary information Supplementary Figure 1: Overall performance distributions across datasets . (a) Violin plots of bootstrap distributions (n = 1,000 resamples) for average AUROC values...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.