pith. sign in

arxiv: 2510.07191 · v3 · submitted 2025-10-08 · 💻 cs.CV · cs.AI· cs.LG

Resolution scaling governs DINOv3 transfer performance in chest radiograph classification

Pith reviewed 2026-05-18 09:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords self-supervised learningchest radiograph classificationDINOv3resolution scalingtransfer learningConvNeXtAUROCmedical imaging
0
0 comments X

The pith

DINOv3 improves adult chest X-ray classification most at 512 by 512 pixels using ConvNeXt backbones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The study tests DINOv3, an updated self-supervised vision model with better high-resolution handling, on seven chest radiograph collections totaling over 800,000 images from both children and adults. At the common 224-pixel size DINOv3 shows little consistent edge over prior models, yet at 512 pixels it becomes the strongest starting point, especially paired with ConvNeXt-B networks. The extra detail helps most when spotting small focal or edge-related findings while leaving performance on large structures largely unchanged. Pediatric cases gain nothing from the newer model or the jump in resolution. Scaling further to 1024 pixels adds heavy compute cost with almost no extra accuracy, pointing to a practical sweet spot for real medical imaging workflows.

Core claim

For adult chest radiograph classification, DINOv3 provides its most reliable benefit at 512 x 512 pixels, particularly with ConvNeXt-B, outperforming both DINOv2 and supervised ImageNet initialization under full fine-tuning while delivering the strongest gains on small focal and boundary-dependent abnormalities.

What carries the argument

Resolution scaling to 512 pixels with DINOv3 high-resolution adaptation on ConvNeXt-B, which together improve fine-grained feature transfer under full fine-tuning.

If this is right

  • Full fine-tuning at 512 pixels with DINOv3 and ConvNeXt-B gives the best performance-cost trade-off compared with 1024-pixel inputs or parameter-efficient adaptation alone.
  • External validation sets preserve the 512-pixel DINOv3 advantage for adult cohorts.
  • Improvements concentrate on small focal and boundary-dependent abnormalities while large-structure findings change little.
  • ConvNeXt-B stays superior to ViT-B/16 under both full and parameter-efficient adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the resolution benefit generalizes, many existing 224-pixel medical imaging benchmarks may systematically undervalue newer self-supervised models.
  • Repeating the same 512-pixel protocol on other body-part imaging tasks would test whether the advantage is specific to chest radiographs or a broader scaling phenomenon.
  • Because label-noise experiments ruled out simple robustness as the explanation, the benefit may stem from better capture of subtle texture cues that only become visible at mid-range resolutions.

Load-bearing premise

The seven chosen datasets and the protocol of averaging AUROC across labels after full fine-tuning are representative enough to support general statements about DINOv3 transfer performance.

What would settle it

A new adult chest radiograph dataset in which DINOv3 at 512 pixels no longer outperforms DINOv2 or in which accuracy peaks at 224 pixels instead would falsify the main claim.

read the original abstract

Self-supervised learning (SSL) has improved visual representation learning, but its value in chest radiography remains uncertain. DINOv3 extends earlier SSL models through Gram-anchored self-distillation and explicit high-resolution adaptation. Whether these changes improve transfer learning for chest radiograph classification has not been established. We benchmarked DINOv3 against DINOv2 and supervised ImageNet initialization across seven chest radiograph datasets comprising 816,183 radiographs from pediatric and adult cohorts. ViT-B/16 and ConvNeXt-B were evaluated under full fine-tuning at 224 and 512 pixels, with targeted 1024 experiments on three cohorts. Additional analyses examined parameter-efficient adaptation, synthetic label corruption, external validation, frozen 7B features, and computational efficiency. The primary outcome was mean AUROC across labels. In adult cohorts, DINOv3 did not consistently outperform DINOv2 at 224 x 224 pixels, but became the strongest initialization at 512 x 512, especially with ConvNeXt-B. Gains were greatest for small focal and boundary-dependent abnormalities, whereas large-structure findings changed little. The pediatric cohort showed no significant benefit from DINOv3, higher resolution, or backbone choice. Scaling to 1024 x 1024 rarely improved performance and markedly increased computational cost. ConvNeXt-B remained superior to ViT-B/16 under both full and parameter-efficient adaptation. External validation preserved the 512 x 512 DINOv3 advantage, whereas synthetic label corruption showed that this benefit should not be interpreted simply as superior noise robustness. For adult chest radiograph classification, DINOv3 provides its most reliable benefit at 512 x 512 pixels, particularly with ConvNeXt-B. Fully adapted mid-sized models at 512 x 512 pixels provided the best performance-cost trade-off in our benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript benchmarks DINOv3 against DINOv2 and supervised ImageNet pretraining for chest radiograph classification across seven datasets (816k images, adult and pediatric cohorts). It evaluates ViT-B/16 and ConvNeXt-B under full fine-tuning at 224x224 and 512x512, with targeted 1024x1024 experiments on three cohorts. Primary outcome is mean AUROC across labels. Key finding: in adult cohorts DINOv3 becomes strongest at 512x512 (especially ConvNeXt-B), with gains on small focal abnormalities; pediatric shows no benefit; 1024 rarely improves and raises cost; ConvNeXt-B outperforms ViT; external validation and label corruption tests support the 512 advantage.

Significance. If the central empirical claims hold, the work supplies practical guidance on resolution and backbone choice for SSL transfer in chest radiography, showing that mid-resolution (512) with ConvNeXt-B yields the best performance-cost trade-off. Strengths include multi-dataset evaluation, external validation, and robustness checks via synthetic label corruption; these elements provide concrete evidence that DINOv3 benefits are resolution-dependent rather than uniform.

major comments (1)
  1. Abstract and results sections: the claim that DINOv3 provides its most reliable benefit at 512x512 (and that scaling to 1024 rarely improves performance) is based on 1024-resolution experiments limited to three cohorts, while 224 and 512 results cover all seven. Because the primary outcome is mean AUROC across adult cohorts and the optimality conclusion is asserted for the full set, any cohort-specific continued gains or reversals at 1024 would directly weaken the scaling-sweet-spot conclusion. The manuscript should either extend the 1024 experiments or qualify the claim to the tested subsets.
minor comments (2)
  1. Methods: additional detail on exact train/validation/test splits, hyperparameter search ranges, and the statistical procedure used to compare AUROCs across initializations would improve reproducibility and allow readers to assess whether post-hoc choices influenced the reported ordering.
  2. Figure clarity: ensure that error bars or confidence intervals are shown on all AUROC bar plots so that the magnitude of reported gains can be evaluated against variability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comment regarding the scope of the 1024-resolution experiments is well taken, and we agree that it requires a qualification of our claims to avoid overgeneralization. We address this point directly below and will incorporate the necessary revisions.

read point-by-point responses
  1. Referee: Abstract and results sections: the claim that DINOv3 provides its most reliable benefit at 512x512 (and that scaling to 1024 rarely improves performance) is based on 1024-resolution experiments limited to three cohorts, while 224 and 512 results cover all seven. Because the primary outcome is mean AUROC across adult cohorts and the optimality conclusion is asserted for the full set, any cohort-specific continued gains or reversals at 1024 would directly weaken the scaling-sweet-spot conclusion. The manuscript should either extend the 1024 experiments or qualify the claim to the tested subsets.

    Authors: We agree with the referee that the 1024×1024 experiments were performed on only three of the seven cohorts (specifically, two adult and one pediatric dataset) owing to the substantial computational cost of full fine-tuning at this resolution. Our core finding—that DINOv3 at 512×512 with ConvNeXt-B yields the strongest performance-cost trade-off—is supported by results across all seven datasets. The statement that scaling to 1024 “rarely improved performance” is accurate for the three cohorts tested, but we acknowledge that this does not constitute evidence for the remaining four cohorts. To prevent any implication that the 1024 results apply to the full set, we will revise the abstract, results, and discussion sections to explicitly state that the 1024-resolution findings are limited to the three evaluated cohorts. We will also add a sentence noting the computational constraints that precluded 1024 experiments on the full collection. These changes will be implemented in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with measured outcomes on held-out data

full rationale

The paper is an empirical benchmarking study that reports measured mean AUROC values for DINOv3, DINOv2, and supervised initializations across seven chest radiograph datasets under full fine-tuning at multiple resolutions. The central claims (DINOv3 advantage at 512x512 in adult cohorts, limited gains at 1024x1024) are direct summaries of these held-out performance numbers rather than quantities derived from equations or prior fitted parameters within the paper. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the results remain falsifiable against the external test sets and do not reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

This is an empirical benchmarking study whose central claim rests on standard assumptions in machine learning transfer learning rather than new axioms or invented entities.

axioms (2)
  • domain assumption The selected chest radiograph datasets are representative of real-world clinical distributions for the evaluated tasks.
    Invoked implicitly when generalizing from the seven cohorts to broader claims about adult and pediatric performance.
  • domain assumption Mean AUROC across labels is a sufficient summary metric for comparing initialization quality in multi-label classification.
    Used as the primary outcome without further justification in the abstract.

pith-pipeline@v0.9.0 · 5894 in / 1531 out tokens · 45870 ms · 2026-05-18T09:01:31.713115+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 5 internal anchors

  1. [1]

    & Topol, E

    Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. J. AI in health and medicine. Nat Med 28, 31–38 (2022)

  2. [2]

    Tayebi Arasteh, S. et al. Large language models streamline automated machine learning for clinical studies. Nat Commun 15, 1603 (2024)

  3. [3]

    Haug, C. J. & Drazen, J. M. Artificial Intelligence and Machine Learning in Clinical Medicine,

  4. [4]

    N Engl J Med 388, 1201–1208 (2023)

  5. [5]

    Tayebi Arasteh, S. et al. The Treasure Trove Hidden in Plain Sight: The Utility of GPT-4 in Chest Radiograph Evaluation. Radiology 313, e233441 (2024)

  6. [6]

    Chen, Z. et al. A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation. Preprint at https://doi.org/10.48550/arXiv.2401.12208 (2024)

  7. [7]

    Johnson, A. E. W. et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data 6, 317 (2019)

  8. [8]

    Deng, J. et al. ImageNet: A large-scale hierarchical image database. in 2009 IEEE Conference on Computer Vision and Pattern Recognition 248–255 (IEEE, Miami, FL, 2009). doi:10.1109/CVPR.2009.5206848

  9. [9]

    Ke, A., Ellsworth, W., Banerjee, O., Ng, A. Y. & Rajpurkar, P. CheXtransfer: performance and parameter efficiency of ImageNet models for chest X-Ray interpretation. in Proceedings of the Conference on Health, Inference, and Learning 116–124 (ACM, Virtual Event USA, 2021). doi:10.1145/3450439.3451867

  10. [10]

    & Topol, E

    Krishnan, R., Rajpurkar, P. & Topol, E. J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng 6, 1346–1352 (2022)

  11. [11]

    & Song, D

    Hendrycks, D., Mazeika, M., Kadavath, S. & Song, D. Using self-supervised learning can improve model robustness and uncertainty. in NIPS’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems vol. 1403 15663– 15674 (2019)

  12. [12]

    & Girshick, R

    He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 9729–9738 (2020)

  13. [13]

    & Hinton, G

    Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. in International Conference on Machine Learning vol. 119 (Vienna, Austria, 2020)

  14. [14]

    Grill, J.-B. et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33, 21271–21284 (2020)

  15. [15]

    Caron, M. et al. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. in Advances in neural information processing systems 33 9912–9924 (2020)

  16. [16]

    & Zhou, C

    Wen, Y., Chen, L., Deng, Y. & Zhou, C. Rethinking pre-training on medical imaging. Journal of Visual Communication and Image Representation 78, 103145 (2021)

  17. [17]

    Vaswani, A. et al. Attention Is All You Need. in NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems 6000–6010 (2017)

  18. [18]

    Dosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Preprint at http://arxiv.org/abs/2010.11929 (2021)

  19. [19]

    Liu, Z. et al. A convnet for the 2020s. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 11976–11986 (2022)

  20. [20]

    Caron, M. et al. Emerging Properties in Self-Supervised Vision Transformers. in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) 9650– 9660 (2021). 28

  21. [21]

    Oquab, M. et al. DINOv2: Learning Robust Visual Features without Supervision. Preprint at http://arxiv.org/abs/2304.07193 (2023)

  22. [22]

    N., Truhn, D

    Tayebi Arasteh, S., Misera, L., Kather, J. N., Truhn, D. & Nebelung, S. Enhancing diagnostic deep learning via self-supervised pretraining on large-scale, unlabeled non-medical images. Eur Radiol Exp 8, 10 (2024)

  23. [23]

    Siméoni, O. et al. DINOv3. Preprint at https://doi.org/10.48550/arXiv.2508.10104 (2025)

  24. [24]

    & Zhu, L

    Yang, S., Wang, H., Xing, Z., Chen, S. & Zhu, L. SegDINO: An Efficient Design for Medical and Natural Image Segmentation with DINO-V3. Preprint at https://doi.org/10.48550/arXiv.2509.00833 (2025)

  25. [25]

    & Yang, X

    Li, Y., Wu, Y., Lai, Y., Hu, M. & Yang, X. MedDINOv3: How to adapt vision foundation models for medical image segmentation? Preprint at https://doi.org/10.48550/arXiv.2509.02379 (2025)

  26. [26]

    Liu, C. et al. Does DINOv3 Set a New Medical Vision Standard? Preprint at https://doi.org/10.48550/arXiv.2509.06467 (2025)

  27. [27]

    Khader, F. et al. Multimodal Deep Learning for Integrating Chest Radiographs and Clinical Parameters: A Case for Transformers. Radiology 309, e230806 (2023)

  28. [28]

    & You, Z

    Wang, B., Li, Q. & You, Z. Self-supervised learning based transformer and convolution hybrid network for one-shot organ segmentation. Neurocomputing 527, 1–12 (2023)

  29. [29]

    He, K. et al. Transformers in medical image analysis. Intelligent Medicine 3, 59–78 (2023)

  30. [30]

    Tanno, R. et al. Collaboration between clinicians and vision–language models in radiology report generation. Nat Med 31, 599–608 (2025)

  31. [31]

    & Mirmehdi, M

    Sloan, P., Clatworthy, P., Simpson, E. & Mirmehdi, M. Automated radiology report generation: A review of recent advances. IEEE Reviews in Biomedical Engineering 18, 368– 387 (2024)

  32. [32]

    H., Pham, H

    Nguyen, N. H., Pham, H. H., Tran, T. T., Nguyen, T. N. M. & Nguyen, H. Q. VinDr-PCXR: An Open, Large-Scale Chest Radiograph Dataset for Interpretation of Common Thoracic Diseases in Children. http://medrxiv.org/lookup/doi/10.1101/2022.03.04.22271937 (2022) doi:10.1101/2022.03.04.22271937

  33. [33]

    Nguyen, H. Q. et al. VinDr-CXR: An open dataset of chest X-rays with radiologist’s annotations. Sci Data 9, 429 (2022)

  34. [34]

    Wang, X. et al. ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 3462–3471 (2017). doi:10.1109/CVPR.2017.369

  35. [35]

    & de la Iglesia-Vayá, M

    Bustos, A., Pertusa, A., Salinas, J.-M. & de la Iglesia-Vayá, M. PadChest: A large chest x- ray image dataset with multi-label annotated reports. Medical Image Analysis 66, 101797 (2020)

  36. [36]

    Irvin, J. et al. CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. AAAI 33, 590–597 (2019)

  37. [37]

    Khader, F. et al. Artificial Intelligence for Clinical Interpretation of Bedside Chest Radiographs. Radiology 307, e220510 (2022)

  38. [38]

    Tayebi Arasteh, S. et al. Collaborative training of medical artificial intelligence models with non-uniform labels. Sci Rep 13, 6046 (2023)

  39. [39]

    Tayebi Arasteh, S. et al. Preserving fairness and diagnostic accuracy in private large-scale AI models for medical imaging. Commun Med 4, 46 (2024)

  40. [40]

    Tayebi Arasteh, S. et al. Securing Collaborative Medical AI by Using Differential Privacy: Domain Transfer for Classification of Chest Radiographs. Radiology. Artificial Intelligence 6, e230212 (2024)

  41. [41]

    & Truhn, D

    Tayebi Arasteh, S., Isfort, P., Kuhl, C., Nebelung, S. & Truhn, D. Automatic Evaluation of Chest Radiographs – The Data Source Matters, But How Much Exactly? in RöFo- 29 Fortschritte auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren vol. 195 ab99 (Georg Thieme Verlag, RheinMain CongressCenter (RMCC) in Wiesbaden, 2023)

  42. [42]

    Chiarenza, A. et al. Chest imaging using signs, symbols, and naturalistic images: a practical guide for radiologists and non-radiologists. Insights Imaging 10, 114 (2019)

  43. [43]

    Sabottke, C. F. & Spieler, B. M. The Effect of Image Resolution on Deep Learning in Radiography. Radiology: Artificial Intelligence 2, e190015 (2020)

  44. [44]

    Haque, M. I. U. et al. Effect of image resolution on automated classification of chest X-rays. J Med Imaging (Bellingham) 10, 044503 (2023)

  45. [45]

    Capitanio, M. A. Pitfalls in Pediatric Chest Radiography. Radiology 137, 656–656 (1980)

  46. [46]

    & Tayebi Arasteh, S

    Lotfinia, M., Tayebiarasteh, A., Samiei, S., Joodaki, M. & Tayebi Arasteh, S. Boosting multi- demographic federated learning for chest radiograph analysis using general-purpose self- supervised representations. European Journal of Radiology Artificial Intelligence 3, 100028 (2025)

  47. [47]

    Tayebi Arasteh, S. et al. Enhancing domain generalization in the AI-based analysis of chest radiographs with federated learning. Sci Rep 13, 22576 (2023)

  48. [48]

    Layer Normalization

    Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer Normalization. Preprint at https://doi.org/10.48550/arXiv.1607.06450 (2016)

  49. [49]

    Gaussian Error Linear Units (GELUs)

    Hendrycks, D. & Gimpel, K. Gaussian Error Linear Units (GELUs). Preprint at https://doi.org/10.48550/arXiv.1606.08415 (2023)

  50. [50]

    & Hutter, F

    Loshchilov, I. & Hutter, F. Decoupled Weight Decay Regularization. in Proceedings of Proceedings of Seventh International Conference on Learning Representations (ICLR) 2019 (New Orleans, LA, USA, 2019)

  51. [51]

    R., Mijani, A

    Rezaei-Dastjerdehei, M. R., Mijani, A. & Fatemizadeh, E. Addressing Imbalance in Multi- Label Classification Using Weighted Cross Entropy Loss Function. in 2020 27th National and 5th International Iranian Conference on Biomedical Engineering (ICBME) 333–338 (IEEE, Tehran, Iran, 2020). doi:10.1109/ICBME51989.2020.9319440

  52. [52]

    & Jégou, H

    Sablayrolles, A., Douze, M., Schmid, C. & Jégou, H. Spreading vectors for similarity search. in Proceedings of Proceedings of Seventh International Conference on Learning Representations (ICLR) 2019 (arXiv, New Orleans, LA, USA, 2019). doi:10.48550/ARXIV.1806.03198

  53. [53]

    Defining an Optimal Cut-Point Value in ROC Analysis: An Alternative Approach

    Unal, I. Defining an Optimal Cut-Point Value in ROC Analysis: An Alternative Approach. Comput Math Methods Med 2017, 3762651 (2017)

  54. [54]

    & Pauly, M

    Konietschke, F. & Pauly, M. Bootstrapping and permuting paired t-test type statistics. Stat Comput 24, 283–296 (2014)

  55. [55]

    Tayebi Arasteh, S. et al. RadioRAG: Online Retrieval–Augmented Generation for Radiology Question Answering. Radiology: Artificial Intelligence 7, e240476 (2025). 30 Supplementary information Supplementary Figure 1: Overall performance distributions across datasets . (a) Violin plots of bootstrap distributions (n = 1,000 resamples) for average AUROC values...