arxiv: 2605.08819 · v1 · submitted 2026-05-09 · 💻 cs.CV · cs.LG

Recognition: no theorem link

From pre-training to downstream performance: Does domain-specific pre-training make sense?

Felix Krones

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:22 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords medical imagingpre-trainingmodality matchingdownstream performanceself-supervised learningCNNtransformer

0 comments

The pith

Only pre-training on data closely matching the target modality significantly improves downstream performance in medical imaging.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether domain-specific pre-training improves deep learning models for medical imaging by comparing convolutional neural networks and transformers under supervised and self-supervised regimes. It evaluates performance after pre-training on natural images versus medical modalities including chest X-rays, chest CT, and retina OCT. The core result is that only pre-training data closely aligned with the target modality produces meaningful gains on downstream tasks. This matters because it shows general pre-training on unrelated images adds little value for building reliable diagnostic systems. Self-supervised methods can exceed supervised ones but only in certain contexts.

Core claim

Models pre-trained on data from the same modality as the downstream task show significant performance gains, whereas pre-training on mismatched modalities like natural images does not. Self-supervised learning outperforms supervised learning in some contexts but not consistently. Evaluations cover chest X-rays, chest CT, retina OCT, and natural images using both convolutional networks and transformers.

What carries the argument

Systematic comparison of pre-training data modalities against target task modalities and their measured impact on downstream accuracy for CNNs and transformers under supervised and self-supervised initializations.

If this is right

Mismatched pre-training such as natural-image initialization yields little or no benefit for medical tasks.
Self-supervised pre-training can exceed supervised pre-training but its advantage depends on the specific modality and task.
Both CNNs and transformers gain from modality-matched pre-training.
Selecting or creating pre-training data that aligns with the target medical modality is required for reliable diagnostic performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Large general vision datasets may offer limited value for medical imaging, favoring investment in modality-specific medical pre-training collections.
Transfer between closely related medical modalities, such as variants of CT, could be tested as a practical extension of the matching principle.
For new imaging techniques, prioritizing collection of matched pre-training data may speed up effective model development.

Load-bearing premise

The selected datasets, model architectures, and evaluation metrics are representative enough of broader medical imaging applications for the modality-matching conclusion to generalize.

What would settle it

An experiment on additional medical modalities or larger models that finds no significant downstream performance difference between matched-modality and mismatched pre-training under matched training budgets.

Figures

Figures reproduced from arXiv: 2605.08819 by Felix Krones.

**Figure 2.** Figure 2: Metrics comparison. Reliability comparison of a selection of models on the example of the hand-labelled CheXpert validation data. (a) Performance across diseases, x-axis: diseases, y-axis: AUC; (b) Subset accuracy per decision threshold, x-axis: decision threshold, y-axis: subset accuracy; (c) FNR per decision threshold as average over diseases, x-axis: decision threshold, y-axis: FNR; (d) Oracle AUC per… view at source ↗

**Figure 3.** Figure 3: Pre-training size [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

read the original abstract

Deep learning techniques have revolutionised medical imaging, improving diagnostic accuracy and enabling both more accurate and earlier disease detection. However, the relationship between pre-training strategies and downstream performance in medical imaging models requires further exploration. Here, we systematically compare convolutional neural networks and transformers, examining various pre-training approaches, including supervised and self-supervised learning, as well as different initialisations and data modalities. Models are evaluated on natural images, chest X-rays, chest CT and retina OCT images, considering the effects of matching pre-training data with target modalities. Our findings indicate that only pre-training on data closely matching the target modality significantly improves downstream performance. While self-supervised learning can outperform supervised methods, its effectiveness varies with context. The study underscores the importance of pre-training strategies to enhance the reliability and effectiveness of deep learning models in medical imaging. By addressing these key factors, our research aims to contribute to the development of more accurate and dependable diagnostic tools, ultimately improving patient outcomes in clinical settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript empirically compares CNNs and Vision Transformers pre-trained via supervised and self-supervised methods on natural-image, chest X-ray, chest CT, and retinal OCT corpora. Downstream evaluation is performed on tasks drawn from the same modalities, with the central claim that only pre-training on data whose modality closely matches the target task yields significant performance gains; self-supervised pre-training is reported to sometimes outperform supervised but with context-dependent effectiveness.

Significance. If the modality-matching result survives controls for pre-training data cardinality, epoch count, and task distribution, the work would supply practical guidance for medical-imaging practitioners and reduce reliance on large general-purpose corpora such as ImageNet. The systematic architecture-by-paradigm design is a strength; however, the absence of reported statistical testing or ablation details in the provided abstract leaves the robustness of the claim difficult to judge from the summary alone.

major comments (2)

[Methods / Experimental Setup] The central claim (abstract and §4) that 'only pre-training on data closely matching the target modality significantly improves downstream performance' is load-bearing yet appears vulnerable to a data-scale confound. ImageNet contains approximately 1.2 M images while the medical corpora (chest X-ray, CT, OCT) are typically far smaller; without explicit subsampling to equalize pre-training set size, batch size, or number of epochs, any observed gap between matched and unmatched pre-training could be driven by data volume rather than modality alignment. A controlled ablation equalizing effective pre-training cardinality is required before the 'only closely matching' conclusion can be accepted.
[Results] Table or figure reporting downstream metrics (presumably §4 or §5) does not indicate whether results are averaged over multiple random seeds, whether statistical significance tests (e.g., paired t-tests or Wilcoxon) were performed, or what the effective sample sizes were for each pre-training condition. These omissions directly affect the reliability of the modality-matching claim.

minor comments (2)

[Abstract] Abstract sentence 'improving diagnostic accuracy and enabling both more accurate and earlier disease detection' contains redundant phrasing; a single concise statement would improve readability.
[Results] Notation for model variants (CNN vs. transformer, supervised vs. self-supervised) should be introduced once in a table or consistent acronym list rather than repeated descriptively in every results paragraph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications and indicating the revisions we will implement to strengthen the manuscript.

read point-by-point responses

Referee: [Methods / Experimental Setup] The central claim (abstract and §4) that 'only pre-training on data closely matching the target modality significantly improves downstream performance' is load-bearing yet appears vulnerable to a data-scale confound. ImageNet contains approximately 1.2 M images while the medical corpora (chest X-ray, CT, OCT) are typically far smaller; without explicit subsampling to equalize pre-training set size, batch size, or number of epochs, any observed gap between matched and unmatched pre-training could be driven by data volume rather than modality alignment. A controlled ablation equalizing effective pre-training cardinality is required before the 'only closely matching' conclusion can be accepted.

Authors: We acknowledge the potential for a data-scale confound and appreciate the referee's suggestion for a controlled ablation. In our original experiments, batch size and number of pre-training epochs were held constant across all conditions, but pre-training set sizes were not explicitly equalized. To address this, we have now performed additional ablations by subsampling the ImageNet corpus to match the cardinality of the largest medical pre-training set (while preserving class balance where applicable) and repeating the full pre-training and downstream evaluation pipeline. These new results, which will be added to §4 and the supplementary material, confirm that the modality-matching effect remains statistically significant even under equalized data volumes. We will also explicitly report the effective pre-training cardinalities, batch sizes, and epoch counts for all conditions in the revised methods section. revision: yes
Referee: [Results] Table or figure reporting downstream metrics (presumably §4 or §5) does not indicate whether results are averaged over multiple random seeds, whether statistical significance tests (e.g., paired t-tests or Wilcoxon) were performed, or what the effective sample sizes were for each pre-training condition. These omissions directly affect the reliability of the modality-matching claim.

Authors: We agree that the reporting of statistical details was insufficient. All downstream metrics in the manuscript were computed as means over five independent random seeds with standard deviations shown in the tables. We performed paired t-tests (two-tailed) between matched-modality and unmatched-modality pre-training conditions for each architecture and task, reporting p-values in the revised tables. Effective sample sizes correspond to the number of images in each downstream evaluation split, which are already specified in the dataset descriptions but will now be reiterated in the results section and table captions. We will add a dedicated paragraph in §4 detailing the statistical procedure, seed count, and significance thresholds to ensure full transparency. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison study with independent experimental controls

full rationale

The paper performs a systematic empirical comparison of CNNs and transformers pre-trained under supervised/self-supervised regimes on natural images versus medical modalities (chest X-ray, CT, OCT), then measures downstream task performance. No mathematical derivation, fitted parameters, or uniqueness theorems are invoked; the central claim follows directly from the reported accuracy deltas across controlled modality-match conditions. No self-citation load-bearing steps, no ansatz smuggling, and no renaming of known results appear in the provided text. The skeptic concern about corpus size is a potential methodological gap but does not constitute circularity under the defined criteria, as the study does not reduce its conclusion to a fit or self-referential definition by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of supervised and self-supervised learning in computer vision plus the representativeness of the chosen imaging modalities and tasks; no new entities or ad-hoc parameters are introduced in the abstract.

axioms (1)

domain assumption Standard assumptions of transfer learning hold, including that pre-training features are useful for fine-tuning on related tasks.
Invoked implicitly when claiming performance improvements from pre-training.

pith-pipeline@v0.9.0 · 5459 in / 1139 out tokens · 46557 ms · 2026-05-12T02:22:31.860342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 3 internal anchors

[1]

SiT: Self-supervised vision transformer

Atito, S., Awais, M., Kittler, J.: SiT: Self-supervised vIsion Transformer. arXiv:2104.03602 (2021)

work page arXiv 2021
[2]

arXiv:2205.14986 (2022)

Atito, S., Awais, M., Kittler, J.: GMML is all you need. arXiv:2205.14986 (2022)

work page arXiv 2022
[3]

arXiv:2205.09723 (2022)

Azizi, S., Culp, L., Freyberg, J., Mustafa, B., Baur, S., Kornblith, S., Chen, T., MacWilliams, P., Mahdavi, S.S., Wulczyn, E., et al.: Robust and efficient medical imaging with self-supervision. arXiv:2205.09723 (2022)

work page arXiv 2022
[4]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., Deaton, J., Loh, A., Karthikesalingam, A., Kornblith, S., Chen, T., et al.: Big self-supervised models advance medical image classification. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 3478–3488 (2021)

work page 2021
[5]

Advances in Neural Information Processing Systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems33, 1877–1901 (2020)

work page 1901
[6]

Medical Image Anal- ysis66, 101797 (2020)

Bustos, A., Pertusa, A., Salinas, J.M., de la Iglesia-Vayá, M.: PadChest: A large chest X-ray image dataset with multi-label annotated reports. Medical Image Anal- ysis66, 101797 (2020)

work page 2020
[7]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9650–9660 (2021)

work page 2021
[8]

In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision

Chen, X., Xie, S., He, K.: An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Com- puter Vision. pp. 9640–9649 (2021)

work page 2021
[9]

In: 2009 IEEE Conference on Computer Vision and Pattern Recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255. IEEE (2009)

work page 2009
[10]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[11]

arXiv:2201.01283 (2022)

Ghesu, F.C., Georgescu, B., Mansoor, A., Yoo, Y., Neumann, D., Patel, P., Vish- wanath, R., Balter, J.M., Cao, Y., Grbic, S., et al.: Self-supervised learning from 100 million medical images. arXiv:2201.01283 (2022)

work page arXiv 2022
[12]

Unsupervised representation learning by predicting image rotations

Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv:1803.07728 (2018)

work page arXiv 2018
[13]

arXiv:2110.14755 (2021)

Glocker, B., Winzeck, S.: Algorithmic encoding of protected characteristics and its implications on disparities across subgroups. arXiv:2110.14755 (2021)

work page arXiv 2021
[14]

Circulation101(23), e215–e220 (2000)

Goldberger, A.L., Amaral, L.A., Glass, L., Hausdorff, J.M., Ivanov, P.C., Mark, R.G., Mietus, J.E., Moody, G.B., Peng, C.K., Stanley, H.E.: PhysioBank, Phys- ioToolkit, and PhysioNet: components of a new research resource for complex phys- iologic signals. Circulation101(23), e215–e220 (2000)

work page 2000
[15]

Communications of the ACM63(11), 139–144 (2020)

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020)

work page 2020
[16]

arXiv:2404.09957 (2024)

Gu,H.,Dong,H.,Yang,J.,Mazurowski,M.A.:Howtobuildthebestmedicalimage segmentation algorithm using foundation models: a comprehensive empirical study with segment anything model. arXiv:2404.09957 (2024)

work page arXiv 2024
[17]

Frontiers in Medicine8, 729287 (2022) 10 Felix Krones

Gunraj,H.,Sabri,A.,Koff,D.,Wong,A.:COVID-NetCT-2:Enhanceddeepneural networks for detection of COVID-19 from chest CT images through bigger, more diverse learning. Frontiers in Medicine8, 729287 (2022) 10 Felix Krones

work page 2022
[18]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Haghighi, F., Taher, M.R.H., Gotway, M.B., Liang, J.: DiRA: Discriminative, Restorative, and Adversarial learning for self-supervised medical image analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20824–20834 (2022)

work page 2022
[19]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)

work page 2016
[20]

Hosseinzadeh Taher, M.R., Haghighi, F., Feng, R., Gotway, M.B., Liang, J.: A systematic benchmarking analysis of transfer learning for medical image analysis. In: Domain Adaptation and Representation Transfer, and Affordable Healthcare and AI for Resource Diverse Global Health: 3rd MICCAI Workshop, DART 2021, and First MICCAI Workshop, FAIR 2021. pp. 3–13...

work page 2021
[21]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 590–597 (2019)

work page 2019
[22]

Johnson,A.,Bulgarelli,L.,Pollard,T.,Horng,S.,Celi,L.A.,Roger,M.:MIMIC-IV (version 1.0) (03 2021)

work page 2021
[23]

Johnson, A., Lungren, M., Peng, Y., Lu, Z., Mark, R., Berkowitz, S., Horng, S.: MIMIC-CXR-JPG-chest radiographs with structured labels (2019)

work page 2019
[24]

https://physionet.org/content/mimic-cxr/ (2019)

Johnson, A.E.W., Pollard, T., Mark, R., Berkowitz, S., Horng, S.: The MIMIC- CXR database. https://physionet.org/content/mimic-cxr/ (2019)

work page 2019
[25]

Kermany, D., Zhang, K., Goldbaum, M.: Large dataset of labeled optical coherence tomography (OCT) and chest X-Ray images (2018)

work page 2018
[26]

Khan, M.O., Fang, Y.: Are Medical Imaging Foundation Models Effectively Uti- lized? In: Medical Imaging with Deep Learning (2024)

work page 2024
[27]

ACM Computing Surveys (CSUR) (2021)

Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in Vision: A Survey. ACM Computing Surveys (CSUR) (2021)

work page 2021
[28]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[29]

arXiv:2107.04212 (2021)

Kivlichan, I.D., Lin, Z., Liu, J., Vasserman, L.: Measuring and improving model- moderator collaboration using uncertainty estimation. arXiv:2107.04212 (2021)

work page arXiv 2021
[30]

Nature Biomedical Engineering pp

Krishnan, R., Rajpurkar, P., Topol, E.J.: Self-supervised learning in medicine and healthcare. Nature Biomedical Engineering pp. 1–7 (2022)

work page 2022
[31]

Information Fusion114, 102690 (2025)

Krones, F., Marikkar, U., Parsons, G., Szmul, A., Mahdi, A.: Review of multimodal machine learning approaches in healthcare. Information Fusion114, 102690 (2025)

work page 2025
[32]

PLOS Digital Health3(12), e0000437 (2024)

Krones, F., Walker, B.: From theoretical models to practical deployment: A per- spective and case study of opportunities and challenges in ai-driven cardiac aus- cultation research for low-income settings. PLOS Digital Health3(12), e0000437 (2024)

work page 2024
[33]

IEEE Transactions on Knowledge and Data Engineering (2021)

Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang, J., Tang, J.: Self-supervised learning: Generative or contrastive. IEEE Transactions on Knowledge and Data Engineering (2021)

work page 2021
[34]

CoRR (2021)

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Trans- former: Hierarchical Vision Transformer using Shifted Windows. CoRR (2021)

work page 2021
[35]

In: Domain Adaptation and Representation Transfer: 4th MICCAI Work- shop, DART 202

Ma, D., Hosseinzadeh Taher, M.R., Pang, J., Islam, N.U., Haghighi, F., Gotway, M.B., Liang, J.: Benchmarking and boosting transformers for medical image classi- fication. In: Domain Adaptation and Representation Transfer: 4th MICCAI Work- shop, DART 202. pp. 12–22. Springer (2022) From pre-training to downstream performance 11

work page 2022
[36]

In: 2023 IEEE International Conference on Image Processing (ICIP)

Marikkar, U., Atito, S., Awais, M., Mahdi, A.: LT-ViT: A vision transformer for multi-label chest X-Ray classification. In: 2023 IEEE International Conference on Image Processing (ICIP). pp. 2565–2569. IEEE (2023)

work page 2023
[37]

Nguyen, H.Q., Lam, K., Le, L.T., Pham, H.H., Tran, D.Q., Nguyen, D.B., Le, D.D., Pham, C.M., Tong, H.T.T., Dinh, D.H., Do, C.D., Doan, L.T., Nguyen, C.N., Nguyen, B.T., Nguyen, Q.V., Hoang, A.D., Phan, H.N., Nguyen, A.T., Ho, P.H., Ngo, D.T., Nguyen, N.T., Nguyen, N.T., Dao, M., Vu, V.: VinDr-CXR: An open dataset of chest X-Rays with radiologist’s annotat...

work page 2020
[38]

In: Computer Vision–ECCV 2016: 14th European Conference, Am- sterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI

Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Computer Vision–ECCV 2016: 14th European Conference, Am- sterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI. pp. 69–84. Springer (2016)

work page 2016
[39]

Scientific Reports14(1), 8755 (2024)

Oh, S., Kim, N., Ryu, J.: Analyzing to discover origins of cnns and vit architectures in medical images. Scientific Reports14(1), 8755 (2024)

work page 2024
[40]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: DINOv2: Learning Robust Visual Features without Supervision. arXiv:2304.07193 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

In: Domain Adaptation and Representation Transfer: 4th MICCAI Workshop, DART 2022, Held in Conjunction with MICCAI 2022, Singapore, September 22, 2022, Proceedings

Pang, J., Haghighi, F., Ma, D., Islam, N.U., Hosseinzadeh Taher, M.R., Gotway, M.B., Liang, J.: Popar: Patch order prediction and appearance recovery for self- supervised medical image analysis. In: Domain Adaptation and Representation Transfer: 4th MICCAI Workshop, DART 2022, Held in Conjunction with MICCAI 2022, Singapore, September 22, 2022, Proceeding...

work page 2022
[42]

Nature Medicine pp

Rajpurkar, P., Chen, E., Banerjee, O., Topol, E.J.: AI in health and medicine. Nature Medicine pp. 1–8 (2022)

work page 2022
[43]

Radiology305(2), 454–465 (2022)

Sellergren, A.B., Chen, C., Nabulsi, Z., Li, Y., Maschinot, A., Sarna, A., Huang, J., Lau, C., Kalidindi, S.R., Etemadi, M., et al.: Simplified transfer learning for chest radiography models using less data. Radiology305(2), 454–465 (2022)

work page 2022
[44]

In: Medical Imaging with Deep Learning

Sowrirajan, H., Yang, J., Ng, A.Y., Rajpurkar, P.: Moco pretraining improves representation and transferability of chest X-Ray models. In: Medical Imaging with Deep Learning. pp. 728–744. PMLR (2021)

work page 2021
[45]

Summers, R.: NIH chest X-Ray dataset of 14 common thorax disease categories (2019)

work page 2019
[46]

Nature Biomedical Engineering pp

Tiu, E., Talius, E., Patel, P., Langlotz, C.P., Ng, A.Y., Rajpurkar, P.: Expert-level detection of pathologies from unannotated chest X-Ray images via self-supervised learning. Nature Biomedical Engineering pp. 1–8 (2022)

work page 2022
[47]

In: International Conference on Machine Learning

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning. pp. 10347–10357 (2021)

work page 2021
[48]

arXiv:2207.07411 (2022)

Tran, D., Liu, J., Dusenberry, M.W., Phan, D., Collier, M., Ren, J., Han, K., Wang, Z., Mariet, Z., Hu, H., et al.: PLEX: Towards reliability using pretrained large model extensions. arXiv:2207.07411 (2022)

work page arXiv 2022
[49]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-Ray8: Hospital-scale chest X-Ray database and benchmarks on weakly-supervised classi- fication and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2097–2106 (2017)

work page 2097
[50]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H.: SimMIM: A Simple Framework for Masked Image Modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9653–9663 (June 2022)

work page 2022
[51]

Zhou, H.Y., Adithan, S., Acosta, J.N., Topol, E.J., Rajpurkar, P.: A generalist learner for multifaceted medical image interpretation. arXiv:2405.07988 (2024) 12 Felix Krones A Supplementary material A.1 Implementation details For the ResNet50 architecture the fine-tuning code from [20] was used and for the ViT architecture the code from [36]. Supervised ...

work page arXiv 2024
[52]

code 3.1 Random 1000e, small, 1 0.7896 (0.0023) 0.8021 (0.0013) 0.2.2 Random 0500e, small, 1 0.7686 (0.0092) - 0.2.3 Random 1000e, original, 1 0.7826 (0.0042) - 0.2.4 Random 1000e, small, 3 0.7868 (0.0080) 0.8052 (0.0025) 0.2.5 ImageNet1k (from SiT) 1000e, small, 3 0.7914 (0.0169) 0.8097 (0.0022) 3.2 ImageNet1k (from supervised) 1000e, small, 3 0.7991 (0....

work page arXiv 2021
[53]

It addresses challenges related to instability in self-supervised ViT training and takes advantage of the potential scalability and accuracy of ViT models

is a refined version of the MoCo framework, optimised to better accom- modate Vision Transformers. It addresses challenges related to instability in self-supervised ViT training and takes advantage of the potential scalability and accuracy of ViT models. SimMIMSimMIM is a simplified approach to self-supervised learning in com- putervision,focusingonmasked...

work page 2009
[54]

test data All White Black Asian Patients [No.] 12,866 9,956 879 2,031 Scans [No.] 38,240 29,844 2,746 5,650 Age mean [years] 63 64 57 61 Age min [years] 18 18 18 18 Age max [years] 90 90 90 90 Female [% of patients] 42 41 51 44 Atelectasis [% of scans] 15 16 13 14 Cardiomegaly [% of scans] 13 12 21 13 Consolidation [% of scans] 7 6 6 7 Edema [% of scans] ...

work page