pith. sign in

arxiv: 2605.19359 · v1 · pith:QCUSWUMXnew · submitted 2026-05-19 · 💻 cs.CV · cs.LG

MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification

Pith reviewed 2026-05-20 06:31 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords mammographyBI-RADSvision-language pretrainingcontrastive learningmedical imagingatlasclassification
0
0 comments X

The pith

Pretraining vision encoders on mammography atlas image-text pairs improves BI-RADS classification performance

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that contrastive learning on pairs of mammography images and their descriptive captions from atlases equips the vision encoder with better knowledge of findings. When this pretrained encoder is fine-tuned for predicting BI-RADS categories, it outperforms models that skip the atlas pretraining step. The advantage is greater when the fine-tuning dataset has fewer examples. The authors also show that atlas text pairs can be more informative than an equal number of labeled images for the prediction task.

Core claim

The authors curate image-text pairs from mammography atlases and train a vision-language model with contrastive loss to align image and text representations. This allows the vision component to absorb information from the captions about mammography findings. Fine-tuning the resulting vision encoder on BI-RADS labeled datasets then produces superior classification results compared to standard training, with the largest gains observed in data-scarce conditions.

What carries the argument

Contrastive pretraining on image-caption pairs from mammography atlases to align visual and textual embeddings, enabling the vision encoder to incorporate descriptive knowledge for improved BI-RADS prediction upon fine-tuning

If this is right

  • The vision encoder gains richer understanding of mammography patterns from textual descriptions
  • Classification performance improves more when labeled data for fine-tuning is limited
  • A small set of atlas pairs can provide more value than the same number of labeled samples for BI-RADS prediction
  • Textual information from atlases serves as a valuable resource for enhancing medical image models

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This pretraining method could be applied to other medical imaging modalities with available atlases and reports
  • It may help reduce reliance on large amounts of manually labeled data in developing diagnostic AI systems
  • Combining atlas pretraining with other self-supervised techniques on images could yield even stronger models

Load-bearing premise

The captions provide accurate and relevant descriptions of mammography findings that can be effectively transferred to BI-RADS classification through contrastive image-text alignment

What would settle it

A direct comparison on a held-out BI-RADS test set where the atlas-pretrained vision encoder performs no better than or worse than a vision encoder trained without atlas data would disprove the value of the proposed pretraining

Figures

Figures reproduced from arXiv: 2605.19359 by Halil Ibrahim Gulluk, Olivier Gevaert.

Figure 1
Figure 1. Figure 1: Model overview. We first extract mammogram images and their corresponding captions [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Number of samples per class for the two classification datasets. The EMBED dataset is [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Classification results on the TEKNOFEST dataset. For every number of training samples [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Classification results on the EMBED dataset. For every number of training samples, our [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Deep learning methods have demonstrated promising results in predicting BI-RADS scores from mammography images. However, the interpretation of these images can vary, leading to discrepancies even among radiologists. Given the inherent complexity of mammograms, training classification models solely on image labels often yields limited performance. To address this challenge, we curated 2313 mammogram images and their corresponding captions from two mammography atlases. Our proposed approach employs a multi-modal model that uses a pretrained PubMedBERT as the language component. By training this model on image-text pairs with contrastive learning, we enable the vision encoder to absorb the rich information contained in the captions, thereby improving its understanding of mammography findings. We then fine-tune the vision encoder on two datasets for BI-RADS prediction, achieving superior performance compared with models trained without this pretraining, particularly when labeled samples are scarce. The improvement in the 3-class average F1 score ranges from +1% to +14%: a +1% increase with 40K training samples, and a +14% increase with 1K samples. Furthermore, our experiments reveal that 2K image-text pairs from mammography atlases can be more informative than 2K labeled samples for label prediction, with an average margin of +1.1% when more than 10K training samples are available. Overall, our work provides a vision-language model for mammography and highlights the value of textual information from mammography atlases. In addition, we publicly release preprocessed mammography images of the TEKNOFEST dataset. The training code, pre-trained model weights, data extraction scripts, and the released dataset are publicly available at: https://github.com/igulluk/MAM-CLIP

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces MAM-CLIP, a vision-language model that pretrains a vision encoder on 2313 mammography image-caption pairs curated from two atlases using contrastive learning with PubMedBERT as the text encoder. The pretrained vision encoder is subsequently fine-tuned for BI-RADS classification on target datasets, yielding 3-class average F1 improvements of +1% (with 40K samples) to +14% (with 1K samples) relative to models without atlas pretraining. The work further reports that 2K atlas image-text pairs outperform 2K labeled samples by an average +1.1% margin when more than 10K training samples are available, and publicly releases the preprocessed TEKNOFEST dataset along with code and model weights.

Significance. If the attribution to caption semantics holds, the result would demonstrate that descriptive text from mammography atlases can inject finding-specific knowledge (e.g., mass shape, calcification morphology) into vision encoders, offering a practical route to stronger performance in low-labeled-data regimes common to medical imaging. The public release of dataset, code, and weights is a clear strength that supports reproducibility.

major comments (1)
  1. [§4 (Experiments)] §4 (Experiments), paragraph reporting the +1.1% margin for 2K image-text pairs versus 2K labeled samples: the comparison does not include a control that replaces atlas captions with shuffled or generic text. Without this ablation, the observed gains cannot be confidently attributed to the semantic content of the captions (as asserted in the abstract) rather than the mere presence of a contrastive objective or additional unlabeled images. This control is load-bearing for the central claim that the vision encoder absorbs 'rich information contained in the captions'.
minor comments (3)
  1. [§4] The manuscript does not report statistical significance testing (e.g., standard errors, p-values, or bootstrap intervals) for the F1 improvements across sample sizes; adding these would strengthen the empirical claims.
  2. [§4] Details on the exact train/validation/test splits and sampling procedure for the 40K and 1K regimes on the target BI-RADS datasets are not fully specified; this information is needed to assess the low-data experiments.
  3. [§3 (Method)] The value chosen for the contrastive temperature and its sensitivity analysis are not reported, even though it is a free hyperparameter in the method.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below and outline the planned revision to strengthen the manuscript.

read point-by-point responses
  1. Referee: the comparison does not include a control that replaces atlas captions with shuffled or generic text. Without this ablation, the observed gains cannot be confidently attributed to the semantic content of the captions (as asserted in the abstract) rather than the mere presence of a contrastive objective or additional unlabeled images. This control is load-bearing for the central claim that the vision encoder absorbs 'rich information contained in the captions'.

    Authors: We thank the referee for highlighting this point. We agree that the current comparison of 2K image-text pairs versus 2K labeled samples demonstrates the practical value of atlas pretraining but does not fully isolate the contribution of caption semantics from the contrastive objective or the addition of extra images. To more directly support the abstract claim that the vision encoder absorbs rich information from the captions, we will add the suggested control in the revised Section 4. Specifically, we will retrain the model on the same 2K images paired with (i) shuffled captions and (ii) generic text such as 'a mammogram image', then report the downstream 3-class BI-RADS F1 scores for comparison against the original descriptive-caption results. This ablation will clarify whether the observed +1.1% margin is attributable to caption content. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline uses distinct data sources

full rationale

The paper presents an empirical workflow of contrastive pretraining on 2313 separate atlas image-text pairs using PubMedBERT, followed by fine-tuning the vision encoder on independent labeled BI-RADS datasets. Reported F1 gains (+1% to +14%) are measured on held-out target test sets and do not reduce via any equation or self-citation to quantities defined by the same fitted parameters. No load-bearing step invokes a uniqueness theorem, renames a known result, or smuggles an ansatz through prior self-work; the central claim rests on experimental comparison rather than tautological re-derivation of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard contrastive learning assumptions plus the domain-specific premise that atlas captions encode transferable mammography knowledge; no new entities are postulated and hyperparameters follow common practice.

free parameters (1)
  • contrastive temperature
    Standard hyperparameter in contrastive objectives; chosen during pretraining but not central to the reported gains.
axioms (1)
  • domain assumption Captions from mammography atlases provide accurate and rich descriptions of image findings that align with BI-RADS categories.
    Invoked when claiming that contrastive training allows the vision encoder to absorb information from the captions.

pith-pipeline@v0.9.0 · 5850 in / 1412 out tokens · 40291 ms · 2026-05-20T06:31:03.452606+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 4 internal anchors

  1. [1]

    Ural Koç, Emrah Karaka¸ s, Ebru Akçapınar Sezer, Muhammed Said Be¸ sler, Ya¸ sar Alper Özkaya, ¸ Sehnaz Evrimler, Ahmet Yalçın, Hüseyin Alper Kızılo˘glu, U˘gur Kesimal, Meltem Oruç, ˙Imran Çankaya, Duygu Koç Keles, Neslihan Merd, Erdem Özkan, Numan ˙Ilteri¸ s Çevik, Muham- met Batuhan Gökhan, Bü¸ sra Hayat, Mustafa Özer, O˘guzhan Tokur, Fatih I¸ sık, Mehm...

  2. [2]

    World Health Organization. Cancer. https://www.who.int/news-room/fact-sheets/ detail/cancer, Year of Access. Accessed: March 3, 2024

  3. [3]

    Breast imaging reporting and data system (bi-rads)

    Laura Liberman and Jennifer H Menell. Breast imaging reporting and data system (bi-rads). Radiologic Clinics, 40(3):409–430, 2002

  4. [4]

    Deep learning to improve breast cancer detection on screening mammography.Scientific reports, 9(1):12495, 2019

    Li Shen, Laurie R Margolies, Joseph H Rothstein, Eugene Fluder, Russell McBride, and Weiva Sieh. Deep learning to improve breast cancer detection on screening mammography.Scientific reports, 9(1):12495, 2019

  5. [5]

    A high-performance deep neural network model for bi-rads classification of screening mammography.Sensors, 22(3):1160, 2022

    Kuen-Jang Tsai, Mei-Chun Chou, Hao-Ming Li, Shin-Tso Liu, Jung-Hsiu Hsu, Wei-Cheng Yeh, Chao-Ming Hung, Cheng-Yu Yeh, and Shaw-Hwa Hwang. A high-performance deep neural network model for bi-rads classification of screening mammography.Sensors, 22(3):1160, 2022

  6. [6]

    Huanhuan Liu, Yanhong Chen, Yuzhen Zhang, Lijun Wang, Ran Luo, Haoting Wu, Chenqing Wu, Huiling Zhang, Weixiong Tan, Hongkun Yin, et al. A deep learning model integrating mammography and clinical factors facilitates the malignancy prediction of bi-rads 4 microcalci- fications in breast cancer screening.European Radiology, 31:5902–5912, 2021

  7. [7]

    A novel multi-view deep learning approach for bi-rads and density assessment of mammograms

    Huyen TX Nguyen, Sam B Tran, Dung B Nguyen, Hieu H Pham, and Ha Q Nguyen. A novel multi-view deep learning approach for bi-rads and density assessment of mammograms. In 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 2144–2148. IEEE, 2022

  8. [8]

    Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree.Advances in neural information processing systems, 30, 2017

  9. [9]

    Predicting breast cancer by applying deep learning to linked health records and mammograms.Radiology, 292(2):331–342, 2019

    Ayelet Akselrod-Ballin, Michal Chorev, Yoel Shoshan, Adam Spiro, Alon Hazan, Roie Melamed, Ella Barkan, Esma Herzel, Shaked Naor, Ehud Karavani, et al. Predicting breast cancer by applying deep learning to linked health records and mammograms.Radiology, 292(2):331–342, 2019

  10. [10]

    Med-flamingo: a multimodal medical few-shot learner

    Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Eduardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine Learning for Health (ML4H), pages 353–367. PMLR, 2023

  11. [11]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, et al. Large-scale domain-specific pretraining for biomedical vision-language processing.arXiv preprint arXiv:2303.00915, 2023

  12. [12]

    Medclip: Contrastive learning from unpaired medical images and text

    Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text.arXiv preprint arXiv:2210.10163, 2022

  13. [13]

    Pmc-clip: Contrastive language-image pre-training using biomedical documents,

    Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-clip: Contrastive language-image pre-training using biomedical documents.arXiv preprint arXiv:2303.07240, 2023

  14. [14]

    PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

    Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023. 7

  15. [15]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

  16. [16]

    Domain-specific language model pretraining for biomedical natural language processing.ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021

    Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing.ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021

  17. [17]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

  18. [18]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  19. [19]

    Lippincott Williams & Wilkins, 2007

    Ellen Shaw De Paredes.Atlas of mammography. Lippincott Williams & Wilkins, 2007

  20. [20]

    D’Orsi, C. J. and Sickles, E. A. and Mendelson, E. B. and Morris, E. A. and others. Acr bi-rads® atlas, breast imaging reporting and data system. reston, va, american college of radiology; 2013,

  21. [21]

    https://www.acr.org/Clinical-Resources/Reporting-and-Data-Systems/ Bi-Rads[Accessed: 2024/02/20]

  22. [22]

    Jorj X. McKie. Pymupdf, 2024. https://pymupdf.readthedocs.io/en/latest/module. html[Accessed: 2024/02/20]

  23. [23]

    pytesseract, 2024.https://github.com/h/pytesseract[Accessed: 2024/02/20]

    Samuel Hoffstaetter, Juarez Bochi, Matthias Lee, Lars Kistner, Ryan Mitchell, and Emilio Cec- chini. pytesseract, 2024.https://github.com/h/pytesseract[Accessed: 2024/02/20]

  24. [24]

    Jiwoong J Jeong, Brianna L Vey, Ananth Bhimireddy, Thomas Kim, Thiago Santos, Ramon Correa, Raman Dutt, Marina Mosunjac, Gabriela Oprea-Ilies, Geoffrey Smith, et al. The emory breast imaging dataset (embed): A racially diverse, granular dataset of 3.4 million screening and diagnostic mammographic images.Radiology: Artificial Intelligence, 5(1):e220047, 2023

  25. [25]

    https://teknofest.org/ en/competitions/artificial-intelligence-in-health-competition/

    Artificial intelligence in health competition, teknofest. https://teknofest.org/ en/competitions/artificial-intelligence-in-health-competition/ . Accessed: 2024-03-03

  26. [26]

    Sa ˘glık Bakanlı˘gı

    T.C. Sa ˘glık Bakanlı˘gı. Mamografi verisi. https://acikveri.saglik.gov.tr/Home/ DataSetDetail/3, 2024. Accessed: 2024-07-30

  27. [27]

    Teknofest: Aerospace and Technology Festival

    Teknofest. Teknofest: Aerospace and Technology Festival. https://teknofest.org/en/,

  28. [28]

    Accessed: 2024-07-30

  29. [29]

    YOLOX: Exceeding YOLO Series in 2021

    Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. Yolox: Exceeding yolo series in 2021.arXiv preprint arXiv:2107.08430, 2021

  30. [30]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 8