pith. machine review for the scientific record. sign in

arxiv: 2512.00198 · v3 · submitted 2025-11-28 · 💻 cs.CV

Mammo-FM: Breast-specific foundational model for Integrated Mammographic Diagnosis, Prognosis, and Reporting

Pith reviewed 2026-05-17 03:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords mammographyfoundation modelbreast cancerdiagnosisprognosisstructured reportingmedical imagingdomain-specific AI
0
0 comments X

The pith

A breast-specific foundation model outperforms larger generalist models on diagnosis, prognosis, and reporting while using one-third the parameters

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Mammo-FM as the first foundation model built specifically for mammography. It is pretrained on the largest dataset yet, covering over 140,000 patients and 800,000 images from multiple U.S. hospitals. The model supports a range of tasks including spotting cancer, predicting risk, and generating reports, all from aligned image and text data. A reader would care because it suggests specialized models can deliver better results in medicine while being smaller and more practical than broad AI systems.

Core claim

Mammo-FM is introduced as the first mammography-specific foundation model, pretrained on 140,677 patients and 821,326 mammograms from four U.S. institutions. It provides a single framework for cancer diagnosis, pathology localization, structured report generation, and cancer risk prognosis by aligning images with text for improved interpretability. The model operates on native-resolution mammograms and uses only one-third the parameters of state-of-the-art generalist foundation models while consistently outperforming them on multiple benchmarks.

What carries the argument

The image-text aligned foundation model pretrained on native-resolution mammograms that unifies diagnosis, localization, reporting, and prognosis in one representation.

If this is right

  • The image-text alignment allows clinicians to audit model decisions through linked visual and textual explanations.
  • A single smaller model can replace separate tools for diagnosis, risk assessment, and report writing in breast imaging workflows.
  • Operating at native image resolution preserves fine details that downsampled generalist models might lose.
  • Performance gains on out-of-distribution tests indicate the model handles data from different institutions better than broader alternatives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Domain-specific pretraining could prove more efficient than scaling up general models for other specialized medical imaging tasks such as chest X-rays or MRIs.
  • The results point to the value of multi-institution datasets for building medical AI that generalizes across equipment and patient populations.
  • If the efficiency advantage holds, clinical deployment of advanced mammography AI could require less specialized hardware than current general models.

Load-bearing premise

The multi-institutional dataset of 140,677 patients is representative of real-world mammogram variations and free of biases or data leakage that could affect generalization.

What would settle it

Testing Mammo-FM on mammograms from a hospital or region outside the four training institutions and checking whether it still outperforms generalist models on the same tasks.

Figures

Figures reproduced from arXiv: 2512.00198 by Abhishek Varshney, Alex Tang, Aya Kassem, Clare B. Poynton, Hari M. Trivedi, Ho Cheung Aiden Wong, Imon Banerjee, Judy Wawira Gichoya, Katelyn C. Morrison, Kayhan Batmanghelich, Param Budhraja, Payel Basak, Rayan Syed, Shantanu Ghosh, Shyam Visweswaran, Vedant Parthesh Joshi, Weicheng Dai.

Figure 1
Figure 1. Figure 1: Overview of the Mammo-FM foundation model and downstream applications. a. Schematic of the Mammo-FM framework. High-resolution mammographic views (CC, MLO) and paired radiology reports are jointly used for multi-view contrastive pretraining, aligning image and text representations within a shared embedding space (see Methods). b. Composition of multi-institutional pretraining datasets from the Mayo Clinic,… view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation of Mammo-FM representations for diagnosing mammographic findings. a. Schematic overview of zero-shot classification, where aligned image and text encoders predict the presence of mammo￾graphic findings (e.g., mass, calcification) by measuring image–text embedding similarity, without task-specific training. b. Zero-shot evaluation of breast cancer and finding classification performance on in-dist… view at source ↗
Figure 3
Figure 3. Figure 3: Integrating Mammo-FM representations with risk prediction pipelines and interpretable model￾ing. a. Training pipeline of the risk predictors – MIRAI w/ Mammo-FM and AsymMIRAI w/ Mammo-FM using knowledge distillation from 1–5-year risk outputs of the original MIRAI model. We freeze the image encoder throughout the training. b. We replace the standard ResNet-18 encoder in MIRAI and AsymMIRAI with a Mammo-FM-… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of Mammo-GRG. a. Schematic of the Mammo-GRG architecture. Four-view screening mammograms (LCC, LMLO, RCC, RMLO) are encoded by dedicated Mammo-FM vision encoders and projected to the latent space of a Llama-3.1-8B language model via a multimodal projector. View-specific tokens and positional embeddings preserve spatial and semantic distinctions, enabling cross-view reasoning. b. Clinical grounding… view at source ↗
Figure 5
Figure 5. Figure 5: Distributions of radiology report lengths across datasets. a. Violin plots showing the distribution of report word counts for breast imaging datasets from BU and UPMC. Each violin displays the full distribution of report lengths, with internal boxes denoting the interquartile range and median. Reports from UPMC were generally longer and exhibited higher variance compared with BU. b. Histogram and kernel de… view at source ↗
Figure 6
Figure 6. Figure 6: Example of report-like sentence generation for the attribute [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Integration of Mammo-FM representations into risk prediction frameworks. a. MIRAI w/ Mammo-FM. Each of the four standard mammographic views (LCC, LMLO, RCC, RMLO) passes through an independent, frozen Mammo-FM image encoder to produce a 2,048-dimensional feature. A lightweight transformer aggregator fuses view-specific representations enriched with side, view, and time embeddings. The aggregated features a… view at source ↗
Figure 8
Figure 8. Figure 8: Selection pipeline of the UPMC dataset used for pretraining Mammo-FM 86,050 exams (2010-2024) 69,651 exams 13,854 exams retrieved from PACs 13,679 complete exams 13,480 exams In the dataset Exclude 14,978 exams that lack followup * Exclude 879 exams for history of breast cancer Exclude 542 exams for history of breast implants Exclude 175 exams that lack complete imaging data (i.e., 4 images per exam) Exclu… view at source ↗
Figure 9
Figure 9. Figure 9: Selection pipeline of the BU dataset used for pretraining Mammo-FM 37 [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Selection pipeline of the EMBED dataset used for pretraining Mammo-FM BU long_answer: 153361 multiple_choice: 32763 report_generation: 10921 UPMC long_answer: 127009 multiple_choice: 26769 report_generation: 8923 Counts: long_answer: 383535 multiple_choice: 59532 report_generation: 19844 [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distribution of question-answer (QA) pairs within the Mammo-Instruct dataset – used to train Mammo-GRG – across the UPMC and BU subsets. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Evaluation (Precision and Recall) of diagnostic performance across training regimes and validation settings. a. Zero-shot evaluation of Mammo-FM and baseline models across out-of-distribution (OOD) and in-distribution (ID) validation sets. b. Linear probe evaluation of the same models, where only the classification layer was trained on labeled data while image–text encoders remained frozen. c. Full fine-t… view at source ↗
Figure 13
Figure 13. Figure 13: Accuracies of breast density classification across pre-trained all the image encoders. Mammo-FM (multi-institution) achieves the highest accuracy across both in-distribution (EMBED) and out-of-distribution (RSNA, VinDr) datasets under linear probe and full fine-tuning settings. Its consistent gains over Mammo-FM (UPMC), MedSigLIP, DINOv3, and CXR-CLIP-RN50 highlight the benefits of large-scale, domain-spe… view at source ↗
Figure 14
Figure 14. Figure 14: Zero-shot AUROC performance across out-of-distribution (OOD) datasets using different text encoders of Mammo-FM pretrained on the UPMC dataset. The plot compares multiple text encoder variants under zero-shot settings, highlighting the impact of the language backbone. Finetuned and non-finetuned ModernBERT variants indicate whether the text encoder is finetuned on 200,000 UPMC radiology reports, which ser… view at source ↗
Figure 15
Figure 15. Figure 15: Accuracies of BI-RADS classification performance across report generation models: Mammo￾GRG, Med-Gemma (fine-tuned), Med-Gemma, LLaVA-med (fine-tuned), and LLaVA-med. Across both the datasets – UPMC and BU, Mammo-GRG demonstrates superior performance compared with all generalist baselines, indicating more clinically reliable BI-RADS classification from generated reports. For this evaluation, we group BI-R… view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative examples of text-grounded interpretability of the MIRAI w/ Mammo-FM risk predictor on BU dataset samples for 1-year cancer risk prediction. Each heatmap (left) localizes high-risk activations driving the model’s prediction. We rank the report sentences (right) by probabilities derived from the causal ablation distribution, computed as the normalized drop in cosine similarity between the origin… view at source ↗
Figure 17
Figure 17. Figure 17: Effect of Mammo-FM–based zero-shot grounding on Mammo-GRG. a. Recall across key clinical findings (mass, asymmetry, calcification) for BU and UPMC datasets. Mammo-GRG with Mammo-FM-zero-shot grounding achieves consistently higher recall across all categories, demonstrating improved factual accuracy and clinical relevance. b-c. GREEN factuality scores on BU and UPMC datasets. Mammo-GRG with Mammo￾FM ground… view at source ↗
Figure 18
Figure 18. Figure 18: Accuracies of key mammographic findings (e.g., mass, calcification, and asymmetry) extracted from generated versus reference reports across datasets and laterality. Across both datasets (UPMC and BU) and for both left and right breasts, Mammo-GRG consistently outperforms all generalist baselines, demonstrating superior accuracy and more clinically reliable BI-RADS classification from generated reports. 42… view at source ↗
Figure 19
Figure 19. Figure 19: Comparison of Mammo-GRG fact-checking performance using different Mammo-FM config￾urations. Evaluation of Mammo-GRG report verification using fully fine-tuned (FT), linear probe (LP), and zero-shot (ZS) variants of Mammo-FM across in-distribution (EMBED) and out-of-distribution (VinDr) datasets. Recall is reported for mass and calcification detection. The fully fine-tuned Mammo-FM achieves the highest rec… view at source ↗
Figure 20
Figure 20. Figure 20: Ablation study of Mammo-GRG using different large language models for report generation. a. Recall for key diagnostic findings (BI-RADS category, mass, calcification, and asymmetry) extracted from reports generated by Mammo-GRG models based on different LLM backbones—Llama-3.1-8B, Vicuna-13B, and Mistral-7B—is shown for the BU and UPMC datasets, separately for overall, left-breast, and right-breast evalua… view at source ↗
read the original abstract

Breast cancer is one of the leading causes of death among women worldwide. We introduce Mammo-FM, the first foundation model specifically for mammography, pretrained on the largest and most diverse dataset to date - 140,677 patients (821,326 mammograms) across four U.S. institutions. Mammo-FM provides a unified foundation for core clinical tasks in breast imaging, including cancer diagnosis, pathology localization, structured report generation, and cancer risk prognosis within a single framework. Its alignment between images and text enables both visual and textual interpretability, improving transparency and clinical auditability, which are essential for real-world adoption. We rigorously evaluate Mammo-FM across diagnosis, prognosis, and report-generation tasks in in- and out-of-distribution datasets. Despite operating on native-resolution mammograms and using only one-third of the parameters of state-of-the-art generalist FMs, Mammo-FM consistently outperforms them across multiple public and private benchmarks. These results highlight the efficiency and value of domain-specific foundation models designed around the full spectrum of tasks within a clinical domain and emphasize the importance of rigorous, domain-aligned evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Mammo-FM, a breast-specific foundation model pretrained on the largest reported mammography dataset to date (140,677 patients / 821,326 images across four U.S. institutions). It presents a single image-text aligned model that jointly supports cancer diagnosis, pathology localization, structured report generation, and cancer risk prognosis. The central empirical claim is that Mammo-FM, operating at native resolution with only one-third the parameters of leading generalist foundation models, consistently outperforms those baselines on both public and private in- and out-of-distribution benchmarks.

Significance. If the performance advantages survive rigorous controls for data leakage and identical fine-tuning protocols, the work would usefully demonstrate that domain-specific pretraining can deliver efficiency and multi-task coverage advantages in medical imaging. The scale of the multi-institutional pretraining corpus and the explicit integration of visual and textual outputs for interpretability are concrete strengths that would be of interest to both the computer-vision and clinical-breast-imaging communities.

major comments (2)
  1. [§4 and §3.2] §4 (Experiments) and §3.2 (Dataset Construction): the manuscript does not describe patient-level deduplication, temporal splits, or institution-level separation between the 140,677-patient pretraining corpus and the private test cohorts. Because the headline claim of consistent outperformance on out-of-distribution data rests on the assumption that these test sets are truly unseen, the absence of these controls is load-bearing for the generalization argument.
  2. [Results tables] Results tables (e.g., Tables 2–4): no statistical significance tests, confidence intervals, or ablation studies isolating the contribution of domain-specific pretraining versus native-resolution input or fine-tuning protocol are reported. Without these, it is impossible to verify that the reported gains are attributable to the model rather than evaluation asymmetry.
minor comments (2)
  1. [Abstract and §3.1] The exact parameter count comparison to the cited generalist models should be presented in a dedicated table rather than stated only in the abstract.
  2. [Figures 3–5] Figure captions for localization and report-generation examples would benefit from explicit annotation of ground-truth versus model output to aid clinical interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments identify important areas where additional methodological transparency and statistical rigor will strengthen the manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [§4 and §3.2] §4 (Experiments) and §3.2 (Dataset Construction): the manuscript does not describe patient-level deduplication, temporal splits, or institution-level separation between the 140,677-patient pretraining corpus and the private test cohorts. Because the headline claim of consistent outperformance on out-of-distribution data rests on the assumption that these test sets are truly unseen, the absence of these controls is load-bearing for the generalization argument.

    Authors: We agree that explicit documentation of these controls is essential for supporting the out-of-distribution claims. Patient-level deduplication was performed using unique patient identifiers across all institutions, and the private test cohorts were drawn from held-out institutions and later acquisition periods not represented in the pretraining corpus. We will revise §3.2 to provide a clear description of the deduplication process, the institution-level separation, and any temporal considerations used to ensure no patient or image overlap between pretraining and evaluation sets. revision: yes

  2. Referee: [Results tables] Results tables (e.g., Tables 2–4): no statistical significance tests, confidence intervals, or ablation studies isolating the contribution of domain-specific pretraining versus native-resolution input or fine-tuning protocol are reported. Without these, it is impossible to verify that the reported gains are attributable to the model rather than evaluation asymmetry.

    Authors: We concur that statistical significance testing and confidence intervals are necessary to substantiate the performance differences. We will recompute and report p-values and 95% confidence intervals for all key metrics in Tables 2–4. In addition, we will add an ablation study in §4 that isolates the contribution of domain-specific pretraining from the effects of native-resolution input and fine-tuning protocol, using controlled comparisons on the same evaluation sets. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation chain

full rationale

The paper describes pretraining a domain-specific foundation model on a large multi-institutional dataset of 140,677 patients followed by empirical evaluation on diagnosis, prognosis, localization, and report-generation tasks using in- and out-of-distribution benchmarks. All central claims rest on comparative performance numbers against external generalist models rather than any closed mathematical derivation, fitted parameter renamed as prediction, or self-referential definition. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the abstract or described methodology; the reported outperformance is presented as an observable result of the training and evaluation protocol, not a quantity forced by the paper's own equations. The setup is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard deep-learning training assumptions plus the domain assumption that the collected multi-institutional mammogram corpus is sufficiently diverse and unbiased to support the reported generalization; no new physical entities or ad-hoc mathematical constructs are introduced.

free parameters (1)
  • Pretraining hyperparameters and architecture choices
    Standard transformer-scale choices (layers, heads, learning rate schedule) that are fitted during pretraining on the mammography corpus.
axioms (1)
  • domain assumption The multi-institutional dataset of 140,677 patients supplies adequate diversity and quality for robust pretraining and out-of-distribution generalization.
    Invoked to support claims of consistent outperformance on in- and out-of-distribution benchmarks.

pith-pipeline@v0.9.0 · 5580 in / 1422 out tokens · 71015 ms · 2026-05-17T03:17:04.179195+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    there is a suspicious mass in the lower inner right breast at anterior depth

  2. [2]

    there is a suspicious mass in the lower inner right breast at mid depth

  3. [3]

    there is a suspicious mass in the lower inner right breast at posterior dept …………. Subtype: 1.suspicious 2.obscured 3.indeterminant 4.new 5.spiculated Position: 1.upper 2.lower 3.inner 4.upper outer 5.lower inner Laterality: 1.left 2.right Depth: 1.anterior 2.mid 3.posterior Figure 6. Example of report-like sentence generation for the attributemasslabeled...

  4. [4]

    Please provide the radiology report for the following 2D screening mammogram <image>

    achieves strong predictive accuracy by aggregating multi-view mammographic representa- tions through a transformer, but its latent features remain opaque and lack clinical interpretability. AsymMIRAI Donnelly et al. [2024] advances this direction by introducing an interpretable frame- work based on bilateral dissimilarity between left–right breast represe...

  5. [6]

    Also, the question-answer must be derived from the given radiology report only, nothing else

    All questions and answers must be visually driven, meaning that someone would need to look at the actual mammogram images to confirm the answer. Also, the question-answer must be derived from the given radiology report only, nothing else

  6. [8]

    All questions should relate to what can be observed or concluded from the mammogram images

    Do not include random or irrelevant questions with respect to the report. All questions should relate to what can be observed or concluded from the mammogram images

  7. [9]

    Provide detailed answers when answering complex questions

    Also include complex questions that are relevant to the report accompanied by the 2D mammogram images only. Provide detailed answers when answering complex questions. For example, give detailed examples or reasoning steps to make the content more convincing and well-organized

  8. [10]

    If BI-RADS is not mentioned in the report, do not include questions on the overall BI-RADS assessment

    You can include questions about BI-RADS only if BI-RADS assessment is mentioned in the report explicitly. If BI-RADS is not mentioned in the report, do not include questions on the overall BI-RADS assessment. You can include as many question and answer couples as you find appropriate

  9. [11]

    suspicious

    If there is any finding (mass/calcification/asymmetry etc.) mentioned, you must generate 3 questions: (1) what is the finding? (2) the corresponding laterality (left/right/bilateral), and (3) the corresponding view (CC or MLO) if mentioned. If there is no mention of views (CC/MLO), don’t generate questions on views. Table 5.Prompt used for generating conv...

  10. [13]

    Minimal vascular calcification is seen

    Probability: 0.3311 (33.1%) "Minimal vascular calcification is seen. " [2] Probability: 0.2674 (26.7%) "A few other scattered coarse benign-appearing calcifications are seen in the left breast. " [3] Probability: 0.2093 (20.9%) "Findings: There are scattered fine punctate and coarsely grouped calcifications in the right breast consistent with benign findi...

  11. [14]

    There is a mass in the central right breast with surrounding calcifications, for which additional imaging with spot compression and magnification views are recommended

    Probability: 0.4671 (46.7%) "There is a mass in the central right breast with surrounding calcifications, for which additional imaging with spot compression and magnification views are recommended. " [2] Probability: 0.3527 (35.3%) "The patient will be contacted by our department to schedule an appointment for a diagnostic mammogram with spot compression ...

  12. [15]

    Minimal vascular calcification is seen

    Probability: 0.3311 (33.1%) "Minimal vascular calcification is seen. " [2] Probability: 0.2674 (26.7%) "A few other scattered coarse benign-appearing calcifications are seen in the left breast. " [3] Probability: 0.2093 (20.9%) "Findings: There are scattered fine punctate and coarsely grouped calcifications in the right breast consistent with benign findi...

  13. [16]

    Do not reference the report verbatim or mention its specifics ( e.g., who read the exam, the software used, or the date)

    Use the mammogram report strictly for context regarding findings and impressions. Do not reference the report verbatim or mention its specifics ( e.g., who read the exam, the software used, or the date)

  14. [17]

    Also, the question-answer pairs must be derived from the given radiology report only, nothing else

    All questions and answers must be visually driven, meaning that someone would need to look at the actual mammogram images to confirm the answer. Also, the question-answer pairs must be derived from the given radiology report only, nothing else

  15. [18]

    Focus strictly on the core findings or impressions

  16. [19]

    All questions must relate to observable or inferable image features

    Do not include random or irrelevant questions. All questions must relate to observable or inferable image features

  17. [20]

    Include BI-RADS-related questions only if the report explicitly mentions a BI-RADS assessment

  18. [21]

    If no views are mentioned, omit view-specific questions

    If the report mentions any finding ( e.g., mass, calcification, asymmetry), you must generate 3 questions: (1) identify the finding, (2) specify the laterality (left/right/bilateral), and (3) indicate the view (CC or MLO) if mentioned. If no views are mentioned, omit view-specific questions

  19. [22]

    These three core questions must appear in each of the <free _response>, <description>, and <multiple _choice> sections. Final structure: <free_response> <q>Q1</q><a>A1</a> <q>Q2</q><a>A2</a> </free_response> <description> <q>Q1</q><a>A1</a> <q>Q2</q><a>A2</a> </description> <multiple_choice> <q>Q1 (a)...(d)</q><a>(b)</a> <q>Q2 (a)...(d)</q><a>(a)</a> </mu...

  20. [23]

    Focus strictly on 2D screening-mammography findings - tissue composition, masses, calcifications, asymmetries, distortions, overall BI-RADS category, etc

  21. [24]

    Use clear, radiology-style language; be concise and factual

  22. [25]

    If the user asks general breast-imaging knowledge (not about this exam), answer normally

  23. [26]

    Avoid unrelated topics and keep all responses clinically relevant

    If the user greets you, respond politely. Avoid unrelated topics and keep all responses clinically relevant. Table 7.System prompt for Mammo-GRG (preliminary report generation)defining view-specific inputs, directive tokens, and clinical response guidelines for mammography-grounded report generation. System prompt to generate the final report from the pre...

  24. [27]

    Preserve its clinically relevant content, density description, laterality, locations (e.g.,upper outer quadrant), and recommendations

    Treat the PRELIMINARY REPORT as the base text. Preserve its clinically relevant content, density description, laterality, locations (e.g.,upper outer quadrant), and recommendations

  25. [28]

    You MUST include (carry forward) all non-contradicted statements from the preliminary report

  26. [29]

    RECONCILIATION RULES (VERY IMPORTANT):

    Reconcile contradictions using the rules below. RECONCILIATION RULES (VERY IMPORTANT):

  27. [30]

    no X” statements for that category. - State laterality accurately: Left, Right, or “bilaterally

    If structured findings are POSITIVE for a category (mass, suspicious calcification, asymmetry), you MUST reflect that positivity in the final report: - Remove or revise any preliminary “no X” statements for that category. - State laterality accurately: Left, Right, or “bilaterally” ONLY if both sides are positive

  28. [31]

    If structured findings are NEGATIVE and the preliminary report already says “no X,” keep that negative statement

  29. [32]

    bilaterally

    Do NOT add “bilaterally” unless BOTH sides are positive. 4)BI-RADS POLICY (SCREENING):You must output BI-RADS in {0, 1, 2} ONLY. - If any structured finding is positive/indeterminate, assign BI-RADS 0. - If there are no findings, assign BI-RADS 1. - If only benign findings are present, assign BI-RADS 2. - If the preliminary BI-RADS is provided, keep it if...

  30. [33]

    classifier

    Do NOT mention external models or the word “classifier.”

  31. [34]

    Use standard terminology (CC, MLO; laterality) and be concise and clinically appropriate

  32. [35]

    Preserve the overall style of the preliminary report while outputting ONLY the two sections. Table 8.Prompt used for the grounding stage of Mammo-GRG.This instruction reconciles structured findings from Mammo-FM with the preliminary generated report to produce a clinically consistent final screening mammography report containing onlyFindingsandImpressions...