Multimodal Fusion for Fine-Grained Classification of Breast Fibroadenoma and Phyllodes Tumors
Pith reviewed 2026-07-03 15:41 UTC · model grok-4.3
The pith
A multimodal fusion method using ultrasound images, clinical attributes, and diagnostic descriptions classifies breast fibroadenoma from phyllodes tumors at 77.64 percent accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a clinically guided multimodal framework, built from DenseNet visual encoding, CLIP-style text encoding, and lightweight clinical encoding together with clinical-conditioned adaptive modulation, cross-modal Transformer fusion, and dual-path representation learning, produces superior patient-level performance on the binary distinction between fibroadenoma and phyllodes tumor when all three data streams are available.
What carries the argument
Clinically guided multimodal framework that performs separate modality encoding followed by clinical-conditioned adaptive modulation, cross-modal Transformer fusion, and dual-path representation learning to improve feature alignment and interaction.
If this is right
- Three-modality fusion raises accuracy, F1-score, and AUC above any single-modality or two-modality ablation on the same patient-level splits.
- Clinical-conditioned adaptive modulation and cross-modal Transformer each measurably improve alignment between visual and textual features.
- The constructed FAPT-M dataset supplies a high-quality, pathology-confirmed benchmark for future multimodal breast-ultrasound studies.
- Class-balanced evaluations confirm that the performance lift holds when class imbalance is controlled.
Where Pith is reading between the lines
- If deployed in a preoperative setting the method could reduce the rate at which borderline phyllodes tumors are misclassified as fibroadenoma and thereby change surgical planning.
- The same clinical-conditioning and dual-path design might transfer to other ultrasound-based tasks that also mix images with free-text reports.
- Scaling the dataset beyond 910 patients while preserving the same strict pathology review would provide a direct test of whether the reported margins persist.
Load-bearing premise
The three modalities supply complementary information that the listed fusion components can combine without adding harmful noise or redundancy.
What would settle it
An independent test set in which the full multimodal model shows no accuracy gain over the strongest single-modality baseline would falsify the claim that the fusion components exploit useful complementary signals.
Figures
read the original abstract
Breast fibroadenoma (FA) and phyllodes tumor (PT) are fibroepithelial breast lesions with highly overlapping appearances on B-mode ultrasound, making benign and borderline PT prone to being misclassified as FA and complicating preoperative decision-making. Existing computer-aided diagnosis methods commonly rely on single-modal imaging features and insufficiently exploit complementary clinical and textual information. To address this limitation, we construct the FAPT-M Dataset, a pathology-confirmed multimodal dataset comprising 910 patients with strictly reviewed ultrasound images, structured clinical attributes, and ultrasound diagnostic descriptions. Based on this dataset, we propose a clinically guided multimodal framework that integrates DenseNet-based visual encoding, CLIP-inspired text encoding, and lightweight clinical encoding, and further introduces clinical-conditioned adaptive modulation, cross-modal Transformer fusion, and dual-path representation learning to improve feature alignment and multimodal interaction. Under patient-level five-fold cross-validation, the proposed method achieves an accuracy of 77.64%, F1-score of 73.38%, and AUC of 89.74%, outperforming representative CNN-, Transformer-, and vision-language-based baselines. Ablation studies and class-balanced evaluations further confirm the contribution of three-modality fusion and the key architectural components. Overall, this work provides an effective multimodal approach for fine-grained FA-PT classification and establishes a high-quality benchmark for multimodal breast ultrasound analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript constructs the FAPT-M dataset of 910 pathology-confirmed patients with ultrasound images, structured clinical attributes, and diagnostic descriptions. It proposes a multimodal framework combining DenseNet visual encoding, CLIP text encoding, and clinical encoding, together with clinical-conditioned adaptive modulation, cross-modal Transformer fusion, and dual-path representation learning. Under patient-level five-fold cross-validation the method reports 77.64% accuracy, 73.38% F1-score and 89.74% AUC, outperforming CNN-, Transformer- and vision-language baselines; ablation studies and class-balanced evaluations are cited to confirm the value of three-modality fusion.
Significance. If the empirical claims hold, the work supplies a new high-quality multimodal benchmark for a clinically relevant fine-grained task where single-modality imaging is known to be insufficient. The explicit ablation studies addressing modality contribution constitute a concrete strength that directly supports the central claim of complementary information exploitation.
major comments (1)
- [Results section] Results section (performance tables and text): the reported accuracy, F1 and AUC values are presented without statistical significance tests, confidence intervals, or any description of baseline training protocols and hyper-parameter search, rendering the outperformance claim difficult to evaluate.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The single major comment identifies a clear gap in the presentation of results that we agree requires addressing to strengthen the empirical claims.
read point-by-point responses
-
Referee: [Results section] Results section (performance tables and text): the reported accuracy, F1 and AUC values are presented without statistical significance tests, confidence intervals, or any description of baseline training protocols and hyper-parameter search, rendering the outperformance claim difficult to evaluate.
Authors: We agree that the absence of statistical significance testing and confidence intervals limits the strength of the outperformance claims. In the revised manuscript we will add (i) 95% confidence intervals computed via patient-level bootstrap resampling over the five folds and (ii) paired statistical tests (McNemar’s test for accuracy/F1 and DeLong’s test for AUC) with p-values reported in the main tables. We will also expand the Methods and supplementary material to document the exact hyper-parameter search protocol (grid ranges for learning rate, batch size, fusion-layer depth, etc.) and training schedules used for every baseline, ensuring full reproducibility of the comparisons. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper presents an empirical ML study: construction of a new multimodal dataset (910 patients), a fusion architecture with DenseNet/CLIP/clinical encoders plus three specific modules, and evaluation via patient-level 5-fold cross-validation yielding accuracy 77.64%, F1 73.38%, AUC 89.74%. Ablation studies are cited to confirm the value of three-modality fusion. No equations, derivations, fitted parameters re-labeled as predictions, or self-citation chains appear in the provided text. The performance claims are obtained through standard held-out validation and are externally falsifiable, rendering the work self-contained against benchmarks with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
M. D. Rowell, R. R. Perry, J. G. Hsiu, and S. C. Barranco, “Phyllodes tumors,”Am J Surg, vol. 165, no. 3, pp. 376–379, Mar. 1993, doi: 10.1016/s0002-9610(05)80849-9. PMID: 8383473
-
[2]
Management of breast fibroadenomas,
R. Greenberg, Y . Skornick, and O. Kaplan, “Management of breast fibroadenomas,”J Gen Intern Med, vol. 13, no. 9, pp. 640–645, Sep. 1998, doi: 10.1046/j.1525-1497.1998.cr188.x. PMID: 9754521; PMCID: PMC1497021
-
[3]
Imaging findings in phyllodes tumors of the breast,
H. Tan et al., “Imaging findings in phyllodes tumors of the breast,”Eur J Radiol, vol. 81, no. 1, pp. e62–e69, Jan. 2012, doi: 10.1016/j.ejrad.2011.01.085. Epub 2011 Feb 25; PMID: 21353414
-
[4]
Phyllodes tumor of breast: a review article,
S. P. Mishra, S. K. Tiwary, M. Mishra, and A. K. Khanna, “Phyllodes tumor of breast: a review article,”ISRN Surg, vol. 2013, p. 361469, 2013, doi: 10.1155/2013/361469
-
[6]
Current Trends in the Management of Phyllodes Tumors of the Breast,
T. Adesoye, H. B. Neuman, L. G. Wilke, J. R. Schumacher, J. Steiman, and C. C. Greenberg, “Current Trends in the Management of Phyllodes Tumors of the Breast,”Ann Surg Oncol, vol. 23, no. 10, pp. 3199–3205, Oct. 2016, doi: 10.1245/s10434-016-5314-0. Epub 2016 Jun 22; PMID: 27334214; PMCID: PMC5021443
-
[8]
E. Stoffel et al., “Distinction between phyllodes tumor and fibroadenoma in breast ultrasound using deep learning image analysis,”Eur J Radiol Open, vol. 5, pp. 165–170, Sep. 2018, doi: 10.1016/j.ejro.2018.09.002
-
[9]
Phyllodes Tumor of the Breast: Ultrasound-Pathology Correlation,
M. Kalambo et al., “Phyllodes Tumor of the Breast: Ultrasound-Pathology Correlation,”AJR Am J Roentgenol, vol. 210, no. 4, pp. W173–W179, Apr. 2018, doi: 10.2214/AJR.17.18554. Epub 2018 Feb 7; PMID: 29412020
-
[10]
Deep Learning in Medical Ultrasound Anal- ysis: A Review,
S. Liu, Y . Wang, X. Yang, B. Lei, L. Liu, S. X. Li, D. Ni, and T. Wang, “Deep Learning in Medical Ultrasound Anal- ysis: A Review,”Engineering, vol. 5, no. 2, pp. 261–275, 2019, ISSN: 2095-8099, doi: 10.1016/j.eng.2018.11.020. URL: https://www.sciencedirect.com/science/article/pii/S2095809918301887
-
[11]
V . Suvannarerg, P. Chitchumnong, W. Apiwat et al., “Diagnostic performance of qualitative and quantitative shear wave elastography in differentiating malignant from benign breast masses, and association with the histological prognostic factors,”Quant Imaging Med Surg, vol. 9, no. 3, pp. 386–398, 2019, doi: 10.21037/qims.2019.03.04
-
[12]
Breast cancer screening programs: does one risk fit all?,
F. Pediconi and F. Galati, “Breast cancer screening programs: does one risk fit all?,”Quant Imaging Med Surg, vol. 10, no. 4, pp. 886–890, 2020, doi: 10.21037/qims.2020.03.14
-
[13]
S. Niu et al., “Differential diagnosis between small breast phyllodes tumors and fibroadenomas using artifi- cial intelligence and ultrasound data,”Quant Imaging Med Surg, vol. 11, no. 5, pp. 2052–2061, 2021, doi: 10.21037/qims-20-919
-
[14]
R. Wilding, V . M. Sheraton, L. Soto, N. Chotai, and E. Y . Tan, “Deep learning applied to breast imaging classification and segmentation with human expert intervention,”J Ultrasound, vol. 25, no. 3, pp. 659–666, Sep. 2022, doi: 10.1007/s40477-021-00642-3. PMID: 35000127; PMCID: PMC9402837
-
[15]
R. Iacob et al., “Evaluating the role of breast ultrasound in early detection of breast cancer in low- and middle- income countries: a comprehensive narrative review,”Bioengineering, vol. 11, no. 3, p. 262, Mar. 2024, doi: 10.3390/bioengineering11030262
-
[16]
Y . Yan et al., “Deep learning-assisted distinguishing breast phyllodes tumours from fibroadenomas based on ultra- sound images: a diagnostic study,”Br J Radiol, vol. 97, no. 1163, pp. 1816–1825, 2024, doi: 10.1093/bjr/tqae147
-
[17]
Multimodal Deep Learning for Phyllodes Tumor Classification from Ultrasound and Clinical Data,
F. F. Abir, A. E. Daly, K. Anderman, T. Ozmen, and L. J. Brattain, “Multimodal Deep Learning for Phyllodes Tumor Classification from Ultrasound and Clinical Data,” in2025 IEEE 21st Int. Conf. Body Sensor Networks (BSN), 2025, pp. 1–4. URL: https://api.semanticscholar.org/CorpusID:281080211
2025
-
[18]
A deep learning-based multimodal medical imaging model for breast cancer screening,
J. Chen, T. Pan, Z. Zhu et al., “A deep learning-based multimodal medical imaging model for breast cancer screening,”Sci Rep, vol. 15, p. 14696, 2025, doi: 10.1038/s41598-025-99535-2
-
[19]
W. Hu et al., “Performance of artificial intelligence-assisted ultrasound elastography in classifying benign and malignant breast tumors: a systematic review and meta-analysis,”BMC Med Imaging, vol. 25, no. 1, p. 440, Nov. 2025, doi: 10.1186/s12880-025-01982-w
-
[20]
H. Liu et al., “Deep Learning Based on Automated Breast V olume Scanner Images for the Diagnosis of Breast Lesions: A Multicenter Diagnostic Study,”Int J Med Sci, vol. 22, no. 15, pp. 3924–3937, 2025, doi: 10.7150/ijms.118430. 14
-
[21]
G. Lu et al., “Intra-tumor and peritumoral radiomics and deep learning based on ultrasound for differentiating fibroadenoma and phyllodes tumor: a multicenter study,”Front Oncol, vol. 15, p. 1668793, Oct. 2025, doi: 10.3389/fonc.2025.1668793
-
[22]
A. Mashekova et al., “Review of Artificial Intelligence Techniques for Breast Cancer Detection with Different Modalities: Mammography, Ultrasound, and Thermography Images,”Bioengineering, vol. 12, no. 10, p. 1110, Oct. 2025, doi: 10.3390/bioengineering12101110
-
[23]
Y . Li et al., “A review of deep learning-based information fusion techniques for multimodal medical image classification,”Comput. Biol. Med., vol. 177, p. 108635, 2024, doi: 10.1016/j.compbiomed.2024.108635
-
[24]
Deep multimodal fusion of image and non-image data in disease diagnosis and prognosis: a review,
C. Cui et al., “Deep multimodal fusion of image and non-image data in disease diagnosis and prognosis: a review,” Prog. Biomed. Eng., vol. 5, no. 2, 2023, doi: 10.1088/2516-1091/acc2fe
-
[25]
Multimodal medical image fusion combining saliency perception and generative adversarial network,
M. Albekairi et al., “Multimodal medical image fusion combining saliency perception and generative adversarial network,”Sci. Rep., vol. 15, no. 1, p. 10609, Mar. 2025, doi: 10.1038/s41598-025-95147-y
-
[26]
Multimodal deep learning for en- hanced breast cancer diagnosis on sonography,
T. R. Wei, A. Chang, Y . Kang, M. Patel, Y . Fang, and Y . Yan, “Multimodal deep learning for en- hanced breast cancer diagnosis on sonography,”Comput. Biol. Med., vol. 194, p. 110466, Aug. 2025, doi: 10.1016/j.compbiomed.2025.110466
-
[27]
Breast tumor diagnosis via multimodal deep learning using ultrasound B-mode and Nakagami images,
S. Muhtadi and C. M. Gallippi, “Breast tumor diagnosis via multimodal deep learning using ultrasound B-mode and Nakagami images,”J. Med. Imaging, vol. 12, no. Suppl 2, p. S22009, Nov. 2025, doi: 10.1117/1.JMI.12.S2.S22009
-
[28]
A study of CNN and transfer learning in medical imaging: advantages, challenges, future scope,
A. W. Salehi et al., “A study of CNN and transfer learning in medical imaging: advantages, challenges, future scope,”Sustainability, vol. 15, no. 7, p. 5930, 2023, doi: 10.3390/su15075930
-
[29]
A review paper about deep learning for medical image analysis,
B. Sistaninejhad, H. Rasi, and P. Nayeri, “A review paper about deep learning for medical image analysis,”Comput. Math. Methods Med., vol. 2023, p. 7091301, May 2023, doi: 10.1155/2023/7091301
-
[30]
Dense convolutional network and its application in medical image analysis,
T. Zhou, X. Ye, H. Lu, X. Zheng, S. Qiu, and Y . Liu, “Dense convolutional network and its application in medical image analysis,”Biomed. Res. Int., vol. 2022, p. 2384830, Apr. 2022, doi: 10.1155/2022/2384830
-
[31]
Densely connected convolutional networks
G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 2261–2269, doi: 10.1109/CVPR.2017.243
-
[32]
A. AlZoubi et al., “Classification of breast lesions in ultrasound images using deep convolutional neural networks: transfer learning versus automatic architecture design,”Med. Biol. Eng. Comput., vol. 62, no. 1, pp. 135–149, 2024, doi: 10.1007/s11517-023-02922-y
-
[33]
Classification of asymmetry in mammography via the DenseNet convolutional neural network,
T. Liao et al., “Classification of asymmetry in mammography via the DenseNet convolutional neural network,” Eur. J. Radiol. Open, vol. 11, p. 100502, Jul. 2023, doi: 10.1016/j.ejro.2023.100502
-
[34]
X. Li et al., “Classification of breast cancer histopathological images using interleaved DenseNet with SENet (IDSNet),”PLoS One, vol. 15, no. 5, p. e0232127, May 2020, doi: 10.1371/journal.pone.0232127
-
[35]
Differentiation between benign phyllodes tumors and fibroadenomas of the breast on MR imaging,
T. Kamitani et al., “Differentiation between benign phyllodes tumors and fibroadenomas of the breast on MR imaging,”Eur. J. Radiol., vol. 83, no. 8, pp. 1344–1349, 2014, doi: 10.1016/j.ejrad.2014.04.031
-
[36]
W. Lohitvisate et al., “Clinical presentation and radiologic imaging findings of phyllodes tumors: be- nign and borderline/malignant phyllodes tumors,”F1000Research, vol. 13, p. 210, May 28, 2024, doi: 10.12688/f1000research.145872.2
-
[37]
L. Duman, L. et al., “Differentiation between phyllodes tumors and fibroadenomas based on mammographic, sonographic and MRI features,”Breast Care, vol. 11, no. 2, pp. 123–127, 2016, doi: 10.1159/000444377
-
[38]
Value of High Frequency Ultrasound Parameters in Differential Diagnosis of Breast Phyllodes Tumor and Breast Fibroadenoma,
L. Wu, “Value of High Frequency Ultrasound Parameters in Differential Diagnosis of Breast Phyllodes Tumor and Breast Fibroadenoma,”Journal of Kunming Medical University,vol. 33, no. 10, 2012
2012
-
[39]
Fibroepithelial lesions; The WHO spectrum,
G. Krings, G. R. Bean, and Y . Y . Chen, “Fibroepithelial lesions; The WHO spectrum,”Semin. Diagn. Pathol., vol. 34, no. 5, pp. 438–452, Sep. 2017, doi: 10.1053/j.semdp.2017.05.006
-
[40]
E. A. Rakha et al., “Classification of fibroepithelial lesions of the breast in core needle biopsy with implications for further management,”Mod. Pathol., vol. 38, no. 5, p. 100734, 2025, doi: 10.1016/j.modpat.2025.100734
-
[41]
Core needle biopsy in fibroepithelial tumors: predicting factors for phyllodes tumors,
Y . N. Reis et al., “Core needle biopsy in fibroepithelial tumors: predicting factors for phyllodes tumors,”Clinics (Sao Paulo), vol. 76, p. e2806, Apr. 16, 2021, doi: 10.6061/clinics/2021/e2806
-
[42]
C. Wiratkapun et al., “Fibroadenoma versus phyllodes tumor: distinguishing factors in patients diagnosed with fibroepithelial lesions after a core needle biopsy,”Diagn. Interv. Radiol., vol. 20, no. 1, pp. 27–33, 2014, doi: 10.5152/dir.2013.13133. 15
-
[43]
EfficientNet: Rethinking model scaling for convolutional neural networks,
M. Tan and Q. V . Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in *Proc. Int. Conf. Mach. Learn. (ICML)*, 2019, pp. 6105–6114
2019
-
[44]
In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV)
Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in 2021 IEEE/CVF Inter- national Conference on Computer Vision (ICCV), 2021, pp. 9992–10002, doi: 10.1109/ICCV48922.2021.00986
-
[45]
Learning Transferable Visual Models From Natural Language Supervision
A. Radford et al., “Learning transferable visual models from natural language supervision,” 2021, arXiv:2103.00020 [cs.CV]
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[46]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021, arXiv:2010.11929 [cs.CV]
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[47]
Training Data-Efficient Image Transformers & Distillation through Attention,
H. Touvron et al., “Training data-efficient image transformers & distillation through attention,” 2021, arXiv:2012.12877 [cs.CV]
-
[48]
C. C. Atabansi et al., “A survey of Transformer applications for histopathological image analysis: New develop- ments and future directions,”Biomed. Eng. Online, vol. 22, no. 1, p. 96, Sep. 2023, doi: 10.1186/s12938-023- 01157-0
-
[49]
C. C. Atabansi, J. Nie, H. Liu, Q. Song, L. Yan, and X. Zhou, “A survey of Transformer applications for histopathological image analysis: New developments and future directions,”Biomed. Eng. Online, vol. 22, no. 1, p. 96, Sep. 2023, doi: 10.1186/s12938-023-01157-0
-
[50]
Systematic review of hybrid vision transformer architectures for radio- logical image analysis,
J. W. Kim, A. U. Khan, and I. Banerjee, “Systematic review of hybrid vision transformer architectures for radio- logical image analysis,”J. Imaging Inform. Med., vol. 38, no. 5, pp. 3248–3262, Oct. 2025, doi: 10.1007/s10278- 024-01322-4
-
[51]
Shreffler and M
J. Shreffler and M. R. Huecker, Diagnostic Testing Accuracy: Sensitivity, Specificity, Predictive Values and Likelihood Ratios. StatPearls Publishing, 2023
2023
-
[52]
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh and D. Batra, Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. IEEE, 2017
2017
-
[53]
Alotaibi, B
M. Alotaibi, B. Alotaibi, A. Razaque and O. Alotaibi, Breast cancer classification based on convolutional neural networks: A systematic review. PeerJ Computer Science, 2023
2023
-
[54]
Q. Dan, Z. Xu, H. Burrows, J. Bissram, J. S. A. Stringer and Y . Li, Diagnostic performance of deep learning in ultrasound diagnosis of breast cancer: A systematic review. npj Precision Oncology, 2024
2024
-
[55]
Reston, V A, USA: American College of Radiology, 2013
American College of Radiology, ACR BI-RADS Atlas: Breast Imaging Reporting and Data System. Reston, V A, USA: American College of Radiology, 2013. 16
2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.