Multimodal Fusion of Histopathology Images and Electronic Health Records for Early Breast Cancer Diagnosis

Aditya Shribhagwan Khandelwal; Asra Aslam; Mohammad Samar Ansari

arxiv: 2604.17122 · v1 · submitted 2026-04-18 · 💻 cs.CV

Multimodal Fusion of Histopathology Images and Electronic Health Records for Early Breast Cancer Diagnosis

Aditya Shribhagwan Khandelwal , Mohammad Samar Ansari , Asra Aslam This is my paper

Pith reviewed 2026-05-10 06:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal fusionhistopathology imageselectronic health recordsbreast cancer diagnosisintermediate fusionAUCmitosis detectioninterpretability

0 comments

The pith

Intermediate fusion of histopathology images and EHR data reaches 0.997 macro AUC for breast cancer diagnosis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that combining patch-level histopathology features with structured electronic health records through intermediate fusion produces higher diagnostic accuracy than either data source alone. A ResNet-18 model processes the images while XGBoost and multilayer perceptrons handle the tabular records; their latent outputs are simply concatenated and passed to a final classifier. This yields a macro-average AUC of 0.997 and the strongest gains on the mitosis category, which is both clinically critical and class-imbalanced. The result matters because breast cancer survival improves with earlier, more reliable identification of cell-division patterns that are hard to catch from images or records in isolation. Grad-CAM and SHAP maps further show that the fused decisions track known pathological and clinical markers.

Core claim

The authors demonstrate that an intermediate-fusion model, formed by concatenating latent vectors from a ResNet-18 convolutional network processing histopathology patches and from tabular models on electronic health records, achieves a macro-average AUC of 0.997 on breast cancer classification tasks. This exceeds the performance of unimodal image models like ResNet-18 and unimodal EHR models like XGBoost. The largest gains appear in the mitosis category, which is class-imbalanced yet diagnostically important, reaching an AUC of 0.994. Interpretability via Grad-CAM and SHAP confirms alignment with pathological and clinical standards.

What carries the argument

The intermediate fusion step that concatenates latent representations extracted separately from the image CNN and the EHR tabular model before feeding them into a final classifier.

Load-bearing premise

The latent representations from the image and tabular models are complementary and that their simple concatenation captures joint information without introducing redundancy or noise from mismatched data distributions.

What would settle it

An independent test set where the intermediate fusion model shows no AUC improvement over the best unimodal baseline on the mitosis category would falsify the claim of meaningful multimodal gains.

Figures

Figures reproduced from arXiv: 2604.17122 by Aditya Shribhagwan Khandelwal, Asra Aslam, Mohammad Samar Ansari.

**Figure 1.** Figure 1: BreCaHAD patch extraction workflow: original whole-slide image with expert dot annotations, tiling into 64 × 64 non-overlapping patches, & class-sorted patches. 3. Methods 3.1. Image Data: Preprocessing, CNN, and ResNet Preprocessing of the BreCaHAD Image Dataset. The BreCaHAD dataset consists of high-resolution histopathology images in TIFF format, each accompanied by dot-level annotations identifying cel… view at source ↗

**Figure 2.** Figure 2: Grad-CAM visualisations for ResNet across three BreCaHAD classes (rows: mitosis, non-tumour nuclei, tumour nuclei; columns: example patches). Activation heatmaps are consistently localized to the relevant cellular structures. patches as described in Section 3.1. To generate a compact feature representation, the final fully connected classification layer was removed and a 512-dimensional dense embedding wa… view at source ↗

**Figure 3.** Figure 3: Left: Confusion matrix for ResNet on BreCaHAD test set. Tumour nuclei: 22,091 correc /1,578 misclassified as non-tumour. Non-tumour nuclei: 1,796 correct/9 misclassified. Mitosis: 163 correct/176 misclassified as non-tumour /15 as tumour. Errors concentrate between morphologically similar classes, as expected. Right: ROC curves (one-vs-rest) for Simple CNN and ResNet-. Both achieve high overall AUC. Class… view at source ↗

**Figure 4.** Figure 4: Left: confusion matrix for MLP on MIMIC-IV. Right: ROC curves for MLP (onevs-rest). Strong majority-class performance alongside near-diagonal minorityclass ROC curves illustrates the MLP’s sensitivity to class imbalance in structured clinical data, motivating the use of XGBoost as a more robust tabular baseline [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: SHAP analyses for XGBoost on MIMIC-IV. First two: local waterfall plots for a correctly classified high-risk patient (left) and a low-risk patient (right), showing individual feature contributions to each prediction. Last two: global beeswarm plot showing the distribution of SHAP values across all patients (left), and mean SHAP bar chart summarizing overall feature importance (right). Age, comorbidity cou… view at source ↗

**Figure 6.** Figure 6: (a) Confusion matrix for the fusion model. Tumour nuclei and non-tumour nuclei are classified well. The mitosis class remains the primary challenge, with most errors concentrated between mitosis and non-tumour nuclei reflecting morphological ambiguity and class imbalance though the shift in error patterns compared to ResNet alone reflects a contribution from the tabular clinical branch. (b) ROC curves for… view at source ↗

**Figure 7.** Figure 7: Comparative performance of the ResNet-18 image-only model (blue) and the intermediate fusion model (orange) across five evaluation metrics. The Fusion model consistently matches or exceeds ResNet across all metrics, with the most pronounced gains in recall and ROC-AUC the metrics most critical for minimizing missed breast cancer diagnoses in clinical practice. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Training and Validation accuracy (left) and loss (right) for Simple CNN vs. ResNet-18. ResNet achieves faster initial convergence and maintains lower, more stable validation loss, confirming the benefit of residual learning and transfer learning over a shallow convolutional baseline. Appendix A: Computational Environment Image experiments were conducted on a university HPC cluster with NVIDIA Tesla V100 GP… view at source ↗

read the original abstract

Breast cancer is a leading cause of cancer-related mortality worldwide, and timely accurate diagnosis is critical to improving survival outcomes. While convolutional neural networks (CNNs) have demonstrated strong performance on histopathology image classification, and machine learning models on structured electronic health records (EHR) have shown utility for clinical risk stratification, most existing work treats these modalities in isolation. This paper presents a systematic multimodal framework that integrates patch-level histopathology features from the BreCaHAD dataset with structured clinical data from MIMIC-IV. We train and evaluate unimodal image models (a simple CNN baseline and ResNet-18 with transfer learning), unimodal tabular models (XGBoost and a multilayer perceptron), and an intermediate-fusion model that concatenates latent representations from both modalities. ResNet-18 achieves near-perfect accuracy (1.000) and AUC (1.000) on three-class patch-level classification, while XGBoost achieves 98% accuracy on the EHR prediction task. The intermediate fusion model yields a macro-average AUC of 0.997, outperforming all unimodal baselines and delivering the largest improvements on the diagnostically critical but class-imbalanced mitosis category (AUC 0.994). Grad-CAM and SHAP interpretability analyses validate that model decisions align with established pathological and clinical criteria. Our results demonstrate that multimodal integration delivers meaningful improvements in both predictive performance and clinical transparency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The fusion results are invalid because BreCaHAD images and MIMIC-IV records have no patient overlap or alignment.

read the letter

The central problem is that this paper trains an image model on BreCaHAD patches and a tabular model on MIMIC-IV records, then concatenates their latent vectors for the fusion stage. Those two datasets come from entirely separate patient cohorts with no shared identifiers or matching procedure described. Any reported gain from the intermediate fusion model therefore rests on artificially combined features rather than actual multimodal patient data. That undercuts the main claim about improved performance on the mitosis class and the macro AUC of 0.997. The numbers may be reproducible on the given splits, but they do not demonstrate cross-modal learning on real clinical instances. On the positive side, the authors run straightforward baselines (ResNet-18 on images, XGBoost on tabular data) against public datasets and add Grad-CAM plus SHAP explanations. Those steps are transparent and the focus on the imbalanced mitosis category is clinically sensible. The work also stays within standard techniques without overclaiming new architectures. The soft spots are not minor. There are no details on train-test splits, cross-validation, or checks for data leakage between the image and tabular branches. The near-perfect image accuracy lacks error bars or multiple-run statistics, which is unusual for patch-level histopathology. The tabular task itself is only loosely specified in the abstract. This paper is mainly useful to people who want to see how off-the-shelf fusion performs on these two specific public resources, but the data mismatch makes the multimodal conclusions unreliable. A reader interested in deployable diagnostic aids would not get actionable evidence here. I would not send it for peer review without a major revision that either obtains paired multimodal records or reframes the experiment as a controlled simulation rather than a clinical fusion study.

Referee Report

1 major / 1 minor

Summary. The paper presents a multimodal framework for early breast cancer diagnosis that integrates patch-level histopathology images from the BreCaHAD dataset with structured EHR data from MIMIC-IV. It evaluates unimodal baselines (CNN and ResNet-18 on images; XGBoost and MLP on tabular data) and an intermediate-fusion model that concatenates latent representations from both modalities, reporting a macro-average AUC of 0.997 for fusion (with AUC 0.994 on the mitosis class) that outperforms unimodal models, along with Grad-CAM and SHAP interpretability analyses.

Significance. If the results hold on properly aligned multimodal patient data, the work would demonstrate the value of intermediate fusion for boosting performance on diagnostically critical but imbalanced classes and for providing clinically aligned interpretability. The systematic comparison of unimodal and fusion strategies on public datasets is a strength that could guide future multimodal medical imaging research.

major comments (1)

[Abstract and dataset description] Abstract and dataset description: the intermediate-fusion model concatenates latents from a ResNet-18 trained on BreCaHAD patches and an XGBoost/MLP trained on MIMIC-IV records, yet these source datasets derive from completely disjoint patient cohorts with no shared identifiers, slide-level metadata, or alignment procedure described anywhere in the manuscript. Consequently the concatenated vectors do not correspond to real clinical instances, so the reported performance gains (macro AUC 0.997, mitosis AUC 0.994) cannot be interpreted as evidence of genuine cross-modal interaction learning.

minor comments (1)

[Results section] Results section: the perfect accuracy/AUC of 1.000 on the image task and 0.997 on fusion are reported without error bars, confidence intervals, explicit train-test split ratios, class-balancing details, or leakage checks; these omissions make the unusually high figures difficult to evaluate even if the pairing issue were resolved.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address the single major comment below and describe the revisions that will be made to the manuscript.

read point-by-point responses

Referee: [Abstract and dataset description] Abstract and dataset description: the intermediate-fusion model concatenates latents from a ResNet-18 trained on BreCaHAD patches and an XGBoost/MLP trained on MIMIC-IV records, yet these source datasets derive from completely disjoint patient cohorts with no shared identifiers, slide-level metadata, or alignment procedure described anywhere in the manuscript. Consequently the concatenated vectors do not correspond to real clinical instances, so the reported performance gains (macro AUC 0.997, mitosis AUC 0.994) cannot be interpreted as evidence of genuine cross-modal interaction learning.

Authors: We agree with the referee that BreCaHAD and MIMIC-IV are drawn from completely disjoint patient cohorts and that the manuscript contains no alignment procedure or shared identifiers. The original work concatenated independently extracted latent vectors without any patient-level matching, which means the reported fusion results do not reflect genuine cross-modal interaction on aligned clinical cases. We will revise the abstract, methods, results, and discussion sections to explicitly state this limitation, remove any implication of clinical multimodal fusion, and reframe the contribution as a controlled proof-of-concept study of feature-level concatenation across two public datasets. The unimodal baselines and interpretability analyses will remain, but all performance claims will be qualified accordingly. These changes will be reflected in the next manuscript version. revision: yes

Circularity Check

0 steps flagged

No circularity; results are measured empirical outcomes on public datasets

full rationale

The paper reports directly measured performance metrics (AUC, accuracy) from training CNN/ResNet on BreCaHAD patches and XGBoost/MLP on MIMIC-IV records, followed by standard latent concatenation for fusion. No equations, predictions, or derivations reduce to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The central claims rest on external benchmark evaluation rather than self-referential fitting or renaming.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The claim rests on standard assumptions that CNNs extract useful features from histopathology patches and that EHR tabular data contains predictive clinical signals, plus the untested premise that their latent spaces are directly concatenable without domain-specific alignment.

free parameters (2)

latent dimension for each modality before concatenation
Chosen by model architecture (ResNet-18 and MLP output sizes) and not derived from data or theory.
class weighting or decision threshold for mitosis category
Required to handle imbalance but not specified in abstract.

axioms (2)

domain assumption Patch-level histopathology images from BreCaHAD contain sufficient information for reliable three-class classification
Invoked by training ResNet-18 to 1.000 accuracy on the dataset.
domain assumption Structured fields in MIMIC-IV are relevant and aligned with breast cancer diagnostic labels
MIMIC-IV is a general critical-care database; relevance to breast cancer is assumed rather than demonstrated.

pith-pipeline@v0.9.0 · 5558 in / 1722 out tokens · 82887 ms · 2026-05-10T06:33:48.992622+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M

doi: 10.1038/s42256-019-0138-9. Riccardo Miotto, Li Li, Brian A. Kidd, and Joel T. Dudley. Deep patient: An unsuper- vised representation to predict the future of patients from the electronic health records. Scientific Reports, 6:26094, 2016. doi: 10.1038/srep26094. 17 Multimodal Fusion for Breast Cancer Diagnosis Pooya Mobadersany, Safoora Yousefi, Moham...

work page doi:10.1038/s42256-019-0138-9 2016
[2]

Why Should I Trust You?

doi: 10.1145/2939672.2939778. Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-s...

work page doi:10.1145/2939672.2939778 2017

[1] [1]

Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M

doi: 10.1038/s42256-019-0138-9. Riccardo Miotto, Li Li, Brian A. Kidd, and Joel T. Dudley. Deep patient: An unsuper- vised representation to predict the future of patients from the electronic health records. Scientific Reports, 6:26094, 2016. doi: 10.1038/srep26094. 17 Multimodal Fusion for Breast Cancer Diagnosis Pooya Mobadersany, Safoora Yousefi, Moham...

work page doi:10.1038/s42256-019-0138-9 2016

[2] [2]

Why Should I Trust You?

doi: 10.1145/2939672.2939778. Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 618–626, 2017. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-s...

work page doi:10.1145/2939672.2939778 2017