pith. sign in

arxiv: 2605.15235 · v1 · pith:4PRMUGOXnew · submitted 2026-05-13 · 💻 cs.LG

MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion

Pith reviewed 2026-05-19 16:29 UTC · model grok-4.3

classification 💻 cs.LG
keywords multimodal fusionmissing modalitiesrobustnessclinical AIbenchmarksensor failureincomplete data
0
0 comments X

The pith

Architecture family predicts robustness to missing modalities better than model size in clinical fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MuteBench to evaluate how multimodal fusion models handle real-world sensor failures in clinical data, where entire channels or time segments can go missing. By testing six architectures across nine datasets from different clinical domains, it establishes that the architecture family is the main driver of tolerance to these failures, more so than the number of parameters or specific training adjustments. Channel-independent designs cope well with complete modality loss but often falter when short sequences lose time segments. The work also shows that dropout training only shields models up to the rates seen in training, and that data properties like channel count and sequence length decide which failure type hurts more. A case study hints that imputation can recover performance for the most vulnerable models.

Core claim

MuteBench systematically applies controlled modality missing and within-modality missing to six fusion architectures on nine clinical datasets. The central finding is that architecture family is the strongest predictor of robustness, outweighing parameter count. Channel-independent models tolerate modality missing well but remain sensitive to within-modality missing on short sequences. Curriculum modality dropout provides protection only up to the highest dropout rate used during training. Channel count, sequence length, and modality alignment together determine which missing-data mode creates the larger threat. Diffusion-based imputation improves downstream classification under within-modim

What carries the argument

MuteBench benchmark that tests fusion architectures under controlled levels of modality missing and within-modality missing across multiple clinical datasets.

If this is right

  • Channel-independent architectures provide reliable tolerance when an entire sensor channel disappears.
  • Modality dropout during training only guarantees protection up to the maximum rate applied in that training.
  • Short sequences make within-modality missing more damaging than full channel loss.
  • Imputation helps most for models whose internal routing depends heavily on clean inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers could combine channel-independent processing with explicit temporal-gap handling to address both failure modes at once.
  • Robustness claims should be re-checked on datasets that vary sequence length and channel alignment independently.
  • Deployment decisions may benefit from matching model type to expected failure statistics of the target clinical setting.

Load-bearing premise

The nine clinical datasets and six fusion architectures represent the typical range of real-world multimodal physiological signals and sensor-failure patterns.

What would settle it

A new dataset with different channel counts or sequence lengths where the robustness ranking by architecture family reverses or disappears.

Figures

Figures reproduced from arXiv: 2605.15235 by Chen Chen, Song Wang, Tianlong Chen, Wugeng Zheng, Ziwen Kan.

Figure 1
Figure 1. Figure 1: Overview of MuteBench. We evaluate 9 datasets spanning 7 clinical domains and three [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation of missing-data conditions. Left (Complete): Original, fully observed signals. Middle (Modality missing): Entire modalities (e.g., B and E) are dropped with probability p, simulating whole-sensor failures. Right (Within-modality missing): Contiguous time segments are masked independently per channel, simulating transient interruptions like motion artifacts. and Unaligned) datasets lack both spat… view at source ↗
Figure 3
Figure 3. Figure 3: Degradation analysis. left: Radar chart of AUROC drop (∆AUROC = clean − missing, averaged over three seeds) across all 9 datasets under modality and within-modality missing at 20% and 50% rates; larger area indicates greater overall sensitivity, and each axis corresponds to one dataset. right: Detailed degradation trajectory of Flex-MoE on PPG-DaLiA: both AUROC and Macro-F1 decline steeply as missing rate … view at source ↗
read the original abstract

Multimodal physiological data powers clinical AI systems from intensive care units to wearable devices, but sensors routinely fail in practice. Two failure modes are common: modality missing, where an entire channel is absent, and within-modality missing, where a contiguous time segment is lost. No existing benchmark evaluates multiple fusion architectures under both failure modes at controlled severity levels across diverse clinical datasets. We present MuteBench, a benchmark covering 9 datasets from 7 clinical domains, 6 fusion architectures, and 2 missing-data modes over 125,000 samples. Through this benchmark, we find that architecture family is the strongest predictor of robustness, outweighing parameter count. Channel-independent models tolerate modality missing well but can be sensitive to within-modality missing, especially on short sequences. Curriculum modality dropout protects reliably only up to the maximum dropout rate used in training. We also find that channel count, sequence length, and modality alignment jointly determine which failure mode poses the greater threat. Finally, a PTB-XL case study suggests that diffusion-based imputation can improve downstream classification under within-modality missing, with the largest gains for models whose expert routing is most sensitive to corrupted inputs, though broader validation across datasets remains an open direction. MuteBench provides practitioners with concrete guidance for both selecting existing architectures and informing the design of future robust multimodal fusion methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MuteBench, a benchmark for evaluating multimodal fusion architectures under modality missing and within-modality missing conditions. It covers 9 clinical datasets from 7 domains, 6 fusion architectures, and over 125,000 samples across two failure modes. The central empirical finding is that architecture family is the strongest predictor of robustness to these failures, outweighing parameter count; additional results address curriculum modality dropout limits and the potential benefits of diffusion imputation for sensitive routing models.

Significance. If the robustness rankings hold after controlling for design choices, the benchmark supplies concrete, practitioner-oriented guidance for selecting fusion methods in clinical settings where sensor dropouts are routine. The scale of the evaluation and the explicit comparison of failure modes across domains represent a useful contribution to reproducible multimodal robustness research.

major comments (2)
  1. [Abstract] Abstract and results: the claim that architecture family is the strongest predictor of robustness (outweighing parameter count) is not isolated from confounding differences in missing-data handling. The abstract itself notes that curriculum modality dropout protects only up to the training rate and that diffusion imputation helps models with sensitive routing; without an ablation that equalizes these mechanisms across families, variance attributed to 'family' may instead reflect built-in masking, expert routing, or imputation strategies.
  2. [Abstract] Abstract: the assertion that the nine datasets and six architectures 'sufficiently represent' real-world multimodal physiological signals and sensor-failure patterns is stated without supporting evidence or sensitivity analysis. This assumption is load-bearing for the generalization of the robustness rankings.
minor comments (2)
  1. [Abstract] The abstract reports '125,000 samples' but does not break down the distribution across datasets, architectures, or missing-data severity levels.
  2. Clarify the precise statistical procedure used to rank predictors (architecture family vs. parameter count) and report effect sizes or confidence intervals for the ranking.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript introducing MuteBench. We appreciate the referee's focus on potential confounders in our robustness analysis and the generalizability of the benchmark. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract and results: the claim that architecture family is the strongest predictor of robustness (outweighing parameter count) is not isolated from confounding differences in missing-data handling. The abstract itself notes that curriculum modality dropout protects only up to the training rate and that diffusion imputation helps models with sensitive routing; without an ablation that equalizes these mechanisms across families, variance attributed to 'family' may instead reflect built-in masking, expert routing, or imputation strategies.

    Authors: We acknowledge that differences in missing-data handling mechanisms (such as built-in masking, expert routing, or imputation) are inherent to the architecture families evaluated and could contribute to the observed robustness patterns. These mechanisms form part of what distinguishes the families in practical deployments, and our experiments compared representative implementations as they are commonly used. Parameter counts were varied within families where possible to support the family-level finding. To address the concern directly, we will revise the discussion section to explicitly note this potential confounding and highlight the need for future controlled ablations that equalize handling strategies across families. revision: partial

  2. Referee: [Abstract] Abstract: the assertion that the nine datasets and six architectures 'sufficiently represent' real-world multimodal physiological signals and sensor-failure patterns is stated without supporting evidence or sensitivity analysis. This assumption is load-bearing for the generalization of the robustness rankings.

    Authors: The nine datasets were chosen to cover seven distinct clinical domains with differences in channel counts, sequence lengths, sampling rates, and modality alignments, aiming to reflect common physiological signal characteristics and sensor failure scenarios. The total of over 125,000 samples provides scale for the comparisons. We do not claim the selection is exhaustive or perfectly representative of all possible real-world cases. In the revision, we will add a dedicated subsection in the datasets description justifying the selection criteria with a summary table of key characteristics and include a brief sensitivity check by reporting robustness rankings on dataset subsets to assess stability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark derives claims from external dataset comparisons

full rationale

The paper introduces MuteBench as an empirical benchmark evaluating 6 fusion architectures across 9 clinical datasets under controlled missing-data conditions. The central claim that architecture family is the strongest predictor of robustness is obtained directly from experimental results on these external datasets rather than from any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation chain reduces to its own inputs by construction; the findings remain falsifiable via replication on the benchmark. This is the expected outcome for a purely empirical evaluation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard machine-learning benchmark assumptions about dataset representativeness and controlled experimental conditions rather than introducing new fitted parameters or invented entities.

axioms (1)
  • domain assumption The selected clinical datasets and fusion architectures are representative of broader multimodal physiological data scenarios.
    This assumption underpins the generalizability of the reported robustness rankings.

pith-pipeline@v0.9.0 · 5780 in / 1219 out tokens · 53354 ms · 2026-05-19T16:29:45.622394+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 1 internal anchor

  1. [1]

    doi: 10.1038/s41598-026-39035-z

    Benchmarking imputation strategies for missing time-series data in critical care using real- world-inspired scenarios.Scientific Reports, 2026. doi: 10.1038/s41598-026-39035-z

  2. [2]

    Nat Med28, 1773–1784 (2022) https://doi.org/10.1038/s41591-022-01981-2

    Julián N Acosta, Guido J Falcone, Pranav Rajpurkar, and Eric J Topol. Multimodal biomedical AI.Nature Medicine, 28(9):1773–1784, 2022. doi: 10.1038/s41591-022-01981-2. URL https://www.nature.com/articles/s41591-022-01981-2

  3. [3]

    Ehrxqa: A multi-modal ques- tion answering dataset for electronic health records with chest x-ray images

    Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei Ji, Eric Chang, Tackeun Kim, and Edward Choi. Ehrxqa: A multi-modal ques- tion answering dataset for electronic health records with chest x-ray images. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information ...

  4. [4]

    Curriculum learning,

    Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, New York, NY , USA, 2009. Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553380

  5. [5]

    Recurrent neural networks for multivariate time series with missing values.Scientific Reports, 8(1):6085,

    Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values.Scientific Reports, 8(1):6085,

  6. [6]

    Recurrent

    doi: 10.1038/s41598-018-24271-9. URL https://www.nature.com/articles/ s41598-018-24271-9

  7. [7]

    Multimodal clinical benchmark for emergency care (mc-bec): A comprehensive benchmark for evaluating foundation models in emergency medicine

    Emma Chen, Aman Kansal, Julie Chen, Boyang Tom Jin, Julia Rachel Reisler, David A Kim, and Pranav Rajpurkar. Multimodal clinical benchmark for emergency care (mc-bec): A comprehensive benchmark for evaluating foundation models in emergency medicine. 2023. URLhttps://arxiv.org/abs/2311.04937

  8. [8]

    Computational approaches to alleviate alarm fatigue in intensive care medicine: a systematic literature review.Frontiers in Digital Health, 4:843747, 2022

    Jonas Chromik, S A I Klopfenstein, Bjarne Pfitzner, Zeynab C Sinno, Bert Arnrich, Felix Balzer, and Akira-Sebastian Poncette. Computational approaches to alleviate alarm fatigue in intensive care medicine: a systematic literature review.Frontiers in Digital Health, 4:843747, 2022. doi: 10.3389/fdgth.2022.843747. URLhttps://doi.org/10.3389/fdgth.2022.843747

  9. [9]

    CLIMB: Data foundations for large scale multimodal clinical foundation models

    Wei Dai, Peilin Chen, Malinda Lu, Daniel A Li, Haowen Wei, Hejie Cui, and Paul Pu Liang. CLIMB: Data foundations for large scale multimodal clinical foundation models. InForty- second International Conference on Machine Learning, 2025. URL https://openreview. net/forum?id=TcvjOSePic

  10. [10]

    Wearable sensors enable personalized predictions of clinical laboratory measurements

    Jessilyn Dunn, Lukasz Kidzinski, Ryan Runge, Daniel Witt, Jennifer L Hicks, Sophia Miryam Schüssler-Fiorenza Rose, Xiao Li, Amir Bahmani, Scott L Delp, Trevor Hastie, and Michael P Snyder. Wearable sensors enable personalized predictions of clinical laboratory measurements. Nature Medicine, 27(6):1105–1112, 2021. doi: 10.1038/s41591-021-01339-0. URL https...

  11. [11]

    Autonomous medical evaluation for guideline adherence of large language models.NPJ Digital Medicine, 7(1):358, 2024

    Dennis Fast, Lisa C Adams, Felix Busch, Conor Fallon, Marc Huppertz, Robert Siepmann, Philipp Prucker, Nadine Bayerl, Daniel Truhn, Marcus Makowski, et al. Autonomous medical evaluation for guideline adherence of large language models.NPJ Digital Medicine, 7(1):358, 2024. 10

  12. [12]

    Finlayson, Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S

    Samuel G. Finlayson, Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S. Kohane, and Suchi Saria. The clinician and dataset shift in artificial intelligence.New England Journal of Medicine, 385(3):283–286, 2021. doi: 10.1056/NEJMc2104626

  13. [13]

    PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals,

    Ary L. Goldberger, Luis A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley. Physiobank, physiotoolkit, and physionet.Circulation, 101(23):e215–e220, 2000. doi: 10.1161/01.CIR.101.23.e215

  14. [14]

    Fusemoe: Mixture-of- experts transformers for fleximodal fusion.arXiv preprint arXiv:2402.03226, 2024

    Xing Han, Huy Nguyen, Carl Harris, Nhat Ho, and Suchi Saria. Fusemoe: Mixture-of- experts transformers for fleximodal fusion.arXiv preprint arXiv:2402.03226, 2024. URL https://arxiv.org/abs/arXiv:2402.03226

  15. [15]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020

  16. [16]

    VBench: Comprehensive benchmark suite for video gener- ative models

    Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Om- nimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22170–22183, 2024. doi: 10.1109/CVPR52733.2024.02093

  17. [17]

    Modality compe- tition: What makes joint training of multi-modal network fail in deep learning? (Provably)

    Yu Huang, Junyang Lin, Chang Zhou, Hongxia Yang, and Longbo Huang. Modality compe- tition: What makes joint training of multi-modal network fail in deep learning? (Provably). InProceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9226–9259. PMLR, 17–23 Jul 2022. URL https://pro...

  18. [18]

    M3CoTBench: Benchmark chain-of-thought of MLLMs in medical image understanding.arXiv preprint arXiv:2601.08758, 2026

    Juntao Jiang, Jiangning Zhang, Yali Bi, Jinsheng Bai, Weixuan Liu, Weiwei Jin, Zhucun Xue, Yong Liu, Xiaobin Hu, and Shuicheng Yan. M3CoTBench: Benchmark chain-of-thought of MLLMs in medical image understanding.arXiv preprint arXiv:2601.08758, 2026

  19. [19]

    What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14), 2021. ISSN 2076-3417. doi: 10.3390/app11146421. URLhttps://www.mdpi.com/2076-3417/11/14/6421

  20. [20]

    Mimic-cxr-jpg: Chest radiographs with structured labels.PhysioNet, 2019

    Alistair Johnson, Matthew Lungren, Yifan Peng, Zhiyong Lu, Roger Mark, Seth Berkowitz, and Steven Horng. Mimic-cxr-jpg: Chest radiographs with structured labels.PhysioNet, 2019. doi: 10.13026/8360-t248

  21. [21]

    MIMIC-IV.PhysioNet, October 2024

    Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Brian Gow, Benjamin Moody, Steven Horng, Leo Anthony Celi, and Roger Mark. MIMIC-IV.PhysioNet, October 2024. doi: 10.13026/ kpb9-mt58. URLhttps://doi.org/10.13026/kpb9-mt58. Version 3.1

  22. [22]

    Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the eeg,

    B. Kemp, A.H. Zwinderman, B. Tuk, H.A.C. Kamphuisen, and J.J.L. Oberye. Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the eeg.IEEE Transactions on Biomedical Engineering, 47(9):1185–1194, 2000. doi: 10.1109/10.867928. URLhttps://physionet.org/content/sleep-edfx/1.0.0/

  23. [23]

    Multimodal prompting with missing modalities for visual recognition

    Yi-Lun Lee, Yi-Hsuan Tsai, Wei-Chen Chiu, and Chen-Yu Lee. Multimodal prompting with missing modalities for visual recognition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  24. [24]

    MIRA: Medical time series foundation model for real-world health data.arXiv preprint arXiv:2506.07584, 2025

    Hao Li, Bowen Deng, Chang Xu, Zhiyuan Feng, Viktor Schlegel, Yu-Hao Huang, Yizheng Sun, Jingyuan Sun, Kailai Yang, Yiyao Yu, et al. MIRA: Medical time series foundation model for real-world health data.arXiv preprint arXiv:2506.07584, 2025. URL https: //arxiv.org/abs/2506.07584

  25. [25]

    MedGUIDE: Benchmarking clinical decision-making in large language models.arXiv preprint arXiv:2505.11613, 2025

    Xiaomin Li, Mingye Gao, Yuexing Hao, Taoran Li, Guangya Wan, Zihan Wang, and Yijun Wang. MedGUIDE: Benchmarking clinical decision-making in large language models.arXiv preprint arXiv:2505.11613, 2025. 11

  26. [26]

    MULTIZOO & MULTIBENCH: A standardized toolkit for multimodal deep learning.Journal of Machine Learning Research, 24:1–7, 2023

    Paul Pu Liang, Yiwei Lyu, Xiang Fan, Arav Agarwal, Yun Cheng, Louis-Philippe Morency, and Ruslan Salakhutdinov. MULTIZOO & MULTIBENCH: A standardized toolkit for multimodal deep learning.Journal of Machine Learning Research, 24:1–7, 2023

  27. [27]

    SMIL: Multimodal learning with severely missing modality

    Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. SMIL: Multimodal learning with severely missing modality. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2302–2310, 2021. doi: 10.1609/aaai.v35i3.16330. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/16330

  28. [28]

    Up-fall detection dataset: A multimodal approach

    Lourdes Martínez-Villaseñor, Hiram Ponce, Jorge Brieva, Ernesto Moya-Albor, José Núñez- Martínez, and Carlos Peñafort-Asturiano. Up-fall detection dataset: A multimodal approach. Sensors, 19(9), 2019. ISSN 1424-8220. doi: 10.3390/s19091988. URL https://www.mdpi. com/1424-8220/19/9/1988

  29. [29]

    The CirCor DigiScope Phonocardiogram Dataset.PhysioNet, May 2022

    Jorge Oliveira, Francesco Renna, Paulo Costa, Marcelo Nogueira, Ana Cristina Oliveira, Andoni Elola, Carlos Ferreira, Alipio Jorge, Ali Bahrami Rad, Matthew Reyna, Reza Sameni, Gari Clifford, and Miguel Coimbra. The CirCor DigiScope Phonocardiogram Dataset.PhysioNet, May 2022. doi: 10.13026/tshs-mw03. URL https://doi.org/10.13026/tshs-mw03. Version 1.0.3

  30. [30]

    Maestro: Adaptive sparse attention and robust learning for multimodal dynamic time series

    Akash Pandey Payal Mohapatra, Yueyuan Sui, Stephen Xia, and Qi Zhu. Maestro: Adaptive sparse attention and robust learning for multimodal dynamic time series. InNeurIPS, 2025

  31. [31]

    PPG-DaLiA

    Attila Reiss, Ina Indlekofer, and Philip Schmidt. PPG-DaLiA. 2019. URL https://archive. ics.uci.edu/dataset/495/ppg+dalia. DOI: https://doi.org/10.24432/C53890

  32. [32]

    Introducing wesad, a multimodal dataset for wearable stress and affect detection,

    Philip Schmidt, Attila Reiss, Robert Duerichen, Claus Marberger, and Kristof Van Laerhoven. Introducing wesad, a multimodal dataset for wearable stress and affect detection. InProceedings of the 20th ACM International Conference on Multimodal Interaction, ICMI ’18, page 400–408, New York, NY , USA, 2018. Association for Computing Machinery. ISBN 978145035...

  33. [33]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  34. [34]

    DeepSOFA: a continuous acuity score for critically ill patients using clinically interpretable deep learning.Scientific Reports, 9(1):1879, 2019

    Benjamin Shickel, Tyler J Loftus, Lasith Adhikari, Tezcan Ozrazgat-Baslanti, Azra Bihorac, and Parisa Rashidi. DeepSOFA: a continuous acuity score for critically ill patients using clinically interpretable deep learning.Scientific Reports, 9(1):1879, 2019. doi: 10.1038/ s41598-019-38491-0. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC6372608/

  35. [35]

    Multi-time attention networks for irregularly sampled time series

    Satya Narayan Shukla and Benjamin M Marlin. Multi-time attention networks for irregularly sampled time series. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=4c0J6lwQ4_

  36. [36]

    Predicting in- hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012

    Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting in- hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In2012 Computing in Cardiology, pages 245–248, 2012. URL https://physionet.org/ content/challenge-2012/1.0.0/

  37. [37]

    Large language models encode clinical knowledge.Nature, 2023

    Karan Singhal, Shekoofeh Azizi, Tao Tu, Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Perry Payne, Stephen Pfohl, Martin Seneviratne, Paul Gamble, Christopher Kelly, Abubakr Abdelrazig Hassan Babiker, Nathanael Schaerli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Aguera-Arcas, Dale Webst...

  38. [38]

    Integrated multimodal artificial intelligence framework for healthcare applications.NPJ Digital Medicine, 5(1):149, 2022

    Luis R Soenksen, Yu Ma, Cynthia Zeng, Leonard Boussioux, Kimberly Villalobos Carballo, Liangyuan Na, Holly M Wiberg, Michael L Li, Ignacio Fuentes, and Dimitris Bertsimas. Integrated multimodal artificial intelligence framework for healthcare applications.NPJ Digital Medicine, 5(1):149, 2022. doi: 10.1038/s41746-022-00689-4. URL https://www.nature. com/ar...

  39. [39]

    Available: https://doi.org/10.1109/JBHI.2020.3022989

    Nils Strodthoff, Patrick Wagner, Tobias Schaeffter, and Wojciech Samek. Deep learning for ecg analysis: Benchmarks and insights from ptb-xl.IEEE Journal of Biomedical and Health Informatics, 25(5):1519–1528, 2021. doi: 10.1109/JBHI.2020.3022989

  40. [40]

    Csdi: Conditional score-based diffusion models for probabilistic time series imputation.Advances in neural information processing systems, 34:24804–24816, 2021

    Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. Csdi: Conditional score-based diffusion models for probabilistic time series imputation.Advances in neural information processing systems, 34:24804–24816, 2021

  41. [41]

    NEJM AI1(3), 2300138 (2024) https://doi.org/10.1056/AIoa2300138 https://ai.nejm.org/doi/pdf/10.1056/AIoa2300138

    Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, Anil Palepu, Basil Mustafa, Aakanksha Chowdhery, Yun Liu, Simon Kornblith, David Fleet, Philip Mansfield, Sushant Prakash, Renee Wong, Sunny Virmani, Christopher Semturs, S. Sara Mahdavi, Bradley Green, Ewa Dominow...

  42. [42]

    Real-time quality index to control data loss in real-life cardiac monitoring applications.Sensors, 21(16):5357,

    Guillaume Vila, Clément Godin, Sylvie Charbonnier, and Aurélie Campagne. Real-time quality index to control data loss in real-life cardiac monitoring applications.Sensors, 21(16):5357,

  43. [43]

    URLhttps://doi.org/10.3390/s21165357

    doi: 10.3390/s21165357. URLhttps://doi.org/10.3390/s21165357

  44. [44]

    2020 , url =

    Patrick Wagner, Nils Strodthoff, Ralf-Dieter Bousseljot, Wojciech Samek, and Tobias Schaeffter. PTB-XL, a large publicly available electrocardiography dataset.PhysioNet, April 2020. doi: 10.13026/x4td-x982. URLhttps://doi.org/10.13026/x4td-x982. Version 1.0.1

  45. [45]

    Multi- modal learning with missing modality via shared-specific feature modelling

    Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, and Gustavo Carneiro. Multi- modal learning with missing modality via shared-specific feature modelling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15878–15887, 2023

  46. [46]

    Learning fused pixel and feature-based view reconstructions for light fields,

    Weiyao Wang, Du Tran, and Matt Feiszli. What makes training multi-modal classification networks hard? In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12692–12702, 2020. doi: 10.1109/CVPR42600.2020.01271

  47. [47]

    Multimodal risk prediction with physiolog- ical signals, medical images and clinical notes.Heliyon, 10(5):e26772, 2024

    Yuanlong Wang, Changchang Yin, and Ping Zhang. Multimodal risk prediction with physiolog- ical signals, medical images and clinical notes.Heliyon, 10(5):e26772, 2024. ISSN 2405-8440. doi: https://doi.org/10.1016/j.heliyon.2024.e26772

  48. [48]

    Narayan, Errol Colak, Adewole Adamson, Laura Heacock, Geoffrey H

    Kathryn Wantlin, Chenwei Wu, Shih-Cheng Huang, Oishi Banerjee, Farah Dadabhoy, Veeral Vipin Mehta, Ryan Wonhee Han, Fang Cao, Raja R. Narayan, Errol Colak, Adewole Adamson, Laura Heacock, Geoffrey H. Tison, Alex Tamkin, and Pranav Rajpurkar. Benchmd: A benchmark for modality-agnostic learning on medical images and sensors, 2023

  49. [49]

    Multimodal machine learning in image-based and clini- cal biomedicine: survey and prospects.International Journal of Computer Vision, 132(9): 3753–3769, 2024

    Elisa Warner, Joonsang Lee, William Hsu, Tanveer Syeda-Mahmood, Charles E Kahn Jr, Olivier Gevaert, and Arvind Rao. Multimodal machine learning in image-based and clini- cal biomedicine: survey and prospects.International Journal of Computer Vision, 132(9): 3753–3769, 2024. doi: 10.1007/s11263-024-02032-8. URL https://link.springer.com/ article/10.1007/s1...

  50. [50]

    DrFuse: Learning disentangled representation for clinical multi-modal fusion with missing modality and modal inconsistency

    Wenfang Yao, Kejing Yin, William K Cheung, Jia Liu, and Jing Qin. DrFuse: Learning disentangled representation for clinical multi-modal fusion with missing modality and modal inconsistency. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 16416–16424, 2024. doi: 10.1609/aaai.v38i15.29578. URL https://ojs.aaai.org/ index.ph...

  51. [51]

    Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts, 2024

    Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin, Qi Long, and Tianlong Chen. Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts, 2024. URLhttps://arxiv.org/abs/2410.08245

  52. [52]

    M3Care: Learning with missing modalities in multimodal healthcare data

    Chaohe Zhang, Xu Chu, Liantao Ma, Yinghao Zhu, Yasha Wang, Jiangtao Wang, and Junfeng Zhao. M3Care: Learning with missing modalities in multimodal healthcare data. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, pages 2418–2428, 2022. doi: 10.1145/3534678.3539388. 13

  53. [53]

    Graph-guided net- work for irregularly sampled multivariate time series

    Xiang Zhang, Marko Zeman, Theodoros Tsiligkaridis, and Marinka Zitnik. Graph-guided net- work for irregularly sampled multivariate time series. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=Kwm8I7dU-l5

  54. [54]

    Version 1.0.0

    Jianwei Zheng, Hangyuan Guo, and Huimin Chu. A large scale 12-lead electrocardiogram database for arrhythmia study.PhysioNet, August 2022. doi: 10.13026/wgex-er52. URL https://doi.org/10.13026/wgex-er52. Version 1.0.0. A Broader Impacts MuteBench provides practitioners with concrete, dataset-aware guidance for selecting multimodal fusion architectures tha...

  55. [55]

    Clinical time series ( C= 30 , T= 48 ):Vital signs and laboratory values aggregated into 1-hour bins over the first 48 hours of ICU admission. The 30 channels include heart rate, systolic/diastolic/mean arterial blood pressure, respiratory rate, body temperature, SpO2, and key biochemical markers such as glucose, creatinine, potassium, sodium, and bicarbonate. 21

  56. [56]

    [13], using a pretrained thoracic image encoder, capturing structural lung and cardiac pathology including effusions, cardiomegaly, and consolidations

    Chest X-ray features (1024-D static vector):Visual embeddings pre-extracted from the most recent chest radiograph sourced from MIMIC-CXR-JPG [19], following the multimodal configuration of Han et al. [13], using a pretrained thoracic image encoder, capturing structural lung and cardiac pathology including effusions, cardiomegaly, and consolidations

  57. [57]

    ECG features (256-D static vector):Temporal embeddings pre-extracted from the 12-lead ECG recording closest to ICU admission time, encoding arrhythmia and ischaemia patterns in a compact representation

  58. [58]

    Modalities 2–4 are static vectors with no time axis and cannot be aligned with the hourly time series, placing this dataset inType 3 (heterogeneous and unaligned)

    Clinical text features (768-D static vector):Semantic embeddings pre-extracted from clinical notes (nursing notes, discharge summaries) using a pretrained BERT-based clinical language model, encoding free-text observations not captured by structured variables. Modalities 2–4 are static vectors with no time axis and cannot be aligned with the hourly time s...

  59. [59]

    We useblock_n= 0.05andblock_n_max= 0.10, so each block covers 5–10% ofT

    Compute the block length range: ℓmin =⌈block_n·T⌉ , ℓmax =⌈block_n_max·T⌉ . We useblock_n= 0.05andblock_n_max= 0.10, so each block covers 5–10% ofT

  60. [60]

    Estimate the number of blocks required to cover fractionblock_mof the sequence: k= block_m·T (ℓmin +ℓ max)/2 . 24

  61. [61]

    If an overlap is found, resample up to 64 times; if no valid position is found, stop placing further blocks for this channel

    For each block, uniformly sample a start position and check for overlap with already-placed blocks. If an overlap is found, resample up to 64 times; if no valid position is found, stop placing further blocks for this channel

  62. [62]

    Limitations

    Setmask[c,start:end]←0for each placed block. Each channel uses an independent sub-generator: before iterating over channels, the shared rng draws one 64-bit seed per channel upfront, and each channel’s block placement proceeds from its own np.random.default_rng. This ensures that different channels miss different time windows while the entire per-sample p...

  63. [63]

    All datasets were collected under IRB approval or equivalent ethical review by their original data providers; this paper only reuses fully de-identified, publicly released data

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...