MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion

Chen Chen; Song Wang; Tianlong Chen; Wugeng Zheng; Ziwen Kan

arxiv: 2605.15235 · v1 · pith:4PRMUGOXnew · submitted 2026-05-13 · 💻 cs.LG

MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion

Wugeng Zheng , Ziwen Kan , Tianlong Chen , Chen Chen , Song Wang This is my paper

Pith reviewed 2026-05-19 16:29 UTC · model grok-4.3

classification 💻 cs.LG

keywords multimodal fusionmissing modalitiesrobustnessclinical AIbenchmarksensor failureincomplete data

0 comments

The pith

Architecture family predicts robustness to missing modalities better than model size in clinical fusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MuteBench to evaluate how multimodal fusion models handle real-world sensor failures in clinical data, where entire channels or time segments can go missing. By testing six architectures across nine datasets from different clinical domains, it establishes that the architecture family is the main driver of tolerance to these failures, more so than the number of parameters or specific training adjustments. Channel-independent designs cope well with complete modality loss but often falter when short sequences lose time segments. The work also shows that dropout training only shields models up to the rates seen in training, and that data properties like channel count and sequence length decide which failure type hurts more. A case study hints that imputation can recover performance for the most vulnerable models.

Core claim

MuteBench systematically applies controlled modality missing and within-modality missing to six fusion architectures on nine clinical datasets. The central finding is that architecture family is the strongest predictor of robustness, outweighing parameter count. Channel-independent models tolerate modality missing well but remain sensitive to within-modality missing on short sequences. Curriculum modality dropout provides protection only up to the highest dropout rate used during training. Channel count, sequence length, and modality alignment together determine which missing-data mode creates the larger threat. Diffusion-based imputation improves downstream classification under within-modim

What carries the argument

MuteBench benchmark that tests fusion architectures under controlled levels of modality missing and within-modality missing across multiple clinical datasets.

If this is right

Channel-independent architectures provide reliable tolerance when an entire sensor channel disappears.
Modality dropout during training only guarantees protection up to the maximum rate applied in that training.
Short sequences make within-modality missing more damaging than full channel loss.
Imputation helps most for models whose internal routing depends heavily on clean inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could combine channel-independent processing with explicit temporal-gap handling to address both failure modes at once.
Robustness claims should be re-checked on datasets that vary sequence length and channel alignment independently.
Deployment decisions may benefit from matching model type to expected failure statistics of the target clinical setting.

Load-bearing premise

The nine clinical datasets and six fusion architectures represent the typical range of real-world multimodal physiological signals and sensor-failure patterns.

What would settle it

A new dataset with different channel counts or sequence lengths where the robustness ranking by architecture family reverses or disappears.

Figures

Figures reproduced from arXiv: 2605.15235 by Chen Chen, Song Wang, Tianlong Chen, Wugeng Zheng, Ziwen Kan.

**Figure 2.** Figure 2: Evaluation of missing-data conditions. Left (Complete): Original, fully observed signals. Middle (Modality missing): Entire modalities (e.g., B and E) are dropped with probability p, simulating whole-sensor failures. Right (Within-modality missing): Contiguous time segments are masked independently per channel, simulating transient interruptions like motion artifacts. and Unaligned) datasets lack both spat… view at source ↗

**Figure 3.** Figure 3: Degradation analysis. left: Radar chart of AUROC drop (∆AUROC = clean − missing, averaged over three seeds) across all 9 datasets under modality and within-modality missing at 20% and 50% rates; larger area indicates greater overall sensitivity, and each axis corresponds to one dataset. right: Detailed degradation trajectory of Flex-MoE on PPG-DaLiA: both AUROC and Macro-F1 decline steeply as missing rate … view at source ↗

read the original abstract

Multimodal physiological data powers clinical AI systems from intensive care units to wearable devices, but sensors routinely fail in practice. Two failure modes are common: modality missing, where an entire channel is absent, and within-modality missing, where a contiguous time segment is lost. No existing benchmark evaluates multiple fusion architectures under both failure modes at controlled severity levels across diverse clinical datasets. We present MuteBench, a benchmark covering 9 datasets from 7 clinical domains, 6 fusion architectures, and 2 missing-data modes over 125,000 samples. Through this benchmark, we find that architecture family is the strongest predictor of robustness, outweighing parameter count. Channel-independent models tolerate modality missing well but can be sensitive to within-modality missing, especially on short sequences. Curriculum modality dropout protects reliably only up to the maximum dropout rate used in training. We also find that channel count, sequence length, and modality alignment jointly determine which failure mode poses the greater threat. Finally, a PTB-XL case study suggests that diffusion-based imputation can improve downstream classification under within-modality missing, with the largest gains for models whose expert routing is most sensitive to corrupted inputs, though broader validation across datasets remains an open direction. MuteBench provides practitioners with concrete guidance for both selecting existing architectures and informing the design of future robust multimodal fusion methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MuteBench is a new benchmark for clinical multimodal robustness under missing data, with practical observations on model behavior, though the architecture-family claim risks confounding from built-in handling strategies.

read the letter

Hi, the key takeaway is that this paper creates MuteBench to test multimodal fusion models on clinical physiological data when sensors drop out. It runs nine datasets across seven domains, six architectures, and two missing modes (full channel or within-channel segments) at controlled rates over 125k samples. That setup fills a real gap for ICU and wearable applications where incomplete inputs are routine.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MuteBench, a benchmark for evaluating multimodal fusion architectures under modality missing and within-modality missing conditions. It covers 9 clinical datasets from 7 domains, 6 fusion architectures, and over 125,000 samples across two failure modes. The central empirical finding is that architecture family is the strongest predictor of robustness to these failures, outweighing parameter count; additional results address curriculum modality dropout limits and the potential benefits of diffusion imputation for sensitive routing models.

Significance. If the robustness rankings hold after controlling for design choices, the benchmark supplies concrete, practitioner-oriented guidance for selecting fusion methods in clinical settings where sensor dropouts are routine. The scale of the evaluation and the explicit comparison of failure modes across domains represent a useful contribution to reproducible multimodal robustness research.

major comments (2)

[Abstract] Abstract and results: the claim that architecture family is the strongest predictor of robustness (outweighing parameter count) is not isolated from confounding differences in missing-data handling. The abstract itself notes that curriculum modality dropout protects only up to the training rate and that diffusion imputation helps models with sensitive routing; without an ablation that equalizes these mechanisms across families, variance attributed to 'family' may instead reflect built-in masking, expert routing, or imputation strategies.
[Abstract] Abstract: the assertion that the nine datasets and six architectures 'sufficiently represent' real-world multimodal physiological signals and sensor-failure patterns is stated without supporting evidence or sensitivity analysis. This assumption is load-bearing for the generalization of the robustness rankings.

minor comments (2)

[Abstract] The abstract reports '125,000 samples' but does not break down the distribution across datasets, architectures, or missing-data severity levels.
Clarify the precise statistical procedure used to rank predictors (architecture family vs. parameter count) and report effect sizes or confidence intervals for the ranking.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript introducing MuteBench. We appreciate the referee's focus on potential confounders in our robustness analysis and the generalizability of the benchmark. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract and results: the claim that architecture family is the strongest predictor of robustness (outweighing parameter count) is not isolated from confounding differences in missing-data handling. The abstract itself notes that curriculum modality dropout protects only up to the training rate and that diffusion imputation helps models with sensitive routing; without an ablation that equalizes these mechanisms across families, variance attributed to 'family' may instead reflect built-in masking, expert routing, or imputation strategies.

Authors: We acknowledge that differences in missing-data handling mechanisms (such as built-in masking, expert routing, or imputation) are inherent to the architecture families evaluated and could contribute to the observed robustness patterns. These mechanisms form part of what distinguishes the families in practical deployments, and our experiments compared representative implementations as they are commonly used. Parameter counts were varied within families where possible to support the family-level finding. To address the concern directly, we will revise the discussion section to explicitly note this potential confounding and highlight the need for future controlled ablations that equalize handling strategies across families. revision: partial
Referee: [Abstract] Abstract: the assertion that the nine datasets and six architectures 'sufficiently represent' real-world multimodal physiological signals and sensor-failure patterns is stated without supporting evidence or sensitivity analysis. This assumption is load-bearing for the generalization of the robustness rankings.

Authors: The nine datasets were chosen to cover seven distinct clinical domains with differences in channel counts, sequence lengths, sampling rates, and modality alignments, aiming to reflect common physiological signal characteristics and sensor failure scenarios. The total of over 125,000 samples provides scale for the comparisons. We do not claim the selection is exhaustive or perfectly representative of all possible real-world cases. In the revision, we will add a dedicated subsection in the datasets description justifying the selection criteria with a summary table of key characteristics and include a brief sensitivity check by reporting robustness rankings on dataset subsets to assess stability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark derives claims from external dataset comparisons

full rationale

The paper introduces MuteBench as an empirical benchmark evaluating 6 fusion architectures across 9 clinical datasets under controlled missing-data conditions. The central claim that architecture family is the strongest predictor of robustness is obtained directly from experimental results on these external datasets rather than from any self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations. No derivation chain reduces to its own inputs by construction; the findings remain falsifiable via replication on the benchmark. This is the expected outcome for a purely empirical evaluation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard machine-learning benchmark assumptions about dataset representativeness and controlled experimental conditions rather than introducing new fitted parameters or invented entities.

axioms (1)

domain assumption The selected clinical datasets and fusion architectures are representative of broader multimodal physiological data scenarios.
This assumption underpins the generalizability of the reported robustness rankings.

pith-pipeline@v0.9.0 · 5780 in / 1219 out tokens · 53354 ms · 2026-05-19T16:29:45.622394+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 1 internal anchor

[1]

doi: 10.1038/s41598-026-39035-z

Benchmarking imputation strategies for missing time-series data in critical care using real- world-inspired scenarios.Scientific Reports, 2026. doi: 10.1038/s41598-026-39035-z

work page doi:10.1038/s41598-026-39035-z 2026
[2]

Nat Med28, 1773–1784 (2022) https://doi.org/10.1038/s41591-022-01981-2

Julián N Acosta, Guido J Falcone, Pranav Rajpurkar, and Eric J Topol. Multimodal biomedical AI.Nature Medicine, 28(9):1773–1784, 2022. doi: 10.1038/s41591-022-01981-2. URL https://www.nature.com/articles/s41591-022-01981-2

work page doi:10.1038/s41591-022-01981-2 2022
[3]

Ehrxqa: A multi-modal ques- tion answering dataset for electronic health records with chest x-ray images

Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei Ji, Eric Chang, Tackeun Kim, and Edward Choi. Ehrxqa: A multi-modal ques- tion answering dataset for electronic health records with chest x-ray images. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information ...

work page 2023
[4]

Curriculum learning,

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, New York, NY , USA, 2009. Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553380

work page doi:10.1145/1553374.1553380 2009
[5]

Recurrent neural networks for multivariate time series with missing values.Scientific Reports, 8(1):6085,

Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values.Scientific Reports, 8(1):6085,

work page
[6]

Recurrent

doi: 10.1038/s41598-018-24271-9. URL https://www.nature.com/articles/ s41598-018-24271-9

work page doi:10.1038/s41598-018-24271-9
[7]

Multimodal clinical benchmark for emergency care (mc-bec): A comprehensive benchmark for evaluating foundation models in emergency medicine

Emma Chen, Aman Kansal, Julie Chen, Boyang Tom Jin, Julia Rachel Reisler, David A Kim, and Pranav Rajpurkar. Multimodal clinical benchmark for emergency care (mc-bec): A comprehensive benchmark for evaluating foundation models in emergency medicine. 2023. URLhttps://arxiv.org/abs/2311.04937

work page arXiv 2023
[8]

Computational approaches to alleviate alarm fatigue in intensive care medicine: a systematic literature review.Frontiers in Digital Health, 4:843747, 2022

Jonas Chromik, S A I Klopfenstein, Bjarne Pfitzner, Zeynab C Sinno, Bert Arnrich, Felix Balzer, and Akira-Sebastian Poncette. Computational approaches to alleviate alarm fatigue in intensive care medicine: a systematic literature review.Frontiers in Digital Health, 4:843747, 2022. doi: 10.3389/fdgth.2022.843747. URLhttps://doi.org/10.3389/fdgth.2022.843747

work page doi:10.3389/fdgth.2022.843747 2022
[9]

CLIMB: Data foundations for large scale multimodal clinical foundation models

Wei Dai, Peilin Chen, Malinda Lu, Daniel A Li, Haowen Wei, Hejie Cui, and Paul Pu Liang. CLIMB: Data foundations for large scale multimodal clinical foundation models. InForty- second International Conference on Machine Learning, 2025. URL https://openreview. net/forum?id=TcvjOSePic

work page 2025
[10]

Wearable sensors enable personalized predictions of clinical laboratory measurements

Jessilyn Dunn, Lukasz Kidzinski, Ryan Runge, Daniel Witt, Jennifer L Hicks, Sophia Miryam Schüssler-Fiorenza Rose, Xiao Li, Amir Bahmani, Scott L Delp, Trevor Hastie, and Michael P Snyder. Wearable sensors enable personalized predictions of clinical laboratory measurements. Nature Medicine, 27(6):1105–1112, 2021. doi: 10.1038/s41591-021-01339-0. URL https...

work page doi:10.1038/s41591-021-01339-0 2021
[11]

Autonomous medical evaluation for guideline adherence of large language models.NPJ Digital Medicine, 7(1):358, 2024

Dennis Fast, Lisa C Adams, Felix Busch, Conor Fallon, Marc Huppertz, Robert Siepmann, Philipp Prucker, Nadine Bayerl, Daniel Truhn, Marcus Makowski, et al. Autonomous medical evaluation for guideline adherence of large language models.NPJ Digital Medicine, 7(1):358, 2024. 10

work page 2024
[12]

Finlayson, Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S

Samuel G. Finlayson, Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S. Kohane, and Suchi Saria. The clinician and dataset shift in artificial intelligence.New England Journal of Medicine, 385(3):283–286, 2021. doi: 10.1056/NEJMc2104626

work page doi:10.1056/nejmc2104626 2021
[13]

PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals,

Ary L. Goldberger, Luis A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley. Physiobank, physiotoolkit, and physionet.Circulation, 101(23):e215–e220, 2000. doi: 10.1161/01.CIR.101.23.e215

work page doi:10.1161/01.cir.101.23.e215 2000
[14]

Fusemoe: Mixture-of- experts transformers for fleximodal fusion.arXiv preprint arXiv:2402.03226, 2024

Xing Han, Huy Nguyen, Carl Harris, Nhat Ho, and Suchi Saria. Fusemoe: Mixture-of- experts transformers for fleximodal fusion.arXiv preprint arXiv:2402.03226, 2024. URL https://arxiv.org/abs/arXiv:2402.03226

work page arXiv 2024
[15]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020

work page 2020
[16]

VBench: Comprehensive benchmark suite for video gener- ative models

Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Om- nimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22170–22183, 2024. doi: 10.1109/CVPR52733.2024.02093

work page doi:10.1109/cvpr52733.2024.02093 2024
[17]

Modality compe- tition: What makes joint training of multi-modal network fail in deep learning? (Provably)

Yu Huang, Junyang Lin, Chang Zhou, Hongxia Yang, and Longbo Huang. Modality compe- tition: What makes joint training of multi-modal network fail in deep learning? (Provably). InProceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9226–9259. PMLR, 17–23 Jul 2022. URL https://pro...

work page 2022
[18]

M3CoTBench: Benchmark chain-of-thought of MLLMs in medical image understanding.arXiv preprint arXiv:2601.08758, 2026

Juntao Jiang, Jiangning Zhang, Yali Bi, Jinsheng Bai, Weixuan Liu, Weiwei Jin, Zhucun Xue, Yong Liu, Xiaobin Hu, and Shuicheng Yan. M3CoTBench: Benchmark chain-of-thought of MLLMs in medical image understanding.arXiv preprint arXiv:2601.08758, 2026

work page arXiv 2026
[19]

What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14), 2021. ISSN 2076-3417. doi: 10.3390/app11146421. URLhttps://www.mdpi.com/2076-3417/11/14/6421

work page doi:10.3390/app11146421 2021
[20]

Mimic-cxr-jpg: Chest radiographs with structured labels.PhysioNet, 2019

Alistair Johnson, Matthew Lungren, Yifan Peng, Zhiyong Lu, Roger Mark, Seth Berkowitz, and Steven Horng. Mimic-cxr-jpg: Chest radiographs with structured labels.PhysioNet, 2019. doi: 10.13026/8360-t248

work page doi:10.13026/8360-t248 2019
[21]

MIMIC-IV.PhysioNet, October 2024

Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Brian Gow, Benjamin Moody, Steven Horng, Leo Anthony Celi, and Roger Mark. MIMIC-IV.PhysioNet, October 2024. doi: 10.13026/ kpb9-mt58. URLhttps://doi.org/10.13026/kpb9-mt58. Version 3.1

work page doi:10.13026/kpb9-mt58 2024
[22]

Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the eeg,

B. Kemp, A.H. Zwinderman, B. Tuk, H.A.C. Kamphuisen, and J.J.L. Oberye. Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the eeg.IEEE Transactions on Biomedical Engineering, 47(9):1185–1194, 2000. doi: 10.1109/10.867928. URLhttps://physionet.org/content/sleep-edfx/1.0.0/

work page doi:10.1109/10.867928 2000
[23]

Multimodal prompting with missing modalities for visual recognition

Yi-Lun Lee, Yi-Hsuan Tsai, Wei-Chen Chiu, and Chen-Yu Lee. Multimodal prompting with missing modalities for visual recognition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[24]

MIRA: Medical time series foundation model for real-world health data.arXiv preprint arXiv:2506.07584, 2025

Hao Li, Bowen Deng, Chang Xu, Zhiyuan Feng, Viktor Schlegel, Yu-Hao Huang, Yizheng Sun, Jingyuan Sun, Kailai Yang, Yiyao Yu, et al. MIRA: Medical time series foundation model for real-world health data.arXiv preprint arXiv:2506.07584, 2025. URL https: //arxiv.org/abs/2506.07584

work page arXiv 2025
[25]

MedGUIDE: Benchmarking clinical decision-making in large language models.arXiv preprint arXiv:2505.11613, 2025

Xiaomin Li, Mingye Gao, Yuexing Hao, Taoran Li, Guangya Wan, Zihan Wang, and Yijun Wang. MedGUIDE: Benchmarking clinical decision-making in large language models.arXiv preprint arXiv:2505.11613, 2025. 11

work page arXiv 2025
[26]

MULTIZOO & MULTIBENCH: A standardized toolkit for multimodal deep learning.Journal of Machine Learning Research, 24:1–7, 2023

Paul Pu Liang, Yiwei Lyu, Xiang Fan, Arav Agarwal, Yun Cheng, Louis-Philippe Morency, and Ruslan Salakhutdinov. MULTIZOO & MULTIBENCH: A standardized toolkit for multimodal deep learning.Journal of Machine Learning Research, 24:1–7, 2023

work page 2023
[27]

SMIL: Multimodal learning with severely missing modality

Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. SMIL: Multimodal learning with severely missing modality. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2302–2310, 2021. doi: 10.1609/aaai.v35i3.16330. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/16330

work page doi:10.1609/aaai.v35i3.16330 2021
[28]

Up-fall detection dataset: A multimodal approach

Lourdes Martínez-Villaseñor, Hiram Ponce, Jorge Brieva, Ernesto Moya-Albor, José Núñez- Martínez, and Carlos Peñafort-Asturiano. Up-fall detection dataset: A multimodal approach. Sensors, 19(9), 2019. ISSN 1424-8220. doi: 10.3390/s19091988. URL https://www.mdpi. com/1424-8220/19/9/1988

work page doi:10.3390/s19091988 2019
[29]

The CirCor DigiScope Phonocardiogram Dataset.PhysioNet, May 2022

Jorge Oliveira, Francesco Renna, Paulo Costa, Marcelo Nogueira, Ana Cristina Oliveira, Andoni Elola, Carlos Ferreira, Alipio Jorge, Ali Bahrami Rad, Matthew Reyna, Reza Sameni, Gari Clifford, and Miguel Coimbra. The CirCor DigiScope Phonocardiogram Dataset.PhysioNet, May 2022. doi: 10.13026/tshs-mw03. URL https://doi.org/10.13026/tshs-mw03. Version 1.0.3

work page doi:10.13026/tshs-mw03 2022
[30]

Maestro: Adaptive sparse attention and robust learning for multimodal dynamic time series

Akash Pandey Payal Mohapatra, Yueyuan Sui, Stephen Xia, and Qi Zhu. Maestro: Adaptive sparse attention and robust learning for multimodal dynamic time series. InNeurIPS, 2025

work page 2025
[31]

PPG-DaLiA

Attila Reiss, Ina Indlekofer, and Philip Schmidt. PPG-DaLiA. 2019. URL https://archive. ics.uci.edu/dataset/495/ppg+dalia. DOI: https://doi.org/10.24432/C53890

work page doi:10.24432/c53890 2019
[32]

Introducing wesad, a multimodal dataset for wearable stress and affect detection,

Philip Schmidt, Attila Reiss, Robert Duerichen, Claus Marberger, and Kristof Van Laerhoven. Introducing wesad, a multimodal dataset for wearable stress and affect detection. InProceedings of the 20th ACM International Conference on Multimodal Interaction, ICMI ’18, page 400–408, New York, NY , USA, 2018. Association for Computing Machinery. ISBN 978145035...

work page doi:10.1145/3242969.3242985 2018
[33]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

DeepSOFA: a continuous acuity score for critically ill patients using clinically interpretable deep learning.Scientific Reports, 9(1):1879, 2019

Benjamin Shickel, Tyler J Loftus, Lasith Adhikari, Tezcan Ozrazgat-Baslanti, Azra Bihorac, and Parisa Rashidi. DeepSOFA: a continuous acuity score for critically ill patients using clinically interpretable deep learning.Scientific Reports, 9(1):1879, 2019. doi: 10.1038/ s41598-019-38491-0. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC6372608/

work page 2019
[35]

Multi-time attention networks for irregularly sampled time series

Satya Narayan Shukla and Benjamin M Marlin. Multi-time attention networks for irregularly sampled time series. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=4c0J6lwQ4_

work page 2021
[36]

Predicting in- hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012

Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting in- hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In2012 Computing in Cardiology, pages 245–248, 2012. URL https://physionet.org/ content/challenge-2012/1.0.0/

work page 2012
[37]

Large language models encode clinical knowledge.Nature, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Perry Payne, Stephen Pfohl, Martin Seneviratne, Paul Gamble, Christopher Kelly, Abubakr Abdelrazig Hassan Babiker, Nathanael Schaerli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Aguera-Arcas, Dale Webst...

work page 2023
[38]

Integrated multimodal artificial intelligence framework for healthcare applications.NPJ Digital Medicine, 5(1):149, 2022

Luis R Soenksen, Yu Ma, Cynthia Zeng, Leonard Boussioux, Kimberly Villalobos Carballo, Liangyuan Na, Holly M Wiberg, Michael L Li, Ignacio Fuentes, and Dimitris Bertsimas. Integrated multimodal artificial intelligence framework for healthcare applications.NPJ Digital Medicine, 5(1):149, 2022. doi: 10.1038/s41746-022-00689-4. URL https://www.nature. com/ar...

work page doi:10.1038/s41746-022-00689-4 2022
[39]

Available: https://doi.org/10.1109/JBHI.2020.3022989

Nils Strodthoff, Patrick Wagner, Tobias Schaeffter, and Wojciech Samek. Deep learning for ecg analysis: Benchmarks and insights from ptb-xl.IEEE Journal of Biomedical and Health Informatics, 25(5):1519–1528, 2021. doi: 10.1109/JBHI.2020.3022989

work page doi:10.1109/jbhi.2020.3022989 2021
[40]

Csdi: Conditional score-based diffusion models for probabilistic time series imputation.Advances in neural information processing systems, 34:24804–24816, 2021

Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. Csdi: Conditional score-based diffusion models for probabilistic time series imputation.Advances in neural information processing systems, 34:24804–24816, 2021

work page 2021
[41]

NEJM AI1(3), 2300138 (2024) https://doi.org/10.1056/AIoa2300138 https://ai.nejm.org/doi/pdf/10.1056/AIoa2300138

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, Anil Palepu, Basil Mustafa, Aakanksha Chowdhery, Yun Liu, Simon Kornblith, David Fleet, Philip Mansfield, Sushant Prakash, Renee Wong, Sunny Virmani, Christopher Semturs, S. Sara Mahdavi, Bradley Green, Ewa Dominow...

work page doi:10.1056/aioa2300138 2024
[42]

Real-time quality index to control data loss in real-life cardiac monitoring applications.Sensors, 21(16):5357,

Guillaume Vila, Clément Godin, Sylvie Charbonnier, and Aurélie Campagne. Real-time quality index to control data loss in real-life cardiac monitoring applications.Sensors, 21(16):5357,

work page
[43]

URLhttps://doi.org/10.3390/s21165357

doi: 10.3390/s21165357. URLhttps://doi.org/10.3390/s21165357

work page doi:10.3390/s21165357
[44]

2020 , url =

Patrick Wagner, Nils Strodthoff, Ralf-Dieter Bousseljot, Wojciech Samek, and Tobias Schaeffter. PTB-XL, a large publicly available electrocardiography dataset.PhysioNet, April 2020. doi: 10.13026/x4td-x982. URLhttps://doi.org/10.13026/x4td-x982. Version 1.0.1

work page doi:10.13026/x4td-x982 2020
[45]

Multi- modal learning with missing modality via shared-specific feature modelling

Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, and Gustavo Carneiro. Multi- modal learning with missing modality via shared-specific feature modelling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15878–15887, 2023

work page 2023
[46]

Learning fused pixel and feature-based view reconstructions for light fields,

Weiyao Wang, Du Tran, and Matt Feiszli. What makes training multi-modal classification networks hard? In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12692–12702, 2020. doi: 10.1109/CVPR42600.2020.01271

work page doi:10.1109/cvpr42600.2020.01271 2020
[47]

Multimodal risk prediction with physiolog- ical signals, medical images and clinical notes.Heliyon, 10(5):e26772, 2024

Yuanlong Wang, Changchang Yin, and Ping Zhang. Multimodal risk prediction with physiolog- ical signals, medical images and clinical notes.Heliyon, 10(5):e26772, 2024. ISSN 2405-8440. doi: https://doi.org/10.1016/j.heliyon.2024.e26772

work page doi:10.1016/j.heliyon.2024.e26772 2024
[48]

Narayan, Errol Colak, Adewole Adamson, Laura Heacock, Geoffrey H

Kathryn Wantlin, Chenwei Wu, Shih-Cheng Huang, Oishi Banerjee, Farah Dadabhoy, Veeral Vipin Mehta, Ryan Wonhee Han, Fang Cao, Raja R. Narayan, Errol Colak, Adewole Adamson, Laura Heacock, Geoffrey H. Tison, Alex Tamkin, and Pranav Rajpurkar. Benchmd: A benchmark for modality-agnostic learning on medical images and sensors, 2023

work page 2023
[49]

Multimodal machine learning in image-based and clini- cal biomedicine: survey and prospects.International Journal of Computer Vision, 132(9): 3753–3769, 2024

Elisa Warner, Joonsang Lee, William Hsu, Tanveer Syeda-Mahmood, Charles E Kahn Jr, Olivier Gevaert, and Arvind Rao. Multimodal machine learning in image-based and clini- cal biomedicine: survey and prospects.International Journal of Computer Vision, 132(9): 3753–3769, 2024. doi: 10.1007/s11263-024-02032-8. URL https://link.springer.com/ article/10.1007/s1...

work page doi:10.1007/s11263-024-02032-8 2024
[50]

DrFuse: Learning disentangled representation for clinical multi-modal fusion with missing modality and modal inconsistency

Wenfang Yao, Kejing Yin, William K Cheung, Jia Liu, and Jing Qin. DrFuse: Learning disentangled representation for clinical multi-modal fusion with missing modality and modal inconsistency. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 16416–16424, 2024. doi: 10.1609/aaai.v38i15.29578. URL https://ojs.aaai.org/ index.ph...

work page doi:10.1609/aaai.v38i15.29578 2024
[51]

Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts, 2024

Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin, Qi Long, and Tianlong Chen. Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts, 2024. URLhttps://arxiv.org/abs/2410.08245

work page arXiv 2024
[52]

M3Care: Learning with missing modalities in multimodal healthcare data

Chaohe Zhang, Xu Chu, Liantao Ma, Yinghao Zhu, Yasha Wang, Jiangtao Wang, and Junfeng Zhao. M3Care: Learning with missing modalities in multimodal healthcare data. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, pages 2418–2428, 2022. doi: 10.1145/3534678.3539388. 13

work page doi:10.1145/3534678.3539388 2022
[53]

Graph-guided net- work for irregularly sampled multivariate time series

Xiang Zhang, Marko Zeman, Theodoros Tsiligkaridis, and Marinka Zitnik. Graph-guided net- work for irregularly sampled multivariate time series. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=Kwm8I7dU-l5

work page 2022
[54]

Version 1.0.0

Jianwei Zheng, Hangyuan Guo, and Huimin Chu. A large scale 12-lead electrocardiogram database for arrhythmia study.PhysioNet, August 2022. doi: 10.13026/wgex-er52. URL https://doi.org/10.13026/wgex-er52. Version 1.0.0. A Broader Impacts MuteBench provides practitioners with concrete, dataset-aware guidance for selecting multimodal fusion architectures tha...

work page doi:10.13026/wgex-er52 2022
[55]

Clinical time series ( C= 30 , T= 48 ):Vital signs and laboratory values aggregated into 1-hour bins over the first 48 hours of ICU admission. The 30 channels include heart rate, systolic/diastolic/mean arterial blood pressure, respiratory rate, body temperature, SpO2, and key biochemical markers such as glucose, creatinine, potassium, sodium, and bicarbonate. 21

work page
[56]

[13], using a pretrained thoracic image encoder, capturing structural lung and cardiac pathology including effusions, cardiomegaly, and consolidations

Chest X-ray features (1024-D static vector):Visual embeddings pre-extracted from the most recent chest radiograph sourced from MIMIC-CXR-JPG [19], following the multimodal configuration of Han et al. [13], using a pretrained thoracic image encoder, capturing structural lung and cardiac pathology including effusions, cardiomegaly, and consolidations

work page
[57]

ECG features (256-D static vector):Temporal embeddings pre-extracted from the 12-lead ECG recording closest to ICU admission time, encoding arrhythmia and ischaemia patterns in a compact representation

work page
[58]

Modalities 2–4 are static vectors with no time axis and cannot be aligned with the hourly time series, placing this dataset inType 3 (heterogeneous and unaligned)

Clinical text features (768-D static vector):Semantic embeddings pre-extracted from clinical notes (nursing notes, discharge summaries) using a pretrained BERT-based clinical language model, encoding free-text observations not captured by structured variables. Modalities 2–4 are static vectors with no time axis and cannot be aligned with the hourly time s...

work page 2012
[59]

We useblock_n= 0.05andblock_n_max= 0.10, so each block covers 5–10% ofT

Compute the block length range: ℓmin =⌈block_n·T⌉ , ℓmax =⌈block_n_max·T⌉ . We useblock_n= 0.05andblock_n_max= 0.10, so each block covers 5–10% ofT

work page
[60]

Estimate the number of blocks required to cover fractionblock_mof the sequence: k= block_m·T (ℓmin +ℓ max)/2 . 24

work page
[61]

If an overlap is found, resample up to 64 times; if no valid position is found, stop placing further blocks for this channel

For each block, uniformly sample a start position and check for overlap with already-placed blocks. If an overlap is found, resample up to 64 times; if no valid position is found, stop placing further blocks for this channel

work page
[62]

Limitations

Setmask[c,start:end]←0for each placed block. Each channel uses an independent sub-generator: before iterating over channels, the shared rng draws one 64-bit seed per channel upfront, and each channel’s block placement proceeds from its own np.random.default_rng. This ensures that different channels miss different time windows while the entire per-sample p...

work page 2012
[63]

All datasets were collected under IRB approval or equivalent ethical review by their original data providers; this paper only reuses fully de-identified, publicly released data

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page

[1] [1]

doi: 10.1038/s41598-026-39035-z

Benchmarking imputation strategies for missing time-series data in critical care using real- world-inspired scenarios.Scientific Reports, 2026. doi: 10.1038/s41598-026-39035-z

work page doi:10.1038/s41598-026-39035-z 2026

[2] [2]

Nat Med28, 1773–1784 (2022) https://doi.org/10.1038/s41591-022-01981-2

Julián N Acosta, Guido J Falcone, Pranav Rajpurkar, and Eric J Topol. Multimodal biomedical AI.Nature Medicine, 28(9):1773–1784, 2022. doi: 10.1038/s41591-022-01981-2. URL https://www.nature.com/articles/s41591-022-01981-2

work page doi:10.1038/s41591-022-01981-2 2022

[3] [3]

Ehrxqa: A multi-modal ques- tion answering dataset for electronic health records with chest x-ray images

Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei Ji, Eric Chang, Tackeun Kim, and Edward Choi. Ehrxqa: A multi-modal ques- tion answering dataset for electronic health records with chest x-ray images. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information ...

work page 2023

[4] [4]

Curriculum learning,

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. InProceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 41–48, New York, NY , USA, 2009. Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553380

work page doi:10.1145/1553374.1553380 2009

[5] [5]

Recurrent neural networks for multivariate time series with missing values.Scientific Reports, 8(1):6085,

Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values.Scientific Reports, 8(1):6085,

work page

[6] [6]

Recurrent

doi: 10.1038/s41598-018-24271-9. URL https://www.nature.com/articles/ s41598-018-24271-9

work page doi:10.1038/s41598-018-24271-9

[7] [7]

Multimodal clinical benchmark for emergency care (mc-bec): A comprehensive benchmark for evaluating foundation models in emergency medicine

Emma Chen, Aman Kansal, Julie Chen, Boyang Tom Jin, Julia Rachel Reisler, David A Kim, and Pranav Rajpurkar. Multimodal clinical benchmark for emergency care (mc-bec): A comprehensive benchmark for evaluating foundation models in emergency medicine. 2023. URLhttps://arxiv.org/abs/2311.04937

work page arXiv 2023

[8] [8]

Computational approaches to alleviate alarm fatigue in intensive care medicine: a systematic literature review.Frontiers in Digital Health, 4:843747, 2022

Jonas Chromik, S A I Klopfenstein, Bjarne Pfitzner, Zeynab C Sinno, Bert Arnrich, Felix Balzer, and Akira-Sebastian Poncette. Computational approaches to alleviate alarm fatigue in intensive care medicine: a systematic literature review.Frontiers in Digital Health, 4:843747, 2022. doi: 10.3389/fdgth.2022.843747. URLhttps://doi.org/10.3389/fdgth.2022.843747

work page doi:10.3389/fdgth.2022.843747 2022

[9] [9]

CLIMB: Data foundations for large scale multimodal clinical foundation models

Wei Dai, Peilin Chen, Malinda Lu, Daniel A Li, Haowen Wei, Hejie Cui, and Paul Pu Liang. CLIMB: Data foundations for large scale multimodal clinical foundation models. InForty- second International Conference on Machine Learning, 2025. URL https://openreview. net/forum?id=TcvjOSePic

work page 2025

[10] [10]

Wearable sensors enable personalized predictions of clinical laboratory measurements

Jessilyn Dunn, Lukasz Kidzinski, Ryan Runge, Daniel Witt, Jennifer L Hicks, Sophia Miryam Schüssler-Fiorenza Rose, Xiao Li, Amir Bahmani, Scott L Delp, Trevor Hastie, and Michael P Snyder. Wearable sensors enable personalized predictions of clinical laboratory measurements. Nature Medicine, 27(6):1105–1112, 2021. doi: 10.1038/s41591-021-01339-0. URL https...

work page doi:10.1038/s41591-021-01339-0 2021

[11] [11]

Autonomous medical evaluation for guideline adherence of large language models.NPJ Digital Medicine, 7(1):358, 2024

Dennis Fast, Lisa C Adams, Felix Busch, Conor Fallon, Marc Huppertz, Robert Siepmann, Philipp Prucker, Nadine Bayerl, Daniel Truhn, Marcus Makowski, et al. Autonomous medical evaluation for guideline adherence of large language models.NPJ Digital Medicine, 7(1):358, 2024. 10

work page 2024

[12] [12]

Finlayson, Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S

Samuel G. Finlayson, Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S. Kohane, and Suchi Saria. The clinician and dataset shift in artificial intelligence.New England Journal of Medicine, 385(3):283–286, 2021. doi: 10.1056/NEJMc2104626

work page doi:10.1056/nejmc2104626 2021

[13] [13]

PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals,

Ary L. Goldberger, Luis A. N. Amaral, Leon Glass, Jeffrey M. Hausdorff, Plamen Ch. Ivanov, Roger G. Mark, Joseph E. Mietus, George B. Moody, Chung-Kang Peng, and H. Eugene Stanley. Physiobank, physiotoolkit, and physionet.Circulation, 101(23):e215–e220, 2000. doi: 10.1161/01.CIR.101.23.e215

work page doi:10.1161/01.cir.101.23.e215 2000

[14] [14]

Fusemoe: Mixture-of- experts transformers for fleximodal fusion.arXiv preprint arXiv:2402.03226, 2024

Xing Han, Huy Nguyen, Carl Harris, Nhat Ho, and Suchi Saria. Fusemoe: Mixture-of- experts transformers for fleximodal fusion.arXiv preprint arXiv:2402.03226, 2024. URL https://arxiv.org/abs/arXiv:2402.03226

work page arXiv 2024

[15] [15]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020

work page 2020

[16] [16]

VBench: Comprehensive benchmark suite for video gener- ative models

Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Om- nimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22170–22183, 2024. doi: 10.1109/CVPR52733.2024.02093

work page doi:10.1109/cvpr52733.2024.02093 2024

[17] [17]

Modality compe- tition: What makes joint training of multi-modal network fail in deep learning? (Provably)

Yu Huang, Junyang Lin, Chang Zhou, Hongxia Yang, and Longbo Huang. Modality compe- tition: What makes joint training of multi-modal network fail in deep learning? (Provably). InProceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9226–9259. PMLR, 17–23 Jul 2022. URL https://pro...

work page 2022

[18] [18]

M3CoTBench: Benchmark chain-of-thought of MLLMs in medical image understanding.arXiv preprint arXiv:2601.08758, 2026

Juntao Jiang, Jiangning Zhang, Yali Bi, Jinsheng Bai, Weixuan Liu, Weiwei Jin, Zhucun Xue, Yong Liu, Xiaobin Hu, and Shuicheng Yan. M3CoTBench: Benchmark chain-of-thought of MLLMs in medical image understanding.arXiv preprint arXiv:2601.08758, 2026

work page arXiv 2026

[19] [19]

What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14), 2021. ISSN 2076-3417. doi: 10.3390/app11146421. URLhttps://www.mdpi.com/2076-3417/11/14/6421

work page doi:10.3390/app11146421 2021

[20] [20]

Mimic-cxr-jpg: Chest radiographs with structured labels.PhysioNet, 2019

Alistair Johnson, Matthew Lungren, Yifan Peng, Zhiyong Lu, Roger Mark, Seth Berkowitz, and Steven Horng. Mimic-cxr-jpg: Chest radiographs with structured labels.PhysioNet, 2019. doi: 10.13026/8360-t248

work page doi:10.13026/8360-t248 2019

[21] [21]

MIMIC-IV.PhysioNet, October 2024

Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Brian Gow, Benjamin Moody, Steven Horng, Leo Anthony Celi, and Roger Mark. MIMIC-IV.PhysioNet, October 2024. doi: 10.13026/ kpb9-mt58. URLhttps://doi.org/10.13026/kpb9-mt58. Version 3.1

work page doi:10.13026/kpb9-mt58 2024

[22] [22]

Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the eeg,

B. Kemp, A.H. Zwinderman, B. Tuk, H.A.C. Kamphuisen, and J.J.L. Oberye. Analysis of a sleep-dependent neuronal feedback loop: the slow-wave microcontinuity of the eeg.IEEE Transactions on Biomedical Engineering, 47(9):1185–1194, 2000. doi: 10.1109/10.867928. URLhttps://physionet.org/content/sleep-edfx/1.0.0/

work page doi:10.1109/10.867928 2000

[23] [23]

Multimodal prompting with missing modalities for visual recognition

Yi-Lun Lee, Yi-Hsuan Tsai, Wei-Chen Chiu, and Chen-Yu Lee. Multimodal prompting with missing modalities for visual recognition. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023

[24] [24]

MIRA: Medical time series foundation model for real-world health data.arXiv preprint arXiv:2506.07584, 2025

Hao Li, Bowen Deng, Chang Xu, Zhiyuan Feng, Viktor Schlegel, Yu-Hao Huang, Yizheng Sun, Jingyuan Sun, Kailai Yang, Yiyao Yu, et al. MIRA: Medical time series foundation model for real-world health data.arXiv preprint arXiv:2506.07584, 2025. URL https: //arxiv.org/abs/2506.07584

work page arXiv 2025

[25] [25]

MedGUIDE: Benchmarking clinical decision-making in large language models.arXiv preprint arXiv:2505.11613, 2025

Xiaomin Li, Mingye Gao, Yuexing Hao, Taoran Li, Guangya Wan, Zihan Wang, and Yijun Wang. MedGUIDE: Benchmarking clinical decision-making in large language models.arXiv preprint arXiv:2505.11613, 2025. 11

work page arXiv 2025

[26] [26]

MULTIZOO & MULTIBENCH: A standardized toolkit for multimodal deep learning.Journal of Machine Learning Research, 24:1–7, 2023

Paul Pu Liang, Yiwei Lyu, Xiang Fan, Arav Agarwal, Yun Cheng, Louis-Philippe Morency, and Ruslan Salakhutdinov. MULTIZOO & MULTIBENCH: A standardized toolkit for multimodal deep learning.Journal of Machine Learning Research, 24:1–7, 2023

work page 2023

[27] [27]

SMIL: Multimodal learning with severely missing modality

Mengmeng Ma, Jian Ren, Long Zhao, Sergey Tulyakov, Cathy Wu, and Xi Peng. SMIL: Multimodal learning with severely missing modality. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2302–2310, 2021. doi: 10.1609/aaai.v35i3.16330. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/16330

work page doi:10.1609/aaai.v35i3.16330 2021

[28] [28]

Up-fall detection dataset: A multimodal approach

Lourdes Martínez-Villaseñor, Hiram Ponce, Jorge Brieva, Ernesto Moya-Albor, José Núñez- Martínez, and Carlos Peñafort-Asturiano. Up-fall detection dataset: A multimodal approach. Sensors, 19(9), 2019. ISSN 1424-8220. doi: 10.3390/s19091988. URL https://www.mdpi. com/1424-8220/19/9/1988

work page doi:10.3390/s19091988 2019

[29] [29]

The CirCor DigiScope Phonocardiogram Dataset.PhysioNet, May 2022

Jorge Oliveira, Francesco Renna, Paulo Costa, Marcelo Nogueira, Ana Cristina Oliveira, Andoni Elola, Carlos Ferreira, Alipio Jorge, Ali Bahrami Rad, Matthew Reyna, Reza Sameni, Gari Clifford, and Miguel Coimbra. The CirCor DigiScope Phonocardiogram Dataset.PhysioNet, May 2022. doi: 10.13026/tshs-mw03. URL https://doi.org/10.13026/tshs-mw03. Version 1.0.3

work page doi:10.13026/tshs-mw03 2022

[30] [30]

Maestro: Adaptive sparse attention and robust learning for multimodal dynamic time series

Akash Pandey Payal Mohapatra, Yueyuan Sui, Stephen Xia, and Qi Zhu. Maestro: Adaptive sparse attention and robust learning for multimodal dynamic time series. InNeurIPS, 2025

work page 2025

[31] [31]

PPG-DaLiA

Attila Reiss, Ina Indlekofer, and Philip Schmidt. PPG-DaLiA. 2019. URL https://archive. ics.uci.edu/dataset/495/ppg+dalia. DOI: https://doi.org/10.24432/C53890

work page doi:10.24432/c53890 2019

[32] [32]

Introducing wesad, a multimodal dataset for wearable stress and affect detection,

Philip Schmidt, Attila Reiss, Robert Duerichen, Claus Marberger, and Kristof Van Laerhoven. Introducing wesad, a multimodal dataset for wearable stress and affect detection. InProceedings of the 20th ACM International Conference on Multimodal Interaction, ICMI ’18, page 400–408, New York, NY , USA, 2018. Association for Computing Machinery. ISBN 978145035...

work page doi:10.1145/3242969.3242985 2018

[33] [33]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

DeepSOFA: a continuous acuity score for critically ill patients using clinically interpretable deep learning.Scientific Reports, 9(1):1879, 2019

Benjamin Shickel, Tyler J Loftus, Lasith Adhikari, Tezcan Ozrazgat-Baslanti, Azra Bihorac, and Parisa Rashidi. DeepSOFA: a continuous acuity score for critically ill patients using clinically interpretable deep learning.Scientific Reports, 9(1):1879, 2019. doi: 10.1038/ s41598-019-38491-0. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC6372608/

work page 2019

[35] [35]

Multi-time attention networks for irregularly sampled time series

Satya Narayan Shukla and Benjamin M Marlin. Multi-time attention networks for irregularly sampled time series. InInternational Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=4c0J6lwQ4_

work page 2021

[36] [36]

Predicting in- hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012

Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting in- hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012. In2012 Computing in Cardiology, pages 245–248, 2012. URL https://physionet.org/ content/challenge-2012/1.0.0/

work page 2012

[37] [37]

Large language models encode clinical knowledge.Nature, 2023

Karan Singhal, Shekoofeh Azizi, Tao Tu, Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Perry Payne, Stephen Pfohl, Martin Seneviratne, Paul Gamble, Christopher Kelly, Abubakr Abdelrazig Hassan Babiker, Nathanael Schaerli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Aguera-Arcas, Dale Webst...

work page 2023

[38] [38]

Integrated multimodal artificial intelligence framework for healthcare applications.NPJ Digital Medicine, 5(1):149, 2022

Luis R Soenksen, Yu Ma, Cynthia Zeng, Leonard Boussioux, Kimberly Villalobos Carballo, Liangyuan Na, Holly M Wiberg, Michael L Li, Ignacio Fuentes, and Dimitris Bertsimas. Integrated multimodal artificial intelligence framework for healthcare applications.NPJ Digital Medicine, 5(1):149, 2022. doi: 10.1038/s41746-022-00689-4. URL https://www.nature. com/ar...

work page doi:10.1038/s41746-022-00689-4 2022

[39] [39]

Available: https://doi.org/10.1109/JBHI.2020.3022989

Nils Strodthoff, Patrick Wagner, Tobias Schaeffter, and Wojciech Samek. Deep learning for ecg analysis: Benchmarks and insights from ptb-xl.IEEE Journal of Biomedical and Health Informatics, 25(5):1519–1528, 2021. doi: 10.1109/JBHI.2020.3022989

work page doi:10.1109/jbhi.2020.3022989 2021

[40] [40]

Csdi: Conditional score-based diffusion models for probabilistic time series imputation.Advances in neural information processing systems, 34:24804–24816, 2021

Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. Csdi: Conditional score-based diffusion models for probabilistic time series imputation.Advances in neural information processing systems, 34:24804–24816, 2021

work page 2021

[41] [41]

NEJM AI1(3), 2300138 (2024) https://doi.org/10.1056/AIoa2300138 https://ai.nejm.org/doi/pdf/10.1056/AIoa2300138

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, Anil Palepu, Basil Mustafa, Aakanksha Chowdhery, Yun Liu, Simon Kornblith, David Fleet, Philip Mansfield, Sushant Prakash, Renee Wong, Sunny Virmani, Christopher Semturs, S. Sara Mahdavi, Bradley Green, Ewa Dominow...

work page doi:10.1056/aioa2300138 2024

[42] [42]

Real-time quality index to control data loss in real-life cardiac monitoring applications.Sensors, 21(16):5357,

Guillaume Vila, Clément Godin, Sylvie Charbonnier, and Aurélie Campagne. Real-time quality index to control data loss in real-life cardiac monitoring applications.Sensors, 21(16):5357,

work page

[43] [43]

URLhttps://doi.org/10.3390/s21165357

doi: 10.3390/s21165357. URLhttps://doi.org/10.3390/s21165357

work page doi:10.3390/s21165357

[44] [44]

2020 , url =

Patrick Wagner, Nils Strodthoff, Ralf-Dieter Bousseljot, Wojciech Samek, and Tobias Schaeffter. PTB-XL, a large publicly available electrocardiography dataset.PhysioNet, April 2020. doi: 10.13026/x4td-x982. URLhttps://doi.org/10.13026/x4td-x982. Version 1.0.1

work page doi:10.13026/x4td-x982 2020

[45] [45]

Multi- modal learning with missing modality via shared-specific feature modelling

Hu Wang, Yuanhong Chen, Congbo Ma, Jodie Avery, Louise Hull, and Gustavo Carneiro. Multi- modal learning with missing modality via shared-specific feature modelling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15878–15887, 2023

work page 2023

[46] [46]

Learning fused pixel and feature-based view reconstructions for light fields,

Weiyao Wang, Du Tran, and Matt Feiszli. What makes training multi-modal classification networks hard? In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12692–12702, 2020. doi: 10.1109/CVPR42600.2020.01271

work page doi:10.1109/cvpr42600.2020.01271 2020

[47] [47]

Multimodal risk prediction with physiolog- ical signals, medical images and clinical notes.Heliyon, 10(5):e26772, 2024

Yuanlong Wang, Changchang Yin, and Ping Zhang. Multimodal risk prediction with physiolog- ical signals, medical images and clinical notes.Heliyon, 10(5):e26772, 2024. ISSN 2405-8440. doi: https://doi.org/10.1016/j.heliyon.2024.e26772

work page doi:10.1016/j.heliyon.2024.e26772 2024

[48] [48]

Narayan, Errol Colak, Adewole Adamson, Laura Heacock, Geoffrey H

Kathryn Wantlin, Chenwei Wu, Shih-Cheng Huang, Oishi Banerjee, Farah Dadabhoy, Veeral Vipin Mehta, Ryan Wonhee Han, Fang Cao, Raja R. Narayan, Errol Colak, Adewole Adamson, Laura Heacock, Geoffrey H. Tison, Alex Tamkin, and Pranav Rajpurkar. Benchmd: A benchmark for modality-agnostic learning on medical images and sensors, 2023

work page 2023

[49] [49]

Multimodal machine learning in image-based and clini- cal biomedicine: survey and prospects.International Journal of Computer Vision, 132(9): 3753–3769, 2024

Elisa Warner, Joonsang Lee, William Hsu, Tanveer Syeda-Mahmood, Charles E Kahn Jr, Olivier Gevaert, and Arvind Rao. Multimodal machine learning in image-based and clini- cal biomedicine: survey and prospects.International Journal of Computer Vision, 132(9): 3753–3769, 2024. doi: 10.1007/s11263-024-02032-8. URL https://link.springer.com/ article/10.1007/s1...

work page doi:10.1007/s11263-024-02032-8 2024

[50] [50]

DrFuse: Learning disentangled representation for clinical multi-modal fusion with missing modality and modal inconsistency

Wenfang Yao, Kejing Yin, William K Cheung, Jia Liu, and Jing Qin. DrFuse: Learning disentangled representation for clinical multi-modal fusion with missing modality and modal inconsistency. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 16416–16424, 2024. doi: 10.1609/aaai.v38i15.29578. URL https://ojs.aaai.org/ index.ph...

work page doi:10.1609/aaai.v38i15.29578 2024

[51] [51]

Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts, 2024

Sukwon Yun, Inyoung Choi, Jie Peng, Yangfan Wu, Jingxuan Bao, Qiyiwen Zhang, Jiayi Xin, Qi Long, and Tianlong Chen. Flex-moe: Modeling arbitrary modality combination via the flexible mixture-of-experts, 2024. URLhttps://arxiv.org/abs/2410.08245

work page arXiv 2024

[52] [52]

M3Care: Learning with missing modalities in multimodal healthcare data

Chaohe Zhang, Xu Chu, Liantao Ma, Yinghao Zhu, Yasha Wang, Jiangtao Wang, and Junfeng Zhao. M3Care: Learning with missing modalities in multimodal healthcare data. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, pages 2418–2428, 2022. doi: 10.1145/3534678.3539388. 13

work page doi:10.1145/3534678.3539388 2022

[53] [53]

Graph-guided net- work for irregularly sampled multivariate time series

Xiang Zhang, Marko Zeman, Theodoros Tsiligkaridis, and Marinka Zitnik. Graph-guided net- work for irregularly sampled multivariate time series. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=Kwm8I7dU-l5

work page 2022

[54] [54]

Version 1.0.0

Jianwei Zheng, Hangyuan Guo, and Huimin Chu. A large scale 12-lead electrocardiogram database for arrhythmia study.PhysioNet, August 2022. doi: 10.13026/wgex-er52. URL https://doi.org/10.13026/wgex-er52. Version 1.0.0. A Broader Impacts MuteBench provides practitioners with concrete, dataset-aware guidance for selecting multimodal fusion architectures tha...

work page doi:10.13026/wgex-er52 2022

[55] [55]

Clinical time series ( C= 30 , T= 48 ):Vital signs and laboratory values aggregated into 1-hour bins over the first 48 hours of ICU admission. The 30 channels include heart rate, systolic/diastolic/mean arterial blood pressure, respiratory rate, body temperature, SpO2, and key biochemical markers such as glucose, creatinine, potassium, sodium, and bicarbonate. 21

work page

[56] [56]

[13], using a pretrained thoracic image encoder, capturing structural lung and cardiac pathology including effusions, cardiomegaly, and consolidations

Chest X-ray features (1024-D static vector):Visual embeddings pre-extracted from the most recent chest radiograph sourced from MIMIC-CXR-JPG [19], following the multimodal configuration of Han et al. [13], using a pretrained thoracic image encoder, capturing structural lung and cardiac pathology including effusions, cardiomegaly, and consolidations

work page

[57] [57]

ECG features (256-D static vector):Temporal embeddings pre-extracted from the 12-lead ECG recording closest to ICU admission time, encoding arrhythmia and ischaemia patterns in a compact representation

work page

[58] [58]

Modalities 2–4 are static vectors with no time axis and cannot be aligned with the hourly time series, placing this dataset inType 3 (heterogeneous and unaligned)

Clinical text features (768-D static vector):Semantic embeddings pre-extracted from clinical notes (nursing notes, discharge summaries) using a pretrained BERT-based clinical language model, encoding free-text observations not captured by structured variables. Modalities 2–4 are static vectors with no time axis and cannot be aligned with the hourly time s...

work page 2012

[59] [59]

We useblock_n= 0.05andblock_n_max= 0.10, so each block covers 5–10% ofT

Compute the block length range: ℓmin =⌈block_n·T⌉ , ℓmax =⌈block_n_max·T⌉ . We useblock_n= 0.05andblock_n_max= 0.10, so each block covers 5–10% ofT

work page

[60] [60]

Estimate the number of blocks required to cover fractionblock_mof the sequence: k= block_m·T (ℓmin +ℓ max)/2 . 24

work page

[61] [61]

If an overlap is found, resample up to 64 times; if no valid position is found, stop placing further blocks for this channel

For each block, uniformly sample a start position and check for overlap with already-placed blocks. If an overlap is found, resample up to 64 times; if no valid position is found, stop placing further blocks for this channel

work page

[62] [62]

Limitations

Setmask[c,start:end]←0for each placed block. Each channel uses an independent sub-generator: before iterating over channels, the shared rng draws one 64-bit seed per channel upfront, and each channel’s block placement proceeds from its own np.random.default_rng. This ensures that different channels miss different time windows while the entire per-sample p...

work page 2012

[63] [63]

All datasets were collected under IRB approval or equivalent ethical review by their original data providers; this paper only reuses fully de-identified, publicly released data

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page