pith. machine review for the scientific record. sign in

arxiv: 2604.04614 · v2 · submitted 2026-04-06 · 💻 cs.LG · cs.AI

Recognition: unknown

A Clinical Point Cloud Paradigm for In-Hospital Mortality Prediction from Multi-Level Incomplete Multimodal EHRs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords clinical point cloudincomplete EHRmultimodal datamortality predictionrelational attentionhealthcare risk predictionself-supervised learningmissing modalities
0
0 comments X

The pith

Representing clinical events as points in 4D space enables accurate mortality prediction from incomplete multimodal EHRs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that current deep learning methods for EHR-based risk prediction break down when faced with real-world incompleteness such as irregular timing, absent data types, and few outcome labels. It proposes treating every clinical event as a point located by its content, timestamp, modality, and patient identifier, then modeling all pairwise interactions through a low-rank attention layer that scales efficiently across the four dimensions. A hierarchical sampling step keeps computation feasible while still allowing fine-grained self-supervision on unlabeled records and recovery of missing modalities. If the approach succeeds, hospitals could use far more of their existing incomplete data for predictions instead of discarding records or forcing artificial completeness.

Core claim

HealthPoint (HP) is a unified paradigm that represents heterogeneous clinical events as points in a continuous 4D space defined by content, time, modality, and case. Low-Rank Relational Attention efficiently captures high-order dependencies across arbitrary point pairs in these dimensions. A hierarchical interaction and sampling strategy balances fine-grained modeling with computational cost. The resulting framework supports flexible event-level interaction, fine-grained self-supervision, robust modality recovery, and effective use of unlabeled data for in-hospital mortality prediction.

What carries the argument

The 4D clinical point cloud (content, time, modality, case) together with Low-Rank Relational Attention that models interactions between any pair of points while remaining computationally tractable.

If this is right

  • HP reaches state-of-the-art performance on large-scale EHR datasets for in-hospital mortality prediction.
  • The model maintains strong robustness when incompleteness varies across irregular sampling, missing modalities, and sparse labels.
  • Fine-grained self-supervision allows effective use of unlabeled records without requiring complete outcome data.
  • Low-rank attention enables modality recovery while avoiding rigid temporal or modal alignment steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same point-cloud construction could be tested on other clinical time-series tasks such as length-of-stay or readmission prediction where incompleteness is also common.
  • If the 4D representation generalizes, it may reduce reliance on separate imputation pipelines that currently dominate healthcare ML preprocessing.
  • Hospitals could experiment with deploying the model on streaming records that arrive with variable completeness rather than waiting for batch completion.

Load-bearing premise

Representing clinical events as points in a continuous 4D space and modeling their interactions with low-rank attention preserves raw clinical semantics without distortion under irregular sampling, missing modalities, and sparse labels.

What would settle it

Run HP and standard imputation or alignment baselines on the same large EHR dataset with artificially increased missingness rates; if HP's mortality prediction AUC falls below the baselines at high incompleteness levels, the claim that the 4D point representation avoids semantic distortion would be refuted.

Figures

Figures reproduced from arXiv: 2604.04614 by Bohao Li, Bowen Du, Junchen Ye, Tao Zou, Yan Gong.

Figure 1
Figure 1. Figure 1: Irregular sampling, missing modality, and sparse label jointly result in multi-level incomplete multimodal clinical data. HealthPoint addresses these [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The framework of HP. III. METHODOLOGY We propose HealthPoint (HP)1 , a unified framework that formulates incomplete multimodal EHR modeling as a clinical point cloud learning problem, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Robustness analysis under varying missing rate. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Case Study. iii) Case Dimension: Region ➃ highlights strong dependen￾cies between Case 1 and Case 8. This corresponds to their semantically similar trajectories (both exhibiting High-risk → Intervention → Stabilization), demonstrating that LRRL effec￾tively quantifies high-order patient case similarity to leverage historical priors. D. Cost Analysis To evaluate computational cost and validate the efficienc… view at source ↗
Figure 5
Figure 5. Figure 5: , three observations can be drawn: 1) Increasing sampling intervals significantly reduces inference latency, con￾firming that our design effectively prunes computations. 2) Overly coarse sampling leads to performance degradation, highlighting the importance of fine-grained temporal modeling for capturing disease evolution patterns. 3) The configuration Inference time (ms) A U P R C ( % ) F 1 ( % ) Inferenc… view at source ↗
read the original abstract

Deep learning-based modeling of multimodal Electronic Health Records (EHRs) has become an important approach for clinical diagnosis and risk prediction. However, due to diverse clinical workflows and privacy constraints, raw EHRs are inherently multi-level incomplete, including irregular sampling, missing modalities, and sparse labels. These issues cause temporal misalignment, modality imbalance, and limited supervision. Most existing multimodal methods assume relatively complete data, and even methods designed for incompleteness usually address only one or two of these issues in isolation. As a result, they often rely on rigid temporal/modal alignment or discard incomplete data, which may distort raw clinical semantics. To address this problem, we propose HealthPoint (HP), a unified clinical point cloud paradigm for multi-level incomplete EHRs. HP represents heterogeneous clinical events as points in a continuous 4D space defined by content, time, modality, and case. To model interactions between arbitrary point pairs, we introduce a Low-Rank Relational Attention mechanism that efficiently captures high-order dependencies across these four dimensions. We further develop a hierarchical interaction and sampling strategy to balance fine-grained modeling and computational efficiency. Built on this framework, HP enables flexible event-level interaction and fine-grained self-supervision, supporting robust modality recovery and effective use of unlabeled data. Experiments on large-scale EHR datasets for risk prediction show that HP consistently achieves state-of-the-art performance and strong robustness under varying degrees of incompleteness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes HealthPoint (HP), a unified clinical point cloud paradigm for in-hospital mortality prediction from multi-level incomplete multimodal EHRs. It represents heterogeneous clinical events as points in a continuous 4D space (content, time, modality, case), models pairwise interactions with a Low-Rank Relational Attention mechanism, and employs a hierarchical interaction and sampling strategy to support event-level modeling, modality recovery, and self-supervision on unlabeled data. Experiments on large-scale EHR datasets are reported to show state-of-the-art performance and robustness under varying degrees of incompleteness (irregular sampling, missing modalities, sparse labels).

Significance. If the empirical claims hold, the work would be significant for clinical machine learning by providing a single framework that simultaneously addresses multiple incompleteness issues common in real-world EHRs, avoiding the distortions from rigid alignments or data discarding seen in prior methods. The adaptation of point-cloud and low-rank attention ideas to clinical events, plus the emphasis on flexible self-supervision, represents a substantive modeling advance with potential for broader applicability in risk prediction tasks.

major comments (2)
  1. [§3.2] §3.2 (Low-Rank Relational Attention): The central robustness claim under high incompleteness rests on the low-rank factorization recovering high-order 4D dependencies even when many points are absent or irregularly spaced. No approximation-error bound or ablation isolating the rank choice versus full attention is provided, leaving open the possibility that the mechanism systematically loses sparse but clinically informative interactions precisely in the regimes the method targets.
  2. [§4.3] §4.3 (Ablation and incompleteness experiments): The hierarchical sampling strategy is asserted to balance fine-grained modeling with efficiency without discarding informative events, yet the reported ablations do not quantify information loss (e.g., via event-type retention rates or downstream performance drop when sampling is disabled) across the full range of incompleteness levels. This directly affects whether the 4D embedding preserves raw clinical semantics.
minor comments (2)
  1. [Abstract] The abstract and §4.1 refer to 'large-scale EHR datasets' without naming them or reporting cohort sizes and missingness statistics; adding these would strengthen reproducibility and allow readers to judge the severity of the incompleteness regimes tested.
  2. [§3.1] Notation for the 4D point coordinates (content, time, modality, case) is introduced descriptively but lacks an explicit embedding equation; a single displayed equation would clarify how discrete heterogeneous events are mapped into the continuous space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of the Low-Rank Relational Attention mechanism and the evaluation of the hierarchical sampling strategy. We address each point below and will revise the manuscript to incorporate additional analyses where feasible.

read point-by-point responses
  1. Referee: §3.2 (Low-Rank Relational Attention): The central robustness claim under high incompleteness rests on the low-rank factorization recovering high-order 4D dependencies even when many points are absent or irregularly spaced. No approximation-error bound or ablation isolating the rank choice versus full attention is provided, leaving open the possibility that the mechanism systematically loses sparse but clinically informative interactions precisely in the regimes the method targets.

    Authors: We agree that a theoretical approximation-error bound would further strengthen the robustness claims. Deriving a tight bound for high-order 4D dependencies under irregular sampling and missing points is non-trivial and beyond the current scope; we note this limitation explicitly in the revision. Empirically, the manuscript already includes performance comparisons showing that the low-rank mechanism achieves state-of-the-art results with substantially lower computational cost than full attention, and maintains accuracy across high incompleteness regimes. To directly address the concern about sparse interactions, we will add a targeted ablation in the revised version that (i) varies the rank hyperparameter and (ii) evaluates retention of performance on subsets containing rare but clinically critical events. This will isolate the effect of the factorization on sparse signals. revision: partial

  2. Referee: §4.3 (Ablation and incompleteness experiments): The hierarchical sampling strategy is asserted to balance fine-grained modeling with efficiency without discarding informative events, yet the reported ablations do not quantify information loss (e.g., via event-type retention rates or downstream performance drop when sampling is disabled) across the full range of incompleteness levels. This directly affects whether the 4D embedding preserves raw clinical semantics.

    Authors: We appreciate this suggestion for more granular evaluation. The hierarchical sampling is designed to retain events according to temporal density and modality relevance, but the current ablations primarily report end-to-end predictive metrics. In the revision we will add explicit quantification of information preservation: event-type retention rates (broken down by modality and clinical category) and the performance delta when the sampling module is disabled, evaluated across the full spectrum of incompleteness levels already tested in the paper. These additions will provide direct evidence that the 4D point-cloud representation maintains raw clinical semantics under the proposed sampling strategy. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is an explicit modeling choice validated empirically

full rationale

The paper introduces HealthPoint as a proposed paradigm that represents clinical events as 4D points and applies low-rank relational attention plus hierarchical sampling to handle multi-level incompleteness in EHRs. These elements are described as design decisions to enable flexible interactions and self-supervision, not as quantities derived from prior results or fitted parameters that are then relabeled as predictions. No equations or steps in the abstract reduce by construction to the inputs (e.g., no self-definitional mapping where the output is the fit itself, no uniqueness theorem imported from self-citation, and no renaming of known patterns as new unification). Performance claims rest on experimental results across datasets rather than tautological equivalence, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that clinical events can be faithfully embedded as points in a 4D continuous space and that low-rank attention can recover high-order dependencies across missing dimensions without external validation of that embedding.

axioms (1)
  • domain assumption Heterogeneous clinical events can be represented as points in a continuous 4D space defined by content, time, modality, and case without loss of clinical semantics.
    This is the foundational modeling decision stated in the abstract.
invented entities (2)
  • HealthPoint (HP) paradigm no independent evidence
    purpose: Unified clinical point cloud framework for multi-level incomplete multimodal EHRs
    Newly introduced modeling approach.
  • Low-Rank Relational Attention mechanism no independent evidence
    purpose: Efficiently capture high-order dependencies across the four dimensions for arbitrary point pairs
    Novel attention variant proposed in the paper.

pith-pipeline@v0.9.0 · 5562 in / 1253 out tokens · 56057 ms · 2026-05-10T19:41:02.682818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Mimic-iii, a freely accessible critical care database,

    A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark, “Mimic-iii, a freely accessible critical care database,”Scientific data, vol. 3, no. 1, pp. 1–9, 2016

  2. [2]

    Artificial intelligence- based methods for fusion of electronic health records and imaging data,

    F. Mohsen, H. Ali, N. El Hajj, and Z. Shah, “Artificial intelligence- based methods for fusion of electronic health records and imaging data,” Scientific Reports, vol. 12, no. 1, p. 17981, 2022

  3. [3]

    Multimodal pretraining of medical time series and notes,

    R. King, T. Yang, and B. J. Mortazavi, “Multimodal pretraining of medical time series and notes,” inMachine Learning for Health (ML4H). PMLR, 2023, pp. 244–255

  4. [4]

    The future of multimodal artificial intelligence models for integrating imaging and clinical metadata: a narrative review,

    B. D. Simon, K. B. Ozyoruk, D. G. Gelikman, S. A. Harmon, and B. T ¨urkbey, “The future of multimodal artificial intelligence models for integrating imaging and clinical metadata: a narrative review,” Diagnostic and Interventional Radiology, vol. 31, no. 4, p. 303, 2025

  5. [5]

    M3care: Learning with missing modalities in multimodal healthcare data,

    C. Zhang, X. Chu, L. Ma, Y . Zhu, Y . Wang, J. Wang, and J. Zhao, “M3care: Learning with missing modalities in multimodal healthcare data,” inProceedings of the 28th ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining, 2022, pp. 2418–2428

  6. [6]

    Improving medical predictions by irregular multimodal electronic health records modeling,

    X. Zhang, S. Li, Z. Chen, X. Yan, and L. R. Petzold, “Improving medical predictions by irregular multimodal electronic health records modeling,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 41 300–41 313

  7. [7]

    Prime: Pretraining for patient condition representation with irregular multimodal electronic health records,

    B. Li, B. Du, and J. Ye, “Prime: Pretraining for patient condition representation with irregular multimodal electronic health records,”ACM Transactions on Knowledge Discovery from Data, vol. 19, no. 7, pp. 1– 39, 2025

  8. [8]

    Multimodal missing data in healthcare: A comprehensive review and future directions,

    L. P. Le, T. Nguyen, M. A. Riegler, P. Halvorsen, and B. T. Nguyen, “Multimodal missing data in healthcare: A comprehensive review and future directions,”Computer Science Review, vol. 56, p. 100720, 2025

  9. [9]

    Hierarchical pretraining on multimodal electronic health records,

    X. Wang, J. Luo, J. Wang, Z. Yin, S. Cui, Y . Zhong, Y . Wang, and F. Ma, “Hierarchical pretraining on multimodal electronic health records,” inProceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, vol. 2023, 2023, p. 2839

  10. [10]

    Recurrent neural networks for multivariate time series with missing values,

    Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y . Liu, “Recurrent neural networks for multivariate time series with missing values,” Scientific reports, vol. 8, no. 1, p. 6085, 2018

  11. [11]

    Multimodal patient representation learning with missing modalities and labels,

    Z. Wu, A. Dadu, N. Tustison, B. Avants, M. Nalls, J. Sun, and F. Faghri, “Multimodal patient representation learning with missing modalities and labels,” inThe Twelfth International Conference on Learning Representations, 2024

  12. [12]

    Redcore: Rela- tive advantage aware cross-modal representation learning for missing modalities with imbalanced missing rates,

    J. Sun, X. Zhang, S. Han, Y .-P. Ruan, and T. Li, “Redcore: Rela- tive advantage aware cross-modal representation learning for missing modalities with imbalanced missing rates,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 13, 2024, pp. 15 173– 15 182

  13. [13]

    Diffmv: A unified diffusion framework for healthcare predictions with random missing views and view laziness,

    C. Zhao, H. Tang, H. Zhao, and X. Li, “Diffmv: A unified diffusion framework for healthcare predictions with random missing views and view laziness,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 3933–3944

  14. [14]

    Drfuse: Learning disentangled representation for clinical multi-modal fusion with missing modality and modal inconsistency,

    W. Yao, K. Yin, W. K. Cheung, J. Liu, and J. Qin, “Drfuse: Learning disentangled representation for clinical multi-modal fusion with missing modality and modal inconsistency,” inProceedings of the AAAI confer- ence on artificial intelligence, vol. 38, no. 15, 2024, pp. 16 416–16 424

  15. [15]

    Flex- care: Leveraging cross-task synergy for flexible multimodal healthcare prediction,

    M. Xu, Z. Zhu, Y . Li, S. Zheng, Y . Zhao, K. He, and Y . Zhao, “Flex- care: Leveraging cross-task synergy for flexible multimodal healthcare prediction,” inProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 3610–3620

  16. [16]

    Self-supervised multimodal learning: A survey,

    Y . Zong, O. Mac Aodha, and T. M. Hospedales, “Self-supervised multimodal learning: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 7, pp. 5299–5318, 2024

  17. [17]

    Vecocare: Visit sequences-clinical notes joint learning for diagnosis prediction in healthcare data

    Y . Xu, K. Yang, C. Zhang, P. Zou, Z. Wang, H. Ding, J. Zhao, Y . Wang, and B. Xie, “Vecocare: Visit sequences-clinical notes joint learning for diagnosis prediction in healthcare data.” inIJCAI, vol. 23, 2023, pp. 4921–4929

  18. [18]

    Learn- ing missing modal electronic health records with unified multi-modal data embedding and modality-aware attention,

    K. Lee, S. Lee, S. Hahn, H. Hyun, E. Choi, B. Ahn, and J. Lee, “Learn- ing missing modal electronic health records with unified multi-modal data embedding and modality-aware attention,” inMachine Learning for Healthcare Conference. PMLR, 2023, pp. 423–442

  19. [19]

    The false hope of current approaches to explainable artificial intelligence in health care,

    M. Ghassemi, L. Oakden-Rayner, and A. L. Beam, “The false hope of current approaches to explainable artificial intelligence in health care,” The lancet digital health, vol. 3, no. 11, pp. e745–e750, 2021

  20. [20]

    Learning the natural history IEEE TRANSACTIONS ON MULTIMEDIA, UNDER REVIEW 10 of human disease with generative transformers,

    A. Shmatko, A. W. Jung, K. Gaurav, S. Brunak, L. H. Mortensen, E. Birney, T. Fitzgerald, and M. Gerstung, “Learning the natural history IEEE TRANSACTIONS ON MULTIMEDIA, UNDER REVIEW 10 of human disease with generative transformers,”Nature, vol. 647, no. 8088, pp. 248–256, 2025

  21. [21]

    Large language models forecast patient health trajectories enabling digital twins,

    N. Makarov, M. Bordukova, P. Quengdaeng, D. Garger, R. Rodriguez- Esteban, F. Schmich, and M. P. Menden, “Large language models forecast patient health trajectories enabling digital twins,”npj Digital Medicine, vol. 8, no. 1, p. 588, 2025

  22. [22]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space,

    C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,”Advances in neural information processing systems, vol. 30, 2017

  23. [23]

    Point transformer,

    H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V . Koltun, “Point transformer,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 16 259–16 268

  24. [24]

    Multilayer feedforward networks are universal approximators,

    K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,”Neural networks, vol. 2, no. 5, pp. 359–366, 1989

  25. [25]

    Wehbe and Faraz S

    Y . Li, R. M. Wehbe, F. S. Ahmad, H. Wang, and Y . Luo, “Clinical- longformer and clinical-bigbird: Transformers for long clinical se- quences,”arXiv preprint arXiv:2201.11838, 2022

  26. [26]

    On the limits of cross-domain generalization in automated x-ray prediction,

    J. P. Cohen, M. Hashir, R. Brooks, and H. Bertrand, “On the limits of cross-domain generalization in automated x-ray prediction,” inMedical Imaging with Deep Learning, 2020. [Online]. Available: https://arxiv.org/abs/2002.02497

  27. [27]

    Warpformer: A multi-scale modeling approach for irregular clinical time series,

    J. Zhang, S. Zheng, W. Cao, J. Bian, and J. Li, “Warpformer: A multi-scale modeling approach for irregular clinical time series,” in Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 3273–3285

  28. [28]

    Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

    K. Cho, B. Van Merri ¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,”arXiv preprint arXiv:1406.1078, 2014

  29. [29]

    Tensor decompositions and applications,

    T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, pp. 455–500, 2009

  30. [30]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

  31. [31]

    Multimodal machine learning: A survey and taxonomy,

    T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,”IEEE transactions on pattern anal- ysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018

  32. [32]

    A simple framework for contrastive learning of visual representations,

    T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational conference on machine learning. PmLR, 2020, pp. 1597–1607

  33. [33]

    A mathematical theory of communication,

    C. E. Shannon, “A mathematical theory of communication,”The Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948

  34. [34]

    Learning Confidence for Out -of-Distribution Detection in Neural Networks,

    T. DeVries and G. W. Taylor, “Learning confidence for out- of-distribution detection in neural networks,”arXiv preprint arXiv:1802.04865, 2018

  35. [35]

    Mimic- iv, a freely accessible electronic health record dataset,

    A. E. Johnson, L. Bulgarelli, L. Shen, A. Gayles, A. Shammout, S. Horng, T. J. Pollard, S. Hao, B. Moody, B. Gowet al., “Mimic- iv, a freely accessible electronic health record dataset,”Scientific data, vol. 10, no. 1, p. 1, 2023

  36. [36]

    Multitask learning and benchmarking with clinical time series data,

    H. Harutyunyan, H. Khachatrian, D. C. Kale, G. Ver Steeg, and A. Galstyan, “Multitask learning and benchmarking with clinical time series data,”Scientific data, vol. 6, no. 1, p. 96, 2019

  37. [37]

    Heart: Learning better representation of ehr data with a heterogeneous relation-aware transformer,

    T. Huang, S. A. Rizvi, R. K. Thakur, V . Socrates, M. Gupta, D. van Dijk, R. A. Taylor, and R. Ying, “Heart: Learning better representation of ehr data with a heterogeneous relation-aware transformer,”Journal of Biomedical Informatics, vol. 159, p. 104741, 2024

  38. [38]

    Multi-task heterogeneous graph learning on electronic health records,

    T. H. Chan, G. Yin, K. Bae, and L. Yu, “Multi-task heterogeneous graph learning on electronic health records,”Neural Networks, vol. 180, p. 106644, 2024

  39. [39]

    Towards robust multimodal representation: A unified approach with adaptive experts and alignment,

    N. Moradinasab, S. Sengupta, J. Liu, S. Syed, and D. E. Brown, “Towards robust multimodal representation: A unified approach with adaptive experts and alignment,”arXiv preprint arXiv:2503.09498, 2025

  40. [40]

    Ehrcon: Dataset for checking consistency between unstructured notes and structured tables in electronic health records,

    Y . Kwon, J. Kim, G. Lee, S. Bae, D. Kyung, W. Cha, T. Pollard, A. Johnson, and E. Choi, “Ehrcon: Dataset for checking consistency between unstructured notes and structured tables in electronic health records,”Advances in Neural Information Processing Systems, vol. 37, pp. 89 334–89 345, 2024

  41. [41]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learning Representations, 2019

  42. [42]

    Multimodal transformer for unaligned multimodal language sequences,

    Y .-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 6558–6569

  43. [43]

    Con- trastive learning of medical visual representations from paired images and text,

    Y . Zhang, H. Jiang, Y . Miura, C. D. Manning, and C. P. Langlotz, “Con- trastive learning of medical visual representations from paired images and text,” inMachine learning for healthcare conference. PMLR, 2022, pp. 2–25

  44. [44]

    Moe-health: A mixture of experts framework for robust multimodal healthcare prediction,

    X. Wang and C. Yang, “Moe-health: A mixture of experts framework for robust multimodal healthcare prediction,” inProceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 2025, pp. 1–9

  45. [45]

    Large language models encode clinical knowledge,

    K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohlet al., “Large language models encode clinical knowledge,”Nature, vol. 620, no. 7972, pp. 172– 180, 2023

  46. [46]

    Towards generalist biomedical ai,

    T. Tu, S. Azizi, D. Driess, M. Schaekermann, M. Amin, P.-C. Chang, A. Carroll, C. Lau, R. Tanno, I. Ktenaet al., “Towards generalist biomedical ai,”Nejm Ai, vol. 1, no. 3, p. AIoa2300138, 2024

  47. [47]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day,

    C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,”Advances in Neural Information Processing Systems, vol. 36, pp. 28 541–28 564, 2023

  48. [48]

    Building a knowledge graph to enable precision medicine,

    P. Chandak, K. Huang, and M. Zitnik, “Building a knowledge graph to enable precision medicine,”Scientific Data, vol. 10, no. 1, p. 67, 2023

  49. [49]

    Kerprint: local-global knowledge graph enhanced diagnosis prediction for retrospective and prospective interpretations,

    K. Yang, Y . Xu, P. Zou, H. Ding, J. Zhao, Y . Wang, and B. Xie, “Kerprint: local-global knowledge graph enhanced diagnosis prediction for retrospective and prospective interpretations,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 4, 2023, pp. 5357–5365

  50. [50]

    From observation to concept: A flexible multi-view paradigm for medical report generation,

    Z. Liu, Z. Zhu, S. Zheng, Y . Zhao, K. He, and Y . Zhao, “From observation to concept: A flexible multi-view paradigm for medical report generation,”IEEE Transactions on Multimedia, vol. 26, pp. 5987– 5995, 2023

  51. [51]

    Contiformer: Continuous-time transformer for irregular time series modeling,

    Y . Chen, K. Ren, Y . Wang, Y . Fang, W. Sun, and D. Li, “Contiformer: Continuous-time transformer for irregular time series modeling,”Ad- vances in Neural Information Processing Systems, vol. 36, 2024

  52. [52]

    Trajgpt: Irregular time-series representation learning of health trajectory,

    Z. Song, Q. Lu, H. Zhu, D. Buckeridge, and Y . Li, “Trajgpt: Irregular time-series representation learning of health trajectory,”IEEE Journal of Biomedical and Health Informatics, 2025

  53. [53]

    Tee4ehr: Transformer event encoder for better representation learning in electronic health records,

    H. Karami, D. Atienza, and A. Ionescu, “Tee4ehr: Transformer event encoder for better representation learning in electronic health records,” Artificial Intelligence in Medicine, vol. 154, p. 102903, 2024

  54. [54]

    Irregularity-informed time series analysis: Adaptive modelling of spatial and temporal dynamics,

    L. N. Zheng, Z. Li, C. G. Dong, W. E. Zhang, L. Yue, M. Xu, O. Maennel, and W. Chen, “Irregularity-informed time series analysis: Adaptive modelling of spatial and temporal dynamics,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management, 2024, pp. 3405–3414

  55. [55]

    Graph-guided network for irregularly sampled multivari- ate time series.arXiv preprint arXiv:2110.05357,

    X. Zhang, M. Zeman, T. Tsiligkaridis, and M. Zitnik, “Graph-guided network for irregularly sampled multivariate time series,”arXiv preprint arXiv:2110.05357, 2021

  56. [56]

    Ctpd: Cross-modal temporal pattern discovery for enhanced multimodal electronic health records analysis,

    F. Wang, F. Wu, Y . Tang, and L. Yu, “Ctpd: Cross-modal temporal pattern discovery for enhanced multimodal electronic health records analysis,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 6783–6799

  57. [57]

    Learning trimodal relation for audio-visual question answering with missing modality,

    K. R. Park, H. J. Lee, and J. U. Kim, “Learning trimodal relation for audio-visual question answering with missing modality,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 42–59

  58. [58]

    Borrowing treasures from neighbors: In-context learning for multimodal learning with missing modalities and data scarcity,

    Z. Zhi, Z. Liu, M. Elbadawi, A. Daneshmend, M. Orlu, A. Basit, A. De- mosthenous, and M. Rodrigues, “Borrowing treasures from neighbors: In-context learning for multimodal learning with missing modalities and data scarcity,”Neurocomputing, p. 130502, 2025

  59. [59]

    Redeem- ing modality information loss: Retrieval-guided conditional generation for severely modality missing learning,

    J. Lang, R. Hong, Z. Cheng, T. Zhong, Y . Wang, and F. Zhou, “Redeem- ing modality information loss: Retrieval-guided conditional generation for severely modality missing learning,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, 2025, pp. 1241–1252

  60. [60]

    Learning multimodal representations for in- complete ehrs with retrieval-augmented personalized modality recovery,

    B. Li, B. Du, and J. Ye, “Learning multimodal representations for in- complete ehrs with retrieval-augmented personalized modality recovery,” Information Fusion, p. 104347, 2026

  61. [61]

    Multi-task paired masking with alignment modeling for medical vision- language pre-training,

    K. Zhang, Y . Yang, J. Yu, H. Jiang, J. Fan, Q. Huang, and W. Han, “Multi-task paired masking with alignment modeling for medical vision- language pre-training,”IEEE Transactions on Multimedia, vol. 26, pp. 4706–4721, 2023

  62. [62]

    Mime: Multilevel medical embedding of electronic health records for predictive healthcare,

    E. Choi, C. Xiao, W. Stewart, and J. Sun, “Mime: Multilevel medical embedding of electronic health records for predictive healthcare,”Ad- vances in neural information processing systems, vol. 31, 2018

  63. [63]

    Using clini- cal notes with time series data for icu management,

    S. Khadanga, K. Aggarwal, S. Joty, and J. Srivastava, “Using clini- cal notes with time series data for icu management,”arXiv preprint arXiv:1909.09702, 2019. IEEE TRANSACTIONS ON MULTIMEDIA, UNDER REVIEW 11 APPENDIXA THEORETICALJUSTIFICATION OFLOW-RANKCOUPLING In this section, we show that the proposedLow-Rank Coupling(Eq. 4) is a CP-based low-rank appro...

  64. [64]

    Based on the overall performance, we selectR= 8for both datasets

    RankR.We varyR∈ {4,8,16}, and the results are shown in Table XVI. Based on the overall performance, we selectR= 8for both datasets

  65. [65]

    For MIMIC-III, we select1-4|4-12; for MIMIC- IV , we select1-4|-|12-12

    Sampling Intervals.We evaluate different sampling interval settings for each dataset, and the results are reported in Table XVII. For MIMIC-III, we select1-4|4-12; for MIMIC- IV , we select1-4|-|12-12

  66. [66]

    The results are reported in Table XVIII and Table XIX

    Loss Weights (λ a andλ r).We further study the effects of the loss weights for Fine-grained Alignment and Fine- grained Reconstruction. The results are reported in Table XVIII and Table XIX. According to the overall performance, we useλ a = 0.002andλ r = 10for MIMIC-III, and λa = 0.0001andλ r = 5for MIMIC-IV . TABLE XV ABLATION STUDY ON ENTROPY-BASED INFE...