pith. sign in

arxiv: 2503.14998 · v3 · submitted 2025-03-19 · 💻 cs.CV

Tables Guide Vision: Learning to See the Heart through Tabular Data

Pith reviewed 2026-05-22 23:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords contrastive learningtabular datacardiac MRrepresentation learningzero-shot predictionmedical imagingpatient similarity
0
0 comments X

The pith

Tabular data guides contrastive learning to build stronger visual representations from cardiac MR images without joint embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a contrastive learning approach that uses tabular clinical data to identify similar patients and form positive pairs, instead of depending only on image augmentations or multimodal embeddings. This targets the issue of false negatives where semantically related samples are incorrectly treated as negatives, a problem acute in cardiology because demographic and clinical factors shape disease interpretation. The method constructs representations from short-axis cardiac MR images alone while still incorporating tabular guidance for pair selection, then adapts k-NN for zero-shot prediction. Downstream evaluations on fine-tuning, linear probing, and zero-shot tasks for cardiovascular artery diseases and cardiac phenotypes show gains over image-only and combined-embedding baselines. The approach also transfers to a natural-image car advertisement dataset.

Core claim

The tabular-guided contrastive learning framework leverages clinically relevant tabular data to identify patient-level similarities and construct more meaningful pairs, enabling semantically aligned representation learning without requiring joint embeddings across modalities. Evaluation on a large cohort of cardiac MR images shows that incorporating tabular data guidance yields stronger visual representations than conventional methods that rely solely on image augmentation or combined image-tabular embeddings, with corresponding improvements in fine-tuning, linear probing, and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes.

What carries the argument

Tabular-guided contrastive learning framework that uses tabular data to identify patient-level similarities and construct positive pairs for representation learning.

If this is right

  • Tabular data enables more effective distinction between patient subgroups in cardiac MR images.
  • The representations improve results on fine-tuning, linear probing, and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes.
  • The framework generalizes beyond medical images, as demonstrated on a car advertisement dataset.
  • Adapting k-NN to the learned representations supports zero-shot prediction without multimodal training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tabular-pairing idea could be tested on other imaging modalities where metadata such as age or scanner type correlates with visual appearance.
  • By sidestepping joint embeddings the method may allow pretraining on larger image-only archives that still carry auxiliary tabular records.
  • If the tabular similarity signal is weak or noisy the performance edge may shrink, suggesting a natural ablation that removes or perturbs the tabular component.

Load-bearing premise

Clinically relevant tabular data can reliably identify patient-level similarities that align with semantically meaningful visual features in the cardiac MR images.

What would settle it

A controlled experiment on the same cardiac MR cohort in which tabular-guided training shows no gain or a loss relative to standard image-augmentation contrastive learning on all downstream tasks.

Figures

Figures reproduced from arXiv: 2503.14998 by Julia Schnabel, Keno Bressem, Marta Hasny, Maxime Di Folco.

Figure 1
Figure 1. Figure 1: Contrastive learning methods typically assume a one [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of the proposed tabular guidance (TGV: Tables Guide Vision) approach against other contractive learning approaches. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of the models under different amount of training data on LVEF prediction (left, lower is better) and multilabel CAD [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE visualization of the feature embedding generated with TGV. Sex, LVEF, and LVEDV have been included as attributes for [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance of all evaluated methods on LVEF prediction and CAD classification using zero-shot prediction, and linear probing [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Zero-shot CAD classification and LVEF prediction performance as a function of the percentage of samples used for the mean [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: t-SNE visualization of sex, LVEDV, LVEF, and height for a) MMCL [ [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Contrastive learning methods in computer vision typically rely on augmented views of the same image or multimodal pretraining strategies that align paired modalities. However, these approaches often overlook semantic relationships between distinct instances, leading to false negatives when semantically similar samples are treated as negatives. This limitation is especially critical in medical imaging domains such as cardiology, where demographic and clinical attributes play a critical role in assessing disease risk and patient outcomes. We introduce a tabular-guided contrastive learning framework that leverages clinically relevant tabular data to identify patient-level similarities and construct more meaningful pairs, enabling semantically aligned representation learning without requiring joint embeddings across modalities. Additionally, we adapt the k-NN algorithm for zero-shot prediction to overcome the lack of zero-shot capability in unimodal representations. We demonstrate the strength of our methods using a large cohort of short-axis cardiac MR images and clinical attributes, where tabular data helps to more effectively distinguish between patient subgroups. Evaluation on downstream tasks, including fine-tuning, linear probing, and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes, shows that incorporating tabular data guidance yields stronger visual representations than conventional methods that rely solely on image augmentation or combined image-tabular embeddings. Further, we show that our method can generalize to natural images by evaluating it on a car advertisement dataset. Code is available at https://github.com/marteczkah/tables_guide_vision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a tabular-guided contrastive learning framework for short-axis cardiac MR images that selects positive pairs using patient similarities computed from clinical tabular data (demographics and attributes) rather than image augmentations alone. This is claimed to reduce false negatives and yield stronger unimodal visual representations without joint multimodal embeddings. An adapted k-NN is used for zero-shot prediction. Downstream evaluations on fine-tuning, linear probing, and zero-shot tasks for cardiovascular artery disease and cardiac phenotypes show gains over image-only contrastive methods and combined image-tabular embeddings; generalization is demonstrated on a car advertisement dataset. Code is released.

Significance. If the central claim holds, the work offers a practical way to exploit readily available tabular clinical data to improve self-supervised visual representations in medical imaging, avoiding the need for paired multimodal training. The public code supports reproducibility. Significance is tempered by the need to confirm that tabular proximity reliably proxies visual semantic similarity rather than providing an auxiliary signal through other mechanisms.

major comments (3)
  1. [Abstract / Method] Abstract and method description: the claim that tabular proximity identifies 'semantically aligned' pairs (reducing false negatives due to visual similarity) is load-bearing, yet the manuscript provides no direct validation such as visual inspection of selected pairs, feature-space overlap metrics, or ablation comparing tabular-selected pairs against randomly selected pairs with matched tabular distance.
  2. [Experiments] Experiments section (downstream task results): reported gains on fine-tuning, linear probing, and zero-shot tasks are consistent with the claim but do not isolate whether improvements arise from semantic visual alignment versus any auxiliary tabular signal reducing false-negative rate; no control experiment (e.g., random tabular labels or non-clinical tabular features) is described.
  3. [Zero-shot prediction] Zero-shot prediction subsection: the k-NN adaptation is presented as enabling zero-shot capability for unimodal representations, but the description does not specify whether tabular data is required at query time or how the distance metric is defined, leaving unclear whether the method remains strictly unimodal.
minor comments (2)
  1. [Method] Notation for the tabular similarity function and contrastive loss weighting is introduced without an explicit equation or pseudocode, making the exact pair-construction procedure difficult to replicate from text alone.
  2. [Figures] Figure captions for qualitative results on cardiac MR and the car dataset could more explicitly state whether displayed pairs were selected by the tabular method or by baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, agreeing where revisions are needed to strengthen the claims.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: the claim that tabular proximity identifies 'semantically aligned' pairs (reducing false negatives due to visual similarity) is load-bearing, yet the manuscript provides no direct validation such as visual inspection of selected pairs, feature-space overlap metrics, or ablation comparing tabular-selected pairs against randomly selected pairs with matched tabular distance.

    Authors: We agree that direct validation of semantic alignment would strengthen the paper. In the revised manuscript, we will add visual inspection of selected pairs, feature-space overlap metrics, and an ablation comparing tabular-selected pairs to randomly selected pairs with matched tabular distance. revision: yes

  2. Referee: [Experiments] Experiments section (downstream task results): reported gains on fine-tuning, linear probing, and zero-shot tasks are consistent with the claim but do not isolate whether improvements arise from semantic visual alignment versus any auxiliary tabular signal reducing false-negative rate; no control experiment (e.g., random tabular labels or non-clinical tabular features) is described.

    Authors: We acknowledge that control experiments would better isolate the effect. We will add experiments using random tabular labels and non-clinical features to the revised experiments section. revision: yes

  3. Referee: [Zero-shot prediction] Zero-shot prediction subsection: the k-NN adaptation is presented as enabling zero-shot capability for unimodal representations, but the description does not specify whether tabular data is required at query time or how the distance metric is defined, leaving unclear whether the method remains strictly unimodal.

    Authors: The adapted k-NN uses only the learned visual embeddings at query time with Euclidean distance in embedding space and does not require tabular data during inference. We will revise the subsection to state this explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is an independent design choice validated empirically

full rationale

The paper introduces a tabular-guided contrastive learning method that selects positive pairs via patient similarities in tabular space rather than image augmentations. No equations, derivations, or self-referential definitions are present that would reduce the claimed performance gains to fitted parameters or inputs by construction. The central premise—that tabular proximity proxies semantic visual similarity—is an explicit modeling assumption tested via downstream fine-tuning, linear probing, and zero-shot tasks on cardiac MR and a car advertisement dataset. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the text; the method is presented as a self-contained alternative to SimCLR-style or joint-embedding baselines, with results offered as external evidence rather than tautological outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; the central claim rests on the domain assumption that tabular attributes encode visual semantic similarity. No free parameters or invented entities are explicitly introduced in the provided text.

axioms (1)
  • domain assumption Tabular clinical data provides reliable signals for patient-level semantic similarity relevant to cardiac MR image features
    This premise is required to justify constructing positive pairs from tabular matches rather than image augmentation alone.

pith-pipeline@v0.9.0 · 5777 in / 1189 out tokens · 142633 ms · 2026-05-22T23:03:26.834715+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

  1. [1]

    Donna K Arnett, Roger S Blumenthal, Michelle A Albert, Andrew B Buroker, Zachary D Goldberger, Ellen J Hahn, Cheryl Dennison Himmelfarb, Amit Khera, Donald Lloyd- Jones, J William McEvoy, et al. 2019 acc/aha guideline on the primary prevention of cardiovascular disease: a report of the american college of cardiology/american heart associa- tion task force...

  2. [2]

    Automated cardiovascular magnetic resonance image analysis with fully convolutional networks.Journal of cardiovascular magnetic resonance, 20(1):65, 2018

    Wenjia Bai, Matthew Sinclair, Giacomo Tarroni, Ozan Ok- tay, Martin Rajchl, Ghislain Vaillant, Aaron M Lee, Nay Aung, Elena Lukaschuk, Mihir M Sanghvi, et al. Automated cardiovascular magnetic resonance image analysis with fully convolutional networks.Journal of cardiovascular magnetic resonance, 20(1):65, 2018. 5, 11

  3. [3]

    A survey on deep multimodal learning for com- puter vision: advances, trends, applications, and datasets

    Khaled Bayoudh, Raja Knani, Fayc ¸al Hamdaoui, and Abdel- latif Mtibaa. A survey on deep multimodal learning for com- puter vision: advances, trends, applications, and datasets. The Visual Computer, 38(8):2939–2970, 2022. 1

  4. [4]

    Fff: Fixing flawed foundations in contrastive pre-training re- sults in very strong vision-language models

    Adrian Bulat, Yassine Ouali, and Georgios Tzimiropoulos. Fff: Fixing flawed foundations in contrastive pre-training re- sults in very strong vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14172–14182, 2024. 3

  5. [5]

    MONAI: An open-source framework for deep learning in healthcare

    M Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myro- nenko, Can Zhao, Dong Yang, et al. Monai: An open-source framework for deep learning in healthcare.arXiv preprint arXiv:2211.02701, 2022. 5, 11

  6. [6]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on ma- chine learning, pages 1597–1607. PMLR, 2020. 2, 3, 5, 6, 7, 11, 12, 13, 14, 15

  7. [7]

    Incremental false neg- ative detection for contrastive learning.arXiv preprint arXiv:2106.03719, 2021

    Tsai-Shien Chen, Wei-Chih Hung, Hung-Yu Tseng, Shao- Yi Chien, and Ming-Hsuan Yang. Incremental false neg- ative detection for contrastive learning.arXiv preprint arXiv:2106.03719, 2021. 5, 11

  8. [8]

    Exploring simple siamese rep- resentation learning

    Xinlei Chen and Kaiming He. Exploring simple siamese rep- resentation learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 15750–15758, 2021. 5, 7, 12, 13

  9. [9]

    O’Regan, and Chen Qin

    Siyi Du, Shaoming Zheng, Yinsong Wang, Wenjia Bai, De- clan P. O’Regan, and Chen Qin. TIP: Tabular-image pre- training for multimodal classification with incomplete data. In18th European Conference on Computer Vision (ECCV 2024), 2024. 2

  10. [10]

    O’Regan, and Chen Qin

    Siyi Du, Xinzhe Luo, Declan P. O’Regan, and Chen Qin. Stil: Semi-supervised tabular-image learning for compre- hensive task-relevant information exploration in multimodal classification. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 15549– 15559, 2025. 2

  11. [11]

    Time and the patient–physician relationship.Journal of gen- eral internal medicine, 14(Suppl 1):S34, 1999

    David C Dugdale, Ronald Epstein, and Steven Z Pantilat. Time and the patient–physician relationship.Journal of gen- eral internal medicine, 14(Suppl 1):S34, 1999. 2

  12. [12]

    General cardiovascular risk profile for use in primary care: the framingham heart study.Circu- lation, 117(6):743–753, 2008

    Ralph B D’Agostino Sr, Ramachandran S Vasan, Michael J Pencina, Philip A Wolf, Mark Cobain, Joseph M Massaro, and William B Kannel. General cardiovascular risk profile for use in primary care: the framingham heart study.Circu- lation, 117(6):743–753, 2008. 1

  13. [13]

    Learning visual representations via language-guided sam- pling

    Mohamed El Banani, Karan Desai, and Justin Johnson. Learning visual representations via language-guided sam- pling. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 19208–19220, 2023. 3

  14. [14]

    Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017,

    David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017,

  15. [15]

    Softclip: Softer cross-modal alignment makes clip stronger

    Yuting Gao, Jinfeng Liu, Zihan Xu, Tong Wu, Enwei Zhang, Ke Li, Jie Yang, Wei Liu, and Xing Sun. Softclip: Softer cross-modal alignment makes clip stronger. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 1860–1868, 2024. 3

  16. [16]

    Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

    Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020. 2, 5, 7, 11, 12, 13

  17. [17]

    Dimensional- ity reduction by learning an invariant mapping

    Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensional- ity reduction by learning an invariant mapping. In2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), pages 1735–1742. IEEE, 2006. 2

  18. [18]

    Best of both worlds: Multimodal contrastive learning with tabular and imaging data

    Paul Hager, Martin J Menten, and Daniel Rueckert. Best of both worlds: Multimodal contrastive learning with tabular and imaging data. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 23924–23935, 2023. 1, 2, 5, 6, 7, 8, 11, 12, 13, 14, 15

  19. [19]

    Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018

    Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018. 5, 11

  20. [20]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 5, 14

  21. [21]

    Metadata-enhanced contrastive learning from retinal optical coherence tomography images.Medical Im- age Analysis, 97:103296, 2024

    Robbie Holland, Oliver Leingang, Hrvoje Bogunovi ´c, So- phie Riedl, Lars Fritsche, Toby Prevost, Hendrik PN Scholl, Ursula Schmidt-Erfurth, Sobha Sivaprasad, Andrew J Lotery, et al. Metadata-enhanced contrastive learning from retinal optical coherence tomography images.Medical Im- age Analysis, 97:103296, 2024. 3

  22. [22]

    A comprehensive survey on contrastive learning.Neu- rocomputing, page 128645, 2024

    Haigen Hu, Xiaoyuan Wang, Yan Zhang, Qi Chen, and Qiu Guan. A comprehensive survey on contrastive learning.Neu- rocomputing, page 128645, 2024. 2

  23. [23]

    Dvm-car: A large-scale automotive dataset for visual marketing research and applications, 2023

    Jingmin Huang, Bowei Chen, Lan Luo, Shigang Yue, and Iadh Ounis. Dvm-car: A large-scale automotive dataset for visual marketing research and applications, 2023. 2, 6, 14

  24. [24]

    Boosting contrastive self- supervised learning with false negative cancellation

    Tri Huynh, Simon Kornblith, Matthew R Walter, Michael Maire, and Maryam Khademi. Boosting contrastive self- supervised learning with false negative cancellation. InPro- ceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2785–2795, 2022. 2, 3

  25. [25]

    Audio-visual contrastive learning with temporal self- supervision, 2023

    Simon Jenni, Alexander Black, and John Collomosse. Audio-visual contrastive learning with temporal self- supervision, 2023. 2

  26. [26]

    Stephanie M Lopez-Neyman, Kathleen Davis, Namvar Zo- hoori, K Shane Broughton, Carolyn E Moore, and Derek Miketinas. Racial disparities and prevalence of cardiovascu- lar disease risk factors, cardiometabolic risk factors, and car- diovascular health metrics among us adults: Nhanes 2011– 2018.Scientific reports, 12(1):19475, 2022. 8

  27. [27]

    Active contrastive learning of audio-visual video representa- tions.arXiv preprint arXiv:2009.09805, 2020

    Shuang Ma, Zhaoyang Zeng, Daniel McDuff, and Yale Song. Active contrastive learning of audio-visual video representa- tions.arXiv preprint arXiv:2009.09805, 2020. 2

  28. [28]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 2

  29. [29]

    Foundation model for cancer imaging biomarkers.Nature machine intelligence, 6(3):354–367, 2024

    Suraj Pai, Dennis Bontempi, Ibrahim Hadzic, Vasco Pru- dente, Mateo Sokaˇc, Tafadzwa L Chaunzwa, Simon Bernatz, Ahmed Hosny, Raymond H Mak, Nicolai J Birkbak, et al. Foundation model for cancer imaging biomarkers.Nature machine intelligence, 6(3):354–367, 2024. 3

  30. [30]

    Exploring scalable medical image encoders beyond text supervision.Nature Machine Intelligence, pages 1–12, 2025

    Fernando P ´erez-Garc´ıa, Harshita Sharma, Sam Bond-Taylor, Kenza Bouzid, Valentina Salvatelli, Maximilian Ilse, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Matthew P Lungren, et al. Exploring scalable medical image encoders beyond text supervision.Nature Machine Intelligence, pages 1–12, 2025. 2

  31. [31]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 4

  32. [32]

    Pierre, Mathias Peirlinck, and Ellen Kuhl

    Sarah R St. Pierre, Mathias Peirlinck, and Ellen Kuhl. Sex matters: a comprehensive comparison of female and male hearts.Frontiers in Physiology, 13:831179, 2022. 7

  33. [33]

    Uk biobank: an open access resource for identifying the causes of a wide range of com- plex diseases of middle and old age.PLoS medicine, 12(3): e1001779, 2015

    Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. Uk biobank: an open access resource for identifying the causes of a wide range of com- plex diseases of middle and old age.PLoS medicine, 12(3): e1001779, 2015. 2, 4, 15

  34. [34]

    Lightly.https://github

    Aleksandar Susmelj et al. Lightly.https://github. com/lightly-ai/lightly, 2020. 11

  35. [35]

    Contig: Self-supervised multimodal contrastive learning for medical imaging with genetics

    Aiham Taleb, Matthias Kirchler, Remo Monti, and Christoph Lippert. Contig: Self-supervised multimodal contrastive learning for medical imaging with genetics. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20908–20921, 2022. 2

  36. [36]

    Why tabular foundation models should be a research priority

    Boris Van Breugel and Mihaela Van Der Schaar. Why tab- ular foundation models should be a research priority.arXiv preprint arXiv:2405.01147, 2024. 1

  37. [37]

    Medclip: Contrastive learning from unpaired medical images and text

    Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text.arXiv preprint arXiv:2210.10163, 2022. 3

  38. [38]

    Relax- ing binary constraints in contrastive vision-language medi- cal representation learning

    Xiaoyang Wei, Camille Kurtz, and Florence Cloppet. Relax- ing binary constraints in contrastive vision-language medi- cal representation learning. In2025 IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 4462–4471. IEEE, 2025. 3

  39. [39]

    Cardiovascular diseases (cvds),

    World Health Organization. Cardiovascular diseases (cvds),

  40. [40]

    Accessed: 2025-02-10. 1

  41. [41]

    Barlow twins: Self-supervised learning via redundancy reduction

    Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and St´ephane Deny. Barlow twins: Self-supervised learning via redundancy reduction. InInternational conference on ma- chine learning, pages 12310–12320. PMLR, 2021. 2, 5, 7, 11, 12, 13

  42. [42]

    Contrastive learning of medical visual representations from paired images and text

    Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. Contrastive learning of medical visual representations from paired images and text. InMachine learning for healthcare conference, pages 2–25. PMLR, 2022. 2, 13

  43. [43]

    Mgi: Multimodal contrastive pre-training of genomic and medical imaging.arXiv preprint arXiv:2406.00631, 2024

    Jiaying Zhou, Mingzhou Jiang, Junde Wu, Jiayuan Zhu, Ziyue Wang, and Yueming Jin. Mgi: Multimodal contrastive pre-training of genomic and medical imaging.arXiv preprint arXiv:2406.00631, 2024. 2

  44. [44]

    Self-supervised multimodal learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

    Yongshuo Zong, Oisin Mac Aodha, and Timothy Hospedales. Self-supervised multimodal learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1

  45. [45]

    Tabular Attributes Table 5 presents a comprehensive list of tabular attributes from the UK Biobank that were used for tabular similarity calculation during pretraining

    Detailed Data Description 1.1. Tabular Attributes Table 5 presents a comprehensive list of tabular attributes from the UK Biobank that were used for tabular similarity calculation during pretraining. These attributes were con- sistently used across all baseline methods that incorporated tabular data during pretraining. Attributes marked asex- tractedwere ...

  46. [46]

    Implementation Details 2.1. Baselines We compare TGV against a mean-guess baseline (used only for cardiac phenotype prediction), a supervised 3D ResNet-50 model [19], four image-based contrastive learn- ing approaches, and one image-tabular contrastive learning method. This section details the implementation of each baseline. Mean-guess.The mean-guess bas...

  47. [47]

    Performance under Low-Data Regimes (Com- plete) Fig

    Additional Cardiac Experiments 3.1. Performance under Low-Data Regimes (Com- plete) Fig. 5 presents the results on CAD classification and LVEF prediction under low-data regimes for all the baselines, which were omitted for clarity in the main body of the pa- per. TGV outperforms the other methods on nearly all the data regimes and all tasks, with some exc...

  48. [48]

    and SimCLR [6] are typically the second best ap- proach, while BYOL [16], Barlow Twins [40], and Sim- Siam [8] report the worst overall performance. 3.2. Evaluating Robustness of the Zero-shot Predic- tions We evaluate the robustness of our zero-shot approach in terms of two conditions: (1) how changing the size of the representative set impacts performan...

  49. [49]

    The experiment is performed us- ing the image encoder pretrained with TGV and the results are reported in Table 7

    Robustness to representative set size.We evaluate the robustness of the zero-shot predictions under different sizes of the representative setP. The experiment is performed us- ing the image encoder pretrained with TGV and the results are reported in Table 7. We consider the N=2000 as the baseline and report the changes in the performance against it. Reduc...

  50. [50]

    CAD prediction shows the highest standard de- viation, which is reflective of the small number of CAD positive cases in the UK Biobank

    Robustness across different representative sets.Ta- ble 8 reports the mean and standard deviation of zero-shot prediction performance across three different representative setsP. CAD prediction shows the highest standard de- viation, which is reflective of the small number of CAD positive cases in the UK Biobank. Generally, methods with lower overall perf...

  51. [51]

    Dataset To assess whether TGV can generalize to other domains and datasets, we use the Data Visual Marketing (DVM) car dataset [23]

    Assessing TGV’s Generalizability 4.1. Dataset To assess whether TGV can generalize to other domains and datasets, we use the Data Visual Marketing (DVM) car dataset [23]. The dataset contains 1,451,784 images and their corresponding attributes of cars at varying de- gree angles. Model performance is evaluated on two tasks, car model classification (286 cl...