Tables Guide Vision: Learning to See the Heart through Tabular Data

Julia Schnabel; Keno Bressem; Marta Hasny; Maxime Di Folco

arxiv: 2503.14998 · v3 · submitted 2025-03-19 · 💻 cs.CV

Tables Guide Vision: Learning to See the Heart through Tabular Data

Marta Hasny , Maxime Di Folco , Keno Bressem , Julia Schnabel This is my paper

Pith reviewed 2026-05-22 23:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords contrastive learningtabular datacardiac MRrepresentation learningzero-shot predictionmedical imagingpatient similarity

0 comments

The pith

Tabular data guides contrastive learning to build stronger visual representations from cardiac MR images without joint embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a contrastive learning approach that uses tabular clinical data to identify similar patients and form positive pairs, instead of depending only on image augmentations or multimodal embeddings. This targets the issue of false negatives where semantically related samples are incorrectly treated as negatives, a problem acute in cardiology because demographic and clinical factors shape disease interpretation. The method constructs representations from short-axis cardiac MR images alone while still incorporating tabular guidance for pair selection, then adapts k-NN for zero-shot prediction. Downstream evaluations on fine-tuning, linear probing, and zero-shot tasks for cardiovascular artery diseases and cardiac phenotypes show gains over image-only and combined-embedding baselines. The approach also transfers to a natural-image car advertisement dataset.

Core claim

The tabular-guided contrastive learning framework leverages clinically relevant tabular data to identify patient-level similarities and construct more meaningful pairs, enabling semantically aligned representation learning without requiring joint embeddings across modalities. Evaluation on a large cohort of cardiac MR images shows that incorporating tabular data guidance yields stronger visual representations than conventional methods that rely solely on image augmentation or combined image-tabular embeddings, with corresponding improvements in fine-tuning, linear probing, and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes.

What carries the argument

Tabular-guided contrastive learning framework that uses tabular data to identify patient-level similarities and construct positive pairs for representation learning.

If this is right

Tabular data enables more effective distinction between patient subgroups in cardiac MR images.
The representations improve results on fine-tuning, linear probing, and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes.
The framework generalizes beyond medical images, as demonstrated on a car advertisement dataset.
Adapting k-NN to the learned representations supports zero-shot prediction without multimodal training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tabular-pairing idea could be tested on other imaging modalities where metadata such as age or scanner type correlates with visual appearance.
By sidestepping joint embeddings the method may allow pretraining on larger image-only archives that still carry auxiliary tabular records.
If the tabular similarity signal is weak or noisy the performance edge may shrink, suggesting a natural ablation that removes or perturbs the tabular component.

Load-bearing premise

Clinically relevant tabular data can reliably identify patient-level similarities that align with semantically meaningful visual features in the cardiac MR images.

What would settle it

A controlled experiment on the same cardiac MR cohort in which tabular-guided training shows no gain or a loss relative to standard image-augmentation contrastive learning on all downstream tasks.

Figures

Figures reproduced from arXiv: 2503.14998 by Julia Schnabel, Keno Bressem, Marta Hasny, Maxime Di Folco.

**Figure 2.** Figure 2: Comparison of the proposed tabular guidance (TGV: Tables Guide Vision) approach against other contractive learning approaches. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of the models under different amount of training data on LVEF prediction (left, lower is better) and multilabel CAD [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE visualization of the feature embedding generated with TGV. Sex, LVEF, and LVEDV have been included as attributes for [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Performance of all evaluated methods on LVEF prediction and CAD classification using zero-shot prediction, and linear probing [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Zero-shot CAD classification and LVEF prediction performance as a function of the percentage of samples used for the mean [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: t-SNE visualization of sex, LVEDV, LVEF, and height for a) MMCL [ [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Contrastive learning methods in computer vision typically rely on augmented views of the same image or multimodal pretraining strategies that align paired modalities. However, these approaches often overlook semantic relationships between distinct instances, leading to false negatives when semantically similar samples are treated as negatives. This limitation is especially critical in medical imaging domains such as cardiology, where demographic and clinical attributes play a critical role in assessing disease risk and patient outcomes. We introduce a tabular-guided contrastive learning framework that leverages clinically relevant tabular data to identify patient-level similarities and construct more meaningful pairs, enabling semantically aligned representation learning without requiring joint embeddings across modalities. Additionally, we adapt the k-NN algorithm for zero-shot prediction to overcome the lack of zero-shot capability in unimodal representations. We demonstrate the strength of our methods using a large cohort of short-axis cardiac MR images and clinical attributes, where tabular data helps to more effectively distinguish between patient subgroups. Evaluation on downstream tasks, including fine-tuning, linear probing, and zero-shot prediction of cardiovascular artery diseases and cardiac phenotypes, shows that incorporating tabular data guidance yields stronger visual representations than conventional methods that rely solely on image augmentation or combined image-tabular embeddings. Further, we show that our method can generalize to natural images by evaluating it on a car advertisement dataset. Code is available at https://github.com/marteczkah/tables_guide_vision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move—using tabular clinical data to pick positive pairs for unimodal contrastive learning on cardiac MR—looks workable and avoids multimodal fusion, but the claim that this reliably captures visual semantics rests on an assumption that needs tighter checks.

read the letter

The punchline is that tabular attributes like demographics and clinical variables are used to define patient similarities, which then become the positive pairs in contrastive training instead of relying only on image augmentations. This keeps everything unimodal while trying to cut down on false negatives where two different patients actually share relevant heart features. They also adapt k-NN for zero-shot prediction on the learned features and test on both cardiac MR and a car advertisement dataset for generalization.

Referee Report

3 major / 2 minor

Summary. The paper introduces a tabular-guided contrastive learning framework for short-axis cardiac MR images that selects positive pairs using patient similarities computed from clinical tabular data (demographics and attributes) rather than image augmentations alone. This is claimed to reduce false negatives and yield stronger unimodal visual representations without joint multimodal embeddings. An adapted k-NN is used for zero-shot prediction. Downstream evaluations on fine-tuning, linear probing, and zero-shot tasks for cardiovascular artery disease and cardiac phenotypes show gains over image-only contrastive methods and combined image-tabular embeddings; generalization is demonstrated on a car advertisement dataset. Code is released.

Significance. If the central claim holds, the work offers a practical way to exploit readily available tabular clinical data to improve self-supervised visual representations in medical imaging, avoiding the need for paired multimodal training. The public code supports reproducibility. Significance is tempered by the need to confirm that tabular proximity reliably proxies visual semantic similarity rather than providing an auxiliary signal through other mechanisms.

major comments (3)

[Abstract / Method] Abstract and method description: the claim that tabular proximity identifies 'semantically aligned' pairs (reducing false negatives due to visual similarity) is load-bearing, yet the manuscript provides no direct validation such as visual inspection of selected pairs, feature-space overlap metrics, or ablation comparing tabular-selected pairs against randomly selected pairs with matched tabular distance.
[Experiments] Experiments section (downstream task results): reported gains on fine-tuning, linear probing, and zero-shot tasks are consistent with the claim but do not isolate whether improvements arise from semantic visual alignment versus any auxiliary tabular signal reducing false-negative rate; no control experiment (e.g., random tabular labels or non-clinical tabular features) is described.
[Zero-shot prediction] Zero-shot prediction subsection: the k-NN adaptation is presented as enabling zero-shot capability for unimodal representations, but the description does not specify whether tabular data is required at query time or how the distance metric is defined, leaving unclear whether the method remains strictly unimodal.

minor comments (2)

[Method] Notation for the tabular similarity function and contrastive loss weighting is introduced without an explicit equation or pseudocode, making the exact pair-construction procedure difficult to replicate from text alone.
[Figures] Figure captions for qualitative results on cardiac MR and the car dataset could more explicitly state whether displayed pairs were selected by the tabular method or by baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, agreeing where revisions are needed to strengthen the claims.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the claim that tabular proximity identifies 'semantically aligned' pairs (reducing false negatives due to visual similarity) is load-bearing, yet the manuscript provides no direct validation such as visual inspection of selected pairs, feature-space overlap metrics, or ablation comparing tabular-selected pairs against randomly selected pairs with matched tabular distance.

Authors: We agree that direct validation of semantic alignment would strengthen the paper. In the revised manuscript, we will add visual inspection of selected pairs, feature-space overlap metrics, and an ablation comparing tabular-selected pairs to randomly selected pairs with matched tabular distance. revision: yes
Referee: [Experiments] Experiments section (downstream task results): reported gains on fine-tuning, linear probing, and zero-shot tasks are consistent with the claim but do not isolate whether improvements arise from semantic visual alignment versus any auxiliary tabular signal reducing false-negative rate; no control experiment (e.g., random tabular labels or non-clinical tabular features) is described.

Authors: We acknowledge that control experiments would better isolate the effect. We will add experiments using random tabular labels and non-clinical features to the revised experiments section. revision: yes
Referee: [Zero-shot prediction] Zero-shot prediction subsection: the k-NN adaptation is presented as enabling zero-shot capability for unimodal representations, but the description does not specify whether tabular data is required at query time or how the distance metric is defined, leaving unclear whether the method remains strictly unimodal.

Authors: The adapted k-NN uses only the learned visual embeddings at query time with Euclidean distance in embedding space and does not require tabular data during inference. We will revise the subsection to state this explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is an independent design choice validated empirically

full rationale

The paper introduces a tabular-guided contrastive learning method that selects positive pairs via patient similarities in tabular space rather than image augmentations. No equations, derivations, or self-referential definitions are present that would reduce the claimed performance gains to fitted parameters or inputs by construction. The central premise—that tabular proximity proxies semantic visual similarity—is an explicit modeling assumption tested via downstream fine-tuning, linear probing, and zero-shot tasks on cardiac MR and a car advertisement dataset. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the text; the method is presented as a self-contained alternative to SimCLR-style or joint-embedding baselines, with results offered as external evidence rather than tautological outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only; the central claim rests on the domain assumption that tabular attributes encode visual semantic similarity. No free parameters or invented entities are explicitly introduced in the provided text.

axioms (1)

domain assumption Tabular clinical data provides reliable signals for patient-level semantic similarity relevant to cardiac MR image features
This premise is required to justify constructing positive pairs from tabular matches rather than image augmentation alone.

pith-pipeline@v0.9.0 · 5777 in / 1189 out tokens · 142633 ms · 2026-05-22T23:03:26.834715+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

[1]

Donna K Arnett, Roger S Blumenthal, Michelle A Albert, Andrew B Buroker, Zachary D Goldberger, Ellen J Hahn, Cheryl Dennison Himmelfarb, Amit Khera, Donald Lloyd- Jones, J William McEvoy, et al. 2019 acc/aha guideline on the primary prevention of cardiovascular disease: a report of the american college of cardiology/american heart associa- tion task force...

work page 2019
[2]

Automated cardiovascular magnetic resonance image analysis with fully convolutional networks.Journal of cardiovascular magnetic resonance, 20(1):65, 2018

Wenjia Bai, Matthew Sinclair, Giacomo Tarroni, Ozan Ok- tay, Martin Rajchl, Ghislain Vaillant, Aaron M Lee, Nay Aung, Elena Lukaschuk, Mihir M Sanghvi, et al. Automated cardiovascular magnetic resonance image analysis with fully convolutional networks.Journal of cardiovascular magnetic resonance, 20(1):65, 2018. 5, 11

work page 2018
[3]

A survey on deep multimodal learning for com- puter vision: advances, trends, applications, and datasets

Khaled Bayoudh, Raja Knani, Fayc ¸al Hamdaoui, and Abdel- latif Mtibaa. A survey on deep multimodal learning for com- puter vision: advances, trends, applications, and datasets. The Visual Computer, 38(8):2939–2970, 2022. 1

work page 2022
[4]

Fff: Fixing flawed foundations in contrastive pre-training re- sults in very strong vision-language models

Adrian Bulat, Yassine Ouali, and Georgios Tzimiropoulos. Fff: Fixing flawed foundations in contrastive pre-training re- sults in very strong vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14172–14182, 2024. 3

work page 2024
[5]

MONAI: An open-source framework for deep learning in healthcare

M Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myro- nenko, Can Zhao, Dong Yang, et al. Monai: An open-source framework for deep learning in healthcare.arXiv preprint arXiv:2211.02701, 2022. 5, 11

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on ma- chine learning, pages 1597–1607. PMLR, 2020. 2, 3, 5, 6, 7, 11, 12, 13, 14, 15

work page 2020
[7]

Incremental false neg- ative detection for contrastive learning.arXiv preprint arXiv:2106.03719, 2021

Tsai-Shien Chen, Wei-Chih Hung, Hung-Yu Tseng, Shao- Yi Chien, and Ming-Hsuan Yang. Incremental false neg- ative detection for contrastive learning.arXiv preprint arXiv:2106.03719, 2021. 5, 11

work page arXiv 2021
[8]

Exploring simple siamese rep- resentation learning

Xinlei Chen and Kaiming He. Exploring simple siamese rep- resentation learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 15750–15758, 2021. 5, 7, 12, 13

work page 2021
[9]

O’Regan, and Chen Qin

Siyi Du, Shaoming Zheng, Yinsong Wang, Wenjia Bai, De- clan P. O’Regan, and Chen Qin. TIP: Tabular-image pre- training for multimodal classification with incomplete data. In18th European Conference on Computer Vision (ECCV 2024), 2024. 2

work page 2024
[10]

O’Regan, and Chen Qin

Siyi Du, Xinzhe Luo, Declan P. O’Regan, and Chen Qin. Stil: Semi-supervised tabular-image learning for compre- hensive task-relevant information exploration in multimodal classification. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 15549– 15559, 2025. 2

work page 2025
[11]

Time and the patient–physician relationship.Journal of gen- eral internal medicine, 14(Suppl 1):S34, 1999

David C Dugdale, Ronald Epstein, and Steven Z Pantilat. Time and the patient–physician relationship.Journal of gen- eral internal medicine, 14(Suppl 1):S34, 1999. 2

work page 1999
[12]

General cardiovascular risk profile for use in primary care: the framingham heart study.Circu- lation, 117(6):743–753, 2008

Ralph B D’Agostino Sr, Ramachandran S Vasan, Michael J Pencina, Philip A Wolf, Mark Cobain, Joseph M Massaro, and William B Kannel. General cardiovascular risk profile for use in primary care: the framingham heart study.Circu- lation, 117(6):743–753, 2008. 1

work page 2008
[13]

Learning visual representations via language-guided sam- pling

Mohamed El Banani, Karan Desai, and Justin Johnson. Learning visual representations via language-guided sam- pling. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 19208–19220, 2023. 3

work page 2023
[14]

Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017,

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017,

work page arXiv
[15]

Softclip: Softer cross-modal alignment makes clip stronger

Yuting Gao, Jinfeng Liu, Zihan Xu, Tong Wu, Enwei Zhang, Ke Li, Jie Yang, Wei Liu, and Xing Sun. Softclip: Softer cross-modal alignment makes clip stronger. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 1860–1868, 2024. 3

work page 2024
[16]

Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020. 2, 5, 7, 11, 12, 13

work page 2020
[17]

Dimensional- ity reduction by learning an invariant mapping

Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensional- ity reduction by learning an invariant mapping. In2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), pages 1735–1742. IEEE, 2006. 2

work page 2006
[18]

Best of both worlds: Multimodal contrastive learning with tabular and imaging data

Paul Hager, Martin J Menten, and Daniel Rueckert. Best of both worlds: Multimodal contrastive learning with tabular and imaging data. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 23924–23935, 2023. 1, 2, 5, 6, 7, 8, 11, 12, 13, 14, 15

work page 2023
[19]

Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018. 5, 11

work page 2018
[20]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 5, 14

work page 2016
[21]

Metadata-enhanced contrastive learning from retinal optical coherence tomography images.Medical Im- age Analysis, 97:103296, 2024

Robbie Holland, Oliver Leingang, Hrvoje Bogunovi ´c, So- phie Riedl, Lars Fritsche, Toby Prevost, Hendrik PN Scholl, Ursula Schmidt-Erfurth, Sobha Sivaprasad, Andrew J Lotery, et al. Metadata-enhanced contrastive learning from retinal optical coherence tomography images.Medical Im- age Analysis, 97:103296, 2024. 3

work page 2024
[22]

A comprehensive survey on contrastive learning.Neu- rocomputing, page 128645, 2024

Haigen Hu, Xiaoyuan Wang, Yan Zhang, Qi Chen, and Qiu Guan. A comprehensive survey on contrastive learning.Neu- rocomputing, page 128645, 2024. 2

work page 2024
[23]

Dvm-car: A large-scale automotive dataset for visual marketing research and applications, 2023

Jingmin Huang, Bowei Chen, Lan Luo, Shigang Yue, and Iadh Ounis. Dvm-car: A large-scale automotive dataset for visual marketing research and applications, 2023. 2, 6, 14

work page 2023
[24]

Boosting contrastive self- supervised learning with false negative cancellation

Tri Huynh, Simon Kornblith, Matthew R Walter, Michael Maire, and Maryam Khademi. Boosting contrastive self- supervised learning with false negative cancellation. InPro- ceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2785–2795, 2022. 2, 3

work page 2022
[25]

Audio-visual contrastive learning with temporal self- supervision, 2023

Simon Jenni, Alexander Black, and John Collomosse. Audio-visual contrastive learning with temporal self- supervision, 2023. 2

work page 2023
[26]

Stephanie M Lopez-Neyman, Kathleen Davis, Namvar Zo- hoori, K Shane Broughton, Carolyn E Moore, and Derek Miketinas. Racial disparities and prevalence of cardiovascu- lar disease risk factors, cardiometabolic risk factors, and car- diovascular health metrics among us adults: Nhanes 2011– 2018.Scientific reports, 12(1):19475, 2022. 8

work page 2011
[27]

Active contrastive learning of audio-visual video representa- tions.arXiv preprint arXiv:2009.09805, 2020

Shuang Ma, Zhaoyang Zeng, Daniel McDuff, and Yale Song. Active contrastive learning of audio-visual video representa- tions.arXiv preprint arXiv:2009.09805, 2020. 2

work page arXiv 2009
[28]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Foundation model for cancer imaging biomarkers.Nature machine intelligence, 6(3):354–367, 2024

Suraj Pai, Dennis Bontempi, Ibrahim Hadzic, Vasco Pru- dente, Mateo Sokaˇc, Tafadzwa L Chaunzwa, Simon Bernatz, Ahmed Hosny, Raymond H Mak, Nicolai J Birkbak, et al. Foundation model for cancer imaging biomarkers.Nature machine intelligence, 6(3):354–367, 2024. 3

work page 2024
[30]

Exploring scalable medical image encoders beyond text supervision.Nature Machine Intelligence, pages 1–12, 2025

Fernando P ´erez-Garc´ıa, Harshita Sharma, Sam Bond-Taylor, Kenza Bouzid, Valentina Salvatelli, Maximilian Ilse, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Matthew P Lungren, et al. Exploring scalable medical image encoders beyond text supervision.Nature Machine Intelligence, pages 1–12, 2025. 2

work page 2025
[31]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 4

work page 2021
[32]

Pierre, Mathias Peirlinck, and Ellen Kuhl

Sarah R St. Pierre, Mathias Peirlinck, and Ellen Kuhl. Sex matters: a comprehensive comparison of female and male hearts.Frontiers in Physiology, 13:831179, 2022. 7

work page 2022
[33]

Uk biobank: an open access resource for identifying the causes of a wide range of com- plex diseases of middle and old age.PLoS medicine, 12(3): e1001779, 2015

Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. Uk biobank: an open access resource for identifying the causes of a wide range of com- plex diseases of middle and old age.PLoS medicine, 12(3): e1001779, 2015. 2, 4, 15

work page 2015
[34]

Lightly.https://github

Aleksandar Susmelj et al. Lightly.https://github. com/lightly-ai/lightly, 2020. 11

work page 2020
[35]

Contig: Self-supervised multimodal contrastive learning for medical imaging with genetics

Aiham Taleb, Matthias Kirchler, Remo Monti, and Christoph Lippert. Contig: Self-supervised multimodal contrastive learning for medical imaging with genetics. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20908–20921, 2022. 2

work page 2022
[36]

Why tabular foundation models should be a research priority

Boris Van Breugel and Mihaela Van Der Schaar. Why tab- ular foundation models should be a research priority.arXiv preprint arXiv:2405.01147, 2024. 1

work page arXiv 2024
[37]

Medclip: Contrastive learning from unpaired medical images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text.arXiv preprint arXiv:2210.10163, 2022. 3

work page arXiv 2022
[38]

Relax- ing binary constraints in contrastive vision-language medi- cal representation learning

Xiaoyang Wei, Camille Kurtz, and Florence Cloppet. Relax- ing binary constraints in contrastive vision-language medi- cal representation learning. In2025 IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 4462–4471. IEEE, 2025. 3

work page 2025
[39]

Cardiovascular diseases (cvds),

World Health Organization. Cardiovascular diseases (cvds),

work page
[40]

Accessed: 2025-02-10. 1

work page 2025
[41]

Barlow twins: Self-supervised learning via redundancy reduction

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and St´ephane Deny. Barlow twins: Self-supervised learning via redundancy reduction. InInternational conference on ma- chine learning, pages 12310–12320. PMLR, 2021. 2, 5, 7, 11, 12, 13

work page 2021
[42]

Contrastive learning of medical visual representations from paired images and text

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. Contrastive learning of medical visual representations from paired images and text. InMachine learning for healthcare conference, pages 2–25. PMLR, 2022. 2, 13

work page 2022
[43]

Mgi: Multimodal contrastive pre-training of genomic and medical imaging.arXiv preprint arXiv:2406.00631, 2024

Jiaying Zhou, Mingzhou Jiang, Junde Wu, Jiayuan Zhu, Ziyue Wang, and Yueming Jin. Mgi: Multimodal contrastive pre-training of genomic and medical imaging.arXiv preprint arXiv:2406.00631, 2024. 2

work page arXiv 2024
[44]

Self-supervised multimodal learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Yongshuo Zong, Oisin Mac Aodha, and Timothy Hospedales. Self-supervised multimodal learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1

work page 2024
[45]

Tabular Attributes Table 5 presents a comprehensive list of tabular attributes from the UK Biobank that were used for tabular similarity calculation during pretraining

Detailed Data Description 1.1. Tabular Attributes Table 5 presents a comprehensive list of tabular attributes from the UK Biobank that were used for tabular similarity calculation during pretraining. These attributes were con- sistently used across all baseline methods that incorporated tabular data during pretraining. Attributes marked asex- tractedwere ...

work page
[46]

Implementation Details 2.1. Baselines We compare TGV against a mean-guess baseline (used only for cardiac phenotype prediction), a supervised 3D ResNet-50 model [19], four image-based contrastive learn- ing approaches, and one image-tabular contrastive learning method. This section details the implementation of each baseline. Mean-guess.The mean-guess bas...

work page 2000
[47]

Performance under Low-Data Regimes (Com- plete) Fig

Additional Cardiac Experiments 3.1. Performance under Low-Data Regimes (Com- plete) Fig. 5 presents the results on CAD classification and LVEF prediction under low-data regimes for all the baselines, which were omitted for clarity in the main body of the pa- per. TGV outperforms the other methods on nearly all the data regimes and all tasks, with some exc...

work page
[48]

and SimCLR [6] are typically the second best ap- proach, while BYOL [16], Barlow Twins [40], and Sim- Siam [8] report the worst overall performance. 3.2. Evaluating Robustness of the Zero-shot Predic- tions We evaluate the robustness of our zero-shot approach in terms of two conditions: (1) how changing the size of the representative set impacts performan...

work page
[49]

The experiment is performed us- ing the image encoder pretrained with TGV and the results are reported in Table 7

Robustness to representative set size.We evaluate the robustness of the zero-shot predictions under different sizes of the representative setP. The experiment is performed us- ing the image encoder pretrained with TGV and the results are reported in Table 7. We consider the N=2000 as the baseline and report the changes in the performance against it. Reduc...

work page 2000
[50]

CAD prediction shows the highest standard de- viation, which is reflective of the small number of CAD positive cases in the UK Biobank

Robustness across different representative sets.Ta- ble 8 reports the mean and standard deviation of zero-shot prediction performance across three different representative setsP. CAD prediction shows the highest standard de- viation, which is reflective of the small number of CAD positive cases in the UK Biobank. Generally, methods with lower overall perf...

work page
[51]

Dataset To assess whether TGV can generalize to other domains and datasets, we use the Data Visual Marketing (DVM) car dataset [23]

Assessing TGV’s Generalizability 4.1. Dataset To assess whether TGV can generalize to other domains and datasets, we use the Data Visual Marketing (DVM) car dataset [23]. The dataset contains 1,451,784 images and their corresponding attributes of cars at varying de- gree angles. Model performance is evaluated on two tasks, car model classification (286 cl...

work page

[1] [1]

Donna K Arnett, Roger S Blumenthal, Michelle A Albert, Andrew B Buroker, Zachary D Goldberger, Ellen J Hahn, Cheryl Dennison Himmelfarb, Amit Khera, Donald Lloyd- Jones, J William McEvoy, et al. 2019 acc/aha guideline on the primary prevention of cardiovascular disease: a report of the american college of cardiology/american heart associa- tion task force...

work page 2019

[2] [2]

Automated cardiovascular magnetic resonance image analysis with fully convolutional networks.Journal of cardiovascular magnetic resonance, 20(1):65, 2018

Wenjia Bai, Matthew Sinclair, Giacomo Tarroni, Ozan Ok- tay, Martin Rajchl, Ghislain Vaillant, Aaron M Lee, Nay Aung, Elena Lukaschuk, Mihir M Sanghvi, et al. Automated cardiovascular magnetic resonance image analysis with fully convolutional networks.Journal of cardiovascular magnetic resonance, 20(1):65, 2018. 5, 11

work page 2018

[3] [3]

A survey on deep multimodal learning for com- puter vision: advances, trends, applications, and datasets

Khaled Bayoudh, Raja Knani, Fayc ¸al Hamdaoui, and Abdel- latif Mtibaa. A survey on deep multimodal learning for com- puter vision: advances, trends, applications, and datasets. The Visual Computer, 38(8):2939–2970, 2022. 1

work page 2022

[4] [4]

Fff: Fixing flawed foundations in contrastive pre-training re- sults in very strong vision-language models

Adrian Bulat, Yassine Ouali, and Georgios Tzimiropoulos. Fff: Fixing flawed foundations in contrastive pre-training re- sults in very strong vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14172–14182, 2024. 3

work page 2024

[5] [5]

MONAI: An open-source framework for deep learning in healthcare

M Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Benjamin Murrey, Andriy Myro- nenko, Can Zhao, Dong Yang, et al. Monai: An open-source framework for deep learning in healthcare.arXiv preprint arXiv:2211.02701, 2022. 5, 11

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on ma- chine learning, pages 1597–1607. PMLR, 2020. 2, 3, 5, 6, 7, 11, 12, 13, 14, 15

work page 2020

[7] [7]

Incremental false neg- ative detection for contrastive learning.arXiv preprint arXiv:2106.03719, 2021

Tsai-Shien Chen, Wei-Chih Hung, Hung-Yu Tseng, Shao- Yi Chien, and Ming-Hsuan Yang. Incremental false neg- ative detection for contrastive learning.arXiv preprint arXiv:2106.03719, 2021. 5, 11

work page arXiv 2021

[8] [8]

Exploring simple siamese rep- resentation learning

Xinlei Chen and Kaiming He. Exploring simple siamese rep- resentation learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 15750–15758, 2021. 5, 7, 12, 13

work page 2021

[9] [9]

O’Regan, and Chen Qin

Siyi Du, Shaoming Zheng, Yinsong Wang, Wenjia Bai, De- clan P. O’Regan, and Chen Qin. TIP: Tabular-image pre- training for multimodal classification with incomplete data. In18th European Conference on Computer Vision (ECCV 2024), 2024. 2

work page 2024

[10] [10]

O’Regan, and Chen Qin

Siyi Du, Xinzhe Luo, Declan P. O’Regan, and Chen Qin. Stil: Semi-supervised tabular-image learning for compre- hensive task-relevant information exploration in multimodal classification. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 15549– 15559, 2025. 2

work page 2025

[11] [11]

Time and the patient–physician relationship.Journal of gen- eral internal medicine, 14(Suppl 1):S34, 1999

David C Dugdale, Ronald Epstein, and Steven Z Pantilat. Time and the patient–physician relationship.Journal of gen- eral internal medicine, 14(Suppl 1):S34, 1999. 2

work page 1999

[12] [12]

General cardiovascular risk profile for use in primary care: the framingham heart study.Circu- lation, 117(6):743–753, 2008

Ralph B D’Agostino Sr, Ramachandran S Vasan, Michael J Pencina, Philip A Wolf, Mark Cobain, Joseph M Massaro, and William B Kannel. General cardiovascular risk profile for use in primary care: the framingham heart study.Circu- lation, 117(6):743–753, 2008. 1

work page 2008

[13] [13]

Learning visual representations via language-guided sam- pling

Mohamed El Banani, Karan Desai, and Justin Johnson. Learning visual representations via language-guided sam- pling. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 19208–19220, 2023. 3

work page 2023

[14] [14]

Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017,

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017,

work page arXiv

[15] [15]

Softclip: Softer cross-modal alignment makes clip stronger

Yuting Gao, Jinfeng Liu, Zihan Xu, Tong Wu, Enwei Zhang, Ke Li, Jie Yang, Wei Liu, and Xing Sun. Softclip: Softer cross-modal alignment makes clip stronger. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 1860–1868, 2024. 3

work page 2024

[16] [16]

Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

Jean-Bastien Grill, Florian Strub, Florent Altch ´e, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- laghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020. 2, 5, 7, 11, 12, 13

work page 2020

[17] [17]

Dimensional- ity reduction by learning an invariant mapping

Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensional- ity reduction by learning an invariant mapping. In2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), pages 1735–1742. IEEE, 2006. 2

work page 2006

[18] [18]

Best of both worlds: Multimodal contrastive learning with tabular and imaging data

Paul Hager, Martin J Menten, and Daniel Rueckert. Best of both worlds: Multimodal contrastive learning with tabular and imaging data. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 23924–23935, 2023. 1, 2, 5, 6, 7, 8, 11, 12, 13, 14, 15

work page 2023

[19] [19]

Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and im- agenet? InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018. 5, 11

work page 2018

[20] [20]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 5, 14

work page 2016

[21] [21]

Metadata-enhanced contrastive learning from retinal optical coherence tomography images.Medical Im- age Analysis, 97:103296, 2024

Robbie Holland, Oliver Leingang, Hrvoje Bogunovi ´c, So- phie Riedl, Lars Fritsche, Toby Prevost, Hendrik PN Scholl, Ursula Schmidt-Erfurth, Sobha Sivaprasad, Andrew J Lotery, et al. Metadata-enhanced contrastive learning from retinal optical coherence tomography images.Medical Im- age Analysis, 97:103296, 2024. 3

work page 2024

[22] [22]

A comprehensive survey on contrastive learning.Neu- rocomputing, page 128645, 2024

Haigen Hu, Xiaoyuan Wang, Yan Zhang, Qi Chen, and Qiu Guan. A comprehensive survey on contrastive learning.Neu- rocomputing, page 128645, 2024. 2

work page 2024

[23] [23]

Dvm-car: A large-scale automotive dataset for visual marketing research and applications, 2023

Jingmin Huang, Bowei Chen, Lan Luo, Shigang Yue, and Iadh Ounis. Dvm-car: A large-scale automotive dataset for visual marketing research and applications, 2023. 2, 6, 14

work page 2023

[24] [24]

Boosting contrastive self- supervised learning with false negative cancellation

Tri Huynh, Simon Kornblith, Matthew R Walter, Michael Maire, and Maryam Khademi. Boosting contrastive self- supervised learning with false negative cancellation. InPro- ceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2785–2795, 2022. 2, 3

work page 2022

[25] [25]

Audio-visual contrastive learning with temporal self- supervision, 2023

Simon Jenni, Alexander Black, and John Collomosse. Audio-visual contrastive learning with temporal self- supervision, 2023. 2

work page 2023

[26] [26]

Stephanie M Lopez-Neyman, Kathleen Davis, Namvar Zo- hoori, K Shane Broughton, Carolyn E Moore, and Derek Miketinas. Racial disparities and prevalence of cardiovascu- lar disease risk factors, cardiometabolic risk factors, and car- diovascular health metrics among us adults: Nhanes 2011– 2018.Scientific reports, 12(1):19475, 2022. 8

work page 2011

[27] [27]

Active contrastive learning of audio-visual video representa- tions.arXiv preprint arXiv:2009.09805, 2020

Shuang Ma, Zhaoyang Zeng, Daniel McDuff, and Yale Song. Active contrastive learning of audio-visual video representa- tions.arXiv preprint arXiv:2009.09805, 2020. 2

work page arXiv 2009

[28] [28]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Foundation model for cancer imaging biomarkers.Nature machine intelligence, 6(3):354–367, 2024

Suraj Pai, Dennis Bontempi, Ibrahim Hadzic, Vasco Pru- dente, Mateo Sokaˇc, Tafadzwa L Chaunzwa, Simon Bernatz, Ahmed Hosny, Raymond H Mak, Nicolai J Birkbak, et al. Foundation model for cancer imaging biomarkers.Nature machine intelligence, 6(3):354–367, 2024. 3

work page 2024

[30] [30]

Exploring scalable medical image encoders beyond text supervision.Nature Machine Intelligence, pages 1–12, 2025

Fernando P ´erez-Garc´ıa, Harshita Sharma, Sam Bond-Taylor, Kenza Bouzid, Valentina Salvatelli, Maximilian Ilse, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Matthew P Lungren, et al. Exploring scalable medical image encoders beyond text supervision.Nature Machine Intelligence, pages 1–12, 2025. 2

work page 2025

[31] [31]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 4

work page 2021

[32] [32]

Pierre, Mathias Peirlinck, and Ellen Kuhl

Sarah R St. Pierre, Mathias Peirlinck, and Ellen Kuhl. Sex matters: a comprehensive comparison of female and male hearts.Frontiers in Physiology, 13:831179, 2022. 7

work page 2022

[33] [33]

Uk biobank: an open access resource for identifying the causes of a wide range of com- plex diseases of middle and old age.PLoS medicine, 12(3): e1001779, 2015

Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, et al. Uk biobank: an open access resource for identifying the causes of a wide range of com- plex diseases of middle and old age.PLoS medicine, 12(3): e1001779, 2015. 2, 4, 15

work page 2015

[34] [34]

Lightly.https://github

Aleksandar Susmelj et al. Lightly.https://github. com/lightly-ai/lightly, 2020. 11

work page 2020

[35] [35]

Contig: Self-supervised multimodal contrastive learning for medical imaging with genetics

Aiham Taleb, Matthias Kirchler, Remo Monti, and Christoph Lippert. Contig: Self-supervised multimodal contrastive learning for medical imaging with genetics. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20908–20921, 2022. 2

work page 2022

[36] [36]

Why tabular foundation models should be a research priority

Boris Van Breugel and Mihaela Van Der Schaar. Why tab- ular foundation models should be a research priority.arXiv preprint arXiv:2405.01147, 2024. 1

work page arXiv 2024

[37] [37]

Medclip: Contrastive learning from unpaired medical images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text.arXiv preprint arXiv:2210.10163, 2022. 3

work page arXiv 2022

[38] [38]

Relax- ing binary constraints in contrastive vision-language medi- cal representation learning

Xiaoyang Wei, Camille Kurtz, and Florence Cloppet. Relax- ing binary constraints in contrastive vision-language medi- cal representation learning. In2025 IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 4462–4471. IEEE, 2025. 3

work page 2025

[39] [39]

Cardiovascular diseases (cvds),

World Health Organization. Cardiovascular diseases (cvds),

work page

[40] [40]

Accessed: 2025-02-10. 1

work page 2025

[41] [41]

Barlow twins: Self-supervised learning via redundancy reduction

Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and St´ephane Deny. Barlow twins: Self-supervised learning via redundancy reduction. InInternational conference on ma- chine learning, pages 12310–12320. PMLR, 2021. 2, 5, 7, 11, 12, 13

work page 2021

[42] [42]

Contrastive learning of medical visual representations from paired images and text

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. Contrastive learning of medical visual representations from paired images and text. InMachine learning for healthcare conference, pages 2–25. PMLR, 2022. 2, 13

work page 2022

[43] [43]

Mgi: Multimodal contrastive pre-training of genomic and medical imaging.arXiv preprint arXiv:2406.00631, 2024

Jiaying Zhou, Mingzhou Jiang, Junde Wu, Jiayuan Zhu, Ziyue Wang, and Yueming Jin. Mgi: Multimodal contrastive pre-training of genomic and medical imaging.arXiv preprint arXiv:2406.00631, 2024. 2

work page arXiv 2024

[44] [44]

Self-supervised multimodal learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

Yongshuo Zong, Oisin Mac Aodha, and Timothy Hospedales. Self-supervised multimodal learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 1

work page 2024

[45] [45]

Tabular Attributes Table 5 presents a comprehensive list of tabular attributes from the UK Biobank that were used for tabular similarity calculation during pretraining

Detailed Data Description 1.1. Tabular Attributes Table 5 presents a comprehensive list of tabular attributes from the UK Biobank that were used for tabular similarity calculation during pretraining. These attributes were con- sistently used across all baseline methods that incorporated tabular data during pretraining. Attributes marked asex- tractedwere ...

work page

[46] [46]

Implementation Details 2.1. Baselines We compare TGV against a mean-guess baseline (used only for cardiac phenotype prediction), a supervised 3D ResNet-50 model [19], four image-based contrastive learn- ing approaches, and one image-tabular contrastive learning method. This section details the implementation of each baseline. Mean-guess.The mean-guess bas...

work page 2000

[47] [47]

Performance under Low-Data Regimes (Com- plete) Fig

Additional Cardiac Experiments 3.1. Performance under Low-Data Regimes (Com- plete) Fig. 5 presents the results on CAD classification and LVEF prediction under low-data regimes for all the baselines, which were omitted for clarity in the main body of the pa- per. TGV outperforms the other methods on nearly all the data regimes and all tasks, with some exc...

work page

[48] [48]

and SimCLR [6] are typically the second best ap- proach, while BYOL [16], Barlow Twins [40], and Sim- Siam [8] report the worst overall performance. 3.2. Evaluating Robustness of the Zero-shot Predic- tions We evaluate the robustness of our zero-shot approach in terms of two conditions: (1) how changing the size of the representative set impacts performan...

work page

[49] [49]

The experiment is performed us- ing the image encoder pretrained with TGV and the results are reported in Table 7

Robustness to representative set size.We evaluate the robustness of the zero-shot predictions under different sizes of the representative setP. The experiment is performed us- ing the image encoder pretrained with TGV and the results are reported in Table 7. We consider the N=2000 as the baseline and report the changes in the performance against it. Reduc...

work page 2000

[50] [50]

CAD prediction shows the highest standard de- viation, which is reflective of the small number of CAD positive cases in the UK Biobank

Robustness across different representative sets.Ta- ble 8 reports the mean and standard deviation of zero-shot prediction performance across three different representative setsP. CAD prediction shows the highest standard de- viation, which is reflective of the small number of CAD positive cases in the UK Biobank. Generally, methods with lower overall perf...

work page

[51] [51]

Dataset To assess whether TGV can generalize to other domains and datasets, we use the Data Visual Marketing (DVM) car dataset [23]

Assessing TGV’s Generalizability 4.1. Dataset To assess whether TGV can generalize to other domains and datasets, we use the Data Visual Marketing (DVM) car dataset [23]. The dataset contains 1,451,784 images and their corresponding attributes of cars at varying de- gree angles. Model performance is evaluated on two tasks, car model classification (286 cl...

work page