Cross-Modal Iteration Distillation for Robust IHD Screening: The IDNet Framework and A New Benchmark

Hongfei Zhang; Jia Mu; Junjie Pang; Shaojie Li; Shuaiyu Yang; Xichao Jia; Yongchang Gao; Yusheng Yang

arxiv: 2606.30027 · v1 · pith:4XEZOAVVnew · submitted 2026-06-29 · 💻 cs.CV

Cross-Modal Iteration Distillation for Robust IHD Screening: The IDNet Framework and A New Benchmark

Yongchang Gao , Junjie Pang , Shuaiyu Yang , Yusheng Yang , Xichao Jia , Shaojie Li , Hongfei Zhang , Jia Mu This is my paper

Pith reviewed 2026-06-30 06:20 UTC · model grok-4.3

classification 💻 cs.CV

keywords ischemic heart disease screeningmultimodal fusioncolor fundus photographycross-modal distillationUK Biobank benchmarklearnable queries

0 comments

The pith

IDNet uses learnable queries in a Cross-Modal Distillation Aggregator to fuse left-eye, right-eye, and clinical features for improved ischemic heart disease screening.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a multimodal model called IDNet can more effectively combine color fundus photographs with sparse clinical variables than image-only or clinical-only approaches for screening ischemic heart disease. The central mechanism is the Cross-Modal Distillation Aggregator, which applies learnable queries to balance high-dimensional image data against low-dimensional tabular inputs. The authors also release a large, reproducible benchmark drawn from the UK Biobank containing 50,410 images from 25,205 subjects. If the approach holds, routine eye photography could become a practical, low-cost entry point for heart-disease risk assessment.

Core claim

IDNet with its Cross-Modal Distillation Aggregator outperforms image-only, clinical-only, and several multimodal baselines on the UK Biobank benchmark, and the aggregator functions as a plug-in module that improves performance when attached to multiple visual encoders.

What carries the argument

The Cross-Modal Distillation Aggregator (CDA), which deploys learnable queries to sequentially integrate left-eye image features, right-eye image features, and clinical variables.

If this is right

IDNet achieves higher screening accuracy than models that use only retinal images or only clinical data.
The CDA module can be attached to existing visual encoders to raise their multimodal performance without retraining the entire encoder.
The open UK Biobank curation pipeline supplies a standardized dataset for comparing future multimodal IHD methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Query-based sequential fusion may transfer to other medical tasks that combine images with sparse tabular records.
If CDA proves robust, eye clinics could add a lightweight clinical-data module to existing retinal cameras for preliminary heart-risk triage.
The benchmark size supports stratified testing across age, sex, and comorbidity subgroups to check generalization.

Load-bearing premise

The learnable queries can integrate high-dimensional visual features with low-dimensional clinical inputs without the visual features dominating or introducing bias.

What would settle it

A controlled experiment that permutes clinical variables relative to their matched images and measures whether IDNet performance drops to the level of an image-only model.

Figures

Figures reproduced from arXiv: 2606.30027 by Hongfei Zhang, Jia Mu, Junjie Pang, Shaojie Li, Shuaiyu Yang, Xichao Jia, Yongchang Gao, Yusheng Yang.

**Figure 2.** Figure 2: Overview of the IDNet framework, where the blue section represents the overall workflow encompassing the entire processing chain from data [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Detailed schematic of the Cross-Modal Distillation Aggregator [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Interpretability Analysis. Saliency maps (Left) highlight the model’s [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Color Fundus Photography (CFP) offers a low-cost and non-invasive route for ischemic heart disease (IHD) screening, but current studies are limited by scarce public benchmarks and ineffective fusion of retinal images with sparse clinical variables. We propose IDNet, a multimodal framework with a Cross-Modal Distillation Aggregator (CDA) that uses learnable queries to sequentially integrate left-eye, right-eye, and clinical features, mitigating the imbalance between high-dimensional visual features and low-dimensional tabular inputs. We also construct a reproducible UK Biobank benchmark with open-source curation and quality-control pipelines, yielding 50,410 images from 25,205 subjects. On this benchmark, IDNet outperforms image-only, clinical-only, and several multimodal baselines, and CDA consistently improves multiple visual encoders as a plug-in fusion module.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IDNet adds a query-based aggregator for fusing bilateral CFP with clinical data and releases a new UK Biobank benchmark, but the abstract contains zero metrics or stats so the performance claims cannot be judged yet.

read the letter

Hi,

The main takeaway is a new multimodal framework called IDNet that uses a Cross-Modal Distillation Aggregator with learnable queries to combine left-eye, right-eye, and low-dimensional clinical features for IHD screening from cheap fundus photos, plus an open UK Biobank benchmark of 50k images with curation pipelines.

The benchmark construction is the clearest positive. Releasing reproducible quality-control code and a sizable labeled set from UK Biobank gives the community something concrete to test against, which is more useful than most incremental fusion papers. The CDA design itself is a straightforward application of query-based aggregation to handle modality imbalance; treating it as a plug-in module that can be dropped onto different visual encoders is a practical choice.

The soft spot is the complete lack of numbers. The abstract claims outperformance over image-only, clinical-only, and multimodal baselines, yet supplies no AUCs, accuracies, confidence intervals, p-values, split details, or baseline re-implementation notes. Without those it is impossible to tell whether the reported gains are real, whether the new benchmark was curated in a way that favors the method, or whether the queries actually prevent visual features from dominating. The assumption that sequential query integration balances the modalities therefore stays untested on the evidence given.

This paper is aimed at researchers working on retinal biomarkers for systemic disease and on multimodal fusion for imbalanced medical data. A reader who needs a new public benchmark or wants to try query-based fusion in a similar setting could extract value from the dataset release and the aggregator description.

It deserves peer review because the benchmark is new and the application has clear healthcare relevance; the experiments will need close scrutiny on metrics, ablations, and statistical rigor, but that is exactly what referees are for.

Recommendation: send it out rather than desk-reject.

Referee Report

1 major / 1 minor

Summary. The paper proposes IDNet, a multimodal framework for ischemic heart disease (IHD) screening from color fundus photography (CFP) combined with clinical variables. It introduces a Cross-Modal Distillation Aggregator (CDA) that employs learnable queries to sequentially fuse left-eye, right-eye, and clinical features in order to mitigate imbalance between high-dimensional visual and low-dimensional tabular inputs. The work also constructs and releases a reproducible UK Biobank benchmark comprising 50,410 images from 25,205 subjects together with open-source curation and quality-control pipelines, claiming that IDNet outperforms image-only, clinical-only, and several multimodal baselines while CDA improves multiple visual encoders as a plug-in module.

Significance. If the performance claims are substantiated by detailed, statistically rigorous experiments with ablations and reproducible baselines, the open benchmark and the CDA fusion module would constitute a useful contribution to multimodal medical imaging, addressing the noted scarcity of public datasets for IHD screening from retinal images.

major comments (1)

[Abstract] Abstract: the central claim that 'IDNet outperforms image-only, clinical-only, and several multimodal baselines' is asserted without any reported metrics (e.g., AUC, accuracy), error bars, statistical tests, baseline implementation details, or data-split information, rendering it impossible to evaluate whether the experimental results support the claim.

minor comments (1)

[Abstract] The abstract refers to 'several multimodal baselines' without naming them or indicating whether they are re-implemented or taken from prior literature.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive comment on the abstract. We address the point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'IDNet outperforms image-only, clinical-only, and several multimodal baselines' is asserted without any reported metrics (e.g., AUC, accuracy), error bars, statistical tests, baseline implementation details, or data-split information, rendering it impossible to evaluate whether the experimental results support the claim.

Authors: We agree that the abstract would be strengthened by including key quantitative results to support the performance claim. While the full manuscript provides AUC, accuracy, statistical significance tests, baseline details, and data-split information in the Experiments and Results sections (including ablations and comparisons on the 50,410-image benchmark), the abstract itself is currently limited to a qualitative statement. In the revised version, we will update the abstract to report the primary AUC improvements (with error bars where applicable) and note the data splits used. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The abstract and available description introduce IDNet with CDA and a new benchmark but contain no equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work. The claimed outperformance is an empirical result on the constructed benchmark rather than a derivation that reduces to its own inputs by construction. No load-bearing step matches any of the enumerated circularity patterns, so the chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities can be identified from the given information.

pith-pipeline@v0.9.1-grok · 5695 in / 1134 out tokens · 42601 ms · 2026-06-30T06:20:19.956630+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Systematic comparison of deep-learning based fusion strategies for multi-modal ultrasound in diagnosis of liver cancer,

Wei Li, Man-Xia Lin, Xin-Xin Lin, Hu, et al., “Systematic comparison of deep-learning based fusion strategies for multi-modal ultrasound in diagnosis of liver cancer,”Neurocomputing, vol. 603, pp. 128257, 2024

2024
[2]

Retinal vessel calibre and risk for coronary heart disease: a systematic review and meta-analysis,

K McGeecan, G Liew, P Macaskill, L Irwig, R Klein, BE Klein, et al., “Retinal vessel calibre and risk for coronary heart disease: a systematic review and meta-analysis,”Ann Intern. Med, vol. 151, pp. 404–413, 2009

2009
[3]

A foundation model for generalizable disease detection from retinal images,

Yukun Zhou, Mark A Chia, Siegfried K Wagner, et al., “A foundation model for generalizable disease detection from retinal images,”Nature, vol. 622, no. 7981, pp. 156–163, 2023

2023
[4]

Artificial intelligence-enhanced retinal imaging as a biomarker for systemic diseases,

Jinyuan Wang, Ya Xing Wang, Dian Zeng, Zhu, et al., “Artificial intelligence-enhanced retinal imaging as a biomarker for systemic diseases,”Theranostics, vol. 15, no. 8, pp. 3223, 2025

2025
[5]

Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning,

Michael David Abr `amoff, Yiyue Lou, Ali Erginay, Warren Clarida, Ryan Amelon, James C Folk, and Meindert Niemeijer, “Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning,”Investigative ophthal- mology & visual science, vol. 57, no. 13, pp. 5200–5206, 2016

2016
[6]

A review of diabetic retinopathy: Datasets, approaches, evaluation metrics and future trends,

Dimple Nagpal, Surya Narayan Panda, Muthukumaran Malarvel, Priyadarshini A Pattanaik, and Mohammad Zubair Khan, “A review of diabetic retinopathy: Datasets, approaches, evaluation metrics and future trends,”Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 9, pp. 7138–7152, 2022

2022
[7]

Focal loss for dense object detection,

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980– 2988

2017
[8]

Supervised contrastive learning,

Prannay Khosla, Piotr Teterwak, et al., “Supervised contrastive learning,”Advances in neural information processing systems, vol. 33, pp. 18661–18673, 2020

2020
[9]

Mil-vit: A multiple instance vision transformer for fundus image classification,

Qi Bi, Xu Sun, Shuang Yu, et al., “Mil-vit: A multiple instance vision transformer for fundus image classification,”Journal of Visual Communication and Image Representation, vol. 97, pp. 103956, 2023

2023
[10]

Mstnet: Multi-scale spatial- aware transformer with multi-instance learning for diabetic retinopathy classification,

Xin Wei, Yanbei Liu, Fang Zhang, et al., “Mstnet: Multi-scale spatial- aware transformer with multi-instance learning for diabetic retinopathy classification,”Medical Image Analysis, vol. 102, pp. 103511, 2025

2025
[11]

Multimodal co- attention transformer for survival prediction in gigapixel whole slide images,

Richard J Chen, Ming Y Lu, Wei-Hung Weng, et al., “Multimodal co- attention transformer for survival prediction in gigapixel whole slide images,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 4015–4025

2021
[12]

Multimodal optimal transport-based co- attention transformer with global structure consistency for survival prediction,

Yingxue Xu and Hao Chen, “Multimodal optimal transport-based co- attention transformer with global structure consistency for survival prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 21241–21251

2023
[13]

Heal- net: Multimodal fusion for heterogeneous biomedical data,

Konstantin Hemker, Nikola Simidjievski, and Mateja Jamnik, “Heal- net: Multimodal fusion for heterogeneous biomedical data,”Advances in Neural Information Processing Systems, vol. 37, pp. 64479–64498, 2024

2024
[14]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770– 778

2016
[15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[16]

Vision-rwkv: Efficient and scalable visual perception with rwkv-like architectures,

Yuchen Duan, Weiyun Wang, Zhe Chen, et al., “Vision-rwkv: Efficient and scalable visual perception with rwkv-like architectures,”arXiv preprint arXiv:2403.02308, 2024

work page arXiv 2024
[17]

Vision mamba: A comprehensive survey and taxonomy,

Xiao Liu, Chenxu Zhang, Fuxiang Huang, et al., “Vision mamba: A comprehensive survey and taxonomy,”IEEE Transactions on Neural Networks and Learning Systems, 2025

2025
[18]

Swin transformer: Hierarchical vision transformer using shifted windows,

Ze Liu, Yutong Lin, Yue Cao, et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022

2021
[19]

Convnext v2: Co- designing and scaling convnets with masked autoencoders,

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie, “Convnext v2: Co- designing and scaling convnets with masked autoencoders,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 16133–16142

2023

[1] [1]

Systematic comparison of deep-learning based fusion strategies for multi-modal ultrasound in diagnosis of liver cancer,

Wei Li, Man-Xia Lin, Xin-Xin Lin, Hu, et al., “Systematic comparison of deep-learning based fusion strategies for multi-modal ultrasound in diagnosis of liver cancer,”Neurocomputing, vol. 603, pp. 128257, 2024

2024

[2] [2]

Retinal vessel calibre and risk for coronary heart disease: a systematic review and meta-analysis,

K McGeecan, G Liew, P Macaskill, L Irwig, R Klein, BE Klein, et al., “Retinal vessel calibre and risk for coronary heart disease: a systematic review and meta-analysis,”Ann Intern. Med, vol. 151, pp. 404–413, 2009

2009

[3] [3]

A foundation model for generalizable disease detection from retinal images,

Yukun Zhou, Mark A Chia, Siegfried K Wagner, et al., “A foundation model for generalizable disease detection from retinal images,”Nature, vol. 622, no. 7981, pp. 156–163, 2023

2023

[4] [4]

Artificial intelligence-enhanced retinal imaging as a biomarker for systemic diseases,

Jinyuan Wang, Ya Xing Wang, Dian Zeng, Zhu, et al., “Artificial intelligence-enhanced retinal imaging as a biomarker for systemic diseases,”Theranostics, vol. 15, no. 8, pp. 3223, 2025

2025

[5] [5]

Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning,

Michael David Abr `amoff, Yiyue Lou, Ali Erginay, Warren Clarida, Ryan Amelon, James C Folk, and Meindert Niemeijer, “Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning,”Investigative ophthal- mology & visual science, vol. 57, no. 13, pp. 5200–5206, 2016

2016

[6] [6]

A review of diabetic retinopathy: Datasets, approaches, evaluation metrics and future trends,

Dimple Nagpal, Surya Narayan Panda, Muthukumaran Malarvel, Priyadarshini A Pattanaik, and Mohammad Zubair Khan, “A review of diabetic retinopathy: Datasets, approaches, evaluation metrics and future trends,”Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 9, pp. 7138–7152, 2022

2022

[7] [7]

Focal loss for dense object detection,

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980– 2988

2017

[8] [8]

Supervised contrastive learning,

Prannay Khosla, Piotr Teterwak, et al., “Supervised contrastive learning,”Advances in neural information processing systems, vol. 33, pp. 18661–18673, 2020

2020

[9] [9]

Mil-vit: A multiple instance vision transformer for fundus image classification,

Qi Bi, Xu Sun, Shuang Yu, et al., “Mil-vit: A multiple instance vision transformer for fundus image classification,”Journal of Visual Communication and Image Representation, vol. 97, pp. 103956, 2023

2023

[10] [10]

Mstnet: Multi-scale spatial- aware transformer with multi-instance learning for diabetic retinopathy classification,

Xin Wei, Yanbei Liu, Fang Zhang, et al., “Mstnet: Multi-scale spatial- aware transformer with multi-instance learning for diabetic retinopathy classification,”Medical Image Analysis, vol. 102, pp. 103511, 2025

2025

[11] [11]

Multimodal co- attention transformer for survival prediction in gigapixel whole slide images,

Richard J Chen, Ming Y Lu, Wei-Hung Weng, et al., “Multimodal co- attention transformer for survival prediction in gigapixel whole slide images,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 4015–4025

2021

[12] [12]

Multimodal optimal transport-based co- attention transformer with global structure consistency for survival prediction,

Yingxue Xu and Hao Chen, “Multimodal optimal transport-based co- attention transformer with global structure consistency for survival prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 21241–21251

2023

[13] [13]

Heal- net: Multimodal fusion for heterogeneous biomedical data,

Konstantin Hemker, Nikola Simidjievski, and Mateja Jamnik, “Heal- net: Multimodal fusion for heterogeneous biomedical data,”Advances in Neural Information Processing Systems, vol. 37, pp. 64479–64498, 2024

2024

[14] [14]

Deep residual learning for image recognition,

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770– 778

2016

[15] [15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[16] [16]

Vision-rwkv: Efficient and scalable visual perception with rwkv-like architectures,

Yuchen Duan, Weiyun Wang, Zhe Chen, et al., “Vision-rwkv: Efficient and scalable visual perception with rwkv-like architectures,”arXiv preprint arXiv:2403.02308, 2024

work page arXiv 2024

[17] [17]

Vision mamba: A comprehensive survey and taxonomy,

Xiao Liu, Chenxu Zhang, Fuxiang Huang, et al., “Vision mamba: A comprehensive survey and taxonomy,”IEEE Transactions on Neural Networks and Learning Systems, 2025

2025

[18] [18]

Swin transformer: Hierarchical vision transformer using shifted windows,

Ze Liu, Yutong Lin, Yue Cao, et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022

2021

[19] [19]

Convnext v2: Co- designing and scaling convnets with masked autoencoders,

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie, “Convnext v2: Co- designing and scaling convnets with masked autoencoders,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 16133–16142

2023