ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines

arxiv: 2604.11389 · v1 · submitted 2026-04-13 · 💻 cs.CV

ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines

Nafiseh Ghaffar Nia , Vinesh Appadurai , Suchithra V. , Chinmay Rane , Daniel Pittman , James Carr , Adrienne Kline This is my paper

Pith reviewed 2026-05-10 15:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords cine cardiac MRIview classificationspatiotemporal deep learninguncertainty estimationcardiac imagingConvFormermulti-clip fusionmasked reconstruction

0 comments p. Extension

The pith

ConvFormer3D-TAP integrates 3D convolutional tokenization and multiscale self-attention with uncertainty-weighted multi-clip fusion to classify standard cine cardiac MRI views at 96 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a specialized deep learning architecture for identifying which of six standard views a given cine cardiac MRI sequence shows. Each view dictates which cardiac structures are visible and which quantitative measurements can be trusted, so misclassification can silently invalidate later steps such as ventricular segmentation or strain analysis. The model fuses local anatomical detail from 3D convolutions with longer-range cycle dynamics from hierarchical attention, and it is trained to reconstruct masked spatiotemporal patches while weighting clips according to estimated uncertainty. On a large real-world collection of 150974 clinical sequences the system reaches 96 percent accuracy, F1 scores of at least 0.94 on every class, and well-calibrated predictions, with the remaining errors confined to anatomically adjacent view pairs.

Core claim

ConvFormer3D-TAP is a spatiotemporal network that combines 3D convolutional tokenization with multiscale self-attention. It is trained by masked spatiotemporal reconstruction together with uncertainty-weighted multi-clip fusion so that it learns both static anatomy and dynamic cardiac-cycle information while down-weighting ambiguous temporal segments. On 150974 clinically acquired cine sequences spanning six standard views the model attains 96 percent validation accuracy, per-class F1 scores of 0.94 or higher, and low calibration error (ECE 0.025, Brier 0.040), with residual mistakes concentrated in long-axis and LVOT/AV pairs that share natural prescription overlap.

What carries the argument

ConvFormer3D-TAP, the architecture that performs 3D convolutional tokenization followed by multiscale self-attention and is trained with masked spatiotemporal reconstruction plus uncertainty-weighted multi-clip fusion to capture complementary local and long-range cardiac cues.

If this is right

Cine sequences can be routed automatically to the correct downstream analysis pipeline without manual view labeling.
Quality-control filters can flag low-confidence or misclassified studies before they reach segmentation or strain modules.
The same front-end can scale to routine clinical volumes given its performance on more than 150000 sequences.
Calibrated uncertainty scores allow ambiguous cases to be escalated to human readers rather than silently propagating errors.
Error patterns limited to anatomically adjacent views indicate that view-specific priors could be added without redesigning the core model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The architecture could be adapted to other dynamic cardiac modalities such as echocardiography or cardiac CT if similar spatiotemporal cues dominate.
Integration into an end-to-end pipeline might eliminate the need for any manual view selection step in high-volume centers.
The concentration of errors in adjacent long-axis views suggests that adding explicit plane-prescription metadata as an auxiliary input would be a low-cost next improvement.
Strong calibration opens the possibility of using the model inside larger ensemble or active-learning loops where only uncertain cases receive additional review.

Load-bearing premise

The uncertainty-weighted fusion and masked reconstruction training produce genuine robustness to scanner, protocol, and artifact variation without the large clinical cohort containing hidden selection biases or data leakage.

What would settle it

Running the trained model on an independent external cohort acquired on different scanner vendors or with altered protocols and finding either validation accuracy below 90 percent or ECE above 0.05 would falsify the robustness claim.

Figures

Figures reproduced from arXiv: 2604.11389 by Adrienne Kline, Chinmay Rane, Daniel Pittman, James Carr, Nafiseh Ghaffar Nia, Suchithra V., Vinesh Appadurai.

**Figure 1.** Figure 1: Overview of the six standardized cine CMR views used in this study. Representative frames illustrate the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Architecture of ConvFormer3D-TAP. Phase-aware sampling selects 16-frame clips from 25-frame cine loops, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Row-normalized training confusion matrix produced by the evaluation pipeline. Diagonal concentration [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Row-normalized validation confusion matrix for the best-performing checkpoint. Most errors occur between [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Learning curves for cine cMRI view classification. Training and validation accuracy across epochs show rapid [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Reliable recognition of standard cine cardiac MRI views is essential because each view determines which cardiac anatomy is visualized and which quantitative analyses can be performed. Incorrect view identification, whether by a human reader or an automated deep learning system, can propagate errors into segmentation, volumetric assessment, strain analysis, and valve evaluation. However, accurate view classification remains challenging under routine clinical variability in scanner vendor, acquisition protocol, motion artifacts, and plane prescription. We present ConvFormer3D-TAP, a cine-specific spatiotemporal architecture that integrates 3D convolutional tokenization with multiscale self-attention. The model is trained using masked spatiotemporal reconstruction and uncertainty-weighted multi-clip fusion to enhance robustness across cardiac phases and ambiguous temporal segments. The design captures complementary cues: local anatomical structure through convolutional priors and long-range cardiac-cycle dynamics through hierarchical attention. On a cohort of 150,974 clinically acquired cine sequences spanning six standard cine cardiac MRI views, ConvFormer3D-TAP achieved 96% validation accuracy with per-class F1-scores >= 0.94 and strong calibration (ECE = 0.025; Brier = 0.040). Error analysis shows that residual confusions are concentrated in anatomically adjacent long-axis and LVOT/AV view pairs, consistent with intrinsic prescription overlap. These results support ConvFormer3D-TAP as a scalable front-end for view routing, filtering and quality control in end-to-end cMRI workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ConvFormer3D-TAP reports strong accuracy and calibration on a large cine CMR cohort for view classification, but the lack of any information on patient-level data splits or baselines leaves the generalization claims hard to evaluate.

read the letter

The paper's main contribution is a ConvFormer3D-TAP model that combines 3D convolutional tokenization with multiscale self-attention for classifying six standard cine cardiac MRI views. It adds masked spatiotemporal reconstruction during training and uncertainty-weighted fusion across clips to handle phase ambiguity and temporal variability. On 150,974 sequences the model hits 96% accuracy, F1 scores at or above 0.94, and low calibration error. That scale and the focus on clinical robustness to scanner and protocol differences are the practical strengths here. The error patterns it flags between adjacent long-axis and LVOT views also line up with what one would expect from anatomy overlap. What is missing is any description of how the data were split. Cine datasets commonly have multiple sequences per patient, so without explicit patient- or study-disjoint partitioning the numbers could reflect memorization rather than true robustness to motion artifacts or vendor differences. The abstract also gives no baselines, no ablation results on the uncertainty or masking components, and no external validation set. Those gaps make it difficult to judge how much the new architecture actually moves the needle over simpler 3D CNN or transformer baselines. This work is aimed at groups building automated pipelines for cardiac MRI analysis who need a reliable front-end view classifier. It is worth sending to peer review because the task matters and the reported metrics are high enough to merit scrutiny, but referees will have to insist on the missing experimental details before any stronger claims can be accepted.

Referee Report

2 major / 0 minor

Summary. The paper introduces ConvFormer3D-TAP, a 3D convolutional tokenization plus multiscale self-attention architecture for classifying six standard cine cardiac MRI views. It incorporates masked spatiotemporal reconstruction and uncertainty-weighted multi-clip fusion during training to handle phase ambiguity and temporal variability. On a cohort of 150,974 clinically acquired sequences, the model reports 96% validation accuracy, per-class F1-scores >=0.94, ECE=0.025, and Brier score=0.040, with residual errors concentrated in anatomically adjacent long-axis and LVOT/AV views.

Significance. If the reported metrics are supported by patient-disjoint splits and proper controls, the work would offer a practical, high-accuracy front-end for automated view routing and quality control in large-scale CMR pipelines, potentially reducing downstream errors in segmentation and quantification. The scale of the cohort and explicit handling of uncertainty and cardiac-phase cues are notable strengths.

major comments (2)

[Abstract] Abstract: The central performance claim (96% accuracy, F1>=0.94, ECE=0.025) is presented without any description of the data splitting strategy, patient count, study-disjointness, multi-center distribution, or external validation set. Because cine CMR datasets routinely contain multiple sequences per patient with correlated acquisition characteristics, the absence of explicit patient-level partitioning leaves open the possibility of leakage that would inflate both accuracy and the claimed robustness to scanner/protocol/motion variability.
[Abstract] Abstract: No baseline comparisons (e.g., to standard 2D/3D CNNs or vanilla transformers) or ablation studies on the masked reconstruction and uncertainty-weighted fusion components are reported. Without these controls it is impossible to determine whether the observed gains are attributable to the proposed ConvFormer3D-TAP design or to dataset scale and training tricks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. These points highlight opportunities to improve the self-contained nature of the abstract and to strengthen the evidence for our architectural contributions. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim (96% accuracy, F1>=0.94, ECE=0.025) is presented without any description of the data splitting strategy, patient count, study-disjointness, multi-center distribution, or external validation set. Because cine CMR datasets routinely contain multiple sequences per patient with correlated acquisition characteristics, the absence of explicit patient-level partitioning leaves open the possibility of leakage that would inflate both accuracy and the claimed robustness to scanner/protocol/motion variability.

Authors: We agree that the abstract should be more self-contained on this point. The full manuscript details a patient-level disjoint split of the 150,974 sequences drawn from a multi-center cohort to eliminate leakage between training and validation. No separate external validation set was employed. To directly address the concern, we will revise the abstract to include a concise statement on the patient-disjoint partitioning, approximate patient count, and multi-center distribution. This will make the robustness claims transparent without altering the reported metrics. revision: yes
Referee: [Abstract] Abstract: No baseline comparisons (e.g., to standard 2D/3D CNNs or vanilla transformers) or ablation studies on the masked reconstruction and uncertainty-weighted fusion components are reported. Without these controls it is impossible to determine whether the observed gains are attributable to the proposed ConvFormer3D-TAP design or to dataset scale and training tricks.

Authors: We acknowledge that the absence of explicit baselines and ablations in the reported results makes it harder to attribute gains specifically to the proposed components. The current manuscript focuses on end-to-end performance but does not present these controls. In the revised version we will add a results subsection containing baseline comparisons against standard 2D/3D CNNs and vanilla transformers, together with ablations that isolate the masked spatiotemporal reconstruction and uncertainty-weighted multi-clip fusion. These additions will clarify the incremental value of the ConvFormer3D-TAP design. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metrics from held-out validation on large cohort

full rationale

The paper presents ConvFormer3D-TAP as a trained spatiotemporal model whose reported 96% accuracy, F1-scores, ECE, and Brier score are direct empirical measurements on a validation split of 150,974 sequences. No equations, parameters, or first-principles derivations are claimed; the architecture (3D conv tokenization + multiscale attention) and training (masked reconstruction + uncertainty-weighted fusion) are standard data-driven procedures whose outputs are evaluated on held-out data rather than being algebraically forced by the inputs. No self-definitional loops, fitted-input-as-prediction, or load-bearing self-citations appear in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central performance claim rests on empirical training and validation rather than theoretical derivation. Key unstated assumptions include data representativeness and absence of leakage in the large cohort.

free parameters (1)

model architecture hyperparameters
Layer counts, attention heads, fusion weights, and reconstruction loss coefficients are optimized during training on the dataset.

axioms (1)

domain assumption The 150,974-sequence cohort is representative of routine clinical variability across vendors and protocols
Invoked to support generalization claims for the reported accuracy and calibration.

pith-pipeline@v0.9.0 · 5590 in / 1432 out tokens · 53973 ms · 2026-05-10T15:52:57.192171+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

[1]

Cardiovascular diseases (cvds) fact sheet

World Health Organization. Cardiovascular diseases (cvds) fact sheet. https://www.who.int/news-room/ fact-sheets/detail/cardiovascular-diseases-(cvds), 2022

work page 2022
[2]

Yuchi Han, Tiffany Chen, Jennifer Bryant, Chiara Bucciarelli-Ducci, Christopher Dyke, Michael D Elliott, Victor A Ferrari, Matthias G Friedrich, Chris Lawton, Warren J Manning, et al. Society for cardiovascular magnetic resonance (scmr) guidance for the practice of cardiovascular magnetic resonance during the covid-19 pandemic.Journal of Cardiovascular Ma...

work page 2020
[3]

Uk biobank’s cardiovascular magnetic resonance protocol.Journal of cardiovascular magnetic resonance, 18(1):8, 2016

Steffen E Petersen, Paul M Matthews, Jane M Francis, Matthew D Robson, Filip Zemrak, Redha Boubertakh, Alistair A Young, Sarah Hudson, Peter Weale, Steve Garratt, et al. Uk biobank’s cardiovascular magnetic resonance protocol.Journal of cardiovascular magnetic resonance, 18(1):8, 2016

work page 2016
[4]

Role of cardiac magnetic resonance imaging in heart failure.Heart Failure Clinics, 17(2):207–221, 2021

Carla Contaldi, Santo Dellegrottaglie, Ciro Mauro, Francesco Ferrara, Luigia Romano, Alberto M Marra, Brigida Ranieri, Andrea Salzano, Salvatore Rega, Alessandra Scatteia, et al. Role of cardiac magnetic resonance imaging in heart failure.Heart Failure Clinics, 17(2):207–221, 2021

work page 2021
[5]

Kawakubo

Motoya et al. Kawakubo. Right ventricular conduction delay diagnosis using deep learning on cardiac mri.Journal of Magnetic Resonance Imaging, 2022

work page 2022
[6]

Automated segmentation of 3d cine cardiovascular magnetic resonance imaging.Frontiers in Cardiovascular Medicine, 10:1167500, 2023

Soroosh Tayebi Arasteh, Jennifer Romanowicz, Danielle F Pace, Polina Golland, Andrew J Powell, Andreas K Maier, Daniel Truhn, Tom Brosch, Juergen Weese, Mahshad Lotfinia, et al. Automated segmentation of 3d cine cardiovascular magnetic resonance imaging.Frontiers in Cardiovascular Medicine, 10:1167500, 2023

work page 2023
[7]

Fine-tuned convolutional neural nets for cardiac mri acquisition plane recognition.Computerized Medical Imaging and Graphics, 56:50–61, 2017

Jasmina Margeta, Antonio Criminisi, Roberto Cabrera-Lozoya, Daniel C Lee, and Nicholas Ayache. Fine-tuned convolutional neural nets for cardiac mri acquisition plane recognition.Computerized Medical Imaging and Graphics, 56:50–61, 2017. 12 Physics-Informed and Neuro-Symbolic AI in Health

work page 2017
[8]

Vincenzo Vergani, Chiara Gastaldi, Pier Giorgio Masci, et al. Deep learning for classification and selection of cine cmr images to achieve fully automated quality-controlled cmr analysis from scanner to report.Frontiers in Cardiovascular Medicine, 8:710, 2021

work page 2021
[9]

Sébastien et al. Leclerc. Deep learning for cardiac mri plane classification. InMICCAI, 2019

work page 2019
[10]

A closer look at spatiotemporal convolutions for action recognition

Du Tran et al. A closer look at spatiotemporal convolutions for action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6450–6459, 2018

work page 2018
[11]

Ruijsink

Bram et al. Ruijsink. Deep learning for automated cardiac view classification in mri.Medical Image Analysis, 2020

work page 2020
[12]

Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification

Saining Xie et al. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. InECCV, pages 305–321, 2018

work page 2018
[13]

Long-term recurrent convolutional networks for visual recognition and description

Jeff Donahue et al. Long-term recurrent convolutional networks for visual recognition and description. InCVPR, pages 2625–2634, 2015

work page 2015
[15]

Big self-supervised models advance medical image classification

Shekoofeh Azizi et al. Big self-supervised models advance medical image classification. InICCV, 2021

work page 2021
[16]

Spatio-temporal graph convolution for video-based action recognition

Maosen Li et al. Spatio-temporal graph convolution for video-based action recognition. InAAAI, 2021

work page 2021
[17]

Deep learning-based automatic analysis of cardiac mri.Radiology, 295(3):575–584, 2020

Elad Cohen et al. Deep learning-based automatic analysis of cardiac mri.Radiology, 295(3):575–584, 2020

work page 2020
[18]

Interobserver variability of manual cardiac mri contouring.International Journal of Cardiovas- cular Imaging, 36:1821–1830, 2020

M Böttcher et al. Interobserver variability of manual cardiac mri contouring.International Journal of Cardiovas- cular Imaging, 36:1821–1830, 2020

work page 2020
[19]

Uk biobank’s cardiovascular magnetic resonance protocol, quality control and reproducibility.European Heart Journal - Cardiovascular Imaging, 23(4):400–412, 2022

Steffen E Petersen, Paul M Matthews, Jefferies M Francis, et al. Uk biobank’s cardiovascular magnetic resonance protocol, quality control and reproducibility.European Heart Journal - Cardiovascular Imaging, 23(4):400–412, 2022

work page 2022
[20]

Automated quality control and standardization of cardiac mri.Medical Image Analysis, 70:101996, 2021

Hamed Ebadi et al. Automated quality control and standardization of cardiac mri.Medical Image Analysis, 70:101996, 2021

work page 2021
[21]

Automated cardiovascular magnetic resonance image analysis with fully convolutional networks.Journal of cardiovascular magnetic resonance, 20(1):65, 2018

Wenjia Bai, Matthew Sinclair, Giacomo Tarroni, Ozan Oktay, Martin Rajchl, Ghislain Vaillant, Aaron M Lee, Nay Aung, Elena Lukaschuk, Mihir M Sanghvi, et al. Automated cardiovascular magnetic resonance image analysis with fully convolutional networks.Journal of cardiovascular magnetic resonance, 20(1):65, 2018

work page 2018
[22]

Cardiac mri segmentation with strong anatomical guarantees.IEEE Transactions on Medical Imaging, 41(6):1466– 1477, 2022

Neale Painchaud, Yasmeen Skandarani, Shane Judge, Odile Bernard, Alain Lalande, and Pierre-Marc Jodoin. Cardiac mri segmentation with strong anatomical guarantees.IEEE Transactions on Medical Imaging, 41(6):1466– 1477, 2022

work page 2022
[23]

Victor M. et al. Campello. Multi-task deep learning for workflow automation in cardiac mri.Medical Image Analysis, 2021

work page 2021
[24]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[25]

Is space-time attention all you need for video understanding? InInternational Conference on Machine Learning (ICML), pages 813–824, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InInternational Conference on Machine Learning (ICML), pages 813–824, 2021

work page 2021
[26]

Video swin transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer.arXiv preprint arXiv:2106.13230, 2022

work page arXiv 2022
[27]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, October 2021

work page 2021
[28]

X3d: Exploring the limits of efficient video recognition

Christoph Feichtenhofer. X3d: Exploring the limits of efficient video recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 203–213, 2020

work page 2020
[29]

Uniformer: Uni- fied transformer for efficient spatiotemporal representation learning

Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Uni- fied transformer for efficient spatiotemporal representation learning. InInternational Conference on Learning Representations (ICLR), 2022. OpenReview paper

work page 2022
[30]

Multiscale vision transformers

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichten- hofer. Multiscale vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6824–6835, October 2021. 13 Physics-Informed and Neuro-Symbolic AI in Health

work page 2021
[31]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

work page 2017
[32]

A closer look at spatiotemporal convolutions for action recognition

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018

work page 2018
[33]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019

work page 2019
[34]

Non-local neural networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018

work page 2018
[35]

Tsm: Temporal shift module for efficient video understanding

Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019

work page 2019
[36]

Multiscale vision transformers

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichten- hofer. Multiscale vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021

work page 2021
[37]

Mvitv2: Improved multiscale vision transformers for classification and detection

Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4804–4814, 2022

work page 2022
[38]

Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer.arXiv preprint arXiv:2211.09552, 2022

work page arXiv 2022
[39]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

work page 2022
[40]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

work page 2020
[41]

Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

work page 2020
[42]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

work page 2020
[43]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021
[44]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

work page 2022
[45]

Markus Huellebrand, Matthias Ivantsits, Lennart Tautz, Sebastian Kelle, and Anja Hennemuth. A collaborative approach for the development and application of machine learning solutions for cmr-based cardiac disease classification.Frontiers in Cardiovascular Medicine, 9:829512, 2022

work page 2022
[46]

Deep learning for diagnosis of chronic myocardial infarction on nonenhanced cardiac cine mri.Radiology, 291(3):606–617, 2019

Nan Zhang, Guang Yang, Zhifan Gao, Chenchu Xu, Yanping Zhang, Rui Shi, Jennifer Keegan, Lei Xu, Heye Zhang, Zhanming Fan, and David Firmin. Deep learning for diagnosis of chronic myocardial infarction on nonenhanced cardiac cine mri.Radiology, 291(3):606–617, 2019

work page 2019
[47]

Besser, Simran Anand, Ravi Madduri, Neil Getty, Sebastian Kelle, Keigo Kawaji, Victor Mor-Avi, and Amit R

Daksh Chauhan, Emeka Anyanwu, Jacob Goes, Stephanie A. Besser, Simran Anand, Ravi Madduri, Neil Getty, Sebastian Kelle, Keigo Kawaji, Victor Mor-Avi, and Amit R. Patel. Comparison of machine learning and deep learning for view identification from cardiac magnetic resonance images.Clinical Imaging, 82:121–126, 2022

work page 2022
[48]

Jacob, Teodora Chitiboi, U

Athira J. Jacob, Teodora Chitiboi, U. Joseph Schoepf, Puneet Sharma, Jonathan Aldinger, Charles Baker, Carla Lautenschlager, Tilman Emrich, and Akos Varga-Szemes. Deep-learning-based disease classification in patients undergoing cine cardiac mri.Journal of Magnetic Resonance Imaging, 61(4):1635–1647, 2025. 14

work page 2025

[1] [1]

Cardiovascular diseases (cvds) fact sheet

World Health Organization. Cardiovascular diseases (cvds) fact sheet. https://www.who.int/news-room/ fact-sheets/detail/cardiovascular-diseases-(cvds), 2022

work page 2022

[2] [2]

Yuchi Han, Tiffany Chen, Jennifer Bryant, Chiara Bucciarelli-Ducci, Christopher Dyke, Michael D Elliott, Victor A Ferrari, Matthias G Friedrich, Chris Lawton, Warren J Manning, et al. Society for cardiovascular magnetic resonance (scmr) guidance for the practice of cardiovascular magnetic resonance during the covid-19 pandemic.Journal of Cardiovascular Ma...

work page 2020

[3] [3]

Uk biobank’s cardiovascular magnetic resonance protocol.Journal of cardiovascular magnetic resonance, 18(1):8, 2016

Steffen E Petersen, Paul M Matthews, Jane M Francis, Matthew D Robson, Filip Zemrak, Redha Boubertakh, Alistair A Young, Sarah Hudson, Peter Weale, Steve Garratt, et al. Uk biobank’s cardiovascular magnetic resonance protocol.Journal of cardiovascular magnetic resonance, 18(1):8, 2016

work page 2016

[4] [4]

Role of cardiac magnetic resonance imaging in heart failure.Heart Failure Clinics, 17(2):207–221, 2021

Carla Contaldi, Santo Dellegrottaglie, Ciro Mauro, Francesco Ferrara, Luigia Romano, Alberto M Marra, Brigida Ranieri, Andrea Salzano, Salvatore Rega, Alessandra Scatteia, et al. Role of cardiac magnetic resonance imaging in heart failure.Heart Failure Clinics, 17(2):207–221, 2021

work page 2021

[5] [5]

Kawakubo

Motoya et al. Kawakubo. Right ventricular conduction delay diagnosis using deep learning on cardiac mri.Journal of Magnetic Resonance Imaging, 2022

work page 2022

[6] [6]

Automated segmentation of 3d cine cardiovascular magnetic resonance imaging.Frontiers in Cardiovascular Medicine, 10:1167500, 2023

Soroosh Tayebi Arasteh, Jennifer Romanowicz, Danielle F Pace, Polina Golland, Andrew J Powell, Andreas K Maier, Daniel Truhn, Tom Brosch, Juergen Weese, Mahshad Lotfinia, et al. Automated segmentation of 3d cine cardiovascular magnetic resonance imaging.Frontiers in Cardiovascular Medicine, 10:1167500, 2023

work page 2023

[7] [7]

Fine-tuned convolutional neural nets for cardiac mri acquisition plane recognition.Computerized Medical Imaging and Graphics, 56:50–61, 2017

Jasmina Margeta, Antonio Criminisi, Roberto Cabrera-Lozoya, Daniel C Lee, and Nicholas Ayache. Fine-tuned convolutional neural nets for cardiac mri acquisition plane recognition.Computerized Medical Imaging and Graphics, 56:50–61, 2017. 12 Physics-Informed and Neuro-Symbolic AI in Health

work page 2017

[8] [8]

Vincenzo Vergani, Chiara Gastaldi, Pier Giorgio Masci, et al. Deep learning for classification and selection of cine cmr images to achieve fully automated quality-controlled cmr analysis from scanner to report.Frontiers in Cardiovascular Medicine, 8:710, 2021

work page 2021

[9] [9]

Sébastien et al. Leclerc. Deep learning for cardiac mri plane classification. InMICCAI, 2019

work page 2019

[10] [10]

A closer look at spatiotemporal convolutions for action recognition

Du Tran et al. A closer look at spatiotemporal convolutions for action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6450–6459, 2018

work page 2018

[11] [11]

Ruijsink

Bram et al. Ruijsink. Deep learning for automated cardiac view classification in mri.Medical Image Analysis, 2020

work page 2020

[12] [12]

Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification

Saining Xie et al. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. InECCV, pages 305–321, 2018

work page 2018

[13] [13]

Long-term recurrent convolutional networks for visual recognition and description

Jeff Donahue et al. Long-term recurrent convolutional networks for visual recognition and description. InCVPR, pages 2625–2634, 2015

work page 2015

[14] [15]

Big self-supervised models advance medical image classification

Shekoofeh Azizi et al. Big self-supervised models advance medical image classification. InICCV, 2021

work page 2021

[15] [16]

Spatio-temporal graph convolution for video-based action recognition

Maosen Li et al. Spatio-temporal graph convolution for video-based action recognition. InAAAI, 2021

work page 2021

[16] [17]

Deep learning-based automatic analysis of cardiac mri.Radiology, 295(3):575–584, 2020

Elad Cohen et al. Deep learning-based automatic analysis of cardiac mri.Radiology, 295(3):575–584, 2020

work page 2020

[17] [18]

Interobserver variability of manual cardiac mri contouring.International Journal of Cardiovas- cular Imaging, 36:1821–1830, 2020

M Böttcher et al. Interobserver variability of manual cardiac mri contouring.International Journal of Cardiovas- cular Imaging, 36:1821–1830, 2020

work page 2020

[18] [19]

Uk biobank’s cardiovascular magnetic resonance protocol, quality control and reproducibility.European Heart Journal - Cardiovascular Imaging, 23(4):400–412, 2022

Steffen E Petersen, Paul M Matthews, Jefferies M Francis, et al. Uk biobank’s cardiovascular magnetic resonance protocol, quality control and reproducibility.European Heart Journal - Cardiovascular Imaging, 23(4):400–412, 2022

work page 2022

[19] [20]

Automated quality control and standardization of cardiac mri.Medical Image Analysis, 70:101996, 2021

Hamed Ebadi et al. Automated quality control and standardization of cardiac mri.Medical Image Analysis, 70:101996, 2021

work page 2021

[20] [21]

Automated cardiovascular magnetic resonance image analysis with fully convolutional networks.Journal of cardiovascular magnetic resonance, 20(1):65, 2018

Wenjia Bai, Matthew Sinclair, Giacomo Tarroni, Ozan Oktay, Martin Rajchl, Ghislain Vaillant, Aaron M Lee, Nay Aung, Elena Lukaschuk, Mihir M Sanghvi, et al. Automated cardiovascular magnetic resonance image analysis with fully convolutional networks.Journal of cardiovascular magnetic resonance, 20(1):65, 2018

work page 2018

[21] [22]

Cardiac mri segmentation with strong anatomical guarantees.IEEE Transactions on Medical Imaging, 41(6):1466– 1477, 2022

Neale Painchaud, Yasmeen Skandarani, Shane Judge, Odile Bernard, Alain Lalande, and Pierre-Marc Jodoin. Cardiac mri segmentation with strong anatomical guarantees.IEEE Transactions on Medical Imaging, 41(6):1466– 1477, 2022

work page 2022

[22] [23]

Victor M. et al. Campello. Multi-task deep learning for workflow automation in cardiac mri.Medical Image Analysis, 2021

work page 2021

[23] [24]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[24] [25]

Is space-time attention all you need for video understanding? InInternational Conference on Machine Learning (ICML), pages 813–824, 2021

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InInternational Conference on Machine Learning (ICML), pages 813–824, 2021

work page 2021

[25] [26]

Video swin transformer

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer.arXiv preprint arXiv:2106.13230, 2022

work page arXiv 2022

[26] [27]

Vivit: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, October 2021

work page 2021

[27] [28]

X3d: Exploring the limits of efficient video recognition

Christoph Feichtenhofer. X3d: Exploring the limits of efficient video recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 203–213, 2020

work page 2020

[28] [29]

Uniformer: Uni- fied transformer for efficient spatiotemporal representation learning

Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Uni- fied transformer for efficient spatiotemporal representation learning. InInternational Conference on Learning Representations (ICLR), 2022. OpenReview paper

work page 2022

[29] [30]

Multiscale vision transformers

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichten- hofer. Multiscale vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6824–6835, October 2021. 13 Physics-Informed and Neuro-Symbolic AI in Health

work page 2021

[30] [31]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

work page 2017

[31] [32]

A closer look at spatiotemporal convolutions for action recognition

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018

work page 2018

[32] [33]

Slowfast networks for video recognition

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019

work page 2019

[33] [34]

Non-local neural networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018

work page 2018

[34] [35]

Tsm: Temporal shift module for efficient video understanding

Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019

work page 2019

[35] [36]

Multiscale vision transformers

Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichten- hofer. Multiscale vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021

work page 2021

[36] [37]

Mvitv2: Improved multiscale vision transformers for classification and detection

Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4804–4814, 2022

work page 2022

[37] [38]

Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer.arXiv preprint arXiv:2211.09552, 2022

work page arXiv 2022

[38] [39]

Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

work page 2022

[39] [40]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

work page 2020

[40] [41]

Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

work page 2020

[41] [42]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

work page 2020

[42] [43]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

work page 2021

[43] [44]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

work page 2022

[44] [45]

Markus Huellebrand, Matthias Ivantsits, Lennart Tautz, Sebastian Kelle, and Anja Hennemuth. A collaborative approach for the development and application of machine learning solutions for cmr-based cardiac disease classification.Frontiers in Cardiovascular Medicine, 9:829512, 2022

work page 2022

[45] [46]

Deep learning for diagnosis of chronic myocardial infarction on nonenhanced cardiac cine mri.Radiology, 291(3):606–617, 2019

Nan Zhang, Guang Yang, Zhifan Gao, Chenchu Xu, Yanping Zhang, Rui Shi, Jennifer Keegan, Lei Xu, Heye Zhang, Zhanming Fan, and David Firmin. Deep learning for diagnosis of chronic myocardial infarction on nonenhanced cardiac cine mri.Radiology, 291(3):606–617, 2019

work page 2019

[46] [47]

Besser, Simran Anand, Ravi Madduri, Neil Getty, Sebastian Kelle, Keigo Kawaji, Victor Mor-Avi, and Amit R

Daksh Chauhan, Emeka Anyanwu, Jacob Goes, Stephanie A. Besser, Simran Anand, Ravi Madduri, Neil Getty, Sebastian Kelle, Keigo Kawaji, Victor Mor-Avi, and Amit R. Patel. Comparison of machine learning and deep learning for view identification from cardiac magnetic resonance images.Clinical Imaging, 82:121–126, 2022

work page 2022

[47] [48]

Jacob, Teodora Chitiboi, U

Athira J. Jacob, Teodora Chitiboi, U. Joseph Schoepf, Puneet Sharma, Jonathan Aldinger, Charles Baker, Carla Lautenschlager, Tilman Emrich, and Akos Varga-Szemes. Deep-learning-based disease classification in patients undergoing cine cardiac mri.Journal of Magnetic Resonance Imaging, 61(4):1635–1647, 2025. 14

work page 2025