pith. sign in

arxiv: 2604.11389 · v1 · submitted 2026-04-13 · 💻 cs.CV

ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines

Pith reviewed 2026-05-10 15:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords cine cardiac MRIview classificationspatiotemporal deep learninguncertainty estimationcardiac imagingConvFormermulti-clip fusionmasked reconstruction
0
0 comments X p. Extension

The pith

ConvFormer3D-TAP integrates 3D convolutional tokenization and multiscale self-attention with uncertainty-weighted multi-clip fusion to classify standard cine cardiac MRI views at 96 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a specialized deep learning architecture for identifying which of six standard views a given cine cardiac MRI sequence shows. Each view dictates which cardiac structures are visible and which quantitative measurements can be trusted, so misclassification can silently invalidate later steps such as ventricular segmentation or strain analysis. The model fuses local anatomical detail from 3D convolutions with longer-range cycle dynamics from hierarchical attention, and it is trained to reconstruct masked spatiotemporal patches while weighting clips according to estimated uncertainty. On a large real-world collection of 150974 clinical sequences the system reaches 96 percent accuracy, F1 scores of at least 0.94 on every class, and well-calibrated predictions, with the remaining errors confined to anatomically adjacent view pairs.

Core claim

ConvFormer3D-TAP is a spatiotemporal network that combines 3D convolutional tokenization with multiscale self-attention. It is trained by masked spatiotemporal reconstruction together with uncertainty-weighted multi-clip fusion so that it learns both static anatomy and dynamic cardiac-cycle information while down-weighting ambiguous temporal segments. On 150974 clinically acquired cine sequences spanning six standard views the model attains 96 percent validation accuracy, per-class F1 scores of 0.94 or higher, and low calibration error (ECE 0.025, Brier 0.040), with residual mistakes concentrated in long-axis and LVOT/AV pairs that share natural prescription overlap.

What carries the argument

ConvFormer3D-TAP, the architecture that performs 3D convolutional tokenization followed by multiscale self-attention and is trained with masked spatiotemporal reconstruction plus uncertainty-weighted multi-clip fusion to capture complementary local and long-range cardiac cues.

If this is right

  • Cine sequences can be routed automatically to the correct downstream analysis pipeline without manual view labeling.
  • Quality-control filters can flag low-confidence or misclassified studies before they reach segmentation or strain modules.
  • The same front-end can scale to routine clinical volumes given its performance on more than 150000 sequences.
  • Calibrated uncertainty scores allow ambiguous cases to be escalated to human readers rather than silently propagating errors.
  • Error patterns limited to anatomically adjacent views indicate that view-specific priors could be added without redesigning the core model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The architecture could be adapted to other dynamic cardiac modalities such as echocardiography or cardiac CT if similar spatiotemporal cues dominate.
  • Integration into an end-to-end pipeline might eliminate the need for any manual view selection step in high-volume centers.
  • The concentration of errors in adjacent long-axis views suggests that adding explicit plane-prescription metadata as an auxiliary input would be a low-cost next improvement.
  • Strong calibration opens the possibility of using the model inside larger ensemble or active-learning loops where only uncertain cases receive additional review.

Load-bearing premise

The uncertainty-weighted fusion and masked reconstruction training produce genuine robustness to scanner, protocol, and artifact variation without the large clinical cohort containing hidden selection biases or data leakage.

What would settle it

Running the trained model on an independent external cohort acquired on different scanner vendors or with altered protocols and finding either validation accuracy below 90 percent or ECE above 0.05 would falsify the robustness claim.

Figures

Figures reproduced from arXiv: 2604.11389 by Adrienne Kline, Chinmay Rane, Daniel Pittman, James Carr, Nafiseh Ghaffar Nia, Suchithra V., Vinesh Appadurai.

Figure 1
Figure 1. Figure 1: Overview of the six standardized cine CMR views used in this study. Representative frames illustrate the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of ConvFormer3D-TAP. Phase-aware sampling selects 16-frame clips from 25-frame cine loops, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Row-normalized training confusion matrix produced by the evaluation pipeline. Diagonal concentration [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Row-normalized validation confusion matrix for the best-performing checkpoint. Most errors occur between [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Learning curves for cine cMRI view classification. Training and validation accuracy across epochs show rapid [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Reliable recognition of standard cine cardiac MRI views is essential because each view determines which cardiac anatomy is visualized and which quantitative analyses can be performed. Incorrect view identification, whether by a human reader or an automated deep learning system, can propagate errors into segmentation, volumetric assessment, strain analysis, and valve evaluation. However, accurate view classification remains challenging under routine clinical variability in scanner vendor, acquisition protocol, motion artifacts, and plane prescription. We present ConvFormer3D-TAP, a cine-specific spatiotemporal architecture that integrates 3D convolutional tokenization with multiscale self-attention. The model is trained using masked spatiotemporal reconstruction and uncertainty-weighted multi-clip fusion to enhance robustness across cardiac phases and ambiguous temporal segments. The design captures complementary cues: local anatomical structure through convolutional priors and long-range cardiac-cycle dynamics through hierarchical attention. On a cohort of 150,974 clinically acquired cine sequences spanning six standard cine cardiac MRI views, ConvFormer3D-TAP achieved 96% validation accuracy with per-class F1-scores >= 0.94 and strong calibration (ECE = 0.025; Brier = 0.040). Error analysis shows that residual confusions are concentrated in anatomically adjacent long-axis and LVOT/AV view pairs, consistent with intrinsic prescription overlap. These results support ConvFormer3D-TAP as a scalable front-end for view routing, filtering and quality control in end-to-end cMRI workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces ConvFormer3D-TAP, a 3D convolutional tokenization plus multiscale self-attention architecture for classifying six standard cine cardiac MRI views. It incorporates masked spatiotemporal reconstruction and uncertainty-weighted multi-clip fusion during training to handle phase ambiguity and temporal variability. On a cohort of 150,974 clinically acquired sequences, the model reports 96% validation accuracy, per-class F1-scores >=0.94, ECE=0.025, and Brier score=0.040, with residual errors concentrated in anatomically adjacent long-axis and LVOT/AV views.

Significance. If the reported metrics are supported by patient-disjoint splits and proper controls, the work would offer a practical, high-accuracy front-end for automated view routing and quality control in large-scale CMR pipelines, potentially reducing downstream errors in segmentation and quantification. The scale of the cohort and explicit handling of uncertainty and cardiac-phase cues are notable strengths.

major comments (2)
  1. [Abstract] Abstract: The central performance claim (96% accuracy, F1>=0.94, ECE=0.025) is presented without any description of the data splitting strategy, patient count, study-disjointness, multi-center distribution, or external validation set. Because cine CMR datasets routinely contain multiple sequences per patient with correlated acquisition characteristics, the absence of explicit patient-level partitioning leaves open the possibility of leakage that would inflate both accuracy and the claimed robustness to scanner/protocol/motion variability.
  2. [Abstract] Abstract: No baseline comparisons (e.g., to standard 2D/3D CNNs or vanilla transformers) or ablation studies on the masked reconstruction and uncertainty-weighted fusion components are reported. Without these controls it is impossible to determine whether the observed gains are attributable to the proposed ConvFormer3D-TAP design or to dataset scale and training tricks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. These points highlight opportunities to improve the self-contained nature of the abstract and to strengthen the evidence for our architectural contributions. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim (96% accuracy, F1>=0.94, ECE=0.025) is presented without any description of the data splitting strategy, patient count, study-disjointness, multi-center distribution, or external validation set. Because cine CMR datasets routinely contain multiple sequences per patient with correlated acquisition characteristics, the absence of explicit patient-level partitioning leaves open the possibility of leakage that would inflate both accuracy and the claimed robustness to scanner/protocol/motion variability.

    Authors: We agree that the abstract should be more self-contained on this point. The full manuscript details a patient-level disjoint split of the 150,974 sequences drawn from a multi-center cohort to eliminate leakage between training and validation. No separate external validation set was employed. To directly address the concern, we will revise the abstract to include a concise statement on the patient-disjoint partitioning, approximate patient count, and multi-center distribution. This will make the robustness claims transparent without altering the reported metrics. revision: yes

  2. Referee: [Abstract] Abstract: No baseline comparisons (e.g., to standard 2D/3D CNNs or vanilla transformers) or ablation studies on the masked reconstruction and uncertainty-weighted fusion components are reported. Without these controls it is impossible to determine whether the observed gains are attributable to the proposed ConvFormer3D-TAP design or to dataset scale and training tricks.

    Authors: We acknowledge that the absence of explicit baselines and ablations in the reported results makes it harder to attribute gains specifically to the proposed components. The current manuscript focuses on end-to-end performance but does not present these controls. In the revised version we will add a results subsection containing baseline comparisons against standard 2D/3D CNNs and vanilla transformers, together with ablations that isolate the masked spatiotemporal reconstruction and uncertainty-weighted multi-clip fusion. These additions will clarify the incremental value of the ConvFormer3D-TAP design. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metrics from held-out validation on large cohort

full rationale

The paper presents ConvFormer3D-TAP as a trained spatiotemporal model whose reported 96% accuracy, F1-scores, ECE, and Brier score are direct empirical measurements on a validation split of 150,974 sequences. No equations, parameters, or first-principles derivations are claimed; the architecture (3D conv tokenization + multiscale attention) and training (masked reconstruction + uncertainty-weighted fusion) are standard data-driven procedures whose outputs are evaluated on held-out data rather than being algebraically forced by the inputs. No self-definitional loops, fitted-input-as-prediction, or load-bearing self-citations appear in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central performance claim rests on empirical training and validation rather than theoretical derivation. Key unstated assumptions include data representativeness and absence of leakage in the large cohort.

free parameters (1)
  • model architecture hyperparameters
    Layer counts, attention heads, fusion weights, and reconstruction loss coefficients are optimized during training on the dataset.
axioms (1)
  • domain assumption The 150,974-sequence cohort is representative of routine clinical variability across vendors and protocols
    Invoked to support generalization claims for the reported accuracy and calibration.

pith-pipeline@v0.9.0 · 5590 in / 1432 out tokens · 53973 ms · 2026-05-10T15:52:57.192171+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages

  1. [1]

    Cardiovascular diseases (cvds) fact sheet

    World Health Organization. Cardiovascular diseases (cvds) fact sheet. https://www.who.int/news-room/ fact-sheets/detail/cardiovascular-diseases-(cvds), 2022

  2. [2]

    Yuchi Han, Tiffany Chen, Jennifer Bryant, Chiara Bucciarelli-Ducci, Christopher Dyke, Michael D Elliott, Victor A Ferrari, Matthias G Friedrich, Chris Lawton, Warren J Manning, et al. Society for cardiovascular magnetic resonance (scmr) guidance for the practice of cardiovascular magnetic resonance during the covid-19 pandemic.Journal of Cardiovascular Ma...

  3. [3]

    Uk biobank’s cardiovascular magnetic resonance protocol.Journal of cardiovascular magnetic resonance, 18(1):8, 2016

    Steffen E Petersen, Paul M Matthews, Jane M Francis, Matthew D Robson, Filip Zemrak, Redha Boubertakh, Alistair A Young, Sarah Hudson, Peter Weale, Steve Garratt, et al. Uk biobank’s cardiovascular magnetic resonance protocol.Journal of cardiovascular magnetic resonance, 18(1):8, 2016

  4. [4]

    Role of cardiac magnetic resonance imaging in heart failure.Heart Failure Clinics, 17(2):207–221, 2021

    Carla Contaldi, Santo Dellegrottaglie, Ciro Mauro, Francesco Ferrara, Luigia Romano, Alberto M Marra, Brigida Ranieri, Andrea Salzano, Salvatore Rega, Alessandra Scatteia, et al. Role of cardiac magnetic resonance imaging in heart failure.Heart Failure Clinics, 17(2):207–221, 2021

  5. [5]

    Kawakubo

    Motoya et al. Kawakubo. Right ventricular conduction delay diagnosis using deep learning on cardiac mri.Journal of Magnetic Resonance Imaging, 2022

  6. [6]

    Automated segmentation of 3d cine cardiovascular magnetic resonance imaging.Frontiers in Cardiovascular Medicine, 10:1167500, 2023

    Soroosh Tayebi Arasteh, Jennifer Romanowicz, Danielle F Pace, Polina Golland, Andrew J Powell, Andreas K Maier, Daniel Truhn, Tom Brosch, Juergen Weese, Mahshad Lotfinia, et al. Automated segmentation of 3d cine cardiovascular magnetic resonance imaging.Frontiers in Cardiovascular Medicine, 10:1167500, 2023

  7. [7]

    Fine-tuned convolutional neural nets for cardiac mri acquisition plane recognition.Computerized Medical Imaging and Graphics, 56:50–61, 2017

    Jasmina Margeta, Antonio Criminisi, Roberto Cabrera-Lozoya, Daniel C Lee, and Nicholas Ayache. Fine-tuned convolutional neural nets for cardiac mri acquisition plane recognition.Computerized Medical Imaging and Graphics, 56:50–61, 2017. 12 Physics-Informed and Neuro-Symbolic AI in Health

  8. [8]

    Vincenzo Vergani, Chiara Gastaldi, Pier Giorgio Masci, et al. Deep learning for classification and selection of cine cmr images to achieve fully automated quality-controlled cmr analysis from scanner to report.Frontiers in Cardiovascular Medicine, 8:710, 2021

  9. [9]

    Sébastien et al. Leclerc. Deep learning for cardiac mri plane classification. InMICCAI, 2019

  10. [10]

    A closer look at spatiotemporal convolutions for action recognition

    Du Tran et al. A closer look at spatiotemporal convolutions for action recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6450–6459, 2018

  11. [11]

    Ruijsink

    Bram et al. Ruijsink. Deep learning for automated cardiac view classification in mri.Medical Image Analysis, 2020

  12. [12]

    Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification

    Saining Xie et al. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. InECCV, pages 305–321, 2018

  13. [13]

    Long-term recurrent convolutional networks for visual recognition and description

    Jeff Donahue et al. Long-term recurrent convolutional networks for visual recognition and description. InCVPR, pages 2625–2634, 2015

  14. [15]

    Big self-supervised models advance medical image classification

    Shekoofeh Azizi et al. Big self-supervised models advance medical image classification. InICCV, 2021

  15. [16]

    Spatio-temporal graph convolution for video-based action recognition

    Maosen Li et al. Spatio-temporal graph convolution for video-based action recognition. InAAAI, 2021

  16. [17]

    Deep learning-based automatic analysis of cardiac mri.Radiology, 295(3):575–584, 2020

    Elad Cohen et al. Deep learning-based automatic analysis of cardiac mri.Radiology, 295(3):575–584, 2020

  17. [18]

    Interobserver variability of manual cardiac mri contouring.International Journal of Cardiovas- cular Imaging, 36:1821–1830, 2020

    M Böttcher et al. Interobserver variability of manual cardiac mri contouring.International Journal of Cardiovas- cular Imaging, 36:1821–1830, 2020

  18. [19]

    Uk biobank’s cardiovascular magnetic resonance protocol, quality control and reproducibility.European Heart Journal - Cardiovascular Imaging, 23(4):400–412, 2022

    Steffen E Petersen, Paul M Matthews, Jefferies M Francis, et al. Uk biobank’s cardiovascular magnetic resonance protocol, quality control and reproducibility.European Heart Journal - Cardiovascular Imaging, 23(4):400–412, 2022

  19. [20]

    Automated quality control and standardization of cardiac mri.Medical Image Analysis, 70:101996, 2021

    Hamed Ebadi et al. Automated quality control and standardization of cardiac mri.Medical Image Analysis, 70:101996, 2021

  20. [21]

    Automated cardiovascular magnetic resonance image analysis with fully convolutional networks.Journal of cardiovascular magnetic resonance, 20(1):65, 2018

    Wenjia Bai, Matthew Sinclair, Giacomo Tarroni, Ozan Oktay, Martin Rajchl, Ghislain Vaillant, Aaron M Lee, Nay Aung, Elena Lukaschuk, Mihir M Sanghvi, et al. Automated cardiovascular magnetic resonance image analysis with fully convolutional networks.Journal of cardiovascular magnetic resonance, 20(1):65, 2018

  21. [22]

    Cardiac mri segmentation with strong anatomical guarantees.IEEE Transactions on Medical Imaging, 41(6):1466– 1477, 2022

    Neale Painchaud, Yasmeen Skandarani, Shane Judge, Odile Bernard, Alain Lalande, and Pierre-Marc Jodoin. Cardiac mri segmentation with strong anatomical guarantees.IEEE Transactions on Medical Imaging, 41(6):1466– 1477, 2022

  22. [23]

    Victor M. et al. Campello. Multi-task deep learning for workflow automation in cardiac mri.Medical Image Analysis, 2021

  23. [24]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

  24. [25]

    Is space-time attention all you need for video understanding? InInternational Conference on Machine Learning (ICML), pages 813–824, 2021

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? InInternational Conference on Machine Learning (ICML), pages 813–824, 2021

  25. [26]

    Video swin transformer

    Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer.arXiv preprint arXiv:2106.13230, 2022

  26. [27]

    Vivit: A video vision transformer

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Luˇci´c, and Cordelia Schmid. Vivit: A video vision transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, October 2021

  27. [28]

    X3d: Exploring the limits of efficient video recognition

    Christoph Feichtenhofer. X3d: Exploring the limits of efficient video recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 203–213, 2020

  28. [29]

    Uniformer: Uni- fied transformer for efficient spatiotemporal representation learning

    Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Uni- fied transformer for efficient spatiotemporal representation learning. InInternational Conference on Learning Representations (ICLR), 2022. OpenReview paper

  29. [30]

    Multiscale vision transformers

    Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichten- hofer. Multiscale vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6824–6835, October 2021. 13 Physics-Informed and Neuro-Symbolic AI in Health

  30. [31]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

  31. [32]

    A closer look at spatiotemporal convolutions for action recognition

    Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018

  32. [33]

    Slowfast networks for video recognition

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019

  33. [34]

    Non-local neural networks

    Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018

  34. [35]

    Tsm: Temporal shift module for efficient video understanding

    Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019

  35. [36]

    Multiscale vision transformers

    Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichten- hofer. Multiscale vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021

  36. [37]

    Mvitv2: Improved multiscale vision transformers for classification and detection

    Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4804–4814, 2022

  37. [38]

    Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Limin Wang, and Yu Qiao. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer.arXiv preprint arXiv:2211.09552, 2022

  38. [39]

    Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

    Zhan Tong, Yibing Song, Jue Wang, and Limin Wang. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training.Advances in neural information processing systems, 35:10078–10093, 2022

  39. [40]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

  40. [41]

    Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning.Advances in neural information processing systems, 33:21271–21284, 2020

  41. [42]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020

  42. [43]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  43. [44]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

  44. [45]

    Markus Huellebrand, Matthias Ivantsits, Lennart Tautz, Sebastian Kelle, and Anja Hennemuth. A collaborative approach for the development and application of machine learning solutions for cmr-based cardiac disease classification.Frontiers in Cardiovascular Medicine, 9:829512, 2022

  45. [46]

    Deep learning for diagnosis of chronic myocardial infarction on nonenhanced cardiac cine mri.Radiology, 291(3):606–617, 2019

    Nan Zhang, Guang Yang, Zhifan Gao, Chenchu Xu, Yanping Zhang, Rui Shi, Jennifer Keegan, Lei Xu, Heye Zhang, Zhanming Fan, and David Firmin. Deep learning for diagnosis of chronic myocardial infarction on nonenhanced cardiac cine mri.Radiology, 291(3):606–617, 2019

  46. [47]

    Besser, Simran Anand, Ravi Madduri, Neil Getty, Sebastian Kelle, Keigo Kawaji, Victor Mor-Avi, and Amit R

    Daksh Chauhan, Emeka Anyanwu, Jacob Goes, Stephanie A. Besser, Simran Anand, Ravi Madduri, Neil Getty, Sebastian Kelle, Keigo Kawaji, Victor Mor-Avi, and Amit R. Patel. Comparison of machine learning and deep learning for view identification from cardiac magnetic resonance images.Clinical Imaging, 82:121–126, 2022

  47. [48]

    Jacob, Teodora Chitiboi, U

    Athira J. Jacob, Teodora Chitiboi, U. Joseph Schoepf, Puneet Sharma, Jonathan Aldinger, Charles Baker, Carla Lautenschlager, Tilman Emrich, and Akos Varga-Szemes. Deep-learning-based disease classification in patients undergoing cine cardiac mri.Journal of Magnetic Resonance Imaging, 61(4):1635–1647, 2025. 14