pith. sign in

arxiv: 2606.07775 · v1 · pith:PWBWF26Lnew · submitted 2026-06-05 · 💻 cs.CV

DALE-CT: Depth-Aware Foundation Models for Computed Tomography

Pith reviewed 2026-06-27 21:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords computed tomographyself-supervised learningdepth-aware pre-trainingmulti-abnormality detection2D foundation modelsLeJEPAmedical imagingmultiple instance learning
0
0 comments X

The pith

A 2D slice-based CT model trained from scratch reaches 0.833 macro AUROC matching 3D vision-language models on abnormality detection

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops DALE-CT as a family of 2D models for CT volumes built entirely from scratch with LeJEPA self-supervised learning. It adds a depth-aware pre-training step that draws dense auxiliary signals from both automated anatomical masks and human-annotated abnormalities. When the resulting dual-supervised backbone is frozen and tested with multiple instance learning for multi-abnormality detection, it records a macro AUROC of 0.833. This score sits close to leading 3D vision-language models while using far less data and no text at all. The work therefore presents 2D slice processing as a workable substitute for native 3D architectures in volumetric CT tasks.

Core claim

DALE-CT-2S, trained with the dual-supervised depth-aware strategy on the CT-RATE dataset, reaches a macro AUROC of 0.833 under linear-probe MIL evaluation for multi-abnormality detection, matching the performance of current state-of-the-art 3D vision-language models despite training from scratch on less data and without any textual supervision.

What carries the argument

The novel 3D depth-aware pre-training strategy that anchors 2D LeJEPA representations with dense auxiliary supervision from automated anatomical masks and human-annotated abnormalities.

If this is right

  • 2D slice architectures become viable flexible alternatives to native 3D models for CT volume processing.
  • High downstream detection performance is attainable without large textual corpora or language-model integration.
  • Training entirely from scratch on moderate data volumes can approach results of continually pre-trained 3D models.
  • Dual supervision from masks and annotations measurably improves representation quality for abnormality detection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same depth-aware auxiliary signals could improve 2D handling of other volumetric scans such as MRI.
  • Lower data requirements may allow training of custom models on smaller institutional or rare-disease datasets.
  • Public release of weights and code makes direct testing on new abnormality classes or clinical pipelines straightforward.
  • Lightweight 3D post-processing could be added later to close any remaining gap with full 3D models.

Load-bearing premise

That 2D slice processing plus the depth-aware pre-training strategy supplies enough volumetric spatial context for multi-abnormality detection to reach parity with native 3D models.

What would settle it

A native 3D model trained on the same CT-RATE data and supervision signals that substantially exceeds 0.833 macro AUROC would show the 2D approach misses critical 3D context.

Figures

Figures reproduced from arXiv: 2606.07775 by Caroline N. Leach, Emily B. Collier, Evan W. Damron, Mahmut S. Gokmen, Mitchell A. Klusty, V. K. Cody Bumgardner.

Figure 1
Figure 1. Figure 1: Overview of the DALE-CT framework for volumetric CT representation. The pipeline is divided into three primary stages: (1) [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of Latent Continuity between Depth-Aware slab sampling [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Recent breakthroughs in self-supervised learning (SSL), such as the Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA), alongside successes in integrating visual encoders with language models, have driven the demand for adaptable, high-capacity vision encoders in Computed Tomography (CT). In this work, we explore 2D slice-based architectures as a flexible alternative to native 3D models for processing volumetric CT data. Using the CT-RATE dataset, we trained DALE-CT (Depth-Aware Latent-Euclidean Computed Tomography), a 2D model family built entirely from scratch using LeJEPA, and compared its performance against a continually pre-trained DINOv2 baseline. To enhance representation quality, we developed a novel 3D depth-aware pre-training strategy anchored by dense auxiliary supervision from both automated anatomical masks and human-annotated abnormalities. Under linear probe evaluation with Multiple Instance Learning (MIL) for multi-abnormality detection, the frozen backbone of this dual-supervised model (DALE-CT-2S) achieves a Macro AUROC of 0.833. This performance demonstrates near-parity with state-of-the-art 3D vision-language models, achieved entirely from scratch with significantly less data and no textual supervision. To ensure reproducibility, all training code, evaluation scripts, and model weights have been made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DALE-CT, a family of 2D slice-based models for CT volumes trained from scratch on the CT-RATE dataset using LeJEPA self-supervised learning. A novel depth-aware pre-training strategy is proposed that incorporates dense auxiliary supervision from automated anatomical masks and human-annotated abnormalities. The central empirical result is that the frozen dual-supervised backbone (DALE-CT-2S) achieves a Macro AUROC of 0.833 under linear probing with Multiple Instance Learning for multi-abnormality detection, reported as near-parity with state-of-the-art 3D vision-language models while using less data and no textual supervision. All code, evaluation scripts, and model weights are released publicly.

Significance. If the reported performance holds and the depth-aware component is shown to be responsible, the result would be significant for the field: it would establish that carefully supervised 2D slice encoders can deliver competitive volumetric CT performance without native 3D architectures or language supervision, at lower data and compute cost. The public release of code and weights is a clear strength that supports reproducibility and follow-on work.

major comments (2)
  1. [Abstract] Abstract: The claim that the depth-aware pre-training strategy (LeJEPA plus automated masks and abnormality annotations) supplies the inter-slice volumetric context needed for 2D slices to reach near-parity with native 3D models is load-bearing for the central result, yet the manuscript supplies neither an ablation isolating this component versus standard slice-wise LeJEPA nor any explicit mechanism describing how depth information is injected into the 2D encoder.
  2. [Abstract] Abstract: The Macro AUROC of 0.833 is presented without statistical tests, confidence intervals, variance across runs, or a table of direct numerical comparisons against the specific 3D vision-language baselines referenced, preventing assessment of whether the near-parity conclusion is robust.
minor comments (2)
  1. [Abstract] The abstract states the model was trained 'entirely from scratch with significantly less data' but provides no quantitative comparison of dataset size or compute against the referenced 3D VL models.
  2. No training curves, hyperparameter details, or ablation tables are referenced in the supplied text, which would be needed to substantiate the contribution of the dual-supervision signals.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of results and claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the depth-aware pre-training strategy (LeJEPA plus automated masks and abnormality annotations) supplies the inter-slice volumetric context needed for 2D slices to reach near-parity with native 3D models is load-bearing for the central result, yet the manuscript supplies neither an ablation isolating this component versus standard slice-wise LeJEPA nor any explicit mechanism describing how depth information is injected into the 2D encoder.

    Authors: We agree that an explicit ablation isolating the depth-aware component from standard LeJEPA would strengthen the central claim. We will add this ablation to the revised manuscript. The mechanism for injecting depth information is the use of dense 3D anatomical masks and abnormality annotations as auxiliary supervision targets during pre-training; these targets are derived from the full volume and thereby encourage the 2D encoder to learn representations that respect inter-slice relationships. We will expand the methods section to describe this process more explicitly. revision: yes

  2. Referee: [Abstract] Abstract: The Macro AUROC of 0.833 is presented without statistical tests, confidence intervals, variance across runs, or a table of direct numerical comparisons against the specific 3D vision-language baselines referenced, preventing assessment of whether the near-parity conclusion is robust.

    Authors: We acknowledge that the current presentation lacks the requested statistical details and direct comparisons. In the revision we will report confidence intervals, standard deviation across multiple runs, and a table with numerical results against the referenced 3D baselines to allow readers to assess the robustness of the near-parity claim. revision: yes

Circularity Check

0 steps flagged

No circularity: central result is measured AUROC on held-out data

full rationale

The paper's primary claim is an empirical Macro AUROC of 0.833 obtained via linear-probe evaluation with MIL on held-out CT-RATE data for the frozen DALE-CT-2S backbone. This is a direct performance measurement against external labels, not an output of any equation or fitted parameter defined inside the paper. The pre-training description (LeJEPA with auxiliary anatomical masks) supplies the model weights but does not contain a derivation chain in which the reported AUROC reduces to a self-referential fit or self-citation. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the supplied text; the result remains falsifiable on independent test data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or training details, so free parameters, axioms, and invented entities cannot be enumerated; the LeJEPA backbone and CT-RATE dataset are treated as external inputs.

pith-pipeline@v0.9.1-grok · 5792 in / 1227 out tokens · 16627 ms · 2026-06-27T21:59:36.545857+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 5 canonical work pages

  1. [1]

    Generalist foundation models from a multimodal dataset for 3d computed tomography,

    I. E. Hamamci, S. Er, C. Wang, F. Almas, A. G. Simsek, S. N. Esirgun, I. Dogan, O. F. Durugol, B. Hou, S. Shit, W. Dai, M. Xu, H. Reynaud, M. F. Dasdelen, B. Wittmann, T. Amiranashvili, E. Simsar, M. Simsar, E. B. Erdemir, A. Alanbay, A. Sekuboyina, B. Lafci, A. Kaplan, Z. Lu, M. Polacin, B. Kainz, C. Bluethgen, K. Batmanghelich, M. K. Ozdemir, and B. Men...

  2. [2]

    Comprehensive language-image pre-training for 3d medical image understanding,

    T. Wald, I. E. Hamamci, Y . Gao, S. Bond-Taylor, H. Sharma, M. Ilse, C. Lo, O. Melnichenko, A. Schwaighofer, N. C. F. Codella, M. T. Wetscherek, K. H. Maier-Hein, P. Korfiatis, V . Salvatelli, J. Alvarez-Valle, and F. P ´erez-Garc´ıa, “Comprehensive language-image pre-training for 3d medical image understanding,” 2026. [Online]. Available: https://arxiv.o...

  3. [3]

    Exploring scalable medical image encoders beyond text supervision,

    F. P ´erez-Garc´ıa, H. Sharma, S. Bond-Taylor, K. Bouzid, V . Salvatelli, M. Ilse, S. Bannur, D. C. Castro, A. Schwaighofer, M. P. Lungren, M. T. Wetscherek, N. Codella, S. L. Hyland, J. Alvarez-Valle, and O. Oktay, “Exploring scalable medical image encoders beyond text supervision,” Nature Machine Intelligence, vol. 7, no. 1, p. 119–130, Jan. 2025. [Onli...

  4. [4]

    Curriculum-driven 3d ct report generation via language-free visual grafting and zone-constrained compression,

    V . Bumgardner, M. A. Klusty, M. S. Gokmen, and E. W. Dam- ron, “Curriculum-driven 3d ct report generation via language-free visual grafting and zone-constrained compression,”arXiv preprint arXiv:2603.23308, 2026

  5. [5]

    Lejepa: Provable and scalable self- supervised learning without the heuristics,

    R. Balestriero and Y . LeCun, “Lejepa: Provable and scalable self- supervised learning without the heuristics,” 2025. [Online]. Available: https://arxiv.org/abs/2511.08544

  6. [6]

    Totalsegmentator: Robust segmentation of 104 anatomic structures in ct images,

    J. Wasserthal, H.-C. Breit, M. T. Meyer, M. Pradella, D. Hinck, A. W. Sauter, T. Heye, D. T. Boll, J. Cyriac, S. Yang, M. Bach, and M. Segeroth, “Totalsegmentator: Robust segmentation of 104 anatomic structures in ct images,”Radiology: Artificial Intelligence, vol. 5, no. 5,

  7. [7]

    Radiology: Artificial Intelligence (2023)

    [Online]. Available: http://dx.doi.org/10.1148/ryai.230024

  8. [8]

    Rexgroundingct: A 3d chest ct dataset for segmentation of findings from free-text reports,

    M. Baharoon, L. Luo, M. Moritz, A. Kumar, S. E. Kim, X. Zhang, M. Zhu, M. H. Alabbad, M. S. Alhazmi, N. P. Mistry, L. Bijnens, K. R. Kleinschmidt, B. Chrisler, S. Suryadevara, S. S. D. Jaliparthi, N. M. Prudlo, M. D. Marino, J. Palacio, R. Akula, D. Zhou, H.-Y . Zhou, I. E. Hamamci, S. J. Adams, H. R. AlOmaish, and P. Rajpurkar, “Rexgroundingct: A 3d ches...

  9. [9]

    Merlin: a computed tomography vision–language foundation model and dataset,

    L. Blankemeier, A. Kumar, J. P. Cohen, J. Liu, L. Liu, D. Van Veen, S. J. S. Gardezi, H. Yu, M. Paschali, Z. Chen, J.-B. Delbrouck, E. Reis, R. Holland, C. Truyts, C. Bluethgen, Y . Wu, L. Lian, M. E. K. Jensen, S. Ostmeier, M. Varma, J. M. J. Valanarasu, Z. Fang, Z. Huo, Z. Nabulsi, D. Ardila, W.-H. Weng, E. A. Junior, N. Ahuja, J. Fries, N. H. Shah, G. ...

  10. [10]

    Vision foundation models for computed tomography,

    S. Pai, I. Hadzic, D. Bontempi, K. Bressem, B. H. Kann, A. Fedorov, R. H. Mak, and H. J. W. L. Aerts, “Vision foundation models for computed tomography,” 2025. [Online]. Available: https://arxiv.org/abs/2501.09001

  11. [11]

    Masked autoencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” 2021. [Online]. Available: https://arxiv.org/abs/2111.06377

  12. [12]

    A simple framework for contrastive learning of visual representations,

    T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” 2020. [Online]. Available: https://arxiv.org/abs/2002.05709

  13. [13]

    Momentum contrast for unsupervised visual representation learning,

    K. He, H. Fan, Y . Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” 2020. [Online]. Available: https://arxiv.org/abs/1911.05722

  14. [14]

    Bootstrap your own latent: A new approach to self-supervised learning,

    J.-B. Grill, F. Strub, F. Altch ´e, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko, “Bootstrap your own latent: A new approach to self-supervised learning,” 2020. [Online]. Available: https://arxiv.org/abs/2006.07733

  15. [15]

    Emerging properties in self-supervised vision transformers,

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,”

  16. [16]

    Available: https://arxiv.org/abs/2104.14294

    [Online]. Available: https://arxiv.org/abs/2104.14294

  17. [17]

    Vicreg: Variance-invariance- covariance regularization for self-supervised learning,

    A. Bardes, J. Ponce, and Y . LeCun, “Vicreg: Variance-invariance- covariance regularization for self-supervised learning,” 2022. [Online]. Available: https://arxiv.org/abs/2105.04906

  18. [18]

    Dinov2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without supe...

  19. [19]

    Tap-ct: 3d task- agnostic pretraining of computed tomography foundation models,

    T. Veenboer, G. Yiasemis, E. Marcus, V . V . Veldhuizen, C. G. M. Snoek, J. Teuwen, and K. B. W. G. Lipman, “Tap-ct: 3d task- agnostic pretraining of computed tomography foundation models,”

  20. [20]

    Available: https://arxiv.org/abs/2512.00872

    [Online]. Available: https://arxiv.org/abs/2512.00872

  21. [21]

    Evaluating general purpose vision foundation models for medical image analysis: An experimental study of dinov2 on radiology benchmarks,

    M. Baharoon, W. Qureshi, J. Ouyang, Y . Xu, A. Aljouie, and W. Peng, “Evaluating general purpose vision foundation models for medical image analysis: An experimental study of dinov2 on radiology benchmarks,”

  22. [22]

    Available: https://arxiv.org/abs/2312.02366

    [Online]. Available: https://arxiv.org/abs/2312.02366

  23. [23]

    Dino-lg: Enhancing vision transformers with label guidance for coronary artery calcium detection,

    M. Gokmen, C. Ozcan, M. Haqueet al., “Dino-lg: Enhancing vision transformers with label guidance for coronary artery calcium detection,”Med Biol Eng Comput, vol. 64, pp. 1249–1266, 2026. [Online]. Available: https://doi.org/10.1007/s11517-026-03523-1

  24. [24]

    U-vlm: Hierarchical vision language modeling for report generation,

    P. Shi, M. Zhang, K. Song, J. Liu, Y . Gu, and X. Zhang, “U-vlm: Hierarchical vision language modeling for report generation,” 2026. [Online]. Available: https://arxiv.org/abs/2603.00479

  25. [25]

    Organ-aware attention improves ct triage and classification,

    L. Dahal, Y . Bhandari, G. D. Rubin, and J. Y . Lo, “Organ-aware attention improves ct triage and classification,” 2026. [Online]. Available: https://arxiv.org/abs/2601.13385

  26. [26]

    Janus: Anatomy- conditioned gating for robust ct triage under distribution shift,

    L. Dahal, Y . Bhandari, G. Rubin, and J. Y . Lo, “Janus: Anatomy- conditioned gating for robust ct triage under distribution shift,” 2026. [Online]. Available: https://arxiv.org/abs/2605.13813

  27. [27]

    Finetuned-dinov2-chest-ct,

    Institute for Biomedical Informatics Center for Applied AI (IBI- CAAI), “Finetuned-dinov2-chest-ct,” https://huggingface.co/IBI-CAAI/ Finetuned-DINOv2-Chest-CT, 2026

  28. [28]

    Vision transformers need registers,

    T. Darcet, M. Oquab, J. Mairal, and P. Bojanowski, “Vision transformers need registers,” 2024. [Online]. Available: https://arxiv.org/abs/2309. 16588

  29. [29]

    Dino-mx: A modular & flexible framework for self-supervised learning,

    M. S. Gokmen and C. Bumgardner, “Dino-mx: A modular & flexible framework for self-supervised learning,” 2025. [Online]. Available: https://arxiv.org/abs/2511.01610

  30. [30]

    Guided-chest-ct-lejepa-0,

    Institute for Biomedical Informatics Center for Applied AI (IBI- CAAI), “Guided-chest-ct-lejepa-0,” https://huggingface.co/IBI-CAAI/ Guided-Chest-CT-LeJEPA-0, 2026

  31. [31]

    Guided-chest-ct-lejepa-1s,

    ——, “Guided-chest-ct-lejepa-1s,” https://huggingface.co/IBI-CAAI/ Guided-Chest-CT-LeJEPA-1S, 2026

  32. [32]

    Guided-chest-ct-lejepa-2s,

    ——, “Guided-chest-ct-lejepa-2s,” https://huggingface.co/IBI-CAAI/ Guided-Chest-CT-LeJEPA-2S, 2026