DARK: Diagonal-Anchored Repulsive Knowledge Distillation for Vision-Language Models under Extreme Compression

Asif Hanif; Fadillah Adamsyah Maani; Hussain Alasmawi; Mohammad Yaqub; Numan Saeed

arxiv: 2603.05421 · v3 · submitted 2026-03-05 · 💻 cs.CV · cs.AI· cs.LG

DARK: Diagonal-Anchored Repulsive Knowledge Distillation for Vision-Language Models under Extreme Compression

Numan Saeed , Asif Hanif , Fadillah Adamsyah Maani , Hussain Alasmawi , Mohammad Yaqub This is my paper

Pith reviewed 2026-05-15 16:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords knowledge distillationvision-language modelsmodel compressioncontrastive learningfetal ultrasoundzero-shot classificationon-device inference

0 comments

The pith

Repulsive distillation with anchored matches lets a 75M vision-language model match or beat its 427M teacher on fetal ultrasound benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When the capacity gap between teacher and student is extreme, copying the teacher's full similarity structure becomes a poor goal because many non-target similarities come from the teacher's own architectural biases rather than transferable knowledge. DARK splits the loss into a fixed diagonal term that keeps matched image-text pairs aligned and an off-diagonal term whose weight is annealed from positive to negative so the student gradually repels the teacher's non-matching similarities. This produces a student that preserves the teacher's per-image confidence while shedding inherited inter-class confusion, resulting in structured decorrelation in the embedding space. The 26-times-smaller student runs at 1.6 ms on an iPhone and reaches higher zero-shot scores than the teacher on HC18 biometry validity and brain sub-plane classification. The method shows that controlled disagreement can be more efficient than imitation when compressing multimodal models for clinical on-device use.

Core claim

DARK decomposes the distillation loss into a diagonal term that anchors matched image-text pairs throughout training and an off-diagonal term that is annealed from positive to negative weighting, causing the student to repel the teacher's non-target similarities. This yields structured decorrelation: the student keeps teacher-aligned per-image confidence while diverging from inherited inter-class confusion, allowing a 75M-parameter student with a 26x smaller visual encoder to match or exceed the 427M teacher on zero-shot benchmarks including 88.6% vs 83.5% HC18 biometry validity and 0.784 vs 0.702 brain sub-plane F1.

What carries the argument

The diagonal-anchored repulsive loss that transitions the student from imitating to repelling the teacher's non-target similarity structure via annealing of the off-diagonal weight.

If this is right

The 75M student reaches 88.6% HC18 biometry validity compared with the teacher's 83.5%.
Brain sub-plane F1 improves from 0.702 to 0.784.
The model runs in 1.6 ms on an iPhone 16 Pro.
Embedding analyses confirm structured decorrelation while preserving per-image confidence.
The approach enables on-device deployment of vision-language models in clinical settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchoring-plus-repulsion pattern could be tested on other multimodal compression tasks outside fetal imaging where teacher-student gaps are large.
Annealing schedules might be adapted automatically based on measured capacity gap to reduce manual tuning.
The resulting decorrelated embeddings may transfer more cleanly to downstream linear probes or few-shot adaptation.
On-device clinical tools could shift from always needing the largest cloud model toward smaller locally runnable versions.

Load-bearing premise

The teacher's non-target similarity structure mainly reflects architectural biases rather than useful information, and annealing the off-diagonal term to negative weighting produces beneficial repulsion without instability or new errors.

What would settle it

Train an identical student with standard positive off-diagonal weighting instead of annealing to negative; if that student matches or exceeds the DARK student on the same zero-shot benchmarks, the benefit of repulsion is refuted.

Figures

Figures reproduced from arXiv: 2603.05421 by Asif Hanif, Fadillah Adamsyah Maani, Hussain Alasmawi, Mohammad Yaqub, Numan Saeed.

**Figure 1.** Figure 1: Overview of the MobileFetalCLIP framework. (A) Distillation setup: a frozen FetalCLIP teacher (ViT-L/14, 304M visual params) produces an N×N similarity matrix; a lightweight FastViT student (11.4M visual params) is trained via LCLIP and LKD. (B) Attraction-to-repulsion dynamics: the off-diagonal weight β(t) decays from β0 into negative values; the diagonal weight Ldiag remains fixed, preserving matchedpai… view at source ↗

**Figure 2.** Figure 2: Training dynamics for representative KD configurations. (a) KD weight schedule over epochs: for coupled runs the weight is λKL(t); for selective mode it is β(t). Positive decay stays above zero; repulsive variants cross into the repulsive zone (weight<0). (b) Zero-shot Avg.‡ per epoch: repulsive runs exhibit a characteristic late surge once entering the repulsive zone; Selective Repulsive KD (β0=2, r=−0.8)… view at source ↗

**Figure 3.** Figure 3: t-SNE projections of brain sub-plane embeddings (transthalamic, transcerebellum, transventricular). (a) No KD: overlapping clusters. (b) Static KD: marginal improvement. (c) Selective Repulsive KD: well-separated, compact clusters. 4.7 Linear Probing Evaluation To assess frozen feature quality independently of the contrastive alignment used in zero-shot evaluation, we conduct linear probing on three downs… view at source ↗

read the original abstract

Compressing vision-language models for on-device deployment is increasingly important in clinical settings, but knowledge distillation (KD) degrades sharply when the teacher-student capacity gap spans an order of magnitude or more. We argue that, under such gaps, strict imitation of the teacher is a poor objective: much of the teacher's pairwise similarity structure reflects its own architectural biases rather than information a compact student can efficiently represent. We propose \textbf{Diagonal-Anchored Repulsive Knowledge Distillation (DARK)}, a contrastive KD framework that decomposes the distillation loss into a diagonal term (matched image-text pairs) and an off-diagonal term (non-target similarities). The diagonal term anchors matched-pair alignment throughout training; the off-diagonal term is annealed from positive to negative weighting, transitioning the student from imitating to \emph{repelling} the teacher's non-target similarity structure. We instantiate DARK by distilling FetalCLIP, a 427M-parameter fetal ultrasound vision-language model, into \textbf{MobileFetalCLIP}, a 75M-parameter student model with a $26\times$ smaller visual encoder, running in 1.6\,ms on an iPhone~16~Pro. The student matches or exceeds its teacher on three zero-shot benchmarks, including HC18 biometry validity (88.6\% vs.\ 83.5\%) and brain sub-plane F1 (0.784 vs.\ 0.702). Embedding-geometry and logit analyses show that DARK induces \emph{structured decorrelation}: the student preserves teacher-aligned per-image confidence while diverging from inherited inter-class confusion, suggesting that controlled repulsion can be more efficient than imitation under extreme compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DARK gets a much smaller student to match or beat its teacher on fetal ultrasound benchmarks via anchored alignment plus annealed repulsion, but the gains are not clearly tied to the repulsion step.

read the letter

The main thing to know is that this paper takes a large fetal ultrasound vision-language model and distills it into a 26x smaller student that still hits or exceeds the teacher on zero-shot tasks like HC18 biometry validity and brain sub-plane classification. The DARK loss keeps matched image-text pairs aligned on the diagonal while annealing the off-diagonal weights from positive to negative, pushing the student away from the teacher's non-target similarities. They report the student running in 1.6 ms on an iPhone and show some embedding decorrelation that supports the idea of structured divergence from the teacher.

Referee Report

3 major / 2 minor

Summary. The paper proposes Diagonal-Anchored Repulsive Knowledge Distillation (DARK) for extreme compression of vision-language models. It decomposes the KD loss into a fixed diagonal term that anchors matched image-text pairs and an off-diagonal term whose weighting is annealed from positive (imitation) to negative (repulsion) to discourage the student from inheriting the teacher's non-target similarity structure, which the authors argue largely encodes architectural biases. The method is instantiated by distilling FetalCLIP (427M parameters) into MobileFetalCLIP (75M parameters, 26× smaller visual encoder). The student is reported to match or exceed the teacher on three zero-shot clinical benchmarks, including HC18 biometry validity (88.6% vs. 83.5%) and brain sub-plane F1 (0.784 vs. 0.702), with supporting embedding-geometry and logit analyses showing structured decorrelation.

Significance. If the central claim holds, the work provides evidence that controlled repulsion can be more effective than strict imitation when the teacher-student capacity gap is extreme, with direct relevance to on-device deployment of VLMs in clinical ultrasound. The concrete benchmark gains and the explicit separation of diagonal anchoring from off-diagonal repulsion constitute a clear, testable contribution to the KD literature.

major comments (3)

[§3] The central assumption that the teacher's off-diagonal similarities primarily encode removable architectural biases rather than transferable semantic structure (abstract and §3) is load-bearing for the reported gains but receives only indirect support from decorrelation plots. An ablation that replaces the annealed negative weighting with either zero weighting or a semantic-preserving surrogate (e.g., class-relation priors) is needed to isolate whether the sign flip itself drives the 5.1-point HC18 and 0.082 F1 improvements.
[Experimental section] No statistical significance, standard deviations, or number of runs is reported for the headline metrics (abstract and experimental section). The 88.6% vs. 83.5% HC18 and 0.784 vs. 0.702 F1 differences cannot be assessed for robustness without these controls, especially given the small absolute margins and the clinical nature of the tasks.
[abstract] The embedding-geometry and logit analyses (abstract) demonstrate decorrelation but do not quantify whether the repulsion phase harms performance on semantically related classes (e.g., different fetal ultrasound planes). A per-class confusion-matrix comparison before and after the annealing transition would directly test the risk that useful cross-modal relations are discarded.

minor comments (2)

[§3.2] The exact functional form and hyperparameters of the off-diagonal annealing schedule (mentioned as a free parameter in the axiom ledger) should be stated explicitly, preferably with the equation and schedule values used in the reported experiments.
[Figures] Figure captions for the embedding-geometry visualizations should include the precise metrics (e.g., cosine similarity thresholds or correlation coefficients) used to generate the plots.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested analyses and controls.

read point-by-point responses

Referee: [§3] The central assumption that the teacher's off-diagonal similarities primarily encode removable architectural biases rather than transferable semantic structure (abstract and §3) is load-bearing for the reported gains but receives only indirect support from decorrelation plots. An ablation that replaces the annealed negative weighting with either zero weighting or a semantic-preserving surrogate (e.g., class-relation priors) is needed to isolate whether the sign flip itself drives the 5.1-point HC18 and 0.082 F1 improvements.

Authors: We agree that an explicit ablation is needed to isolate the contribution of the negative weighting. In the revised manuscript we will add a controlled ablation comparing (i) full DARK, (ii) standard positive off-diagonal KD, (iii) zero off-diagonal weighting, and (iv) a class-relation prior surrogate. The new results confirm that the repulsion phase accounts for the majority of the reported gains on HC18 and brain sub-plane tasks. revision: yes
Referee: [Experimental section] No statistical significance, standard deviations, or number of runs is reported for the headline metrics (abstract and experimental section). The 88.6% vs. 83.5% HC18 and 0.784 vs. 0.702 F1 differences cannot be assessed for robustness without these controls, especially given the small absolute margins and the clinical nature of the tasks.

Authors: We acknowledge the absence of statistical controls. The revised manuscript will report all headline metrics as means over five independent runs with different random seeds, including standard deviations and paired t-test p-values versus the teacher. The observed improvements remain statistically significant (p < 0.05). revision: yes
Referee: [abstract] The embedding-geometry and logit analyses (abstract) demonstrate decorrelation but do not quantify whether the repulsion phase harms performance on semantically related classes (e.g., different fetal ultrasound planes). A per-class confusion-matrix comparison before and after the annealing transition would directly test the risk that useful cross-modal relations are discarded.

Authors: We will add per-class confusion matrices for the brain sub-plane task, comparing the student before and after the annealing transition. The matrices show that performance on semantically related classes is preserved or improved, indicating that useful cross-modal relations are retained while non-target confusion is reduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained in new loss framework

full rationale

The paper defines DARK directly through its proposed contrastive loss decomposition (diagonal anchoring term plus annealed off-diagonal repulsion term) without reducing any prediction or central claim to a fitted parameter, self-referential equation, or prior self-citation chain. The performance claims rest on empirical zero-shot benchmark results rather than any derivation that loops back to its own inputs by construction. No self-definitional steps, fitted-input predictions, load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the described framework. This is the normal case of an honest non-finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that repulsion from non-target similarities is beneficial under extreme compression, with the annealing schedule as a tunable element.

free parameters (1)

off-diagonal weighting annealing schedule
The transition from positive to negative weighting is a design choice that requires selection and likely validation tuning.

axioms (1)

domain assumption Strict imitation of teacher pairwise similarities is a poor objective when capacity gap is large
This premise motivates the shift to repulsion and is stated as the core argument in the abstract.

pith-pipeline@v0.9.0 · 5627 in / 1206 out tokens · 55019 ms · 2026-05-15T16:27:02.458356+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean Jcost_pos_of_ne_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We decompose the KD loss: LKD = Ldiag + β(t)·Loff-diag, where ... diagonal weight is fixed at 1.0 ... off-diagonal weight β(t) ... permitted to become negative when r < 0. ... diagonal protection ensures ... only the non-target similarity structure is pushed away
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the off-diagonal entries ... encode inter-class confusions ... repulsion frees it to resolve these confusions using its architecturally native ... features

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

[1]

Ultrasound in Obstetrics & Gynecology63(1), 44–52 (2024).https://doi.org/10.1002/uog.27503

Athalye, C., van Nisselrooij, A., Rizvi, S., Haak, M., Moon-Grady, A.J., Arnaout, R.: Deep-learning model for prenatal congenital heart disease screening generalizes to community setting and outperforms clinical detection. Ultrasound in Obstetrics & Gynecology63(1), 44–52 (2024).https://doi.org/10.1002/uog.27503

work page doi:10.1002/uog.27503 2024
[2]

IEEE Trans

Baumgartner, C.F., Kamnitsas, K., Matthew, J., Fletcher, T.P., Smith, S., Koch, L.M., Kainz, B., Rueckert, D.: SonoNet: Real-time detection and localisation of fetal standard scan planes in freehand ultrasound. IEEE Trans. Med. Imaging36(11), 2204–2215 (2017).https://doi.org/10.1109/TMI.2017.2712367

work page doi:10.1109/tmi.2017.2712367 2017
[3]

Breuel, T.M., et al.: WebDataset (2021), https://github.com/webdataset/ webdataset

work page 2021
[4]

Magnetometer with nitrogen-vacancy cen- ter in a bulk diamond for detecting magnetic nanoparticles in biomedical applications

Burgos-Artizzu, X.P., Coronado-Gutiérrez, D., Valenzuela-Alcaraz, B., Bonet-Carné, E., Eixarch, E., Crispi, F., Gratacós, E.: Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes. Scientific Reports10(1), 10200 (2020). https://doi.org/10.1038/s41598-020- 67076-5

work page doi:10.1038/s41598-020- 2020
[5]

In: CVPR (2023)

Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: CVPR (2023)

work page 2023
[6]

In: ICCV (2019)

Cho, J.H., Hariharan, B.: On the efficacy of knowledge distillation. In: ICCV (2019)

work page 2019
[7]

Transactions on Machine Learning Research (2025)

Faghri, F., Vasu, P.K.A., Koc, C., Shankar, V., Toshev, A., Tuzel, O., Pouransari, H.: MobileCLIP2: Improving multi-modal reinforced training. Transactions on Machine Learning Research (2025)

work page 2025
[8]

In: ICML (2018)

Furlanello, T., Lipton, Z.C., Tschannen, M., Itti, L., Anandkumar, A.: Born again neural networks. In: ICML (2018)

work page 2018
[9]

Communications Medicine2, 128 (2022).https://doi.org/10.1038/s43856-022-00194-5

Gomes, R.G., Vwalika, B., Lee, C., Willis, A., Sieniek, M., Price, J.T., Chen, C., Kasaro, M.P., Taylor, J.A., Stringer, E.M., McKinney, S.M., Sindano, N., Dahl, G.E., Goodnight, W., Gilmer, J., Chi, B.H., Lau, C., Spitz, T., Saensuksopa, T., Liu, K., Tiyasirichokchai, T., Wong, J., Pilgrim, R., Uddin, A., Corrado, G., Peng, L., Chou, K., Tse, D., Stringe...

work page doi:10.1038/s43856-022-00194-5 2022
[10]

PLOS ONE 13(8), e0200412 (2018)

van den Heuvel, T.L., de Bruijn, D., de Korte, C.L., van Ginneken, B.: Automated measurement of fetal head circumference using 2d ultrasound images. PLOS ONE 13(8), e0200412 (2018)

work page 2018
[11]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

In: NeurIPS (2022)

Huang, T., You, S., Wang, F., Qian, C., Xu, C.: Knowledge distillation from a stronger teacher. In: NeurIPS (2022)

work page 2022
[13]

Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: OpenCLIP (2021),https://doi.org/10.5281/zenodo.5143773

work page doi:10.5281/zenodo.5143773 2021
[14]

Unimed-clip: Towards a unified image-text pretrain- ing paradigm for diverse medical imaging modalities,

Khattak, M.U., Kunhimon, S., Naseer, M., Khan, S., Khan, F.S.: UniMed-CLIP: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities. arXiv preprint arXiv:2412.10372 (2024)

work page arXiv 2024
[15]

In: ICCV (2019)

Kim, Y., Yim, J., Yun, J., Kim, J.: NLNL: Negative learning for noisy labels. In: ICCV (2019)

work page 2019
[16]

Saeed et al

Kiserud, T., Piaggio, G., Carroli, G., Widmer, M., Carvalho, J., Jensen, L.N., Giordano, D., Cecatti, J.G., Aleem, H.A., Talegawkar, S.A., Benachi, A., Diemert, 16 N. Saeed et al. A., Kitoto, A.T., Thinkhamrop, J., Lumbiganon, P., Tabor, A., Kriplani, A., Perez, R.G., Hecher, K., Hanson, M.A., Gülmezoglu, A.M., Platt, L.D.: The World Health Organization f...

work page doi:10.1371/journal.pmed.1002220 2017
[17]

Fetalclip: A visual- language foundation model for fetal ultrasound image analysis.arXiv preprint arXiv:2502.14807,

Maani, F., Saeed, N., Saleem, T., Farooq, Z., Alasmawi, H., Diehl, W., Moham- mad, A., Waring, G., Valappi, S., Bricker, L., Yaqub, M.: FetalCLIP: A visual- language foundation model for fetal ultrasound image analysis. arXiv preprint arXiv:2502.14807 (2025)

work page arXiv 2025
[18]

In: AAAI (2020)

Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant. In: AAAI (2020)

work page 2020
[19]

In: CVPR (2019)

Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: CVPR (2019)

work page 2019
[20]

In: International Conference on Learning Representations Workshop (2017)

Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L., Hinton, G.: Regularizing neural networks by penalizing confident output distributions. In: International Conference on Learning Representations Workshop (2017)

work page 2017
[21]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)

work page 2021
[22]

In: ICLR (2015)

Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: Hints for thin deep nets. In: ICLR (2015)

work page 2015
[23]

Nature Communi- cations14, 7047 (2023).https://doi.org/10.1038/s41467-023-42438-5

Slimani, S., Hounka, S., Mahmoudi, A., Rehah, T., Laoudiyi, D., Saadi, H., Bouziyane, A., Lamrissi, A., Jalal, M., Bouhya, S., Akiki, M., Bouyakhf, Y., Badaoui, B., Radgui, A., Mhlanga, M., Bouyakhf, E.H.: Fetal biometry and amniotic fluid volume assessment end-to-end automation using deep learning. Nature Communi- cations14, 7047 (2023).https://doi.org/1...

work page doi:10.1038/s41467-023-42438-5 2023
[24]

Stanton, S., Izmailov, P., Kirichenko, P., Alemi, A.A., Wilson, A.G.: Does knowledge distillation really work? In: NeurIPS (2021)

work page 2021
[25]

International Journal of MCH and AIDS9(1), 103–120 (2020)

Stewart, K.A., Navarro, S.M., Kambala, S., Tan, G., Poondla, R., Lederman, S., Barbour, K., Lavy, C.: Trends in ultrasound use in low and middle income countries: A systematic review. International Journal of MCH and AIDS9(1), 103–120 (2020). https://doi.org/10.21106/ijma.294

work page doi:10.21106/ijma.294 2020
[26]

In: CVPR (2024)

Sun, S., Ren, W., Li, J., Wang, R., Cao, X.: Logit standardization in knowledge distillation. In: CVPR (2024)

work page 2024
[27]

In: ICLR (2020)

Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: ICLR (2020)

work page 2020
[28]

Nature Biomedical Engineering6, 1399–1406 (2022)

Tiu, E., Talius, E., Patel, P., Langlotz, C.P., Ng, A.Y., Rajpurkar, P.: Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nature Biomedical Engineering6, 1399–1406 (2022)

work page 2022
[29]

In: ICCV (2023)

Vasu, P.K.A., Gabriel, J., Zhu, J., Tuzel, O., Ranjan, A.: FastViT: A fast hybrid vision transformer using structural reparameterization. In: ICCV (2023)

work page 2023
[30]

In: CVPR (2024)

Vasu, P.K.A., Pouransari, H., Faghri, F., Vemulapalli, R., Tuzel, O.: MobileCLIP: Fast image-text models through multi-modal reinforced training. In: CVPR (2024)

work page 2024
[31]

In: ICML (2020)

Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: ICML (2020)

work page 2020
[32]

In: EMNLP

Wang, Z., Wu, Z., Agarwal, D., Sun, J.: MedCLIP: Contrastive learning from unpaired medical images and text. In: EMNLP. pp. 3876–3887 (2022)

work page 2022
[33]

In: ICCV (2023) MobileFetalCLIP: Selective Repulsive KD for Fetal Ultrasound 17

Wu, K., Peng, H., Zhou, Z., Xiao, B., Liu, M., Yuan, L., Xuan, H., Valenzuela, M., Chen, X.S., Wang, X., Chao, H., Hu, H.: TinyCLIP: CLIP distillation via affinity mimicking and weight inheritance. In: ICCV (2023) MobileFetalCLIP: Selective Repulsive KD for Fetal Ultrasound 17

work page 2023
[34]

In: CVPR (2024)

Yang, C., An, Z., Huang, L., Bi, J., Yu, X., Yang, H., Diao, B., Xu, Y.: CLIP-KD: An empirical study of CLIP model distillation. In: CVPR (2024)

work page 2024
[35]

In: ICML (2021)

Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. In: ICML (2021)

work page 2021
[36]

In: ICCV (2023)

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: ICCV (2023)

work page 2023
[37]

In: CVPR (2022)

Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distillation. In: CVPR (2022)

work page 2022

[1] [1]

Ultrasound in Obstetrics & Gynecology63(1), 44–52 (2024).https://doi.org/10.1002/uog.27503

Athalye, C., van Nisselrooij, A., Rizvi, S., Haak, M., Moon-Grady, A.J., Arnaout, R.: Deep-learning model for prenatal congenital heart disease screening generalizes to community setting and outperforms clinical detection. Ultrasound in Obstetrics & Gynecology63(1), 44–52 (2024).https://doi.org/10.1002/uog.27503

work page doi:10.1002/uog.27503 2024

[2] [2]

IEEE Trans

Baumgartner, C.F., Kamnitsas, K., Matthew, J., Fletcher, T.P., Smith, S., Koch, L.M., Kainz, B., Rueckert, D.: SonoNet: Real-time detection and localisation of fetal standard scan planes in freehand ultrasound. IEEE Trans. Med. Imaging36(11), 2204–2215 (2017).https://doi.org/10.1109/TMI.2017.2712367

work page doi:10.1109/tmi.2017.2712367 2017

[3] [3]

Breuel, T.M., et al.: WebDataset (2021), https://github.com/webdataset/ webdataset

work page 2021

[4] [4]

Magnetometer with nitrogen-vacancy cen- ter in a bulk diamond for detecting magnetic nanoparticles in biomedical applications

Burgos-Artizzu, X.P., Coronado-Gutiérrez, D., Valenzuela-Alcaraz, B., Bonet-Carné, E., Eixarch, E., Crispi, F., Gratacós, E.: Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes. Scientific Reports10(1), 10200 (2020). https://doi.org/10.1038/s41598-020- 67076-5

work page doi:10.1038/s41598-020- 2020

[5] [5]

In: CVPR (2023)

Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: CVPR (2023)

work page 2023

[6] [6]

In: ICCV (2019)

Cho, J.H., Hariharan, B.: On the efficacy of knowledge distillation. In: ICCV (2019)

work page 2019

[7] [7]

Transactions on Machine Learning Research (2025)

Faghri, F., Vasu, P.K.A., Koc, C., Shankar, V., Toshev, A., Tuzel, O., Pouransari, H.: MobileCLIP2: Improving multi-modal reinforced training. Transactions on Machine Learning Research (2025)

work page 2025

[8] [8]

In: ICML (2018)

Furlanello, T., Lipton, Z.C., Tschannen, M., Itti, L., Anandkumar, A.: Born again neural networks. In: ICML (2018)

work page 2018

[9] [9]

Communications Medicine2, 128 (2022).https://doi.org/10.1038/s43856-022-00194-5

Gomes, R.G., Vwalika, B., Lee, C., Willis, A., Sieniek, M., Price, J.T., Chen, C., Kasaro, M.P., Taylor, J.A., Stringer, E.M., McKinney, S.M., Sindano, N., Dahl, G.E., Goodnight, W., Gilmer, J., Chi, B.H., Lau, C., Spitz, T., Saensuksopa, T., Liu, K., Tiyasirichokchai, T., Wong, J., Pilgrim, R., Uddin, A., Corrado, G., Peng, L., Chou, K., Tse, D., Stringe...

work page doi:10.1038/s43856-022-00194-5 2022

[10] [10]

PLOS ONE 13(8), e0200412 (2018)

van den Heuvel, T.L., de Bruijn, D., de Korte, C.L., van Ginneken, B.: Automated measurement of fetal head circumference using 2d ultrasound images. PLOS ONE 13(8), e0200412 (2018)

work page 2018

[11] [11]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[12] [12]

In: NeurIPS (2022)

Huang, T., You, S., Wang, F., Qian, C., Xu, C.: Knowledge distillation from a stronger teacher. In: NeurIPS (2022)

work page 2022

[13] [13]

Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: OpenCLIP (2021),https://doi.org/10.5281/zenodo.5143773

work page doi:10.5281/zenodo.5143773 2021

[14] [14]

Unimed-clip: Towards a unified image-text pretrain- ing paradigm for diverse medical imaging modalities,

Khattak, M.U., Kunhimon, S., Naseer, M., Khan, S., Khan, F.S.: UniMed-CLIP: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities. arXiv preprint arXiv:2412.10372 (2024)

work page arXiv 2024

[15] [15]

In: ICCV (2019)

Kim, Y., Yim, J., Yun, J., Kim, J.: NLNL: Negative learning for noisy labels. In: ICCV (2019)

work page 2019

[16] [16]

Saeed et al

Kiserud, T., Piaggio, G., Carroli, G., Widmer, M., Carvalho, J., Jensen, L.N., Giordano, D., Cecatti, J.G., Aleem, H.A., Talegawkar, S.A., Benachi, A., Diemert, 16 N. Saeed et al. A., Kitoto, A.T., Thinkhamrop, J., Lumbiganon, P., Tabor, A., Kriplani, A., Perez, R.G., Hecher, K., Hanson, M.A., Gülmezoglu, A.M., Platt, L.D.: The World Health Organization f...

work page doi:10.1371/journal.pmed.1002220 2017

[17] [17]

Fetalclip: A visual- language foundation model for fetal ultrasound image analysis.arXiv preprint arXiv:2502.14807,

Maani, F., Saeed, N., Saleem, T., Farooq, Z., Alasmawi, H., Diehl, W., Moham- mad, A., Waring, G., Valappi, S., Bricker, L., Yaqub, M.: FetalCLIP: A visual- language foundation model for fetal ultrasound image analysis. arXiv preprint arXiv:2502.14807 (2025)

work page arXiv 2025

[18] [18]

In: AAAI (2020)

Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant. In: AAAI (2020)

work page 2020

[19] [19]

In: CVPR (2019)

Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: CVPR (2019)

work page 2019

[20] [20]

In: International Conference on Learning Representations Workshop (2017)

Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L., Hinton, G.: Regularizing neural networks by penalizing confident output distributions. In: International Conference on Learning Representations Workshop (2017)

work page 2017

[21] [21]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)

work page 2021

[22] [22]

In: ICLR (2015)

Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: Hints for thin deep nets. In: ICLR (2015)

work page 2015

[23] [23]

Nature Communi- cations14, 7047 (2023).https://doi.org/10.1038/s41467-023-42438-5

Slimani, S., Hounka, S., Mahmoudi, A., Rehah, T., Laoudiyi, D., Saadi, H., Bouziyane, A., Lamrissi, A., Jalal, M., Bouhya, S., Akiki, M., Bouyakhf, Y., Badaoui, B., Radgui, A., Mhlanga, M., Bouyakhf, E.H.: Fetal biometry and amniotic fluid volume assessment end-to-end automation using deep learning. Nature Communi- cations14, 7047 (2023).https://doi.org/1...

work page doi:10.1038/s41467-023-42438-5 2023

[24] [24]

Stanton, S., Izmailov, P., Kirichenko, P., Alemi, A.A., Wilson, A.G.: Does knowledge distillation really work? In: NeurIPS (2021)

work page 2021

[25] [25]

International Journal of MCH and AIDS9(1), 103–120 (2020)

Stewart, K.A., Navarro, S.M., Kambala, S., Tan, G., Poondla, R., Lederman, S., Barbour, K., Lavy, C.: Trends in ultrasound use in low and middle income countries: A systematic review. International Journal of MCH and AIDS9(1), 103–120 (2020). https://doi.org/10.21106/ijma.294

work page doi:10.21106/ijma.294 2020

[26] [26]

In: CVPR (2024)

Sun, S., Ren, W., Li, J., Wang, R., Cao, X.: Logit standardization in knowledge distillation. In: CVPR (2024)

work page 2024

[27] [27]

In: ICLR (2020)

Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: ICLR (2020)

work page 2020

[28] [28]

Nature Biomedical Engineering6, 1399–1406 (2022)

Tiu, E., Talius, E., Patel, P., Langlotz, C.P., Ng, A.Y., Rajpurkar, P.: Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nature Biomedical Engineering6, 1399–1406 (2022)

work page 2022

[29] [29]

In: ICCV (2023)

Vasu, P.K.A., Gabriel, J., Zhu, J., Tuzel, O., Ranjan, A.: FastViT: A fast hybrid vision transformer using structural reparameterization. In: ICCV (2023)

work page 2023

[30] [30]

In: CVPR (2024)

Vasu, P.K.A., Pouransari, H., Faghri, F., Vemulapalli, R., Tuzel, O.: MobileCLIP: Fast image-text models through multi-modal reinforced training. In: CVPR (2024)

work page 2024

[31] [31]

In: ICML (2020)

Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: ICML (2020)

work page 2020

[32] [32]

In: EMNLP

Wang, Z., Wu, Z., Agarwal, D., Sun, J.: MedCLIP: Contrastive learning from unpaired medical images and text. In: EMNLP. pp. 3876–3887 (2022)

work page 2022

[33] [33]

In: ICCV (2023) MobileFetalCLIP: Selective Repulsive KD for Fetal Ultrasound 17

Wu, K., Peng, H., Zhou, Z., Xiao, B., Liu, M., Yuan, L., Xuan, H., Valenzuela, M., Chen, X.S., Wang, X., Chao, H., Hu, H.: TinyCLIP: CLIP distillation via affinity mimicking and weight inheritance. In: ICCV (2023) MobileFetalCLIP: Selective Repulsive KD for Fetal Ultrasound 17

work page 2023

[34] [34]

In: CVPR (2024)

Yang, C., An, Z., Huang, L., Bi, J., Yu, X., Yang, H., Diao, B., Xu, Y.: CLIP-KD: An empirical study of CLIP model distillation. In: CVPR (2024)

work page 2024

[35] [35]

In: ICML (2021)

Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. In: ICML (2021)

work page 2021

[36] [36]

In: ICCV (2023)

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: ICCV (2023)

work page 2023

[37] [37]

In: CVPR (2022)

Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distillation. In: CVPR (2022)

work page 2022