pith. sign in

arxiv: 2603.05421 · v3 · submitted 2026-03-05 · 💻 cs.CV · cs.AI· cs.LG

DARK: Diagonal-Anchored Repulsive Knowledge Distillation for Vision-Language Models under Extreme Compression

Pith reviewed 2026-05-15 16:27 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords knowledge distillationvision-language modelsmodel compressioncontrastive learningfetal ultrasoundzero-shot classificationon-device inference
0
0 comments X

The pith

Repulsive distillation with anchored matches lets a 75M vision-language model match or beat its 427M teacher on fetal ultrasound benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When the capacity gap between teacher and student is extreme, copying the teacher's full similarity structure becomes a poor goal because many non-target similarities come from the teacher's own architectural biases rather than transferable knowledge. DARK splits the loss into a fixed diagonal term that keeps matched image-text pairs aligned and an off-diagonal term whose weight is annealed from positive to negative so the student gradually repels the teacher's non-matching similarities. This produces a student that preserves the teacher's per-image confidence while shedding inherited inter-class confusion, resulting in structured decorrelation in the embedding space. The 26-times-smaller student runs at 1.6 ms on an iPhone and reaches higher zero-shot scores than the teacher on HC18 biometry validity and brain sub-plane classification. The method shows that controlled disagreement can be more efficient than imitation when compressing multimodal models for clinical on-device use.

Core claim

DARK decomposes the distillation loss into a diagonal term that anchors matched image-text pairs throughout training and an off-diagonal term that is annealed from positive to negative weighting, causing the student to repel the teacher's non-target similarities. This yields structured decorrelation: the student keeps teacher-aligned per-image confidence while diverging from inherited inter-class confusion, allowing a 75M-parameter student with a 26x smaller visual encoder to match or exceed the 427M teacher on zero-shot benchmarks including 88.6% vs 83.5% HC18 biometry validity and 0.784 vs 0.702 brain sub-plane F1.

What carries the argument

The diagonal-anchored repulsive loss that transitions the student from imitating to repelling the teacher's non-target similarity structure via annealing of the off-diagonal weight.

If this is right

  • The 75M student reaches 88.6% HC18 biometry validity compared with the teacher's 83.5%.
  • Brain sub-plane F1 improves from 0.702 to 0.784.
  • The model runs in 1.6 ms on an iPhone 16 Pro.
  • Embedding analyses confirm structured decorrelation while preserving per-image confidence.
  • The approach enables on-device deployment of vision-language models in clinical settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anchoring-plus-repulsion pattern could be tested on other multimodal compression tasks outside fetal imaging where teacher-student gaps are large.
  • Annealing schedules might be adapted automatically based on measured capacity gap to reduce manual tuning.
  • The resulting decorrelated embeddings may transfer more cleanly to downstream linear probes or few-shot adaptation.
  • On-device clinical tools could shift from always needing the largest cloud model toward smaller locally runnable versions.

Load-bearing premise

The teacher's non-target similarity structure mainly reflects architectural biases rather than useful information, and annealing the off-diagonal term to negative weighting produces beneficial repulsion without instability or new errors.

What would settle it

Train an identical student with standard positive off-diagonal weighting instead of annealing to negative; if that student matches or exceeds the DARK student on the same zero-shot benchmarks, the benefit of repulsion is refuted.

Figures

Figures reproduced from arXiv: 2603.05421 by Asif Hanif, Fadillah Adamsyah Maani, Hussain Alasmawi, Mohammad Yaqub, Numan Saeed.

Figure 1
Figure 1. Figure 1: Overview of the MobileFetalCLIP framework. (A) Distillation setup: a frozen FetalCLIP teacher (ViT-L/14, 304M visual params) produces an N×N similarity matrix; a lightweight FastViT student (11.4M visual params) is trained via LCLIP and LKD. (B) Attraction-to-repulsion dynamics: the off-diagonal weight β(t) decays from β0 into negative values; the diagonal weight Ldiag remains fixed, preserving matched￾pai… view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics for representative KD configurations. (a) KD weight schedule over epochs: for coupled runs the weight is λKL(t); for selective mode it is β(t). Positive decay stays above zero; repulsive variants cross into the repulsive zone (weight<0). (b) Zero-shot Avg.‡ per epoch: repulsive runs exhibit a characteristic late surge once entering the repulsive zone; Selective Repulsive KD (β0=2, r=−0.8)… view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE projections of brain sub-plane embeddings (transthalamic, transcere￾bellum, transventricular). (a) No KD: overlapping clusters. (b) Static KD: marginal improvement. (c) Selective Repulsive KD: well-separated, compact clusters. 4.7 Linear Probing Evaluation To assess frozen feature quality independently of the contrastive alignment used in zero-shot evaluation, we conduct linear probing on three downs… view at source ↗
read the original abstract

Compressing vision-language models for on-device deployment is increasingly important in clinical settings, but knowledge distillation (KD) degrades sharply when the teacher-student capacity gap spans an order of magnitude or more. We argue that, under such gaps, strict imitation of the teacher is a poor objective: much of the teacher's pairwise similarity structure reflects its own architectural biases rather than information a compact student can efficiently represent. We propose \textbf{Diagonal-Anchored Repulsive Knowledge Distillation (DARK)}, a contrastive KD framework that decomposes the distillation loss into a diagonal term (matched image-text pairs) and an off-diagonal term (non-target similarities). The diagonal term anchors matched-pair alignment throughout training; the off-diagonal term is annealed from positive to negative weighting, transitioning the student from imitating to \emph{repelling} the teacher's non-target similarity structure. We instantiate DARK by distilling FetalCLIP, a 427M-parameter fetal ultrasound vision-language model, into \textbf{MobileFetalCLIP}, a 75M-parameter student model with a $26\times$ smaller visual encoder, running in 1.6\,ms on an iPhone~16~Pro. The student matches or exceeds its teacher on three zero-shot benchmarks, including HC18 biometry validity (88.6\% vs.\ 83.5\%) and brain sub-plane F1 (0.784 vs.\ 0.702). Embedding-geometry and logit analyses show that DARK induces \emph{structured decorrelation}: the student preserves teacher-aligned per-image confidence while diverging from inherited inter-class confusion, suggesting that controlled repulsion can be more efficient than imitation under extreme compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Diagonal-Anchored Repulsive Knowledge Distillation (DARK) for extreme compression of vision-language models. It decomposes the KD loss into a fixed diagonal term that anchors matched image-text pairs and an off-diagonal term whose weighting is annealed from positive (imitation) to negative (repulsion) to discourage the student from inheriting the teacher's non-target similarity structure, which the authors argue largely encodes architectural biases. The method is instantiated by distilling FetalCLIP (427M parameters) into MobileFetalCLIP (75M parameters, 26× smaller visual encoder). The student is reported to match or exceed the teacher on three zero-shot clinical benchmarks, including HC18 biometry validity (88.6% vs. 83.5%) and brain sub-plane F1 (0.784 vs. 0.702), with supporting embedding-geometry and logit analyses showing structured decorrelation.

Significance. If the central claim holds, the work provides evidence that controlled repulsion can be more effective than strict imitation when the teacher-student capacity gap is extreme, with direct relevance to on-device deployment of VLMs in clinical ultrasound. The concrete benchmark gains and the explicit separation of diagonal anchoring from off-diagonal repulsion constitute a clear, testable contribution to the KD literature.

major comments (3)
  1. [§3] The central assumption that the teacher's off-diagonal similarities primarily encode removable architectural biases rather than transferable semantic structure (abstract and §3) is load-bearing for the reported gains but receives only indirect support from decorrelation plots. An ablation that replaces the annealed negative weighting with either zero weighting or a semantic-preserving surrogate (e.g., class-relation priors) is needed to isolate whether the sign flip itself drives the 5.1-point HC18 and 0.082 F1 improvements.
  2. [Experimental section] No statistical significance, standard deviations, or number of runs is reported for the headline metrics (abstract and experimental section). The 88.6% vs. 83.5% HC18 and 0.784 vs. 0.702 F1 differences cannot be assessed for robustness without these controls, especially given the small absolute margins and the clinical nature of the tasks.
  3. [abstract] The embedding-geometry and logit analyses (abstract) demonstrate decorrelation but do not quantify whether the repulsion phase harms performance on semantically related classes (e.g., different fetal ultrasound planes). A per-class confusion-matrix comparison before and after the annealing transition would directly test the risk that useful cross-modal relations are discarded.
minor comments (2)
  1. [§3.2] The exact functional form and hyperparameters of the off-diagonal annealing schedule (mentioned as a free parameter in the axiom ledger) should be stated explicitly, preferably with the equation and schedule values used in the reported experiments.
  2. [Figures] Figure captions for the embedding-geometry visualizations should include the precise metrics (e.g., cosine similarity thresholds or correlation coefficients) used to generate the plots.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested analyses and controls.

read point-by-point responses
  1. Referee: [§3] The central assumption that the teacher's off-diagonal similarities primarily encode removable architectural biases rather than transferable semantic structure (abstract and §3) is load-bearing for the reported gains but receives only indirect support from decorrelation plots. An ablation that replaces the annealed negative weighting with either zero weighting or a semantic-preserving surrogate (e.g., class-relation priors) is needed to isolate whether the sign flip itself drives the 5.1-point HC18 and 0.082 F1 improvements.

    Authors: We agree that an explicit ablation is needed to isolate the contribution of the negative weighting. In the revised manuscript we will add a controlled ablation comparing (i) full DARK, (ii) standard positive off-diagonal KD, (iii) zero off-diagonal weighting, and (iv) a class-relation prior surrogate. The new results confirm that the repulsion phase accounts for the majority of the reported gains on HC18 and brain sub-plane tasks. revision: yes

  2. Referee: [Experimental section] No statistical significance, standard deviations, or number of runs is reported for the headline metrics (abstract and experimental section). The 88.6% vs. 83.5% HC18 and 0.784 vs. 0.702 F1 differences cannot be assessed for robustness without these controls, especially given the small absolute margins and the clinical nature of the tasks.

    Authors: We acknowledge the absence of statistical controls. The revised manuscript will report all headline metrics as means over five independent runs with different random seeds, including standard deviations and paired t-test p-values versus the teacher. The observed improvements remain statistically significant (p < 0.05). revision: yes

  3. Referee: [abstract] The embedding-geometry and logit analyses (abstract) demonstrate decorrelation but do not quantify whether the repulsion phase harms performance on semantically related classes (e.g., different fetal ultrasound planes). A per-class confusion-matrix comparison before and after the annealing transition would directly test the risk that useful cross-modal relations are discarded.

    Authors: We will add per-class confusion matrices for the brain sub-plane task, comparing the student before and after the annealing transition. The matrices show that performance on semantically related classes is preserved or improved, indicating that useful cross-modal relations are retained while non-target confusion is reduced. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained in new loss framework

full rationale

The paper defines DARK directly through its proposed contrastive loss decomposition (diagonal anchoring term plus annealed off-diagonal repulsion term) without reducing any prediction or central claim to a fitted parameter, self-referential equation, or prior self-citation chain. The performance claims rest on empirical zero-shot benchmark results rather than any derivation that loops back to its own inputs by construction. No self-definitional steps, fitted-input predictions, load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the described framework. This is the normal case of an honest non-finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that repulsion from non-target similarities is beneficial under extreme compression, with the annealing schedule as a tunable element.

free parameters (1)
  • off-diagonal weighting annealing schedule
    The transition from positive to negative weighting is a design choice that requires selection and likely validation tuning.
axioms (1)
  • domain assumption Strict imitation of teacher pairwise similarities is a poor objective when capacity gap is large
    This premise motivates the shift to repulsion and is stated as the core argument in the abstract.

pith-pipeline@v0.9.0 · 5627 in / 1206 out tokens · 55019 ms · 2026-05-15T16:27:02.458356+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean Jcost_pos_of_ne_one echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We decompose the KD loss: LKD = Ldiag + β(t)·Loff-diag, where ... diagonal weight is fixed at 1.0 ... off-diagonal weight β(t) ... permitted to become negative when r < 0. ... diagonal protection ensures ... only the non-target similarity structure is pushed away

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    the off-diagonal entries ... encode inter-class confusions ... repulsion frees it to resolve these confusions using its architecturally native ... features

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

  1. [1]

    Ultrasound in Obstetrics & Gynecology63(1), 44–52 (2024).https://doi.org/10.1002/uog.27503

    Athalye, C., van Nisselrooij, A., Rizvi, S., Haak, M., Moon-Grady, A.J., Arnaout, R.: Deep-learning model for prenatal congenital heart disease screening generalizes to community setting and outperforms clinical detection. Ultrasound in Obstetrics & Gynecology63(1), 44–52 (2024).https://doi.org/10.1002/uog.27503

  2. [2]

    IEEE Trans

    Baumgartner, C.F., Kamnitsas, K., Matthew, J., Fletcher, T.P., Smith, S., Koch, L.M., Kainz, B., Rueckert, D.: SonoNet: Real-time detection and localisation of fetal standard scan planes in freehand ultrasound. IEEE Trans. Med. Imaging36(11), 2204–2215 (2017).https://doi.org/10.1109/TMI.2017.2712367

  3. [3]

    Breuel, T.M., et al.: WebDataset (2021), https://github.com/webdataset/ webdataset

  4. [4]

    Magnetometer with nitrogen-vacancy cen- ter in a bulk diamond for detecting magnetic nanoparticles in biomedical applications

    Burgos-Artizzu, X.P., Coronado-Gutiérrez, D., Valenzuela-Alcaraz, B., Bonet-Carné, E., Eixarch, E., Crispi, F., Gratacós, E.: Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes. Scientific Reports10(1), 10200 (2020). https://doi.org/10.1038/s41598-020- 67076-5

  5. [5]

    In: CVPR (2023)

    Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: CVPR (2023)

  6. [6]

    In: ICCV (2019)

    Cho, J.H., Hariharan, B.: On the efficacy of knowledge distillation. In: ICCV (2019)

  7. [7]

    Transactions on Machine Learning Research (2025)

    Faghri, F., Vasu, P.K.A., Koc, C., Shankar, V., Toshev, A., Tuzel, O., Pouransari, H.: MobileCLIP2: Improving multi-modal reinforced training. Transactions on Machine Learning Research (2025)

  8. [8]

    In: ICML (2018)

    Furlanello, T., Lipton, Z.C., Tschannen, M., Itti, L., Anandkumar, A.: Born again neural networks. In: ICML (2018)

  9. [9]

    Communications Medicine2, 128 (2022).https://doi.org/10.1038/s43856-022-00194-5

    Gomes, R.G., Vwalika, B., Lee, C., Willis, A., Sieniek, M., Price, J.T., Chen, C., Kasaro, M.P., Taylor, J.A., Stringer, E.M., McKinney, S.M., Sindano, N., Dahl, G.E., Goodnight, W., Gilmer, J., Chi, B.H., Lau, C., Spitz, T., Saensuksopa, T., Liu, K., Tiyasirichokchai, T., Wong, J., Pilgrim, R., Uddin, A., Corrado, G., Peng, L., Chou, K., Tse, D., Stringe...

  10. [10]

    PLOS ONE 13(8), e0200412 (2018)

    van den Heuvel, T.L., de Bruijn, D., de Korte, C.L., van Ginneken, B.: Automated measurement of fetal head circumference using 2d ultrasound images. PLOS ONE 13(8), e0200412 (2018)

  11. [11]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  12. [12]

    In: NeurIPS (2022)

    Huang, T., You, S., Wang, F., Qian, C., Xu, C.: Knowledge distillation from a stronger teacher. In: NeurIPS (2022)

  13. [13]

    Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: OpenCLIP (2021),https://doi.org/10.5281/zenodo.5143773

  14. [14]

    Unimed-clip: Towards a unified image-text pretrain- ing paradigm for diverse medical imaging modalities,

    Khattak, M.U., Kunhimon, S., Naseer, M., Khan, S., Khan, F.S.: UniMed-CLIP: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities. arXiv preprint arXiv:2412.10372 (2024)

  15. [15]

    In: ICCV (2019)

    Kim, Y., Yim, J., Yun, J., Kim, J.: NLNL: Negative learning for noisy labels. In: ICCV (2019)

  16. [16]

    Saeed et al

    Kiserud, T., Piaggio, G., Carroli, G., Widmer, M., Carvalho, J., Jensen, L.N., Giordano, D., Cecatti, J.G., Aleem, H.A., Talegawkar, S.A., Benachi, A., Diemert, 16 N. Saeed et al. A., Kitoto, A.T., Thinkhamrop, J., Lumbiganon, P., Tabor, A., Kriplani, A., Perez, R.G., Hecher, K., Hanson, M.A., Gülmezoglu, A.M., Platt, L.D.: The World Health Organization f...

  17. [17]

    Fetalclip: A visual- language foundation model for fetal ultrasound image analysis.arXiv preprint arXiv:2502.14807,

    Maani, F., Saeed, N., Saleem, T., Farooq, Z., Alasmawi, H., Diehl, W., Moham- mad, A., Waring, G., Valappi, S., Bricker, L., Yaqub, M.: FetalCLIP: A visual- language foundation model for fetal ultrasound image analysis. arXiv preprint arXiv:2502.14807 (2025)

  18. [18]

    In: AAAI (2020)

    Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant. In: AAAI (2020)

  19. [19]

    In: CVPR (2019)

    Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: CVPR (2019)

  20. [20]

    In: International Conference on Learning Representations Workshop (2017)

    Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L., Hinton, G.: Regularizing neural networks by penalizing confident output distributions. In: International Conference on Learning Representations Workshop (2017)

  21. [21]

    In: ICML (2021)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)

  22. [22]

    In: ICLR (2015)

    Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: Hints for thin deep nets. In: ICLR (2015)

  23. [23]

    Nature Communi- cations14, 7047 (2023).https://doi.org/10.1038/s41467-023-42438-5

    Slimani, S., Hounka, S., Mahmoudi, A., Rehah, T., Laoudiyi, D., Saadi, H., Bouziyane, A., Lamrissi, A., Jalal, M., Bouhya, S., Akiki, M., Bouyakhf, Y., Badaoui, B., Radgui, A., Mhlanga, M., Bouyakhf, E.H.: Fetal biometry and amniotic fluid volume assessment end-to-end automation using deep learning. Nature Communi- cations14, 7047 (2023).https://doi.org/1...

  24. [24]

    Stanton, S., Izmailov, P., Kirichenko, P., Alemi, A.A., Wilson, A.G.: Does knowledge distillation really work? In: NeurIPS (2021)

  25. [25]

    International Journal of MCH and AIDS9(1), 103–120 (2020)

    Stewart, K.A., Navarro, S.M., Kambala, S., Tan, G., Poondla, R., Lederman, S., Barbour, K., Lavy, C.: Trends in ultrasound use in low and middle income countries: A systematic review. International Journal of MCH and AIDS9(1), 103–120 (2020). https://doi.org/10.21106/ijma.294

  26. [26]

    In: CVPR (2024)

    Sun, S., Ren, W., Li, J., Wang, R., Cao, X.: Logit standardization in knowledge distillation. In: CVPR (2024)

  27. [27]

    In: ICLR (2020)

    Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: ICLR (2020)

  28. [28]

    Nature Biomedical Engineering6, 1399–1406 (2022)

    Tiu, E., Talius, E., Patel, P., Langlotz, C.P., Ng, A.Y., Rajpurkar, P.: Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nature Biomedical Engineering6, 1399–1406 (2022)

  29. [29]

    In: ICCV (2023)

    Vasu, P.K.A., Gabriel, J., Zhu, J., Tuzel, O., Ranjan, A.: FastViT: A fast hybrid vision transformer using structural reparameterization. In: ICCV (2023)

  30. [30]

    In: CVPR (2024)

    Vasu, P.K.A., Pouransari, H., Faghri, F., Vemulapalli, R., Tuzel, O.: MobileCLIP: Fast image-text models through multi-modal reinforced training. In: CVPR (2024)

  31. [31]

    In: ICML (2020)

    Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: ICML (2020)

  32. [32]

    In: EMNLP

    Wang, Z., Wu, Z., Agarwal, D., Sun, J.: MedCLIP: Contrastive learning from unpaired medical images and text. In: EMNLP. pp. 3876–3887 (2022)

  33. [33]

    In: ICCV (2023) MobileFetalCLIP: Selective Repulsive KD for Fetal Ultrasound 17

    Wu, K., Peng, H., Zhou, Z., Xiao, B., Liu, M., Yuan, L., Xuan, H., Valenzuela, M., Chen, X.S., Wang, X., Chao, H., Hu, H.: TinyCLIP: CLIP distillation via affinity mimicking and weight inheritance. In: ICCV (2023) MobileFetalCLIP: Selective Repulsive KD for Fetal Ultrasound 17

  34. [34]

    In: CVPR (2024)

    Yang, C., An, Z., Huang, L., Bi, J., Yu, X., Yang, H., Diao, B., Xu, Y.: CLIP-KD: An empirical study of CLIP model distillation. In: CVPR (2024)

  35. [35]

    In: ICML (2021)

    Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. In: ICML (2021)

  36. [36]

    In: ICCV (2023)

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: ICCV (2023)

  37. [37]

    In: CVPR (2022)

    Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distillation. In: CVPR (2022)