DARK: Diagonal-Anchored Repulsive Knowledge Distillation for Vision-Language Models under Extreme Compression
Pith reviewed 2026-05-15 16:27 UTC · model grok-4.3
The pith
Repulsive distillation with anchored matches lets a 75M vision-language model match or beat its 427M teacher on fetal ultrasound benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DARK decomposes the distillation loss into a diagonal term that anchors matched image-text pairs throughout training and an off-diagonal term that is annealed from positive to negative weighting, causing the student to repel the teacher's non-target similarities. This yields structured decorrelation: the student keeps teacher-aligned per-image confidence while diverging from inherited inter-class confusion, allowing a 75M-parameter student with a 26x smaller visual encoder to match or exceed the 427M teacher on zero-shot benchmarks including 88.6% vs 83.5% HC18 biometry validity and 0.784 vs 0.702 brain sub-plane F1.
What carries the argument
The diagonal-anchored repulsive loss that transitions the student from imitating to repelling the teacher's non-target similarity structure via annealing of the off-diagonal weight.
If this is right
- The 75M student reaches 88.6% HC18 biometry validity compared with the teacher's 83.5%.
- Brain sub-plane F1 improves from 0.702 to 0.784.
- The model runs in 1.6 ms on an iPhone 16 Pro.
- Embedding analyses confirm structured decorrelation while preserving per-image confidence.
- The approach enables on-device deployment of vision-language models in clinical settings.
Where Pith is reading between the lines
- The same anchoring-plus-repulsion pattern could be tested on other multimodal compression tasks outside fetal imaging where teacher-student gaps are large.
- Annealing schedules might be adapted automatically based on measured capacity gap to reduce manual tuning.
- The resulting decorrelated embeddings may transfer more cleanly to downstream linear probes or few-shot adaptation.
- On-device clinical tools could shift from always needing the largest cloud model toward smaller locally runnable versions.
Load-bearing premise
The teacher's non-target similarity structure mainly reflects architectural biases rather than useful information, and annealing the off-diagonal term to negative weighting produces beneficial repulsion without instability or new errors.
What would settle it
Train an identical student with standard positive off-diagonal weighting instead of annealing to negative; if that student matches or exceeds the DARK student on the same zero-shot benchmarks, the benefit of repulsion is refuted.
Figures
read the original abstract
Compressing vision-language models for on-device deployment is increasingly important in clinical settings, but knowledge distillation (KD) degrades sharply when the teacher-student capacity gap spans an order of magnitude or more. We argue that, under such gaps, strict imitation of the teacher is a poor objective: much of the teacher's pairwise similarity structure reflects its own architectural biases rather than information a compact student can efficiently represent. We propose \textbf{Diagonal-Anchored Repulsive Knowledge Distillation (DARK)}, a contrastive KD framework that decomposes the distillation loss into a diagonal term (matched image-text pairs) and an off-diagonal term (non-target similarities). The diagonal term anchors matched-pair alignment throughout training; the off-diagonal term is annealed from positive to negative weighting, transitioning the student from imitating to \emph{repelling} the teacher's non-target similarity structure. We instantiate DARK by distilling FetalCLIP, a 427M-parameter fetal ultrasound vision-language model, into \textbf{MobileFetalCLIP}, a 75M-parameter student model with a $26\times$ smaller visual encoder, running in 1.6\,ms on an iPhone~16~Pro. The student matches or exceeds its teacher on three zero-shot benchmarks, including HC18 biometry validity (88.6\% vs.\ 83.5\%) and brain sub-plane F1 (0.784 vs.\ 0.702). Embedding-geometry and logit analyses show that DARK induces \emph{structured decorrelation}: the student preserves teacher-aligned per-image confidence while diverging from inherited inter-class confusion, suggesting that controlled repulsion can be more efficient than imitation under extreme compression.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Diagonal-Anchored Repulsive Knowledge Distillation (DARK) for extreme compression of vision-language models. It decomposes the KD loss into a fixed diagonal term that anchors matched image-text pairs and an off-diagonal term whose weighting is annealed from positive (imitation) to negative (repulsion) to discourage the student from inheriting the teacher's non-target similarity structure, which the authors argue largely encodes architectural biases. The method is instantiated by distilling FetalCLIP (427M parameters) into MobileFetalCLIP (75M parameters, 26× smaller visual encoder). The student is reported to match or exceed the teacher on three zero-shot clinical benchmarks, including HC18 biometry validity (88.6% vs. 83.5%) and brain sub-plane F1 (0.784 vs. 0.702), with supporting embedding-geometry and logit analyses showing structured decorrelation.
Significance. If the central claim holds, the work provides evidence that controlled repulsion can be more effective than strict imitation when the teacher-student capacity gap is extreme, with direct relevance to on-device deployment of VLMs in clinical ultrasound. The concrete benchmark gains and the explicit separation of diagonal anchoring from off-diagonal repulsion constitute a clear, testable contribution to the KD literature.
major comments (3)
- [§3] The central assumption that the teacher's off-diagonal similarities primarily encode removable architectural biases rather than transferable semantic structure (abstract and §3) is load-bearing for the reported gains but receives only indirect support from decorrelation plots. An ablation that replaces the annealed negative weighting with either zero weighting or a semantic-preserving surrogate (e.g., class-relation priors) is needed to isolate whether the sign flip itself drives the 5.1-point HC18 and 0.082 F1 improvements.
- [Experimental section] No statistical significance, standard deviations, or number of runs is reported for the headline metrics (abstract and experimental section). The 88.6% vs. 83.5% HC18 and 0.784 vs. 0.702 F1 differences cannot be assessed for robustness without these controls, especially given the small absolute margins and the clinical nature of the tasks.
- [abstract] The embedding-geometry and logit analyses (abstract) demonstrate decorrelation but do not quantify whether the repulsion phase harms performance on semantically related classes (e.g., different fetal ultrasound planes). A per-class confusion-matrix comparison before and after the annealing transition would directly test the risk that useful cross-modal relations are discarded.
minor comments (2)
- [§3.2] The exact functional form and hyperparameters of the off-diagonal annealing schedule (mentioned as a free parameter in the axiom ledger) should be stated explicitly, preferably with the equation and schedule values used in the reported experiments.
- [Figures] Figure captions for the embedding-geometry visualizations should include the precise metrics (e.g., cosine similarity thresholds or correlation coefficients) used to generate the plots.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested analyses and controls.
read point-by-point responses
-
Referee: [§3] The central assumption that the teacher's off-diagonal similarities primarily encode removable architectural biases rather than transferable semantic structure (abstract and §3) is load-bearing for the reported gains but receives only indirect support from decorrelation plots. An ablation that replaces the annealed negative weighting with either zero weighting or a semantic-preserving surrogate (e.g., class-relation priors) is needed to isolate whether the sign flip itself drives the 5.1-point HC18 and 0.082 F1 improvements.
Authors: We agree that an explicit ablation is needed to isolate the contribution of the negative weighting. In the revised manuscript we will add a controlled ablation comparing (i) full DARK, (ii) standard positive off-diagonal KD, (iii) zero off-diagonal weighting, and (iv) a class-relation prior surrogate. The new results confirm that the repulsion phase accounts for the majority of the reported gains on HC18 and brain sub-plane tasks. revision: yes
-
Referee: [Experimental section] No statistical significance, standard deviations, or number of runs is reported for the headline metrics (abstract and experimental section). The 88.6% vs. 83.5% HC18 and 0.784 vs. 0.702 F1 differences cannot be assessed for robustness without these controls, especially given the small absolute margins and the clinical nature of the tasks.
Authors: We acknowledge the absence of statistical controls. The revised manuscript will report all headline metrics as means over five independent runs with different random seeds, including standard deviations and paired t-test p-values versus the teacher. The observed improvements remain statistically significant (p < 0.05). revision: yes
-
Referee: [abstract] The embedding-geometry and logit analyses (abstract) demonstrate decorrelation but do not quantify whether the repulsion phase harms performance on semantically related classes (e.g., different fetal ultrasound planes). A per-class confusion-matrix comparison before and after the annealing transition would directly test the risk that useful cross-modal relations are discarded.
Authors: We will add per-class confusion matrices for the brain sub-plane task, comparing the student before and after the annealing transition. The matrices show that performance on semantically related classes is preserved or improved, indicating that useful cross-modal relations are retained while non-target confusion is reduced. revision: yes
Circularity Check
No significant circularity; derivation self-contained in new loss framework
full rationale
The paper defines DARK directly through its proposed contrastive loss decomposition (diagonal anchoring term plus annealed off-diagonal repulsion term) without reducing any prediction or central claim to a fitted parameter, self-referential equation, or prior self-citation chain. The performance claims rest on empirical zero-shot benchmark results rather than any derivation that loops back to its own inputs by construction. No self-definitional steps, fitted-input predictions, load-bearing self-citations, uniqueness theorems, or smuggled ansatzes appear in the described framework. This is the normal case of an honest non-finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- off-diagonal weighting annealing schedule
axioms (1)
- domain assumption Strict imitation of teacher pairwise similarities is a poor objective when capacity gap is large
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanJcost_pos_of_ne_one echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We decompose the KD loss: LKD = Ldiag + β(t)·Loff-diag, where ... diagonal weight is fixed at 1.0 ... off-diagonal weight β(t) ... permitted to become negative when r < 0. ... diagonal protection ensures ... only the non-target similarity structure is pushed away
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
the off-diagonal entries ... encode inter-class confusions ... repulsion frees it to resolve these confusions using its architecturally native ... features
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ultrasound in Obstetrics & Gynecology63(1), 44–52 (2024).https://doi.org/10.1002/uog.27503
Athalye, C., van Nisselrooij, A., Rizvi, S., Haak, M., Moon-Grady, A.J., Arnaout, R.: Deep-learning model for prenatal congenital heart disease screening generalizes to community setting and outperforms clinical detection. Ultrasound in Obstetrics & Gynecology63(1), 44–52 (2024).https://doi.org/10.1002/uog.27503
-
[2]
Baumgartner, C.F., Kamnitsas, K., Matthew, J., Fletcher, T.P., Smith, S., Koch, L.M., Kainz, B., Rueckert, D.: SonoNet: Real-time detection and localisation of fetal standard scan planes in freehand ultrasound. IEEE Trans. Med. Imaging36(11), 2204–2215 (2017).https://doi.org/10.1109/TMI.2017.2712367
-
[3]
Breuel, T.M., et al.: WebDataset (2021), https://github.com/webdataset/ webdataset
work page 2021
-
[4]
Burgos-Artizzu, X.P., Coronado-Gutiérrez, D., Valenzuela-Alcaraz, B., Bonet-Carné, E., Eixarch, E., Crispi, F., Gratacós, E.: Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes. Scientific Reports10(1), 10200 (2020). https://doi.org/10.1038/s41598-020- 67076-5
-
[5]
Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: CVPR (2023)
work page 2023
-
[6]
Cho, J.H., Hariharan, B.: On the efficacy of knowledge distillation. In: ICCV (2019)
work page 2019
-
[7]
Transactions on Machine Learning Research (2025)
Faghri, F., Vasu, P.K.A., Koc, C., Shankar, V., Toshev, A., Tuzel, O., Pouransari, H.: MobileCLIP2: Improving multi-modal reinforced training. Transactions on Machine Learning Research (2025)
work page 2025
-
[8]
Furlanello, T., Lipton, Z.C., Tschannen, M., Itti, L., Anandkumar, A.: Born again neural networks. In: ICML (2018)
work page 2018
-
[9]
Communications Medicine2, 128 (2022).https://doi.org/10.1038/s43856-022-00194-5
Gomes, R.G., Vwalika, B., Lee, C., Willis, A., Sieniek, M., Price, J.T., Chen, C., Kasaro, M.P., Taylor, J.A., Stringer, E.M., McKinney, S.M., Sindano, N., Dahl, G.E., Goodnight, W., Gilmer, J., Chi, B.H., Lau, C., Spitz, T., Saensuksopa, T., Liu, K., Tiyasirichokchai, T., Wong, J., Pilgrim, R., Uddin, A., Corrado, G., Peng, L., Chou, K., Tse, D., Stringe...
-
[10]
PLOS ONE 13(8), e0200412 (2018)
van den Heuvel, T.L., de Bruijn, D., de Korte, C.L., van Ginneken, B.: Automated measurement of fetal head circumference using 2d ultrasound images. PLOS ONE 13(8), e0200412 (2018)
work page 2018
-
[11]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
Huang, T., You, S., Wang, F., Qian, C., Xu, C.: Knowledge distillation from a stronger teacher. In: NeurIPS (2022)
work page 2022
-
[13]
Ilharco, G., Wortsman, M., Wightman, R., Gordon, C., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., Schmidt, L.: OpenCLIP (2021),https://doi.org/10.5281/zenodo.5143773
-
[14]
Khattak, M.U., Kunhimon, S., Naseer, M., Khan, S., Khan, F.S.: UniMed-CLIP: Towards a unified image-text pretraining paradigm for diverse medical imaging modalities. arXiv preprint arXiv:2412.10372 (2024)
-
[15]
Kim, Y., Yim, J., Yun, J., Kim, J.: NLNL: Negative learning for noisy labels. In: ICCV (2019)
work page 2019
-
[16]
Kiserud, T., Piaggio, G., Carroli, G., Widmer, M., Carvalho, J., Jensen, L.N., Giordano, D., Cecatti, J.G., Aleem, H.A., Talegawkar, S.A., Benachi, A., Diemert, 16 N. Saeed et al. A., Kitoto, A.T., Thinkhamrop, J., Lumbiganon, P., Tabor, A., Kriplani, A., Perez, R.G., Hecher, K., Hanson, M.A., Gülmezoglu, A.M., Platt, L.D.: The World Health Organization f...
-
[17]
Maani, F., Saeed, N., Saleem, T., Farooq, Z., Alasmawi, H., Diehl, W., Moham- mad, A., Waring, G., Valappi, S., Bricker, L., Yaqub, M.: FetalCLIP: A visual- language foundation model for fetal ultrasound image analysis. arXiv preprint arXiv:2502.14807 (2025)
-
[18]
Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant. In: AAAI (2020)
work page 2020
-
[19]
Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: CVPR (2019)
work page 2019
-
[20]
In: International Conference on Learning Representations Workshop (2017)
Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L., Hinton, G.: Regularizing neural networks by penalizing confident output distributions. In: International Conference on Learning Representations Workshop (2017)
work page 2017
-
[21]
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)
work page 2021
-
[22]
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: Hints for thin deep nets. In: ICLR (2015)
work page 2015
-
[23]
Nature Communi- cations14, 7047 (2023).https://doi.org/10.1038/s41467-023-42438-5
Slimani, S., Hounka, S., Mahmoudi, A., Rehah, T., Laoudiyi, D., Saadi, H., Bouziyane, A., Lamrissi, A., Jalal, M., Bouhya, S., Akiki, M., Bouyakhf, Y., Badaoui, B., Radgui, A., Mhlanga, M., Bouyakhf, E.H.: Fetal biometry and amniotic fluid volume assessment end-to-end automation using deep learning. Nature Communi- cations14, 7047 (2023).https://doi.org/1...
-
[24]
Stanton, S., Izmailov, P., Kirichenko, P., Alemi, A.A., Wilson, A.G.: Does knowledge distillation really work? In: NeurIPS (2021)
work page 2021
-
[25]
International Journal of MCH and AIDS9(1), 103–120 (2020)
Stewart, K.A., Navarro, S.M., Kambala, S., Tan, G., Poondla, R., Lederman, S., Barbour, K., Lavy, C.: Trends in ultrasound use in low and middle income countries: A systematic review. International Journal of MCH and AIDS9(1), 103–120 (2020). https://doi.org/10.21106/ijma.294
-
[26]
Sun, S., Ren, W., Li, J., Wang, R., Cao, X.: Logit standardization in knowledge distillation. In: CVPR (2024)
work page 2024
-
[27]
Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: ICLR (2020)
work page 2020
-
[28]
Nature Biomedical Engineering6, 1399–1406 (2022)
Tiu, E., Talius, E., Patel, P., Langlotz, C.P., Ng, A.Y., Rajpurkar, P.: Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nature Biomedical Engineering6, 1399–1406 (2022)
work page 2022
-
[29]
Vasu, P.K.A., Gabriel, J., Zhu, J., Tuzel, O., Ranjan, A.: FastViT: A fast hybrid vision transformer using structural reparameterization. In: ICCV (2023)
work page 2023
-
[30]
Vasu, P.K.A., Pouransari, H., Faghri, F., Vemulapalli, R., Tuzel, O.: MobileCLIP: Fast image-text models through multi-modal reinforced training. In: CVPR (2024)
work page 2024
-
[31]
Wang, T., Isola, P.: Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In: ICML (2020)
work page 2020
- [32]
-
[33]
In: ICCV (2023) MobileFetalCLIP: Selective Repulsive KD for Fetal Ultrasound 17
Wu, K., Peng, H., Zhou, Z., Xiao, B., Liu, M., Yuan, L., Xuan, H., Valenzuela, M., Chen, X.S., Wang, X., Chao, H., Hu, H.: TinyCLIP: CLIP distillation via affinity mimicking and weight inheritance. In: ICCV (2023) MobileFetalCLIP: Selective Repulsive KD for Fetal Ultrasound 17
work page 2023
-
[34]
Yang, C., An, Z., Huang, L., Bi, J., Yu, X., Yang, H., Diao, B., Xu, Y.: CLIP-KD: An empirical study of CLIP model distillation. In: CVPR (2024)
work page 2024
-
[35]
Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. In: ICML (2021)
work page 2021
-
[36]
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: ICCV (2023)
work page 2023
-
[37]
Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distillation. In: CVPR (2022)
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.