Recognition: 2 theorem links
· Lean TheoremTraining a Student Expert via Semi-Supervised Foundation Model Distillation
Pith reviewed 2026-05-13 16:48 UTC · model grok-4.3
The pith
Semi-supervised distillation compresses vision foundation models into compact experts that surpass their teachers on instance segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that maintaining an instance-aware pixel-wise contrastive loss across self-training adaptation and distillation stages aligns teacher and student embeddings, enabling a compact student to leverage abundant unlabeled images and achieve higher instance segmentation accuracy than its larger vision foundation model teachers on Cityscapes and ADE20K.
What carries the argument
An instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins, applied consistently in both adaptation and distillation.
If this is right
- The approximately 11 times smaller student improves over zero-shot VFM teachers by +11.9 AP on Cityscapes and +8.6 AP on ADE20K.
- It surpasses the adapted teachers by +3.4 AP on Cityscapes and +1.5 AP on ADE20K.
- It outperforms state-of-the-art semi-supervised knowledge distillation methods on the evaluated benchmarks.
- The framework enables effective exploitation of unlabeled images to offset the high cost of per-pixel annotations.
Where Pith is reading between the lines
- The same contrastive calibration and distillation stages could be tested on other dense prediction tasks such as semantic segmentation or monocular depth estimation.
- Consistent embedding alignment might reduce sensitivity to domain shifts when the student is deployed on data distributions differing from the adaptation set.
- The refinement stage could be examined for its effect on even smaller model scales or on foundation models with different backbone architectures.
Load-bearing premise
The pseudo-labels generated during self-training with contrastive calibration remain sufficiently accurate and unbiased across domains so that distillation and refinement can improve upon them rather than amplify errors.
What would settle it
Running the full pipeline on a new dataset where contrastive calibration produces sharply inaccurate pseudo-labels and checking whether the final student accuracy falls below the teacher's zero-shot performance.
Figures
read the original abstract
Foundation models deliver strong perception but are often too computationally heavy to deploy, and adapting them typically requires costly annotations. We introduce a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFMs) into compact experts using limited labeled and abundant unlabeled data, and instantiate it for instance segmentation where per-pixel labels are particularly expensive. The framework unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to our approach is an instance-aware pixel-wise contrastive loss that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and more effectively leverage unlabeled images. On Cityscapes and ADE20K, our $\approx 11\times$ smaller student improves over its zero-shot VFM teacher(s) by +11.9 and +8.6 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and outperforms state-of-the-art SSKD methods on benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a three-stage semi-supervised knowledge distillation framework to compress vision foundation models into compact instance segmentation experts. Stage 1 performs domain adaptation of the teacher(s) via self-training with an instance-aware pixel-wise contrastive loss that fuses mask and class scores; stage 2 transfers knowledge through a unified multi-objective loss; stage 3 refines the student to mitigate residual pseudo-label bias. On Cityscapes and ADE20K the ~11× smaller student is reported to improve over zero-shot teachers by +11.9 and +8.6 AP, over adapted teachers by +3.4 and +1.5 AP, and to surpass prior SSKD methods.
Significance. If the central assumption holds, the work would demonstrate a practical route to adapting and distilling large VFMs for dense prediction under limited annotation budgets, with the cross-stage contrastive signal as a potentially reusable mechanism for controlling label noise.
major comments (2)
- [§3.1] §3.1 (Domain Adaptation): the claim that the instance-aware pixel-wise contrastive loss keeps pseudo-labels sufficiently accurate and unbiased to enable net student improvement is load-bearing for the +3.4 AP gain over adapted teachers on Cityscapes, yet the manuscript supplies no direct quantitative check (e.g., pseudo-label mAP or per-class precision on held-out ground truth) that label noise remains below the recovery threshold of the subsequent distillation stages.
- [§4] §4 (Experiments): the headline AP improvements are presented without error bars, multiple random seeds, or ablation on the free parameters (loss weighting coefficients, contrastive temperature, and margin), making it impossible to determine whether the reported margins over adapted teachers and SOTA SSKD baselines are robust or sensitive to hyper-parameter choice.
minor comments (2)
- [Abstract] Abstract: the phrase '≈11× smaller student' should be accompanied by explicit parameter counts or FLOPs for both teacher and student to permit immediate assessment of the compression ratio.
- [§2] §2 (Related Work): the discussion of prior SSKD methods could more explicitly contrast the proposed cross-stage contrastive calibration against existing pixel-wise or mask-aware contrastive losses.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [§3.1] §3.1 (Domain Adaptation): the claim that the instance-aware pixel-wise contrastive loss keeps pseudo-labels sufficiently accurate and unbiased to enable net student improvement is load-bearing for the +3.4 AP gain over adapted teachers on Cityscapes, yet the manuscript supplies no direct quantitative check (e.g., pseudo-label mAP or per-class precision on held-out ground truth) that label noise remains below the recovery threshold of the subsequent distillation stages.
Authors: We agree that direct quantitative validation of pseudo-label quality would make the contribution of the contrastive loss more transparent. In the revised manuscript we will add a new table reporting pseudo-label mAP (and per-class precision/recall) on a held-out portion of the labeled training data for the adapted teacher, comparing the full instance-aware contrastive loss against an ablation that removes the contrastive term. This will directly quantify the reduction in label noise and support the claim that the loss keeps pseudo-labels within the recovery range of the later stages. revision: yes
-
Referee: [§4] §4 (Experiments): the headline AP improvements are presented without error bars, multiple random seeds, or ablation on the free parameters (loss weighting coefficients, contrastive temperature, and margin), making it impossible to determine whether the reported margins over adapted teachers and SOTA SSKD baselines are robust or sensitive to hyper-parameter choice.
Authors: We acknowledge that the current experimental section lacks statistical robustness indicators. In the revision we will rerun the main Cityscapes and ADE20K experiments across five random seeds and report mean AP together with standard deviation. We will also add a dedicated ablation subsection that varies the loss-weighting coefficients, contrastive temperature, and margin over reasonable ranges, showing that the reported gains remain stable and that the chosen operating point is not an outlier. revision: yes
Circularity Check
No circularity: empirical benchmark gains are measured outcomes, not reductions to fitted inputs or self-citations
full rationale
The paper outlines a three-stage SSKD framework (domain adaptation via self-training with contrastive calibration, unified multi-objective distillation, and refinement) centered on an instance-aware pixel-wise contrastive loss. However, the headline claims consist of measured AP improvements on held-out Cityscapes and ADE20K test sets (+11.9, +8.6, +3.4, +1.5 AP), presented as experimental results rather than quantities derived by construction from parameters fitted inside the same chain. No equations, self-citations, or uniqueness theorems are invoked that would reduce the reported gains to tautological redefinitions of the inputs. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- loss weighting coefficients
- contrastive temperature and margin
axioms (1)
- domain assumption Pseudo-labels generated by the adapted teacher are sufficiently accurate on unlabeled data to serve as supervision for the student.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
instance-aware pixel-wise contrastive loss that fuses mask and class scores... NT-Xent objective... Assumption 3.1 (Negative Sampling Guarantee)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
three-stage pipeline: (1) domain adaptation via self-training with contrastive calibration, (2) unified multi-objective loss, (3) student refinement
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Inigo Alonso, Alberto Sabater, David Ferstl, Luis Montesano, and Ana C Murillo. Semi-supervised semantic segmenta- tion with pixel-level contrastive learning from a class-wise memory bank. InProceedings of the IEEE/CVF international conference on computer vision, pages 8219–8228, 2021. 2
work page 2021
-
[2]
Foundation models defining a new era in vision: a survey and outlook
Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Foundation models defining a new era in vision: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2025. 1
work page 2025
-
[3]
Guided distillation for semi-supervised instance segmentation
Tariq Berrada, Camille Couprie, Karteek Alahari, and Jakob Verbeek. Guided distillation for semi-supervised instance segmentation. InProceedings of the IEEE/CVF Winter Con- ference on Applications of Computer Vision, pages 475–483,
-
[4]
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Mar- cel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073, 2024. 2
work page internal anchor Pith review arXiv 2024
-
[5]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- man, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021. 1
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning
Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, and Vicente Ordonez. Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning. InProceedings of the AAAI conference on artificial intelligence, pages 6912–6920, 2021. 2 Figure 9.More Qualitative Results
work page 2021
-
[7]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on machine learning, pages 1597–1607. PmLR, 2020. 2, 1
work page 2020
-
[8]
Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. Big self-supervised mod- els are strong semi-supervised learners.Advances in neural information processing systems, 33:22243–22255, 2020. 1
work page 2020
-
[9]
An empirical study of training self-supervised vision transformers
Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9640–9649, 2021. 2
work page 2021
-
[10]
Semi-supervised semantic segmentation with cross pseudo supervision
Xiaokang Chen, Yuhui Yuan, Gang Zeng, and Jingdong Wang. Semi-supervised semantic segmentation with cross pseudo supervision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2613–2622,
-
[11]
Depth-guided semi-supervised instance segmentation.arXiv preprint arXiv:2406.17413, 2024
Xin Chen, Jie Hu, Xiawu Zheng, Jianghang Lin, Liujuan Cao, and Rongrong Ji. Depth-guided semi-supervised instance segmentation.arXiv preprint arXiv:2406.17413, 2024. 6
-
[12]
Masked-attention mask trans- former for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask trans- former for universal image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 5
work page 2022
-
[13]
The cityscapes dataset for se- mantic urban scene understanding
Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for se- mantic urban scene understanding. InProc. of the IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR),
-
[14]
Jiawei Fan, Chao Li, Xiaolong Liu, Meina Song, and Anbang Yao. Augmentation-free dense contrastive knowledge distilla- tion for efficient semantic segmentation.Advances in Neural Information Processing Systems, 36:51359–51370, 2023. 2
work page 2023
-
[15]
Seed: Self-supervised distillation for visual representation.arXiv preprint arXiv:2101.04731, 2021
Zhiyuan Fang, Jianfeng Wang, Lijuan Wang, Lei Zhang, Yezhou Yang, and Zicheng Liu. Seed: Self-supervised distillation for visual representation.arXiv preprint arXiv:2101.04731, 2021. 2
-
[16]
Roya Firoozi, Johnathan Tucker, Stephen Tian, Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke Zhu, Shuran Song, Ashish Kapoor, Karol Hausman, et al. Foundation models in robotics: Applications, challenges, and the fu- ture.The International Journal of Robotics Research, page 02783649241281508, 2023. 1
work page 2023
-
[17]
Kai Gan and Tong Wei. Erasing the bias: Fine-tuning foun- dation models for semi-supervised learning.arXiv preprint arXiv:2405.11756, 2024. 2
-
[18]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- shick. Mask r-cnn. InProceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 1
work page 2017
-
[19]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[20]
Pseudo-label alignment for semi-supervised instance segmentation
Jie Hu, Chen Chen, Liujuan Cao, Shengchuan Zhang, An- nan Shu, Guannan Jiang, and Rongrong Ji. Pseudo-label alignment for semi-supervised instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16337–16347, 2023. 1, 6, 8
work page 2023
-
[21]
Pixel-wise contrastive distillation
Junqiang Huang and Zichao Guo. Pixel-wise contrastive distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16359–16369, 2023. 2
work page 2023
-
[22]
Jinseong Jang, Chunfei Ma, and Byeongwon Lee. Vl2lite: Task-specific knowledge distillation from large vision- language models to lightweight networks. In2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 30073–30083, 2025. 1
work page 2025
-
[23]
Mtkd: Multi-teacher knowledge distillation for image super- resolution
Yuxuan Jiang, Chen Feng, Fan Zhang, and David Bull. Mtkd: Multi-teacher knowledge distillation for image super- resolution. InEuropean Conference on Computer Vision, pages 364–382. Springer, 2024. 2
work page 2024
-
[24]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything.arXiv:2304.02643, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Jungsoo Lee, Debasmit Das, Munawar Hayat, Sungha Choi, Kyuwoong Hwang, and Fatih Porikli. Customkd: Customiz- ing large vision foundation for edge model improvement via knowledge distillation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 25176–25186,
-
[26]
Pengchen Liang, Haishan Huang, Bin Pu, Jianguo Chen, Xi- ang Hua, Jing Zhang, Weibo Ma, Zhuangzhuang Chen, Yiwei Li, and Qing Chang. Task-specific knowledge distillation from the vision foundation model for enhanced medical im- age segmentation.arXiv preprint arXiv:2503.06976, 2025. 1
-
[27]
Jianghang Lin, Yilin Lu, Yunhang Shen, Chaoyang Zhu, Shengchuan Zhang, Liujuan Cao, and Rongrong Ji. Pseudo- label quality decoupling and correction for semi-supervised instance segmentation.arXiv preprint arXiv:2505.11075,
-
[28]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean Con- ference on Computer Vision, pages 38–55. Springer, 2024. 1, 2, 5
work page 2024
- [29]
-
[30]
S. Lu, Y . Chen, Y . Chen, et al. General lightweight framework for vision foundation model supporting multi-task and multi- center medical image analysis.Nature Communications, 16: 2097, 2025. 1
work page 2097
-
[31]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 1, 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Self-supervised knowledge distillation for few-shot learning.arXiv preprint arXiv:2006.09785, 2020
Jathushan Rajasegaran, Salman Khan, Munawar Hayat, Fa- had Shahbaz Khan, and Mubarak Shah. Self-supervised knowledge distillation for few-shot learning.arXiv preprint arXiv:2006.09785, 2020. 2
-
[33]
Vi- sion transformers for dense prediction
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 5
work page 2021
-
[34]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment any- thing in images and videos.arXiv preprint arXiv:2408.00714,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, et al. Grounded sam: Assembling open-world models for diverse visual tasks.arXiv preprint arXiv:2401.14159,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Channel-wise knowledge distillation for dense prediction
Changyong Shu, Yifan Liu, Jianfei Gao, Zheng Yan, and Chunhua Shen. Channel-wise knowledge distillation for dense prediction. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 5311–5320,
-
[37]
Redwan Sony, Parisa Farmanifard, Arun Ross, and Anil K Jain. Foundation versus domain-specific models: Perfor- mance comparison, fusion, and explainability in face recogni- tion.arXiv preprint arXiv:2507.03541, 2025. 1
-
[38]
Dime-fm: Distilling multi- modal and efficient foundation models
Ximeng Sun, Pengchuan Zhang, Peizhao Zhang, Hardik Shah, Kate Saenko, and Xide Xia. Dime-fm: Distilling multi- modal and efficient foundation models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15521–15533, 2023. 2
work page 2023
-
[39]
Ximeng Tao, Pardis Taghavi, Dimitar Filev, Reza Langari, and Gaurav Pandey. Navidrivevlm: Decoupling high-level reasoning and motion planning for autonomous driving.arXiv preprint arXiv:2603.07901, 2026. 1
-
[40]
Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems, 30, 2017. 2
work page 2017
-
[41]
Contrastive representation distilla- tion,
Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation.arXiv preprint arXiv:1910.10699,
-
[42]
Raviteja Vemulapalli, Hadi Pouransari, Fartash Faghri, Sachin Mehta, Mehrdad Farajtabar, Mohammad Rastegari, and Oncel Tuzel. Knowledge transfer from vision foundation models for efficient training of small task-specific models.ICML2024,
-
[43]
Sam-clip: Merging vision foundation models towards semantic and spatial understanding
Haoxiang Wang, Pavan Kumar Anasosalu Vasu, Fartash Faghri, Raviteja Vemulapalli, Mehrdad Farajtabar, Sachin Mehta, Mohammad Rastegari, Oncel Tuzel, and Hadi Pouransari. Sam-clip: Merging vision foundation models towards semantic and spatial understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3635–3647,...
work page 2024
-
[44]
Dense contrastive learning for self-supervised visual pre-training
Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and Lei Li. Dense contrastive learning for self-supervised visual pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3024–3033,
-
[45]
Contrastmask: Contrastive learning to segment every thing
Xuehui Wang, Kai Zhao, Ruixin Zhang, Shouhong Ding, Yan Wang, and Wei Shen. Contrastmask: Contrastive learning to segment every thing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11604–11613, 2022. 2
work page 2022
-
[46]
Detco: Unsuper- vised contrastive learning for object detection
Enze Xie, Jian Ding, Wenhai Wang, Xiaohang Zhan, Hang Xu, Peize Sun, Zhenguo Li, and Ping Luo. Detco: Unsuper- vised contrastive learning for object detection. InProceedings of the IEEE/CVF international conference on computer vision, pages 8392–8401, 2021. 2
work page 2021
-
[47]
Self-training with noisy student improves imagenet clas- sification
Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet clas- sification. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698,
-
[48]
Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen Lin, and Han Hu. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16684–16693, 2021. 2
work page 2021
-
[49]
Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng
Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, and Tianyi Zhou. A survey on knowledge distillation of large language models. arXiv preprint arXiv:2402.13116, 2024. 1
-
[50]
Xu Yan, Haiming Zhang, Yingjie Cai, Jingming Guo, We- ichao Qiu, Bin Gao, Kaiqiang Zhou, Yue Zhao, Huan Jin, Jiantao Gao, et al. Forging vision foundation models for autonomous driving: Challenges, methodologies, and oppor- tunities.arXiv preprint arXiv:2401.08045, 2024. 1
-
[51]
Cross-image relational knowl- edge distillation for semantic segmentation
Chuanguang Yang, Helong Zhou, Zhulin An, Xue Jiang, Yongjun Xu, and Qian Zhang. Cross-image relational knowl- edge distillation for semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12319–12328, 2022. 2
work page 2022
-
[52]
Clip-kd: An empirical study of clip model distillation
Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xin- qiang Yu, Han Yang, Boyu Diao, and Yongjun Xu. Clip-kd: An empirical study of clip model distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15952–15962, 2024. 2
work page 2024
-
[53]
Chuanguang Yang, Xinqiang Yu, Han Yang, Zhulin An, Chengqing Yu, Libo Huang, and Yongjun Xu. Multi-teacher knowledge distillation with reinforcement learning for visual recognition.arXiv preprint arXiv:2502.18510, 2025. 2
-
[54]
Revisiting weak-to-strong consistency in semi- supervised semantic segmentation
Lihe Yang, Lei Qi, Litong Feng, Wayne Zhang, and Yinghuan Shi. Revisiting weak-to-strong consistency in semi- supervised semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 7236–7246, 2023. 2
work page 2023
-
[55]
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.arXiv preprint arXiv:2406.09414, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[56]
Lewei Yao, Renjie Pi, Hang Xu, Wei Zhang, Zhenguo Li, and Tong Zhang. G-detkd: Towards general distillation frame- work for object detectors via contrastive and semantic-guided feature imitation. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 3591–3600,
-
[57]
Sˆ 4m: Boosting semi-supervised instance segmentation with sam
Heeji Yoon, Heeseong Shin, Eunbeen Hong, Hyunwook Choi, Hansang Cho, Daun Jeong, and Seungryong Kim. Sˆ 4m: Boosting semi-supervised instance segmentation with sam. arXiv preprint arXiv:2504.05301, 2025. 1, 6, 8
-
[58]
arXiv preprint arXiv:2501.04001 , year=
Haobo Yuan, Xiangtai Li, Tao Zhang, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, and Ming- Hsuan Yang. Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos.arXiv preprint arXiv:2501.04001, 2025. 1, 2
-
[59]
Accessing vision foundation models via imagenet-1k
Yitian Zhang, Xu Ma, Yue Bai, Huan Wang, and Yun Fu. Accessing vision foundation models via imagenet-1k. InThe Thirteenth International Conference on Learning Representa- tions, 2025. 2
work page 2025
-
[60]
Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xin- jiang Wang, Yining Li, and Haian Huang. An open and comprehensive pipeline for unified object grounding and de- tection.arXiv preprint arXiv:2401.02361, 2024. 5
-
[61]
Pixel contrastive-consistent semi-supervised semantic segmentation
Yuanyi Zhong, Bodi Yuan, Hong Wu, Zhiqiang Yuan, Jian Peng, and Yu-Xiong Wang. Pixel contrastive-consistent semi-supervised semantic segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7273–7282, 2021. 2
work page 2021
-
[62]
Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fi- dler, Adela Barriuso, and Antonio Torralba. Semantic under- standing of scenes through the ade20k dataset.International Journal of Computer Vision, 127:302–321, 2019. 5
work page 2019
-
[63]
Com- plementary relation contrastive distillation
Jinguo Zhu, Shixiang Tang, Dapeng Chen, Shijie Yu, Yakun Liu, Mingzhe Rong, Aijun Yang, and Xiaohua Wang. Com- plementary relation contrastive distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9260–9269, 2021. 2
work page 2021
-
[64]
Ar- gus: A compact and versatile foundation model for vision
Weiming Zhuang, Chen Chen, Zhizhong Li, Sina Sajad- manesh, Jingtao Li, Jiabo Huang, Vikash Sehwag, Vivek Sharma, Hirotaka Shinozaki, Felan Carlo Garcia, et al. Ar- gus: A compact and versatile foundation model for vision. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 4418–4429, 2025. 1
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.