The Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation

Kevin Alexander Laube; Lukas Schott; Madan Ravi Ganesh; Muhammad Ali; Niclas Popp; Thomas Brox

arxiv: 2604.25530 · v2 · submitted 2026-04-28 · 💻 cs.CV · cs.AI

The Surprising Effectiveness of Canonical Knowledge Distillation for Semantic Segmentation

Muhammad Ali , Kevin Alexander Laube , Madan Ravi Ganesh , Lukas Schott , Niclas Popp , Thomas Brox This is my paper

Pith reviewed 2026-05-07 16:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords knowledge distillationsemantic segmentationCityscapesADE20KResNetmodel compressionwall-clock computedistillation objectives

0 comments

The pith

Canonical knowledge distillation outperforms recent complex methods for semantic segmentation under equal wall-clock training time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that complex knowledge distillation techniques developed specifically for semantic segmentation often underperform simpler, canonical methods when training budgets are compared fairly using wall-clock time rather than fixed iteration counts. This is because the newer methods add significant per-iteration computational overhead. When budgets are matched this way, standard logit-based and feature-based distillation prove superior. With extended training, feature-based distillation sets new records for lightweight ResNet-18 models on Cityscapes and ADE20K, allowing a student model with one-quarter the parameters to reach 99 percent of a large teacher's accuracy on Cityscapes.

Core claim

Under matched wall-clock compute, canonical logit- and feature-based knowledge distillation outperform recent segmentation-specific distillation methods. With extended training, feature-based distillation achieves state-of-the-art performance for ResNet-18 on Cityscapes and ADE20K. A PSPNet with ResNet-18 student reaches 79.0 mIoU on Cityscapes (99% of its ResNet-101 teacher's 79.8) and 92% relative performance on ADE20K despite using only one quarter the parameters.

What carries the argument

Wall-clock time matching for training budget comparison, applied to standard logit and feature distillation to fairly assess their effectiveness against complex custom objectives.

If this is right

Equal iteration counts do not equate to equal training effort when methods have different per-step costs.
Feature-based distillation can produce state-of-the-art lightweight semantic segmentation models with sufficient training time.
Small student models can closely approach the performance of much larger teachers using basic distillation techniques.
Future method design in knowledge distillation should emphasize scaling training compute over creating elaborate task-specific losses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Re-evaluating other vision tasks with compute-matched comparisons could reveal similar patterns favoring simpler methods.
Practitioners may achieve better results by extending training of basic distillation rather than adopting the latest complex techniques.
This finding suggests that the overhead of custom objectives may not be justified unless they provide substantial gains at equal compute.

Load-bearing premise

Wall-clock time measurements accurately represent the true training costs without being skewed by implementation differences or hardware variations across methods.

What would settle it

A re-evaluation where the complex methods are re-implemented and optimized to have minimal additional per-iteration cost, then compared again under strict wall-clock budgets to check if they still lag behind canonical KD.

Figures

Figures reproduced from arXiv: 2604.25530 by Kevin Alexander Laube, Lukas Schott, Madan Ravi Ganesh, Muhammad Ali, Niclas Popp, Thomas Brox.

**Figure 1.** Figure 1: Iteration-matched vs. compute-matched comparison. view at source ↗

**Figure 2.** Figure 2: mIoU vs training budget on Cityscapes under long view at source ↗

**Figure 3.** Figure 3: Semi-supervised KD on Cityscapes with 20k additional view at source ↗

read the original abstract

Recent knowledge distillation (KD) methods for semantic segmentation introduce increasingly complex hand-crafted objectives, yet are typically evaluated under fixed iteration schedules. These objectives substantially increase per-iteration cost, meaning equal iteration counts do not correspond to equal training budgets. It is therefore unclear whether reported gains reflect stronger distillation signals or simply greater compute. We show that iteration-based comparisons are misleading: when wall-clock compute is matched, canonical logit- and feature-based KD outperform recent segmentation-specific methods. Under extended training, feature-based distillation achieves state-of-the-art ResNet-18 performance on Cityscapes and ADE20K. A PSPNet ResNet-18 student closely approaches its ResNet-101 teacher despite using only one quarter of the parameters, reaching 99% of the teacher's mIoU on Cityscapes (79.0 vs 79.8) and 92% on ADE20K. Our results challenge the prevailing assumption that KD for segmentation requires task-specific mechanisms and suggest that scaling, rather than complex hand-crafted objectives, should guide future method design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Canonical KD holds its own or beats complex segmentation-specific methods once wall-clock time is matched, and gets close to the teacher with extended training.

read the letter

The main thing to know is that this paper shows basic logit and feature distillation can match or exceed recent hand-crafted KD losses for semantic segmentation when you equalize actual training compute instead of just iteration count. Under matched wall-clock budgets the simple baselines win, and with longer training a ResNet-18 student reaches 79.0 mIoU on Cityscapes (99% of the ResNet-101 teacher's 79.8) and 92% of teacher performance on ADE20K. That is a concrete empirical result worth paying attention to for anyone building efficient models. The authors correctly flag that complex objectives raise per-iteration cost, so prior fixed-iteration comparisons were biased toward the newer methods. By re-running the comparisons with time-matched budgets they expose that gap and show scaling training time can be more effective than adding task-specific machinery. The numbers on the PSPNet ResNet-18 student are solid and directly useful for practitioners who want strong segmentation without heavy teachers. The soft spot is the compute-matching protocol itself. If the complex baselines had their iteration counts cut to match wall-clock but their poly learning-rate schedules were not adjusted proportionally, those methods effectively saw a shorter training horizon, which would inflate the apparent win for canonical KD. The paper needs to report exact per-method iteration counts, the precise timing measurement setup, and confirmation that schedules were scaled. Any unaccounted differences in data loading or GPU efficiency between re-implementations would also matter. This is aimed at computer vision researchers working on distillation or efficient segmentation. It is a useful corrective on evaluation practices rather than a new framework. The central claim is plausible from the abstract and the reported numbers, so it deserves a serious referee who can check the experimental controls in detail. I would send it to review with a request for those schedule and timing details.

Referee Report

2 major / 2 minor

Summary. The paper claims that recent knowledge distillation methods for semantic segmentation rely on complex hand-crafted objectives evaluated under fixed iteration schedules, which fail to account for their higher per-iteration costs and thus do not reflect equal training budgets. When comparisons are instead performed under matched wall-clock compute, canonical logit- and feature-based KD outperform these methods. With extended training, feature-based distillation reaches state-of-the-art ResNet-18 performance on Cityscapes and ADE20K; a PSPNet ResNet-18 student achieves 99% of its ResNet-101 teacher's mIoU on Cityscapes (79.0 vs. 79.8) and 92% on ADE20K while using only one-quarter the parameters.

Significance. If the wall-clock-matched results hold, the work is significant because it provides concrete empirical counter-evidence to the trend of increasingly elaborate task-specific KD losses in semantic segmentation. It demonstrates that simple canonical approaches, when given comparable compute, are competitive or superior, and that scaling training budget can yield near-teacher performance with compact students. This could redirect research emphasis from loss engineering toward reproducible training protocols and efficiency, with direct implications for deploying accurate segmentation models on resource-constrained hardware.

major comments (2)

[§4] §4 (Experimental Protocol): The central claim that canonical KD outperforms complex methods under matched wall-clock time is load-bearing and requires explicit documentation of per-method iteration counts, the precise timing methodology (including how per-iteration overheads for complex losses were measured), and confirmation that learning-rate schedules (commonly polynomial decay tied to max-iterations) were scaled when iteration budgets were reduced to equalize wall-clock time. Without these details and a table listing matched times and adjusted schedules, it remains possible that baselines received effectively shorter training horizons, undermining the fairness of the comparison.
[Table 2] Table 2 / Cityscapes results: The reported 79.0 mIoU for the PSPNet ResNet-18 student (99% of the ResNet-101 teacher's 79.8) is a key supporting result, yet the paper must clarify whether the student and teacher were trained with identical data augmentations, batch sizes, and optimizer settings or whether the student benefited from additional hyperparameter search; any asymmetry would weaken the interpretation that canonical KD alone enables near-teacher performance at 1/4 parameter count.

minor comments (2)

[§3] The abstract and §3 should include a short, explicit definition or loss equation for the 'canonical logit-based' and 'feature-based' KD baselines used, to make the contrast with complex methods immediately clear to readers unfamiliar with the segmentation KD literature.
[Figure 1] Figure 1 and associated captions would benefit from annotating the exact wall-clock times at which each method was evaluated, rather than only iteration counts, to visually reinforce the matched-compute protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below and will incorporate the requested clarifications into the revised version to strengthen the experimental protocol description.

read point-by-point responses

Referee: [§4] §4 (Experimental Protocol): The central claim that canonical KD outperforms complex methods under matched wall-clock time is load-bearing and requires explicit documentation of per-method iteration counts, the precise timing methodology (including how per-iteration overheads for complex losses were measured), and confirmation that learning-rate schedules (commonly polynomial decay tied to max-iterations) were scaled when iteration budgets were reduced to equalize wall-clock time. Without these details and a table listing matched times and adjusted schedules, it remains possible that baselines received effectively shorter training horizons, undermining the fairness of the comparison.

Authors: We agree that more explicit documentation is needed for full reproducibility. In the revision we will expand §4 with a dedicated paragraph on the timing methodology: we measured per-iteration wall-clock time on the same hardware (NVIDIA V100 GPUs) by averaging 100 forward-backward passes after warm-up, separately for each loss function; complex losses were timed with their full overhead included. We will add a table (new Table 3) that reports, for every method, the original iteration count, the wall-clock-matched iteration count, the measured per-iteration time, and the resulting total training time. We will also confirm that the polynomial learning-rate schedule was rescaled by adjusting the maximum iteration count while keeping the power and end-learning-rate unchanged, ensuring the effective training horizon remains comparable. revision: yes
Referee: [Table 2] Table 2 / Cityscapes results: The reported 79.0 mIoU for the PSPNet ResNet-18 student (99% of the ResNet-101 teacher's 79.8) is a key supporting result, yet the paper must clarify whether the student and teacher were trained with identical data augmentations, batch sizes, and optimizer settings or whether the student benefited from additional hyperparameter search; any asymmetry would weaken the interpretation that canonical KD alone enables near-teacher performance at 1/4 parameter count.

Authors: The student and teacher were trained under identical conditions. Both used the same data-augmentation pipeline (random scale [0.5,2.0], random crop 512×1024, horizontal flip), batch size of 8, SGD optimizer with momentum 0.9 and weight decay 5e-4, and the same initial learning rate of 0.01 with polynomial decay. No separate hyperparameter search was conducted for the student; all settings were taken directly from the teacher training protocol described in §4. We will add an explicit statement to this effect in the caption of Table 2 and in the experimental-setup paragraph to remove any ambiguity. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison on external benchmarks

full rationale

The paper reports direct experimental results comparing knowledge distillation methods on Cityscapes and ADE20K using standard metrics (mIoU) under wall-clock-matched training budgets. No mathematical derivation, ansatz, or fitted parameter is defined in terms of the target result. All claims rest on re-implemented baselines and measured runtimes rather than any self-referential reduction. Self-citations, if present, are not load-bearing for the central empirical finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical study that relies on standard supervised training, common benchmarks, and existing KD formulations without introducing new free parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5492 in / 1139 out tokens · 53601 ms · 2026-05-07T16:53:31.009701+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Knowledge distilla- tion: A good teacher is patient and consistent

Lucas Beyer, Xiaohua Zhai, Am´elie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowledge distilla- tion: A good teacher is patient and consistent. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10925–10934, 2022. 2, 4

work page 2022
[2]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 2

work page 2022
[3]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 1, 2

work page 2016
[4]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 1

work page 2009
[5]

Revisit the power of vanilla knowledge distillation: from small scale to large scale

Zhiwei Hao, Jianyuan Guo, Kai Han, Han Hu, Chang Xu, and Yunhe Wang. Revisit the power of vanilla knowledge distillation: from small scale to large scale. InNeural Infor- mation Processing Systems, 2023. 1, 4

work page 2023
[6]

Zhang, Shaoqing Ren, and Jian Sun

Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.2016 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015. 2

work page 2016
[7]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 1, 2

work page internal anchor Pith review arXiv 2015
[8]

Bpkd: Boundary privi- leged knowledge distillation for semantic segmentation

Liyang Liu, Zihan Wang, Minh Hieu Phan, Bowen Zhang, Jinchao Ge, and Yifan Liu. Bpkd: Boundary privi- leged knowledge distillation for semantic segmentation. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1062–1072, 2024. 1, 2, 3

work page 2024
[9]

Re- thinking knowledge distillation with raw features for seman- tic segmentation

Tao Liu, Chenshu Chen, Xi Yang, and Wenming Tan. Re- thinking knowledge distillation with raw features for seman- tic segmentation. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 1155–1164, 2024. 2, 3

work page 2024
[10]

Structured knowledge distillation for semantic segmentation

Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and Jingdong Wang. Structured knowledge distillation for semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2604–2613, 2019. 2, 3

work page 2019
[11]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2

work page 2021
[12]

Relational knowledge distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019. 1

work page 2019
[13]

Correla- tion congruence for knowledge distillation

Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. Correla- tion congruence for knowledge distillation. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 5007–5016, 2019. 1

work page 2019
[14]

Fit- nets: Hints for thin deep nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fit- nets: Hints for thin deep nets. InInternational Conference on Learning Representations (ICLR), 2015. 1, 2

work page 2015
[15]

Channel-wise knowledge distillation for dense prediction

Changyong Shu, Yifan Liu, Jianfei Gao, Zheng Yan, and Chunhua Shen. Channel-wise knowledge distillation for dense prediction. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 5311–5320,

work page
[16]

The bitter lesson.http : / / www

Rich Sutton. The bitter lesson.http : / / www . incompleteideas . net / IncIdeas / BitterLesson.html, 2019. 1, 4

work page 2019
[17]

Con- trastive representation distillation

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con- trastive representation distillation. InInternational Confer- ence on Learning Representations (ICLR), 2020. 1

work page 2020
[18]

Cross-image relational knowl- edge distillation for semantic segmentation

Chuanguang Yang, Helong Zhou, Zhulin An, Xue Jiang, Yongjun Xu, and Qian Zhang. Cross-image relational knowl- edge distillation for semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12319–12328, 2022. 1, 2, 3

work page 2022
[19]

Masked generative distillation

Zhendong Yang, Zhe Li, Mingqi Shao, Dachuan Shi, Ze- huan Yuan, and Chun Yuan. Masked generative distillation. InEuropean conference on computer vision, pages 53–69. Springer, 2022. 1, 2, 3

work page 2022
[20]

Pyramid scene parsing network

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 2

work page 2017
[21]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,

work page

[1] [1]

Knowledge distilla- tion: A good teacher is patient and consistent

Lucas Beyer, Xiaohua Zhai, Am´elie Royer, Larisa Markeeva, Rohan Anil, and Alexander Kolesnikov. Knowledge distilla- tion: A good teacher is patient and consistent. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10925–10934, 2022. 2, 4

work page 2022

[2] [2]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 2

work page 2022

[3] [3]

The cityscapes dataset for semantic urban scene understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016. 1, 2

work page 2016

[4] [4]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. 1

work page 2009

[5] [5]

Revisit the power of vanilla knowledge distillation: from small scale to large scale

Zhiwei Hao, Jianyuan Guo, Kai Han, Han Hu, Chang Xu, and Yunhe Wang. Revisit the power of vanilla knowledge distillation: from small scale to large scale. InNeural Infor- mation Processing Systems, 2023. 1, 4

work page 2023

[6] [6]

Zhang, Shaoqing Ren, and Jian Sun

Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.2016 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015. 2

work page 2016

[7] [7]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 1, 2

work page internal anchor Pith review arXiv 2015

[8] [8]

Bpkd: Boundary privi- leged knowledge distillation for semantic segmentation

Liyang Liu, Zihan Wang, Minh Hieu Phan, Bowen Zhang, Jinchao Ge, and Yifan Liu. Bpkd: Boundary privi- leged knowledge distillation for semantic segmentation. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1062–1072, 2024. 1, 2, 3

work page 2024

[9] [9]

Re- thinking knowledge distillation with raw features for seman- tic segmentation

Tao Liu, Chenshu Chen, Xi Yang, and Wenming Tan. Re- thinking knowledge distillation with raw features for seman- tic segmentation. InProceedings of the IEEE/CVF Win- ter Conference on Applications of Computer Vision, pages 1155–1164, 2024. 2, 3

work page 2024

[10] [10]

Structured knowledge distillation for semantic segmentation

Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and Jingdong Wang. Structured knowledge distillation for semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2604–2613, 2019. 2, 3

work page 2019

[11] [11]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2

work page 2021

[12] [12]

Relational knowledge distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019. 1

work page 2019

[13] [13]

Correla- tion congruence for knowledge distillation

Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. Correla- tion congruence for knowledge distillation. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 5007–5016, 2019. 1

work page 2019

[14] [14]

Fit- nets: Hints for thin deep nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fit- nets: Hints for thin deep nets. InInternational Conference on Learning Representations (ICLR), 2015. 1, 2

work page 2015

[15] [15]

Channel-wise knowledge distillation for dense prediction

Changyong Shu, Yifan Liu, Jianfei Gao, Zheng Yan, and Chunhua Shen. Channel-wise knowledge distillation for dense prediction. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 5311–5320,

work page

[16] [16]

The bitter lesson.http : / / www

Rich Sutton. The bitter lesson.http : / / www . incompleteideas . net / IncIdeas / BitterLesson.html, 2019. 1, 4

work page 2019

[17] [17]

Con- trastive representation distillation

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con- trastive representation distillation. InInternational Confer- ence on Learning Representations (ICLR), 2020. 1

work page 2020

[18] [18]

Cross-image relational knowl- edge distillation for semantic segmentation

Chuanguang Yang, Helong Zhou, Zhulin An, Xue Jiang, Yongjun Xu, and Qian Zhang. Cross-image relational knowl- edge distillation for semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12319–12328, 2022. 1, 2, 3

work page 2022

[19] [19]

Masked generative distillation

Zhendong Yang, Zhe Li, Mingqi Shao, Dachuan Shi, Ze- huan Yuan, and Chun Yuan. Masked generative distillation. InEuropean conference on computer vision, pages 53–69. Springer, 2022. 1, 2, 3

work page 2022

[20] [20]

Pyramid scene parsing network

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017. 2

work page 2017

[21] [21]

Scene parsing through ade20k dataset

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641,

work page