pith. sign in

arxiv: 2606.04595 · v1 · pith:KWMXQGIMnew · submitted 2026-06-03 · 📡 eess.IV

KD-NVC: A Search-and-Distill Framework to Accelerate Neural Video Coding

Pith reviewed 2026-06-28 04:14 UTC · model grok-4.3

classification 📡 eess.IV
keywords neural video codingknowledge distillationneural architecture searchmodel accelerationrate-distortion performancevideo compressionedge device decodingfeature energy sparsity
0
0 comments X

The pith

A two-stage search-and-distill pipeline produces lightweight neural video codecs that decode 1080p video at 69 FPS while keeping rate-distortion performance comparable to VTM-LDB.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural video coding achieves strong compression yet remains too slow for real-time use on edge devices because of high model complexity. Existing knowledge distillation approaches fall short for these codecs because modules differ in structure and because rate constraints create specific sparse feature patterns that must be retained for good compression. The paper presents KD-NVC, which first runs an acceleration-efficiency neural architecture search to allocate compute budgets across modules without training every candidate, then applies an energy-aware feature distillation loss that matches aggregated energy signatures to pass the sparsity patterns to the student. The resulting models run at 69 frames per second on 1080p content using an RTX 5060 and match the rate-distortion performance of the original teacher and of the VTM-LDB anchor.

Core claim

The authors establish that an acceleration-efficiency-based neural architecture search can identify per-module student architectures by exploring module-wise Pareto frontiers and using an acceleration-efficiency metric to avoid exhaustive training, after which an energy-aware feature distillation loss that aligns spatially-aggregated feature-energy signatures transfers the rate-induced sparsity patterns, enabling student codecs to achieve real-time decoding speeds with rate-distortion performance comparable to the teacher and to VTM-LDB.

What carries the argument

The acceleration-efficiency-based neural architecture search (AE-NAS) that determines module-wise architectures via Pareto frontiers and an acceleration-efficiency metric, together with the energy-aware feature distillation (EFD) loss that aligns spatially-aggregated feature-energy signatures.

If this is right

  • The KD-NVC framework outperforms prior codec-oriented distillation methods on rate-distortion-speed trade-offs.
  • The resulting student models reach 69 FPS decoding for 1080p video on an RTX 5060.
  • Rate-distortion performance stays comparable to both the original teacher model and the VTM-LDB anchor.
  • Module-wise rather than uniform architecture reduction yields better overall efficiency for heterogeneous NVC sub-modules.
  • The two-stage separation of search and distillation lowers the cost of finding suitable lightweight architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same energy-signature alignment could be tested on neural image codecs or point-cloud codecs that also operate under rate constraints.
  • The acceleration-efficiency metric could be adapted to target different hardware platforms by changing the cost model inside the search.
  • If the sparsity transfer holds, similar distillation losses might improve other compression-aware student models without requiring full retraining.
  • Measuring energy signatures on intermediate feature maps from different video content classes would test whether the transferred patterns generalize beyond the training distribution.

Load-bearing premise

Aligning the spatially-aggregated feature-energy signatures between teacher and student transfers the rate-constraint-induced sparsity patterns that are necessary to preserve compression performance.

What would settle it

Running the student architecture both with and without the EFD loss on the same training data and measuring whether the version without EFD shows a clear increase in rate-distortion cost or a mismatch in measured feature-energy sparsity on a held-out test set.

Figures

Figures reproduced from arXiv: 2606.04595 by Chao Yao, Hui Xiang, Jian Jin, Jingran Wu, Meiqin Liu, Weisi Lin, Xianguo Zhang, Yao Zhao, Yuxiao Sun.

Figure 1
Figure 1. Figure 1: Acceleration and rate-distortion performance of compared methods. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Three key observations about using NAS and KD on video coding. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline of the proposed KD-NVC framework, which contains two stages. First, the architecture of the student codec [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Decoding speed-up and rate-distortion performance degradation of [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: All architecture-level candidates in the final search space [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Theoretical complexity (kMACs/pixel) versus practical latency (ms) [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Left: Comparison between uniform architecture reduction and the [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Rate-distortion performance degradation under different distillation [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization on HEVC test sequences of the original frame, VTM-LDB-23.11, the teacher codec DCVC-RT, and the proposed KD-NVC-S/T. KD-NVC [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

While neural video coding (NVC) has achieved remarkable rate-distortion performance, real-time decoding on edge devices has become an important demand but remains limited by high complexity. Knowledge distillation (KD) is widely used for model acceleration, yet its application to NVC faces critical challenges. Specifically, the heterogeneity of NVC sub-modules renders uniform architectural reduction suboptimal, necessitating a per-module design for better rate-distortion-speed trade-off. However, searching for diverse architectures via existing neural architecture search (NAS) algorithms is unaffordable due to the expensive training cost of neural video codecs. Moreover, after the lightweight architecture is determined, existing distillation methods overlook the feature-energy sparsity induced by the rate-constraint, which is essential for maintaining compression performance. To address these issues, we propose a two-stage distillation framework KD-NVC. In the first stage, we introduce an acceleration-efficiency-based neural architecture search (AE-NAS) algorithm. It explores the module-wise Pareto frontier to adaptively allocate the acceleration budget across heterogeneous modules. Also, it introduces the acceleration-efficiency metric to determine the final student architecture without practically training all architecture-level candidates. In the second stage, we design an energy-aware feature distillation (EFD) loss that aligns the spatially-aggregated feature-energy signatures between the teacher and student codecs, transferring the rate-induced sparsity patterns for better compression efficiency. Experimental results demonstrate that the proposed framework consistently outperforms existing codec-oriented distillation methods, and achieves 69 FPS decoding at 1080p on RTX 5060 while maintaining comparable RD performance to VTM-LDB.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes KD-NVC, a two-stage search-and-distill framework for accelerating neural video codecs. Stage 1 uses an acceleration-efficiency-based NAS (AE-NAS) that explores module-wise Pareto frontiers and an acceleration-efficiency metric to select student architectures without training all candidates. Stage 2 introduces an energy-aware feature distillation (EFD) loss that aligns spatially-aggregated feature-energy signatures to transfer rate-constraint-induced sparsity patterns. The central empirical claim is that this consistently outperforms prior codec-oriented distillation methods while achieving 69 FPS 1080p decoding on an RTX 5060 with RD performance comparable to VTM-LDB.

Significance. If the results hold, the work would be significant for practical deployment of neural video coding on edge devices, where real-time decoding remains a bottleneck. The per-module NAS allocation and the targeted handling of rate-induced feature sparsity address domain-specific challenges that uniform KD approaches overlook. The AE-NAS efficiency metric is a pragmatic engineering contribution that could reduce search cost in similar heterogeneous codec settings.

major comments (2)
  1. [EFD loss (second stage)] EFD loss description (second stage): the claim that aligning spatially-aggregated feature-energy signatures transfers rate-induced sparsity patterns and is essential for maintaining compression performance is load-bearing for the outperformance claim, yet the manuscript provides no ablation or causal validation (e.g., comparing RD when sparsity patterns are matched vs. mismatched while holding architecture fixed) to show that spatial aggregation preserves the local rate-allocation details rather than discarding them.
  2. [Experimental results] Experimental results section: the reported 69 FPS at 1080p and consistent outperformance over existing distillation methods are presented without the supporting details (baselines, error bars, ablation tables isolating AE-NAS vs. EFD, or hardware measurement protocol) needed to verify that the gains are attributable to the proposed components rather than training schedule or architecture choice alone.
minor comments (2)
  1. [AE-NAS description] Notation for the acceleration-efficiency metric should be defined explicitly with its formula before being used to select the final architecture.
  2. [Abstract and method] The abstract and method sections would benefit from a clear statement of the precise VTM-LDB configuration (profile, GOP structure) used for the RD comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address each major comment point-by-point below, outlining revisions that will strengthen the manuscript while preserving its core contributions.

read point-by-point responses
  1. Referee: [EFD loss (second stage)] EFD loss description (second stage): the claim that aligning spatially-aggregated feature-energy signatures transfers rate-induced sparsity patterns and is essential for maintaining compression performance is load-bearing for the outperformance claim, yet the manuscript provides no ablation or causal validation (e.g., comparing RD when sparsity patterns are matched vs. mismatched while holding architecture fixed) to show that spatial aggregation preserves the local rate-allocation details rather than discarding them.

    Authors: We thank the referee for this observation. The EFD loss is motivated by the need to transfer rate-constraint-induced sparsity patterns, with spatial aggregation intended to retain local rate-allocation information in a compact form. While the manuscript shows overall gains relative to prior distillation methods, we agree that an explicit causal ablation would provide stronger validation. In the revised version, we will add an ablation comparing RD performance under matched versus mismatched sparsity patterns (architecture fixed) to demonstrate that spatial aggregation preserves the relevant details. revision: yes

  2. Referee: [Experimental results] Experimental results section: the reported 69 FPS at 1080p and consistent outperformance over existing distillation methods are presented without the supporting details (baselines, error bars, ablation tables isolating AE-NAS vs. EFD, or hardware measurement protocol) needed to verify that the gains are attributable to the proposed components rather than training schedule or architecture choice alone.

    Authors: We agree that additional experimental details are required for full reproducibility and attribution of gains. The revised manuscript will expand this section to include: explicit baseline specifications and comparisons, error bars from repeated runs, ablation tables that isolate AE-NAS from EFD contributions, and a precise description of the hardware measurement protocol used to obtain the 69 FPS 1080p decoding result on RTX 5060. These changes will clarify that improvements stem from the proposed components. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering framework with no derivations or self-referential reductions

full rationale

The manuscript describes a two-stage KD-NVC framework (AE-NAS for architecture search followed by EFD loss for feature-energy alignment) as an empirical contribution. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text. Performance claims rest on experimental results rather than any chain that reduces to its own inputs by construction. This matches the default expectation for non-theoretical papers; the reader's assessment of score 2.0 is consistent with minor or absent circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only; the central claims rest on domain assumptions about module heterogeneity and rate-induced sparsity that are stated but not evidenced here. No free parameters or invented entities are quantifiable from the provided text.

axioms (2)
  • domain assumption Heterogeneity of NVC sub-modules renders uniform architectural reduction suboptimal, necessitating per-module design
    Explicitly stated as a critical challenge in the abstract.
  • domain assumption Feature-energy sparsity induced by the rate-constraint is essential for maintaining compression performance
    Presented as the key oversight of existing distillation methods.

pith-pipeline@v0.9.1-grok · 5841 in / 1341 out tokens · 29798 ms · 2026-06-28T04:14:53.242045+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    Auto-encoding variational bayes,

    D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in Proceedings of the International Conference on Learning Representa- tions (ICLR), 2014

  2. [2]

    Neural video compression with feature modulation,

    J. Li, B. Li, and Y . Lu, “Neural video compression with feature modulation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26 099–26 108

  3. [3]

    Towards practical real-time neural video compression,

    Z. Jia, B. Li, J. Li, W. Xie, L. Qi, H. Li, and Y . Lu, “Towards practical real-time neural video compression,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 12 543–12 552

  4. [4]

    NVC-1B: Scaling up neural video coding models,

    C. Tang, X. Sheng, L. Li, D. Liu, and F. Wu, “NVC-1B: Scaling up neural video coding models,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–17, 2026

  5. [5]

    Overview of the high efficiency video coding (HEVC) standard,

    G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649– 1668, 2012

  6. [6]

    Overview of the H.264/A VC video coding standard,

    T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/A VC video coding standard,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560–576, 2003

  7. [7]

    VTM-23.11, https://vcgit.hhi.fraunhofer.de/jvet/VVCSoftware VTM, 2024, accessed on: 2026-05-01

  8. [8]

    DVC: An end-to-end deep video compression framework,

    G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “DVC: An end-to-end deep video compression framework,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11 006–11 015

  9. [9]

    Neural video compression with diverse contexts,

    J. Li, B. Li, and Y . Lu, “Neural video compression with diverse contexts,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22 616–22 626

  10. [10]

    ECVC: Exploiting non- local correlations in multiple frames for contextual video compression,

    W. Jiang, J. Li, K. Zhang, and L. Zhang, “ECVC: Exploiting non- local correlations in multiple frames for contextual video compression,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 7331–7341

  11. [11]

    I 2VC: A unified framework for intra-& inter-frame video compression,

    M. Liu, C. Xu, Y . Gu, C. Yao, and Y . Zhao, “I 2VC: A unified framework for intra-& inter-frame video compression,”arXiv preprint arXiv:2405.14336, 2024

  12. [12]

    Generative neural video compression via video diffusion prior,

    Q. Mao, H. Cheng, T. Yang, L. Jin, and S. Ma, “Generative neural video compression via video diffusion prior,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2026, pp. 43 239–43 248

  13. [13]

    AsymL- LIC: Asymmetric lightweight learned image compression,

    S. Wang, Z. Cheng, D. Feng, G. Lu, L. Song, and W. Zhang, “AsymL- LIC: Asymmetric lightweight learned image compression,” inProceed- ings of the IEEE International Conference on Visual Communications and Image Processing (VCIP), 2024, pp. 1–5

  14. [14]

    Very deep convolutional networks for large-scale image recognition,

    K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” inProceedings of the International Conference on Learning Representations (ICLR), 2015

  15. [15]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

  16. [16]

    Fast and high-performance learned image compression with improved checkerboard context model, deformable residual module, and knowl- edge distillation,

    H. Fu, F. Liang, J. Liang, Y . Wang, Z. Fang, G. Zhang, and J. Han, “Fast and high-performance learned image compression with improved checkerboard context model, deformable residual module, and knowl- edge distillation,”IEEE Transactions on Image Processing, vol. 33, pp. 4702–4715, 2024

  17. [17]

    Effi- cient learned image compression through knowledge distillation,

    F. Allemand, A. Fiandrotti, S. Chaudhuri, and A. E. Mazouz, “Effi- cient learned image compression through knowledge distillation,”arXiv preprint arXiv:2509.10366, 2025

  18. [18]

    Knowledge distillation for learned image compression,

    Y . Chen, Z. Lyu, B. He, N. Cao, G. Chen, G. Lu, and W. Zhang, “Knowledge distillation for learned image compression,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 4996–5006

  19. [19]

    Unicompress: Enhancing multi-data medical image com- pression with knowledge distillation,

    R. Yang, Y . Chen, Z. Zhang, X. Liu, Z. Li, K. He, Z. Xiong, J. Suo, and Q. Dai, “Unicompress: Enhancing multi-data medical image com- pression with knowledge distillation,”arXiv preprint arXiv:2405.16850, 2024

  20. [20]

    Free-VSC: Free semantics from visual foundation models for unsupervised video semantic compression,

    Y . Tian, G. Lu, and G. Zhai, “Free-VSC: Free semantics from visual foundation models for unsupervised video semantic compression,” in THIS MANUSCRIPT IS PREPARED FOR SUBMISSION TO IEEE TRANSACTIONS 10 Proceedings of the European Conference on Computer Vision (ECCV), 2024, pp. 163–183

  21. [21]

    SMC++: Masked learning of unsupervised video semantic compression,

    Y . Tian, X. Ling, C. Geng, Q. Hu, G. Lu, and G. Zhai, “SMC++: Masked learning of unsupervised video semantic compression,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 2, pp. 1992–2011, 2026

  22. [22]

    Symmetric entropy-constrained video coding for machines,

    Y . Sun, M. Liu, C. Yao, Q. Tang, J. Jin, W. Lin, F. Dufaux, and Y . Zhao, “Symmetric entropy-constrained video coding for machines,”

  23. [23]

    Symmetric Entropy-Constrained Video Coding for Machines

    [Online]. Available: https://arxiv.org/abs/2510.15347

  24. [24]

    Single- step diffusion-based video coding with semantic-temporal guidance,

    N. Xue, Z. Jia, J. Li, B. Li, Z. Zheng, Y . Zhang, and Y . Lu, “Single- step diffusion-based video coding with semantic-temporal guidance,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2026, pp. 9752–9761

  25. [25]

    Temporal context min- ing for learned video compression,

    X. Sheng, J. Li, B. Li, L. Li, D. Liu, and Y . Lu, “Temporal context min- ing for learned video compression,”IEEE Transactions on Multimedia, vol. 25, pp. 7311–7322, 2023

  26. [26]

    Spatial decomposition and temporal fusion based inter prediction for learned video compression,

    X. Sheng, L. Li, D. Liu, and H. Li, “Spatial decomposition and temporal fusion based inter prediction for learned video compression,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 7, pp. 6460–6473, 2024

  27. [27]

    Joint autoregressive and hierarchical priors for learned image compression,

    D. Minnen, J. Ball ´e, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” inProceedings of Advances in Neural Information Processing Systems (NeurIPS), vol. 31, 2018

  28. [28]

    Leveraging second-order curvature for efficient learned image compression: Theory and empirical evidence,

    Y . Zhang and F. Zhu, “Leveraging second-order curvature for efficient learned image compression: Theory and empirical evidence,” 2026. [Online]. Available: https://arxiv.org/abs/2601.20769

  29. [29]

    Deep contextual video compression,

    J. Li, B. Li, and Y . Lu, “Deep contextual video compression,” in Proceedings of Advances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021, pp. 18 114–18 125

  30. [30]

    Learned video compression via heterogeneous deformable compensation network,

    H. Wang, Z. Chen, and C. W. Chen, “Learned video compression via heterogeneous deformable compensation network,”IEEE Transactions on Multimedia, vol. 26, pp. 1855–1866, 2024

  31. [31]

    Hybrid spatial-temporal entropy modeling for neural video compression,

    J. Li, B. Li, and Y . Lu, “Hybrid spatial-temporal entropy modeling for neural video compression,” inProceedings of the ACM International Conference on Multimedia (ACM MM), 2022, pp. 1503–1511

  32. [32]

    Prediction and reference quality adaptation for learned video compression,

    X. Sheng, L. Li, D. Liu, and H. Li, “Prediction and reference quality adaptation for learned video compression,”IEEE Transactions on Image Processing, vol. 34, pp. 2285–2300, 2025

  33. [33]

    Perceptual learned video compression with recurrent conditional gan,

    R. Yang, R. Timofte, and L. Van Gool, “Perceptual learned video compression with recurrent conditional gan,” inProceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2022, pp. 1537–1544

  34. [34]

    Real-time neural video compression with unified intra and inter coding,

    H. Xiang, Y . Bian, L. Li, J. Wu, X. Zhang, and D. Liu, “Real-time neural video compression with unified intra and inter coding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2026, pp. 35 217–35 226

  35. [35]

    Integer-centric neural video compression,

    Z. Jia, W. Xie, Z. Guo, B. Li, J. Li, H. Li, and Y . Lu, “Integer-centric neural video compression,”Submitted to ICLR 2026 Conference, 2025. [Online]. Available: https://openreview.net/forum?id=KCQo0fXtFH

  36. [36]

    On the quantization of neural video codecs,

    H.-T. Phung, Y .-H. Lin, C.-H. Wu, R. Conceic ¸˜ao, Y .-H. Chen, M. Porto, L. V . Agostini, and W.-H. Peng, “On the quantization of neural video codecs,”Submitted to ICLR 2026 Conference, 2025. [Online]. Available: https://openreview.net/forum?id=dLqDqzlDxZ

  37. [37]

    MobileNVC: Real-time 1080p neural video compression on a mobile device,

    T. van Rozendaal, T. Singhal, H. Le, G. Sautiere, A. Said, K. Buska, A. Raha, D. Kalatzis, H. Mehta, F. Mayer, L. Zhang, M. Nagel, and A. Wiggers, “MobileNVC: Real-time 1080p neural video compression on a mobile device,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 4311–4321

  38. [38]

    MobileCodec: neural inter-frame video compression on mobile devices,

    H. Le, L. Zhang, A. Said, G. Sautiere, Y . Yang, P. Shrestha, F. Yin, R. Pourreza, and A. Wiggers, “MobileCodec: neural inter-frame video compression on mobile devices,” inProceedings of the 13th ACM Multimedia Systems Conference (MMSys), August 2022, pp. 324–330. [Online]. Available: https://doi.org/10.1145/3524273.3532906

  39. [39]

    Ultra-fast neural video compression,

    J. Li, W. Xie, Z. Jia, B. Li, Z. Guo, X. Zhang, and Y . Lu, “Ultra-fast neural video compression,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2026, pp. 41 311–41 321

  40. [40]

    Distilling the Knowledge in a Neural Network

    G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” 2015. [Online]. Available: https://arxiv.org/abs/1503.02531

  41. [41]

    FitNets: Hints for thin deep nets,

    A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y . Bengio, “FitNets: Hints for thin deep nets,” inProceedings of the International Conference on Learning Representations (ICLR), 2015

  42. [42]

    PDSRN: a progressive distillation network for generalizable single image super-resolution,

    S. Wei, X. Yang, and G. Jeon, “PDSRN: a progressive distillation network for generalizable single image super-resolution,”Multimedia Systems, vol. 31, no. 5, p. 324, 2025

  43. [43]

    Knowledge distillation with multi-granularity mixture of priors for image super-resolution,

    S. Li, Y . Zhang, W. Li, H. Chen, W. Wang, B. Jing, S. Lin, and J. Hu, “Knowledge distillation with multi-granularity mixture of priors for image super-resolution,” inProceedings of the International Conference on Learning Representations (ICLR), 2025

  44. [44]

    FEDS: Feature and entropy- based distillation strategy for efficient learned image compression,

    H. Fu, J. Liang, Z. Fang, and J. Han, “FEDS: Feature and entropy- based distillation strategy for efficient learned image compression,” arXiv preprint arXiv:2503.06399, 2025. [Online]. Available: https: //arxiv.org/abs/2503.06399

  45. [45]

    Checkerboard context model for efficient learned image compression,

    D. He, Y . Zheng, B. Sun, Y . Wang, and H. Qin, “Checkerboard context model for efficient learned image compression,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 14 771–14 780

  46. [46]

    Distilling complexity-scalable learned image compression models via neural architecture search,

    S. Wang, Z. Cheng, D. Feng, Q. Wang, Q. Gu, L. Song, and W. Zhang, “Distilling complexity-scalable learned image compression models via neural architecture search,”IEEE Transactions on Circuits and Systems for Video Technology, vol. PP, no. 99, pp. 1–1, January 2026

  47. [47]

    What Matters in Practical Learned Image Compression

    K. Tatwawadi, P. Rahimzadeh, Z. Sun, Z. Chen, Z. Yang, S. Nair, D. Hasteer, and O. Rippel, “What matters in practical learned image compression,” 2026. [Online]. Available: https://arxiv.org/abs/ 2605.05148

  48. [48]

    Swin Transformer: Hierarchical vision transformer using shifted win- dows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin Transformer: Hierarchical vision transformer using shifted win- dows,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10 012–10 022

  49. [49]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “DINOv2: Learning robust visual features without supervision,”arXiv preprint arXiv:2304.07193, 2023

  50. [50]

    Revisiting Bjontegaard delta bitrate (BD-BR) computation for codec compression efficiency comparison,

    N. Barman, M. G. Martini, and Y . Reznik, “Revisiting Bjontegaard delta bitrate (BD-BR) computation for codec compression efficiency comparison,” inProceedings of the Mile-High Video Conference (MHV), 2022, pp. 113–114

  51. [51]

    ELIC: Efficient learned image compression with unevenly grouped space- channel contextual adaptive coding,

    D. He, Z. Yang, W. Peng, R. Ma, H. Qin, and Y . Wang, “ELIC: Efficient learned image compression with unevenly grouped space- channel contextual adaptive coding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 5718–5727

  52. [52]

    Decoupling dark knowledge via block-wise logit distillation for feature- level alignment,

    C. Yu, F. Zhang, R. Chen, A. Wang, Z. Liu, S. Tan, and E.-P. Li, “Decoupling dark knowledge via block-wise logit distillation for feature- level alignment,”IEEE Transactions on Artificial Intelligence, vol. 6, no. 5, pp. 1143–1155, 2025

  53. [53]

    Video enhance- ment with task-oriented flow,

    T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman, “Video enhance- ment with task-oriented flow,”International Journal of Computer Vision, vol. 127, no. 8, pp. 1106–1125, 2019

  54. [54]

    Recent advances of end-to-end video coding technologies for A VS standard development,

    X. Sheng, X. Liang, C. Tang, Z. Zuo, Y . Bian, Y . Xie, Z. Li, Y . Li, H. Xiang, L. Li, and D. Liu, “Recent advances of end-to-end video coding technologies for A VS standard development,” 2026. [Online]. Available: https://arxiv.org/abs/2602.00483

  55. [55]

    Cyclical learning rates for training neural networks,

    L. N. Smith, “Cyclical learning rates for training neural networks,” inProceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2017, pp. 464–472

  56. [56]

    VVenC: An open and optimized vvc encoder implementation,

    A. Wieckowski, J. Brandenburg, T. Hinz, C. Bartnik, V . George, G. Hege, C. Helmrich, A. Henkel, C. Lehmann, C. Stofferset al., “VVenC: An open and optimized vvc encoder implementation,” in Proceedings of the IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2021, pp. 1–2

  57. [57]

    fvcore: FAIR’s computer vision core library,

    Meta Research, “fvcore: FAIR’s computer vision core library,” https: //github.com/facebookresearch/fvcore, 2019, accessed: 2026-04-16

  58. [58]

    SAR image compression with inherent denoising capability through knowledge distillation,

    Z. Liu, S. Wang, and Y . Gu, “SAR image compression with inherent denoising capability through knowledge distillation,”IEEE Geoscience and Remote Sensing Letters, vol. 21, pp. 1–5, 2024

  59. [59]

    A simple and generic framework for feature distillation via channel-wise transfor- mation,

    Z. Liu, Y . Wang, X. Chu, N. Dong, S. Qi, and H. Ling, “A simple and generic framework for feature distillation via channel-wise transfor- mation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 1129–1138

  60. [60]

    Frequency attention for knowledge distillation,

    C. Pham, V .-A. Nguyen, T. Le, D. Phung, G. Carneiro, and T.-T. Do, “Frequency attention for knowledge distillation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 2277–2286

  61. [61]

    Progressive blockwise knowledge distillation for neural network acceleration,

    H. Wang, H. Zhao, X. Li, and X. Tan, “Progressive blockwise knowledge distillation for neural network acceleration,” inProceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2018, pp. 2769–2775