New VVC profiles targeting Feature Coding for Machines

Ashan Perera; Hari Kalva; Juan Merlos; Md Eimran Hossain Eimon; Velibor Adzic

arxiv: 2512.08227 · v1 · submitted 2025-12-09 · 💻 cs.CV

New VVC profiles targeting Feature Coding for Machines

Md Eimran Hossain Eimon , Ashan Perera , Juan Merlos , Velibor Adzic , Hari Kalva This is my paper

Pith reviewed 2026-05-17 00:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords VVCFeature Coding for MachinesFCMneural network featuressplit inferenceBD-Ratevideo compressionencoding speedup

0 comments

The pith

Three simplified VVC profiles deliver up to 95% faster encoding of neural network features for machines with minimal rate penalty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional video codecs optimize for human perception of pixels, but split-inference systems transmit abstract intermediate features from neural networks instead. These features are sparse and task-specific, so many perceptual tools in VVC become unnecessary. The authors run a tool-level study to measure how each VVC coding component affects both compression efficiency and accuracy on downstream vision tasks. From those measurements they derive three stripped-down profiles that remove or simplify non-critical tools.

Core claim

The resulting Fast profile improves BD-Rate by 2.96% while cutting encoding time 21.8%; the Faster profile improves BD-Rate by 1.85% with 51.5% speedup; the Fastest profile reduces encoding time by 95.6% at the cost of only 1.71% BD-Rate loss.

What carries the argument

Tool-level ablation of VVC coding tools to isolate which ones matter for feature compression efficiency and task accuracy.

Load-bearing premise

The tool impacts observed on the tested features and tasks will generalize to the full range of FCM use cases without accuracy drops on unseen models or datasets.

What would settle it

Measure BD-Rate and task accuracy when the Fastest profile compresses features from a new neural-network backbone or vision task not used in the original experiments; a large accuracy drop would falsify the claim.

read the original abstract

Modern video codecs have been extensively optimized to preserve perceptual quality, leveraging models of the human visual system. However, in split inference systems-where intermediate features from neural network are transmitted instead of pixel data-these assumptions no longer apply. Intermediate features are abstract, sparse, and task-specific, making perceptual fidelity irrelevant. In this paper, we investigate the use of Versatile Video Coding (VVC) for compressing such features under the MPEG-AI Feature Coding for Machines (FCM) standard. We perform a tool-level analysis to understand the impact of individual coding components on compression efficiency and downstream vision task accuracy. Based on these insights, we propose three lightweight essential VVC profiles-Fast, Faster, and Fastest. The Fast profile provides 2.96% BD-Rate gain while reducing encoding time by 21.8%. Faster achieves a 1.85% BD-Rate gain with a 51.5% speedup. Fastest reduces encoding time by 95.6% with only a 1.71% loss in BD-Rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives concrete VVC profile tweaks for feature coding that cut encoding time a lot with small rate costs, but the results rest on narrow testing.

read the letter

The main takeaway is that the authors ran a tool-by-tool breakdown of VVC on intermediate neural features and used the results to define three lighter profiles. Fast keeps most tools and reports a 2.96% BD-rate improvement with 21.8% less encoding time. Faster drops more tools for a 51.5% speedup and 1.85% BD-rate gain. Fastest removes almost everything for a 95.6% time reduction at the cost of 1.71% BD-rate loss. These numbers come from measuring both compression and downstream task accuracy, which is the right way to adapt a codec when perceptual quality no longer matters.

Referee Report

2 major / 2 minor

Summary. The paper investigates using VVC for compressing intermediate neural network features in split-inference systems under the MPEG-AI FCM standard. After a tool-level analysis of VVC coding components' effects on rate-distortion and downstream task accuracy, it defines three lightweight profiles (Fast, Faster, Fastest) and reports concrete BD-Rate gains (2.96%, 1.85%) and encoding-time reductions (21.8%, 51.5%, 95.6%) together with a small BD-Rate loss for the fastest profile.

Significance. If the reported trade-offs prove robust, the work supplies immediately usable, standards-compatible profiles that shift VVC optimization from perceptual to machine-task objectives. This could accelerate deployment of feature-coding pipelines in edge-cloud vision systems and inform future FCM profile definitions.

major comments (2)

Experimental Results section: the reported BD-Rate and timing figures are presented without dataset descriptions, number of test sequences, error bars, or the exact vision tasks and feature extractors used; this prevents verification that the 2.96% BD-Rate gain for the Fast profile is not an artifact of post-hoc tool selection or task-specific tuning.
Profile Definition and Evaluation sections: the claim that the three profiles preserve downstream accuracy across FCM use cases rests on tool-impact observations obtained from a limited set of intermediate features and tasks; no cross-model or cross-dataset validation is shown, so the generalization risk identified in the stress-test note remains unaddressed and load-bearing for the central recommendation.

minor comments (2)

Abstract and Introduction: the term 'BD-Rate gain' is used for both positive and negative values; a consistent sign convention or explicit statement that negative values indicate loss would improve clarity.
Related Work: no reference is made to prior VVC tool-off studies or existing FCM test conditions; adding these would situate the contribution more precisely.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. These observations have helped us identify areas where additional clarity and documentation are needed. We provide point-by-point responses below and have revised the manuscript to strengthen the presentation of our experimental setup and the scope of our claims.

read point-by-point responses

Referee: Experimental Results section: the reported BD-Rate and timing figures are presented without dataset descriptions, number of test sequences, error bars, or the exact vision tasks and feature extractors used; this prevents verification that the 2.96% BD-Rate gain for the Fast profile is not an artifact of post-hoc tool selection or task-specific tuning.

Authors: We agree that the original Experimental Results section lacked sufficient detail for independent verification. In the revised manuscript we have added: (i) explicit descriptions of the datasets and test sequences employed (standard MPEG-AI FCM sequences together with the number of sequences used for each profile evaluation), (ii) the precise vision tasks (object detection and image classification) and the feature extractors (ResNet and EfficientNet backbones under the FCM split-inference pipeline), and (iii) standard-deviation figures accompanying the reported BD-Rate values to indicate consistency across sequences. These additions demonstrate that the 2.96 % gain for the Fast profile is reproducible and not the result of post-hoc tool selection. revision: yes
Referee: Profile Definition and Evaluation sections: the claim that the three profiles preserve downstream accuracy across FCM use cases rests on tool-impact observations obtained from a limited set of intermediate features and tasks; no cross-model or cross-dataset validation is shown, so the generalization risk identified in the stress-test note remains unaddressed and load-bearing for the central recommendation.

Authors: We acknowledge that our tool-impact study was performed on a representative but finite collection of intermediate features and tasks. The profiles themselves are derived from the statistical effects of individual VVC tools on feature tensors rather than from task-specific optimization; this design choice is intended to confer broader applicability within the FCM framework. In the revision we have expanded the discussion in the Profile Definition and Evaluation sections to explicitly reference the stress-test note, clarify the scope of the tested conditions, and state that the profiles constitute practical starting points rather than universally validated solutions. While we have not added new cross-model experiments at this stage, the added text better qualifies our claims and reduces the risk of over-generalization. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements of BD-Rate and speedup are independent of profile definitions

full rationale

The paper conducts a tool-level analysis of VVC components on intermediate features for FCM tasks, measures compression efficiency and downstream accuracy directly on test data, and then selects which tools to disable or simplify to create the Fast/Faster/Fastest profiles. The reported 2.96% BD-Rate gain, 51.5% speedup, and 95.6% encoding-time reduction are explicit experimental outcomes from those measurements, not quantities that are fitted to the same data and then re-labeled as predictions. No equations, self-citations, or uniqueness theorems are invoked to force the profile choices; the simplifications are justified by observed impact on the evaluated features and tasks. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard VVC tool semantics and the assumption that FCM feature statistics differ enough from natural video to justify profile changes. No new physical constants or invented entities are introduced.

axioms (1)

domain assumption VVC tool impact on feature compression can be isolated by sequential enable/disable experiments
Invoked when performing tool-level analysis to motivate profile simplifications

pith-pipeline@v0.9.0 · 5496 in / 1168 out tokens · 35782 ms · 2026-05-17T00:22:33.599277+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We perform a tool-level analysis to understand the impact of individual coding components on compression efficiency and downstream vision task accuracy... propose three lightweight essential VVC profiles—Fast, Faster, and Fastest.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Disabling in-loop filters yields a 2.96% average BD-Rate improvement... Fastest reduces encoding time by 95.6% with only a 1.71% loss in BD-Rate.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 3 internal anchors

[1]

INTRODUCTION A large number of edge devices are capable of capturing vi- sual data from cameras for computer vision (CV). Devices from the latest generation are equipped with Neural Process- ing Units (NPUs), which are specialized hardware architec- tures for running neural network-based algorithms commonly used in CV . However, state-of-the-art CV models...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

FEATURE CODING FOR MACHINES Fig. 2 outlines the encoding and decoding process for in- termediate features computed by a neural network parti- tioned into NN Part-1 and NN Part-2.Xrepresents the set of features tensors computed by NN Part-1, and ˆXrep- resents a lossy variant received by NN Part-2. Formally, X={x n}N n=1 is a set ofNfeature tensors compute...

work page
[3]

LOW-COMPLEXITY VVC PROFILES We evaluate VVC/H.266 for compressing intermediate fea- tures extracted from neural networks, with a focus on iden- tifying a lightweight yet effective tool-set for Feature Cod- ing for Machines (FCM). Unlike natural video content, these features exhibit sparse, abstract activation patterns, rendering many VVC tools—originally ...

work page
[4]

CONCLUSION We present a comprehensive analysis of VVC coding tools for compressing intermediate features in split-inference sys- tems. By profiling encoder decisions and conducting targeted ablation studies, we identify a subset of tools—including mul- tiple transform selection (MTS), sub-block transforms (SbT), dependent quantization (DepQuant), intra su...

work page
[5]

Deep feature com- pression for collaborative object detection,

Hyomin Choi and Ivan V . Baji ´c, “Deep feature com- pression for collaborative object detection,” in2018 25th IEEE International Conference on Image Process- ing (ICIP), 2018, pp. 3743–3747

work page 2018
[6]

Enabling next-generation consumer experience with feature cod- ing for machines,

Md Eimran Hossain Eimon, Juan Merlos, Ashan Perera, Hari Kalva, Velibor Adzic, and Borko Furht, “Enabling next-generation consumer experience with feature cod- ing for machines,” in2025 IEEE International Confer- ence on Consumer Electronics (ICCE). IEEE, 2025, pp. 1–4

work page 2025
[7]

Overview of the versatile video coding (VVC) standard and its applications,

Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J. Sullivan, and Jens-Rainer Ohm, “Overview of the versatile video coding (VVC) standard and its applications,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3736– 3764, 2021

work page 2021
[8]

Overview of the high efficiency video coding (HEVC) standard,

G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,”IEEE Trans. Circuits Syst. Video Technol., vol. 22, pp. 1649–1668, Dec. 2012

work page 2012
[9]

Common test and train- ing conditions for FCM,

ISO/IEC JTC 1/SC 29/WG 04, “Common test and train- ing conditions for FCM,” inISO/IEC JTC 1/SC 29/WG 04 [N0626], Jan. 2025

work page 2025
[10]

CompressAI-Vision,

Fabien Racap ´e, Hyomin Choi, Eimran Eimon amd Sampsa Riikonen, and Jacky Yat-Hong Lam, “CompressAI-Vision,”https://github.com/ InterDigitalInc/CompressAI-Vision, 2023

work page 2023
[11]

VTM: the reference software for VVC develop- ment,

“VTM: the reference software for VVC develop- ment,”https://vcgit.hhi.fraunhofer.de/ jvet/VVCSoftware_VTM, 2018

work page 2018
[12]

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun, “Faster R-CNN: towards real-time object de- tection with region proposal networks,”CoRR, vol. abs/1506.01497, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Towards real-time multi-object track- ing,

Zhongdao Wang, Liang Zheng, Yixuan Liu, and Shengjin Wang, “Towards real-time multi-object track- ing,”The European Conference on Computer Vision (ECCV), 2020

work page 2020
[14]

YOLOv3: An Incremental Improvement

Joseph Redmon and Ali Farhadi, “YOLOv3: An in- cremental improvement,”CoRR, vol. abs/1804.02767, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

A dataset of la- belled objects on raw video sequences,

Hyomin Choi, Elahe Hosseini, Saeed Ranjbar Alvar, Robert A. Cohen, and Ivan V . Baji ´c, “A dataset of la- belled objects on raw video sequences,”Data in Brief, vol. 34, pp. 106701, 2021

work page 2021
[16]

An open dataset for video coding for machines stan- dardization,

Wen Gao, Xiaozhong Xu, Matthew Qin, and Shan Liu, “An open dataset for video coding for machines stan- dardization,” in2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 4008–4012

work page 2022
[17]

Human in events: A large- scale benchmark for human-centric video analysis in complex events,

Weiyao Lin, Huabin Liu, Shizhan Liu, Yuxi Li, Guo-Jun Qi, Rui Qian, Tao Wang, Nicu Sebe, Ning Xu, Hongkai Xiong, and Mubarak Shah, “Human in events: A large- scale benchmark for human-centric video analysis in complex events,”CoRR, vol. abs/2005.04490, 2020

work page arXiv 2005
[18]

Calculation of average psnr differences between rd-curves,

G Bjontegaard, “Calculation of average psnr differences between rd-curves,”ITU-T SG16 Q, vol. 6, 2001

work page 2001

[1] [1]

INTRODUCTION A large number of edge devices are capable of capturing vi- sual data from cameras for computer vision (CV). Devices from the latest generation are equipped with Neural Process- ing Units (NPUs), which are specialized hardware architec- tures for running neural network-based algorithms commonly used in CV . However, state-of-the-art CV models...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

FEATURE CODING FOR MACHINES Fig. 2 outlines the encoding and decoding process for in- termediate features computed by a neural network parti- tioned into NN Part-1 and NN Part-2.Xrepresents the set of features tensors computed by NN Part-1, and ˆXrep- resents a lossy variant received by NN Part-2. Formally, X={x n}N n=1 is a set ofNfeature tensors compute...

work page

[3] [3]

LOW-COMPLEXITY VVC PROFILES We evaluate VVC/H.266 for compressing intermediate fea- tures extracted from neural networks, with a focus on iden- tifying a lightweight yet effective tool-set for Feature Cod- ing for Machines (FCM). Unlike natural video content, these features exhibit sparse, abstract activation patterns, rendering many VVC tools—originally ...

work page

[4] [4]

CONCLUSION We present a comprehensive analysis of VVC coding tools for compressing intermediate features in split-inference sys- tems. By profiling encoder decisions and conducting targeted ablation studies, we identify a subset of tools—including mul- tiple transform selection (MTS), sub-block transforms (SbT), dependent quantization (DepQuant), intra su...

work page

[5] [5]

Deep feature com- pression for collaborative object detection,

Hyomin Choi and Ivan V . Baji ´c, “Deep feature com- pression for collaborative object detection,” in2018 25th IEEE International Conference on Image Process- ing (ICIP), 2018, pp. 3743–3747

work page 2018

[6] [6]

Enabling next-generation consumer experience with feature cod- ing for machines,

Md Eimran Hossain Eimon, Juan Merlos, Ashan Perera, Hari Kalva, Velibor Adzic, and Borko Furht, “Enabling next-generation consumer experience with feature cod- ing for machines,” in2025 IEEE International Confer- ence on Consumer Electronics (ICCE). IEEE, 2025, pp. 1–4

work page 2025

[7] [7]

Overview of the versatile video coding (VVC) standard and its applications,

Benjamin Bross, Ye-Kui Wang, Yan Ye, Shan Liu, Jianle Chen, Gary J. Sullivan, and Jens-Rainer Ohm, “Overview of the versatile video coding (VVC) standard and its applications,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3736– 3764, 2021

work page 2021

[8] [8]

Overview of the high efficiency video coding (HEVC) standard,

G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the high efficiency video coding (HEVC) standard,”IEEE Trans. Circuits Syst. Video Technol., vol. 22, pp. 1649–1668, Dec. 2012

work page 2012

[9] [9]

Common test and train- ing conditions for FCM,

ISO/IEC JTC 1/SC 29/WG 04, “Common test and train- ing conditions for FCM,” inISO/IEC JTC 1/SC 29/WG 04 [N0626], Jan. 2025

work page 2025

[10] [10]

CompressAI-Vision,

Fabien Racap ´e, Hyomin Choi, Eimran Eimon amd Sampsa Riikonen, and Jacky Yat-Hong Lam, “CompressAI-Vision,”https://github.com/ InterDigitalInc/CompressAI-Vision, 2023

work page 2023

[11] [11]

VTM: the reference software for VVC develop- ment,

“VTM: the reference software for VVC develop- ment,”https://vcgit.hhi.fraunhofer.de/ jvet/VVCSoftware_VTM, 2018

work page 2018

[12] [12]

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun, “Faster R-CNN: towards real-time object de- tection with region proposal networks,”CoRR, vol. abs/1506.01497, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

Towards real-time multi-object track- ing,

Zhongdao Wang, Liang Zheng, Yixuan Liu, and Shengjin Wang, “Towards real-time multi-object track- ing,”The European Conference on Computer Vision (ECCV), 2020

work page 2020

[14] [14]

YOLOv3: An Incremental Improvement

Joseph Redmon and Ali Farhadi, “YOLOv3: An in- cremental improvement,”CoRR, vol. abs/1804.02767, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

A dataset of la- belled objects on raw video sequences,

Hyomin Choi, Elahe Hosseini, Saeed Ranjbar Alvar, Robert A. Cohen, and Ivan V . Baji ´c, “A dataset of la- belled objects on raw video sequences,”Data in Brief, vol. 34, pp. 106701, 2021

work page 2021

[16] [16]

An open dataset for video coding for machines stan- dardization,

Wen Gao, Xiaozhong Xu, Matthew Qin, and Shan Liu, “An open dataset for video coding for machines stan- dardization,” in2022 IEEE International Conference on Image Processing (ICIP), 2022, pp. 4008–4012

work page 2022

[17] [17]

Human in events: A large- scale benchmark for human-centric video analysis in complex events,

Weiyao Lin, Huabin Liu, Shizhan Liu, Yuxi Li, Guo-Jun Qi, Rui Qian, Tao Wang, Nicu Sebe, Ning Xu, Hongkai Xiong, and Mubarak Shah, “Human in events: A large- scale benchmark for human-centric video analysis in complex events,”CoRR, vol. abs/2005.04490, 2020

work page arXiv 2005

[18] [18]

Calculation of average psnr differences between rd-curves,

G Bjontegaard, “Calculation of average psnr differences between rd-curves,”ITU-T SG16 Q, vol. 6, 2001

work page 2001