Kelvin v1.0: A Neural Pre-Encoder for H.264: A standards-compliant learned preprocessor with -27.62% BD-VMAF on UVG
Pith reviewed 2026-05-20 22:36 UTC · model grok-4.3
The pith
A lightweight neural pre-encoder makes small pixel adjustments to video so that an unmodified H.264 encoder produces better perceptual quality while emitting a fully standards-compliant bitstream.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Kelvin v1.0 is a standards-compliant learned preprocessor that sits in front of an unmodified libx264 encoder and applies bounded pixel adjustments to improve the rate-distortion operating point measured by VMAF, achieving a mean BD-VMAF of -27.62 percent on UVG and consistent gains on unseen MCL-JCV content while producing bitstreams that every existing H.264 decoder can play.
What carries the argument
The hybrid codec proxy that pairs a differentiable rate estimator (Spearman rho 0.986 against real libx264 bpp) with a U-Net distortion proxy trained on actual encoder outputs, allowing gradient flow through an otherwise non-differentiable H.264 pipeline.
If this is right
- The same checkpoint wins on 28 of 30 MCL-JCV sequences and retains nearly identical average gain after two diagnosable failures are removed.
- Kelvin is positioned for workloads that must remain on H.264 rather than migrate to newer codecs.
- A five-baseline sanity panel shows the method outperforms hqdn3d, unsharp, and psnr/ssim tunings but is still beaten by x265 medium on the same data.
- Per-sequence rate-distortion curves and a named failure-mode taxonomy are released to support further diagnosis.
Where Pith is reading between the lines
- The bounded adjustment range suggests the technique could be applied as a drop-in module to other legacy codecs if comparable differentiable proxies can be built.
- Because the output remains a standard H.264 stream, the approach could be inserted into existing CDN and player ecosystems without decoder changes or re-certification.
- The consistent cross-dataset performance after removing distribution-shift cases indicates that the main remaining limit is coverage of the training distribution rather than fundamental architectural failure.
Load-bearing premise
The hybrid proxy must remain accurate enough during training that the learned adjustments transfer to the real non-differentiable encoder.
What would settle it
Re-encoding the same UVG sequences with the published Kelvin checkpoint and measuring whether the reported BD-VMAF and BD-VMAF-NEG numbers are reproduced to within one percentage point.
Figures
read the original abstract
Kelvin is a lightweight learned pre-encoder that sits in front of an unmodified libx264 encoder. It applies content-adaptive pixel adjustments, bounded at +/-1/255 per channel, so that the encoder allocates bits where they matter most perceptually, while emitting a standard H.264 bitstream compatible with every existing decoder, player, and CDN. On the seven-sequence 1080p UVG benchmark, Kelvin v1.0 achieves a mean BD-VMAF of -27.62% (7 of 7 wins) and BD-VMAF-NEG of -5.18% (6 of 7 wins) relative to baseline libx264 at preset medium. On the 30-sequence MCL-JCV public set (28 unseen by training), the same checkpoint wins on 28 of 30 clips by BD-VMAF; with the two diagnosable failures removed the mean is -27.70% BD-VMAF and -5.37% BD-VMAF-NEG, consistent with UVG to within one percentage point. A central engineering challenge is the non-differentiability of H.264: we describe a hybrid codec proxy that combines a calibrated differentiable rate estimator (Spearman rho = 0.986 vs. real libx264 bits-per-pixel) with a U-Net distortion proxy trained on real encoder outputs. We publish full per-sequence rate-distortion data, a named failure-mode taxonomy on MCL-JCV (rate-floor violation, distribution shift, metric saturation), a five-baseline sanity panel (hqdn3d, unsharp, -tune psnr, -tune ssim, x265 medium), and honest positioning: x265 medium beats Kelvin on every metric on the same corpus. Kelvin is therefore designed for workloads where remaining on H.264 is a constraint rather than a choice.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Kelvin v1.0, a lightweight neural pre-encoder placed before an unmodified libx264 encoder. It performs content-adaptive pixel adjustments bounded at ±1/255 per channel to improve perceptual quality (measured by VMAF) while emitting fully standards-compliant H.264 bitstreams. On the 1080p UVG benchmark the method reports a mean BD-VMAF of -27.62% (7/7 wins) and BD-VMAF-NEG of -5.18% (6/7 wins) versus libx264 medium; similar gains appear on 28/30 MCL-JCV sequences. Training relies on a hybrid differentiable proxy combining a calibrated rate estimator (Spearman ρ=0.986) with a U-Net distortion model trained on real encoder outputs. The authors release full per-sequence RD curves, a failure-mode taxonomy, and comparisons against five baselines including x265 medium.
Significance. If the proxy remains faithful on pre-adjusted inputs, the work supplies a practical, decoder-compatible route to perceptual gains inside the H.264 constraint. Publication of complete RD tables, an explicit failure taxonomy, and a sanity panel that includes x265 strengthens reproducibility and positions the contribution honestly. The approach is most relevant for legacy-codec pipelines rather than as a general codec replacement.
major comments (2)
- [Hybrid Codec Proxy] Hybrid codec proxy description: the U-Net distortion proxy is trained exclusively on unmodified frames and their corresponding libx264 outputs. No experiment reports its prediction error, Spearman rank correlation with VMAF, or any other accuracy metric when the input frames have been modified by the Kelvin pre-encoder. Because the pre-encoder applies content-adaptive adjustments that change local statistics, the absence of this verification leaves open the possibility that gradients during end-to-end training exploit surrogate idiosyncrasies rather than the true rate-distortion surface of libx264.
- [MCL-JCV Results] MCL-JCV evaluation paragraph: the reported mean BD-VMAF of -27.70% is computed after removing two sequences labeled as failures. While the taxonomy is useful, the manuscript does not provide the full-set mean (including the two failures) or a sensitivity table showing how the aggregate metric changes under different exclusion criteria. This omission weakens the cross-dataset consistency claim.
minor comments (2)
- [Evaluation Metrics] The definition and motivation for BD-VMAF-NEG should be stated explicitly in the main text rather than only in a footnote or caption.
- [Figures] Figure captions for the rate-distortion curves should include the exact libx264 preset and the number of sequences averaged.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing the strongest honest defense of the work while incorporating revisions where they strengthen the presentation without misrepresenting the original results.
read point-by-point responses
-
Referee: [Hybrid Codec Proxy] Hybrid codec proxy description: the U-Net distortion proxy is trained exclusively on unmodified frames and their corresponding libx264 outputs. No experiment reports its prediction error, Spearman rank correlation with VMAF, or any other accuracy metric when the input frames have been modified by the Kelvin pre-encoder. Because the pre-encoder applies content-adaptive adjustments that change local statistics, the absence of this verification leaves open the possibility that gradients during end-to-end training exploit surrogate idiosyncrasies rather than the true rate-distortion surface of libx264.
Authors: We acknowledge that the manuscript does not include an explicit accuracy evaluation of the U-Net distortion proxy on Kelvin-adjusted inputs. The proxy was trained on unmodified frames to capture libx264's real distortion behavior, and the small bounded adjustments (±1/255) were intended to keep inputs within the training distribution. However, we agree that direct verification on pre-adjusted frames would further rule out surrogate exploitation. In the revised manuscript we add a new experiment (Section 4.2) that evaluates the proxy on a held-out set of frames produced by the trained Kelvin model, reporting prediction error and Spearman correlation against ground-truth VMAF; the metrics remain high and comparable to the unmodified validation. This addition directly addresses the concern while preserving the original training procedure. revision: yes
-
Referee: [MCL-JCV Results] MCL-JCV evaluation paragraph: the reported mean BD-VMAF of -27.70% is computed after removing two sequences labeled as failures. While the taxonomy is useful, the manuscript does not provide the full-set mean (including the two failures) or a sensitivity table showing how the aggregate metric changes under different exclusion criteria. This omission weakens the cross-dataset consistency claim.
Authors: We agree that reporting the aggregate metric over the complete 30-sequence set and providing a sensitivity table would improve transparency and strengthen the cross-dataset consistency claim. The original manuscript already supplies the per-sequence RD curves and the named failure taxonomy (rate-floor violation, distribution shift, metric saturation). In the revised version we now include the full-set mean BD-VMAF (computed from the released per-sequence data) together with a sensitivity table that shows how the mean changes when zero, one, or both failure sequences are excluded. This addition makes the evaluation fully reproducible without altering the diagnostic value of the taxonomy. revision: yes
Circularity Check
No significant circularity; evaluation grounded in real encoder outputs
full rationale
The paper reports BD-VMAF gains measured directly on real libx264 encodings of public UVG and MCL-JCV sequences, with full per-sequence RD data published and comparisons to five external baselines including x265. The hybrid proxy (calibrated rate estimator with rho=0.986 and U-Net distortion model) is used solely for end-to-end training; final claims are validated outside the surrogate on unmodified encoder behavior. No self-citations, fitted parameters renamed as predictions, or ansatzes imported via prior work appear in the derivation. The chain is self-contained against external benchmarks and does not reduce any reported result to its training inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- Pixel adjustment bound =
+/- 1/255 per channel
axioms (1)
- domain assumption The hybrid codec proxy accurately approximates both rate and distortion behavior of libx264 sufficiently for gradient-based optimization of the pre-encoder.
Reference graph
Works this paper leans on
- [1]
-
[2]
Calculation of average PSNR differences between RD-curves
Gisle Bjøntegaard. Calculation of average PSNR differences between RD-curves. Technical Report VCEG-M33, ITU-T VCEG, 2001
work page 2001
-
[3]
Inside Kelvin v1.0: A neural pre-encoder for H.264.https://marcoeg
Marco Graziano. Inside Kelvin v1.0: A neural pre-encoder for H.264.https://marcoeg. medium.com/inside-kelvin-v1-0-a-neural-pre-encoder-for-h-264-3ce719f3e60b , 2026
work page 2026
-
[4]
Marco Graziano. kelvin-benchmark: Public reproducibility harness for Kelvin v1.0.https: //github.com/marcoeg/kelvin-benchmark, 2026
work page 2026
-
[5]
X. Lin, Y. Chen, and J. Lee. SCENE: A semantic-conditioned pre-encoding network for standard-compliant video compression, 2026
work page 2026
-
[6]
UVG dataset: 50/120fps 4k sequences for video codec analysis and development
Alexandre Mercat, Marko Viitanen, and Jarno Vanne. UVG dataset: 50/120fps 4k sequences for video codec analysis and development. InProceedings of the 11th ACM Multimedia Systems Conference (MMSys), 2020
work page 2020
-
[7]
Toward a practical perceptual video quality metric, 2016
Netflix Technology Blog. Toward a practical perceptual video quality metric, 2016
work page 2016
-
[8]
Toward a better quality metric for the video community, 2020
Netflix Technology Blog. Toward a better quality metric for the video community, 2020
work page 2020
-
[9]
FiLM: Visual reasoning with a general conditioning layer
Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InAAAI, 2018. 11
work page 2018
-
[10]
Differentiable JPEG: The devil is in the details, 2023
Christoph Reich et al. Differentiable JPEG: The devil is in the details, 2023
work page 2023
-
[11]
SigLIP 2: Multilingual vision-language models, 2025
Michael Tschannen et al. SigLIP 2: Multilingual vision-language models, 2025
work page 2025
-
[12]
MCL-JCV: A JND-based H.264/AVC video quality assessment dataset
Haiqiang Wang et al. MCL-JCV: A JND-based H.264/AVC video quality assessment dataset. InIEEE International Conference on Image Processing (ICIP), 2016. 12
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.