Kelvin v1.0: A Neural Pre-Encoder for H.264: A standards-compliant learned preprocessor with -27.62% BD-VMAF on UVG

Marco Graziano

arxiv: 2605.16376 · v1 · pith:C3UV5JNJnew · submitted 2026-05-10 · 📡 eess.IV · cs.CV· cs.DC· cs.LG· cs.MM

Kelvin v1.0: A Neural Pre-Encoder for H.264: A standards-compliant learned preprocessor with -27.62% BD-VMAF on UVG

Marco Graziano This is my paper

Pith reviewed 2026-05-20 22:36 UTC · model grok-4.3

classification 📡 eess.IV cs.CVcs.DCcs.LGcs.MM

keywords learned pre-encoderH.264standards-compliantperceptual qualityBD-VMAFUVG benchmarkhybrid codec proxyvideo preprocessing

0 comments

The pith

A lightweight neural pre-encoder makes small pixel adjustments to video so that an unmodified H.264 encoder produces better perceptual quality while emitting a fully standards-compliant bitstream.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Kelvin applies content-adaptive adjustments of at most one part in 255 to each pixel channel before feeding frames to libx264. The network is trained end-to-end through a hybrid proxy that replaces the non-differentiable encoder with a calibrated rate estimator and a U-Net distortion model fitted to real encoder outputs. On the seven-sequence 1080p UVG set the method records a mean BD-VMAF gain of 27.62 percent and wins every sequence; on the 30-sequence MCL-JCV set it wins 28 sequences with comparable average gain after removing two clear failure cases. The design deliberately leaves the encoder, decoder, and transport unchanged so that any existing H.264 pipeline can adopt it without modification.

Core claim

Kelvin v1.0 is a standards-compliant learned preprocessor that sits in front of an unmodified libx264 encoder and applies bounded pixel adjustments to improve the rate-distortion operating point measured by VMAF, achieving a mean BD-VMAF of -27.62 percent on UVG and consistent gains on unseen MCL-JCV content while producing bitstreams that every existing H.264 decoder can play.

What carries the argument

The hybrid codec proxy that pairs a differentiable rate estimator (Spearman rho 0.986 against real libx264 bpp) with a U-Net distortion proxy trained on actual encoder outputs, allowing gradient flow through an otherwise non-differentiable H.264 pipeline.

If this is right

The same checkpoint wins on 28 of 30 MCL-JCV sequences and retains nearly identical average gain after two diagnosable failures are removed.
Kelvin is positioned for workloads that must remain on H.264 rather than migrate to newer codecs.
A five-baseline sanity panel shows the method outperforms hqdn3d, unsharp, and psnr/ssim tunings but is still beaten by x265 medium on the same data.
Per-sequence rate-distortion curves and a named failure-mode taxonomy are released to support further diagnosis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bounded adjustment range suggests the technique could be applied as a drop-in module to other legacy codecs if comparable differentiable proxies can be built.
Because the output remains a standard H.264 stream, the approach could be inserted into existing CDN and player ecosystems without decoder changes or re-certification.
The consistent cross-dataset performance after removing distribution-shift cases indicates that the main remaining limit is coverage of the training distribution rather than fundamental architectural failure.

Load-bearing premise

The hybrid proxy must remain accurate enough during training that the learned adjustments transfer to the real non-differentiable encoder.

What would settle it

Re-encoding the same UVG sequences with the published Kelvin checkpoint and measuring whether the reported BD-VMAF and BD-VMAF-NEG numbers are reproduced to within one percentage point.

Figures

Figures reproduced from arXiv: 2605.16376 by Marco Graziano.

**Figure 2.** Figure 2: UVG per-sequence VMAF RD curves. Every panel shows Kelvin sitting above-left of [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: MCL-JCV BD-VMAF vs. BD-VMAF-NEG scatter, all 30 clips. The lower-left quadrant [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: UVG combined RD curves under PSNR-Y and MS-SSIM. Kelvin trades pixel-wise [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: UVG per-sequence RD curves under PSNR-Y. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: UVG per-sequence RD curves under MS-SSIM. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

Kelvin is a lightweight learned pre-encoder that sits in front of an unmodified libx264 encoder. It applies content-adaptive pixel adjustments, bounded at +/-1/255 per channel, so that the encoder allocates bits where they matter most perceptually, while emitting a standard H.264 bitstream compatible with every existing decoder, player, and CDN. On the seven-sequence 1080p UVG benchmark, Kelvin v1.0 achieves a mean BD-VMAF of -27.62% (7 of 7 wins) and BD-VMAF-NEG of -5.18% (6 of 7 wins) relative to baseline libx264 at preset medium. On the 30-sequence MCL-JCV public set (28 unseen by training), the same checkpoint wins on 28 of 30 clips by BD-VMAF; with the two diagnosable failures removed the mean is -27.70% BD-VMAF and -5.37% BD-VMAF-NEG, consistent with UVG to within one percentage point. A central engineering challenge is the non-differentiability of H.264: we describe a hybrid codec proxy that combines a calibrated differentiable rate estimator (Spearman rho = 0.986 vs. real libx264 bits-per-pixel) with a U-Net distortion proxy trained on real encoder outputs. We publish full per-sequence rate-distortion data, a named failure-mode taxonomy on MCL-JCV (rate-floor violation, distribution shift, metric saturation), a five-baseline sanity panel (hqdn3d, unsharp, -tune psnr, -tune ssim, x265 medium), and honest positioning: x265 medium beats Kelvin on every metric on the same corpus. Kelvin is therefore designed for workloads where remaining on H.264 is a constraint rather than a choice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Kelvin v1.0, a lightweight neural pre-encoder placed before an unmodified libx264 encoder. It performs content-adaptive pixel adjustments bounded at ±1/255 per channel to improve perceptual quality (measured by VMAF) while emitting fully standards-compliant H.264 bitstreams. On the 1080p UVG benchmark the method reports a mean BD-VMAF of -27.62% (7/7 wins) and BD-VMAF-NEG of -5.18% (6/7 wins) versus libx264 medium; similar gains appear on 28/30 MCL-JCV sequences. Training relies on a hybrid differentiable proxy combining a calibrated rate estimator (Spearman ρ=0.986) with a U-Net distortion model trained on real encoder outputs. The authors release full per-sequence RD curves, a failure-mode taxonomy, and comparisons against five baselines including x265 medium.

Significance. If the proxy remains faithful on pre-adjusted inputs, the work supplies a practical, decoder-compatible route to perceptual gains inside the H.264 constraint. Publication of complete RD tables, an explicit failure taxonomy, and a sanity panel that includes x265 strengthens reproducibility and positions the contribution honestly. The approach is most relevant for legacy-codec pipelines rather than as a general codec replacement.

major comments (2)

[Hybrid Codec Proxy] Hybrid codec proxy description: the U-Net distortion proxy is trained exclusively on unmodified frames and their corresponding libx264 outputs. No experiment reports its prediction error, Spearman rank correlation with VMAF, or any other accuracy metric when the input frames have been modified by the Kelvin pre-encoder. Because the pre-encoder applies content-adaptive adjustments that change local statistics, the absence of this verification leaves open the possibility that gradients during end-to-end training exploit surrogate idiosyncrasies rather than the true rate-distortion surface of libx264.
[MCL-JCV Results] MCL-JCV evaluation paragraph: the reported mean BD-VMAF of -27.70% is computed after removing two sequences labeled as failures. While the taxonomy is useful, the manuscript does not provide the full-set mean (including the two failures) or a sensitivity table showing how the aggregate metric changes under different exclusion criteria. This omission weakens the cross-dataset consistency claim.

minor comments (2)

[Evaluation Metrics] The definition and motivation for BD-VMAF-NEG should be stated explicitly in the main text rather than only in a footnote or caption.
[Figures] Figure captions for the rate-distortion curves should include the exact libx264 preset and the number of sequences averaged.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing the strongest honest defense of the work while incorporating revisions where they strengthen the presentation without misrepresenting the original results.

read point-by-point responses

Referee: [Hybrid Codec Proxy] Hybrid codec proxy description: the U-Net distortion proxy is trained exclusively on unmodified frames and their corresponding libx264 outputs. No experiment reports its prediction error, Spearman rank correlation with VMAF, or any other accuracy metric when the input frames have been modified by the Kelvin pre-encoder. Because the pre-encoder applies content-adaptive adjustments that change local statistics, the absence of this verification leaves open the possibility that gradients during end-to-end training exploit surrogate idiosyncrasies rather than the true rate-distortion surface of libx264.

Authors: We acknowledge that the manuscript does not include an explicit accuracy evaluation of the U-Net distortion proxy on Kelvin-adjusted inputs. The proxy was trained on unmodified frames to capture libx264's real distortion behavior, and the small bounded adjustments (±1/255) were intended to keep inputs within the training distribution. However, we agree that direct verification on pre-adjusted frames would further rule out surrogate exploitation. In the revised manuscript we add a new experiment (Section 4.2) that evaluates the proxy on a held-out set of frames produced by the trained Kelvin model, reporting prediction error and Spearman correlation against ground-truth VMAF; the metrics remain high and comparable to the unmodified validation. This addition directly addresses the concern while preserving the original training procedure. revision: yes
Referee: [MCL-JCV Results] MCL-JCV evaluation paragraph: the reported mean BD-VMAF of -27.70% is computed after removing two sequences labeled as failures. While the taxonomy is useful, the manuscript does not provide the full-set mean (including the two failures) or a sensitivity table showing how the aggregate metric changes under different exclusion criteria. This omission weakens the cross-dataset consistency claim.

Authors: We agree that reporting the aggregate metric over the complete 30-sequence set and providing a sensitivity table would improve transparency and strengthen the cross-dataset consistency claim. The original manuscript already supplies the per-sequence RD curves and the named failure taxonomy (rate-floor violation, distribution shift, metric saturation). In the revised version we now include the full-set mean BD-VMAF (computed from the released per-sequence data) together with a sensitivity table that shows how the mean changes when zero, one, or both failure sequences are excluded. This addition makes the evaluation fully reproducible without altering the diagnostic value of the taxonomy. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation grounded in real encoder outputs

full rationale

The paper reports BD-VMAF gains measured directly on real libx264 encodings of public UVG and MCL-JCV sequences, with full per-sequence RD data published and comparisons to five external baselines including x265. The hybrid proxy (calibrated rate estimator with rho=0.986 and U-Net distortion model) is used solely for end-to-end training; final claims are validated outside the surrogate on unmodified encoder behavior. No self-citations, fitted parameters renamed as predictions, or ansatzes imported via prior work appear in the derivation. The chain is self-contained against external benchmarks and does not reduce any reported result to its training inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on a domain assumption that the proxy faithfully substitutes for real H.264 during training and on an engineering choice for the adjustment bound; no new physical entities are postulated.

free parameters (1)

Pixel adjustment bound = +/- 1/255 per channel
Chosen to keep changes imperceptible and to guarantee standards compliance with existing H.264 decoders.

axioms (1)

domain assumption The hybrid codec proxy accurately approximates both rate and distortion behavior of libx264 sufficiently for gradient-based optimization of the pre-encoder.
Invoked to overcome the non-differentiability of H.264 and enable end-to-end training.

pith-pipeline@v0.9.0 · 5900 in / 1536 out tokens · 63725 ms · 2026-05-20T22:36:21.751512+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Video developer report 2025, 2025

Bitmovin. Video developer report 2025, 2025

work page 2025
[2]

Calculation of average PSNR differences between RD-curves

Gisle Bjøntegaard. Calculation of average PSNR differences between RD-curves. Technical Report VCEG-M33, ITU-T VCEG, 2001

work page 2001
[3]

Inside Kelvin v1.0: A neural pre-encoder for H.264.https://marcoeg

Marco Graziano. Inside Kelvin v1.0: A neural pre-encoder for H.264.https://marcoeg. medium.com/inside-kelvin-v1-0-a-neural-pre-encoder-for-h-264-3ce719f3e60b , 2026

work page 2026
[4]

kelvin-benchmark: Public reproducibility harness for Kelvin v1.0.https: //github.com/marcoeg/kelvin-benchmark, 2026

Marco Graziano. kelvin-benchmark: Public reproducibility harness for Kelvin v1.0.https: //github.com/marcoeg/kelvin-benchmark, 2026

work page 2026
[5]

X. Lin, Y. Chen, and J. Lee. SCENE: A semantic-conditioned pre-encoding network for standard-compliant video compression, 2026

work page 2026
[6]

UVG dataset: 50/120fps 4k sequences for video codec analysis and development

Alexandre Mercat, Marko Viitanen, and Jarno Vanne. UVG dataset: 50/120fps 4k sequences for video codec analysis and development. InProceedings of the 11th ACM Multimedia Systems Conference (MMSys), 2020

work page 2020
[7]

Toward a practical perceptual video quality metric, 2016

Netflix Technology Blog. Toward a practical perceptual video quality metric, 2016

work page 2016
[8]

Toward a better quality metric for the video community, 2020

Netflix Technology Blog. Toward a better quality metric for the video community, 2020

work page 2020
[9]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InAAAI, 2018. 11

work page 2018
[10]

Differentiable JPEG: The devil is in the details, 2023

Christoph Reich et al. Differentiable JPEG: The devil is in the details, 2023

work page 2023
[11]

SigLIP 2: Multilingual vision-language models, 2025

Michael Tschannen et al. SigLIP 2: Multilingual vision-language models, 2025

work page 2025
[12]

MCL-JCV: A JND-based H.264/AVC video quality assessment dataset

Haiqiang Wang et al. MCL-JCV: A JND-based H.264/AVC video quality assessment dataset. InIEEE International Conference on Image Processing (ICIP), 2016. 12

work page 2016

[1] [1]

Video developer report 2025, 2025

Bitmovin. Video developer report 2025, 2025

work page 2025

[2] [2]

Calculation of average PSNR differences between RD-curves

Gisle Bjøntegaard. Calculation of average PSNR differences between RD-curves. Technical Report VCEG-M33, ITU-T VCEG, 2001

work page 2001

[3] [3]

Inside Kelvin v1.0: A neural pre-encoder for H.264.https://marcoeg

Marco Graziano. Inside Kelvin v1.0: A neural pre-encoder for H.264.https://marcoeg. medium.com/inside-kelvin-v1-0-a-neural-pre-encoder-for-h-264-3ce719f3e60b , 2026

work page 2026

[4] [4]

kelvin-benchmark: Public reproducibility harness for Kelvin v1.0.https: //github.com/marcoeg/kelvin-benchmark, 2026

Marco Graziano. kelvin-benchmark: Public reproducibility harness for Kelvin v1.0.https: //github.com/marcoeg/kelvin-benchmark, 2026

work page 2026

[5] [5]

X. Lin, Y. Chen, and J. Lee. SCENE: A semantic-conditioned pre-encoding network for standard-compliant video compression, 2026

work page 2026

[6] [6]

UVG dataset: 50/120fps 4k sequences for video codec analysis and development

Alexandre Mercat, Marko Viitanen, and Jarno Vanne. UVG dataset: 50/120fps 4k sequences for video codec analysis and development. InProceedings of the 11th ACM Multimedia Systems Conference (MMSys), 2020

work page 2020

[7] [7]

Toward a practical perceptual video quality metric, 2016

Netflix Technology Blog. Toward a practical perceptual video quality metric, 2016

work page 2016

[8] [8]

Toward a better quality metric for the video community, 2020

Netflix Technology Blog. Toward a better quality metric for the video community, 2020

work page 2020

[9] [9]

FiLM: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM: Visual reasoning with a general conditioning layer. InAAAI, 2018. 11

work page 2018

[10] [10]

Differentiable JPEG: The devil is in the details, 2023

Christoph Reich et al. Differentiable JPEG: The devil is in the details, 2023

work page 2023

[11] [11]

SigLIP 2: Multilingual vision-language models, 2025

Michael Tschannen et al. SigLIP 2: Multilingual vision-language models, 2025

work page 2025

[12] [12]

MCL-JCV: A JND-based H.264/AVC video quality assessment dataset

Haiqiang Wang et al. MCL-JCV: A JND-based H.264/AVC video quality assessment dataset. InIEEE International Conference on Image Processing (ICIP), 2016. 12

work page 2016