RT-NeRV: Rethinking Hybrid Neural Representations for Video via Residual Tokenization

Alan Wee-Chung Liew; Chengkai Wang; Xiang Feng; Xuefei Yin; Yanming Zhu; Yunjie Xu

arxiv: 2403.12401 · v2 · pith:RGIIOWYBnew · submitted 2024-03-19 · 💻 cs.CV

RT-NeRV: Rethinking Hybrid Neural Representations for Video via Residual Tokenization

Yunjie Xu , Xiang Feng , Chengkai Wang , Alan Wee-Chung Liew , Xuefei Yin , Yanming Zhu This is my paper

Pith reviewed 2026-05-25 08:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords residual tokenizationhybrid NeRVvideo compressionneural video representationsdetail preservationcodebook learningvideo regressionlow-bitrate reconstruction

0 comments

The pith

Discretizing shallow residual features into compact tokens lets hybrid NeRV transmit detail-preserving information efficiently at low bitrates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RT-NeRV to fix a bottleneck in hybrid neural video representations: shallow residual support helps reconstruction but costs too much to send in continuous form, so it stays underused at low bitrates. It proposes turning those residual features and inter-frame cues into discrete compact tokens through a dedicated tokenizer and codebook strategy. This change lets the decoder exploit the support without high transmission overhead. When plugged into existing hybrid NeRV hosts, the method raises detail preservation and overall reconstruction quality while improving bitrate trade-offs. Experiments across video regression and restoration tasks show it beats strong baselines and stays competitive with other INR compression approaches.

Core claim

The central claim is that discretizing shallow residual features and inter-frame residual cues into compact residual tokens via a residual tokenizer and residual-aware codebook learning strategy transmits informative reconstruction support efficiently at low bitrates, allowing the decoder to exploit it and thereby improving detail preservation, reconstruction quality, and bitrate-quality trade-offs when integrated into hybrid NeRV hosts.

What carries the argument

A residual tokenizer paired with residual-aware codebook learning that converts continuous shallow residual features into discrete compact tokens for efficient transmission and decoder use.

If this is right

Integration into existing hybrid NeRV architectures raises detail preservation without redesigning the host decoder.
Reconstruction quality improves across video regression tasks at the same bitrate.
Bitrate-quality trade-offs shift favorably compared with prior hybrid NeRV baselines.
Performance remains competitive with recent INR-based video compression methods on the same tasks.
The same tokenization step applies to related restoration tasks and yields similar gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The tokenization step could be tested as a drop-in module for other neural video codecs that already use residual signals.
If token utilization stays high across diverse content, the approach might reduce reliance on high-dimensional continuous embeddings in bandwidth-constrained settings.
Extending the codebook learning to handle temporal consistency across longer video sequences could further stabilize training on dynamic scenes.
Measuring decoder runtime with the added tokenizer would show whether the efficiency gain at transmission time carries through to real-time playback.

Load-bearing premise

Discretizing the shallow residual features and cues into tokens keeps enough of their informative value for reconstruction without meaningful loss of fidelity.

What would settle it

A direct comparison at low bitrates where the token-discretized version shows no gain or a clear drop in fine-detail PSNR or perceptual metrics relative to the continuous-residual hybrid baseline.

Figures

Figures reproduced from arXiv: 2403.12401 by Alan Wee-Chung Liew, Chengkai Wang, Xiang Feng, Xuefei Yin, Yanming Zhu, Yunjie Xu.

**Figure 2.** Figure 2: Overview of the proposed method VQ-NeRV. The upper figure is the video encoding [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of VQ-NeRV Block Architecture. down-sampling modules to reduce the dimensions of shallow residual features from x ∈ R C0×H0×W0 to x ∈ R 64C0×H0/8×W0/8 . Following this, the VQ-NeRV Block utilizes an invertible block to map x ∈ R 64C0×H0/8×W0/8 to the shallow codebook’s discretized residual feature, as well as Z ′ , which is case agnostic feature. By replacing the original residual feature with the… view at source ↗

**Figure 4.** Figure 4: (a) The selection criteria for Exponential Moving Average (EMA) updates within a batch [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization comparing VQ-NeRV with other state-of-the-art methods for several patches [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Compression results on the bunny dataset. (b) Compression results on the UVG dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Neural Representations for Videos(NeRV) have emerged as a promising paradigm for video compression by representing videos as compact neural networks with efficient decoding. Hybrid NeRV methods further improve reconstruction quality through content adaptive embeddings, but still struggle to preserve fine details at low bitrates. A key limitation is that shallow residual support in formation, although highly beneficial for reconstruction, is costly to transmit in its continuous form and is therefore underutilized. In this paper, we rethink hybrid NeRV and present RT-NeRV, a residual tokenization framework for hybrid neural video representations. The core idea is to discretize shallow residual features and inter-frame residual cues into compact residual tokens, allowing informative reconstruction support to be transmitted efficiently and exploited by the decoder. To this end, we design a residual tokenizer together with a residual-aware codebook learning strategy that improves token utilization and stabilizes training. RT-NeRV can be readily integrated into modern hybrid NeRV hosts, consistently enhancing detail preservation, reconstruction quality, and bitrate quality trade-offs. Extensive experiments on video regression and related restoration tasks show that RT-NeRV outperforms strong hybrid NeRV baselines and remains competitive with recent INR based video compression methods. These results demonstrate that residual tokenization is an effective and complementary direction for advancing hybrid neural video representations

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RT-NeRV adds residual tokenization to hybrid NeRV to transmit shallow residuals more efficiently, a straightforward extension that could help at low bitrates if the gains hold up in the experiments.

read the letter

RT-NeRV's main contribution is a residual tokenization step that turns shallow residual features and inter-frame cues into discrete tokens via a learned codebook. This lets hybrid NeRV methods send more reconstruction support without the cost of continuous residuals, which the paper identifies as a bottleneck at low bitrates. They also add a residual-aware codebook learning strategy meant to improve token use and training stability. The approach is designed to plug into existing hybrid NeRV hosts, and the abstract claims consistent improvements in detail preservation and quality-bitrate tradeoffs on regression and restoration tasks while staying competitive with other INR compression work. The tokenization idea itself is presented as new relative to the cited prior NeRV papers, and it directly targets the underutilization problem they describe. That part reads as a reasonable, incremental fix rather than a complete overhaul. The soft spot is the lack of visible numbers, datasets, or ablations in the summary material. The central assumption—that discretizing the residuals keeps enough fidelity for the decoder to exploit—needs concrete checks on codebook size, token utilization rates, and how much information is actually lost. Without those details it's hard to tell if the reported outperformance is meaningful or mostly small. The construction does not show obvious circularity or internal contradictions from what is described. This is for people already working on neural video codecs and INR-based compression. A reader focused on practical bitrate improvements in hybrid setups would find the integration details and codebook design useful. It deserves a serious referee because the limitation it targets is real and the proposed mechanism is specific enough to evaluate, even if the empirical case needs strengthening. Recommendation: send it to review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces RT-NeRV, a residual tokenization framework for hybrid neural video representations (NeRV). It addresses the underutilization of shallow residual support at low bitrates by discretizing shallow residual features and inter-frame residual cues into compact residual tokens via a residual tokenizer and a residual-aware codebook learning strategy. The approach is designed for plug-in integration into existing hybrid NeRV hosts, with the goal of improving detail preservation, reconstruction quality, and rate-distortion trade-offs. Experiments on video regression and restoration tasks are reported to show consistent outperformance over strong hybrid NeRV baselines while remaining competitive with recent INR-based video compression methods.

Significance. If the empirical claims hold, the work provides a complementary and practical direction for hybrid NeRV by converting costly continuous residual information into efficiently transmissible discrete tokens. The residual-aware codebook learning that improves utilization and stabilizes training is a concrete technical contribution that could be adopted more broadly in INR video compression pipelines.

major comments (2)

[§3] §3 (core method): the central claim that discretization 'preserves the informative reconstruction support without significant loss of fidelity' is load-bearing for the entire contribution, yet the manuscript provides no analysis (e.g., token reconstruction error, information-theoretic bounds, or ablation on codebook size) quantifying the fidelity loss introduced by the learned codebook relative to the continuous residual features.
[§4] §4 (experiments): the abstract and introduction assert 'consistent outperformance' and 'extensive experiments,' but the reported tables lack per-sequence breakdowns, statistical significance tests, or bitrate-specific ablations that would substantiate the claim that residual tokenization is the decisive factor rather than other implementation details of the host NeRV model.

minor comments (2)

[§3.1] Notation: the distinction between 'residual tokens' and the output of the residual tokenizer is not clearly defined in the first occurrence; a short equation or diagram label would remove ambiguity.
[Figure 2] Figure 2 (architecture): the flow from inter-frame residual cues into the codebook is difficult to trace; adding an explicit arrow label or caption sentence would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on RT-NeRV. The comments highlight important aspects of the core method and experimental rigor. We address each point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (core method): the central claim that discretization 'preserves the informative reconstruction support without significant loss of fidelity' is load-bearing for the entire contribution, yet the manuscript provides no analysis (e.g., token reconstruction error, information-theoretic bounds, or ablation on codebook size) quantifying the fidelity loss introduced by the learned codebook relative to the continuous residual features.

Authors: We agree that explicit quantification of fidelity loss from discretization is necessary to support the central claim. The current manuscript relies on end-to-end rate-distortion results and qualitative visualizations but does not isolate the tokenizer's reconstruction error or provide codebook-size ablations. We will add these analyses in the revision, including per-layer token reconstruction PSNR, codebook utilization statistics, and an information-theoretic comparison (e.g., mutual information between continuous residuals and quantized tokens) to demonstrate that the loss remains small relative to the bitrate savings. revision: yes
Referee: [§4] §4 (experiments): the abstract and introduction assert 'consistent outperformance' and 'extensive experiments,' but the reported tables lack per-sequence breakdowns, statistical significance tests, or bitrate-specific ablations that would substantiate the claim that residual tokenization is the decisive factor rather than other implementation details of the host NeRV model.

Authors: The referee correctly notes that the current tables aggregate results without per-sequence detail or statistical tests. While the manuscript already contains some bitrate sweeps and comparisons against multiple hybrid NeRV hosts, these do not isolate the contribution of residual tokenization via controlled ablations (e.g., with vs. without the tokenizer at fixed host architecture). We will expand the experimental section with per-sequence tables, paired t-tests or Wilcoxon tests for significance, and additional ablations that swap only the residual representation while keeping all other components identical. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces RT-NeRV as a residual tokenization framework that discretizes shallow residual features into tokens via a learned codebook for integration into hybrid NeRV hosts. The abstract and described construction present this as an empirical design choice with a residual-aware codebook strategy, validated through experiments on video regression and restoration tasks showing gains over baselines. No equations, fitted parameters, or self-citations are shown that reduce any central claim or prediction to a definitional equivalence or input by construction. The methodology remains self-contained against external benchmarks, with the core improvement presented as a complementary direction rather than a forced renaming or self-referential fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities can be extracted beyond the high-level description of the residual tokenizer and codebook.

invented entities (1)

residual tokens no independent evidence
purpose: discretized representation of shallow residual features and inter-frame cues for efficient transmission
Core idea stated in abstract; no independent evidence supplied

pith-pipeline@v0.9.0 · 5769 in / 1195 out tokens · 20228 ms · 2026-05-25T08:29:06.855551+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

Agustsson, D

E. Agustsson, D. Minnen, N. Johnston, J. Balle, S. J. Hwang, and G. Toderici. Scale-space flow for end-to-end optimized video compression. In CVPR, pages 8503–8512, 2020

work page 2020
[2]

Ahmed, T

N. Ahmed, T. Natarajan, and K. R. Rao. Discrete cosine transform. TC, 100(1):90–93, 1974

work page 1974
[3]

Y . Bai, C. Dong, C. Wang, and C. Yuan. Ps-nerv: Patch-wise stylized neural representations for videos. In ICIP, pages 41–45. IEEE, 2023

work page 2023
[4]

H. Chen, B. He, H. Wang, Y . Ren, S. N. Lim, and A. Shrivastava. Nerv: Neural representations for videos. NeurIPS, 34:21557–21568, 2021

work page 2021
[5]

H. Chen, M. Gwilliam, S.-N. Lim, and A. Shrivastava. Hnerv: A hybrid neural representation for videos. In ICCV, pages 10270–10279, 2023

work page 2023
[6]

X. Feng, Y . He, Y . Wang, C. Wang, Z. Kuang, J. Ding, F. Qin, J. Yu, and J. Fan. Zs-srt: An efficient zero-shot super-resolution training method for neural radiance fields. arXiv preprint arXiv:2312.12122, 2023

work page arXiv 2023
[7]

Habibian, T

A. Habibian, T. v. Rozendaal, J. M. Tomczak, and T. S. Cohen. Video compression with rate-distortion autoencoders. In ICCV, pages 7033–7042, 2019

work page 2019
[8]

M. Huh, B. Cheung, P. Agrawal, and P. Isola. Straightening out the straight-through esti- mator: Overcoming optimization challenges in vector quantized networks. arXiv preprint arXiv:2305.08842, 2023

work page arXiv 2023
[9]

Khani, V

M. Khani, V . Sivaraman, and M. Alizadeh. Efficient video compression via content-adaptive super-resolution. In ICCV, pages 4521–4530, 2021

work page 2021
[10]

H. M. Kwan, G. Gao, F. Zhang, A. Gower, and D. Bull. Hinerv: Video compression with hierarchical encoding-based neural representation. NeurIPS, 36, 2024

work page 2024
[11]

J. C. Lee, D. Rho, J. H. Ko, and E. Park. Ffnerv: Flow-guided frame-wise neural representations for videos. In ACMMM, pages 7859–7870, 2023

work page 2023
[12]

J. Li, B. Li, and Y . Lu. Deep contextual video compression. NeurIPS, 34:18114–18125, 2021

work page 2021
[13]

Z. Li, M. Wang, H. Pi, K. Xu, J. Mei, and Y . Liu. E-nerv: Expedite neural video representation with disentangled spatial-temporal context. In ECCV, pages 267–284. Springer, 2022

work page 2022
[14]

J. Liu, S. Wang, W.-C. Ma, M. Shah, R. Hu, P. Dhawan, and R. Urtasun. Conditional entropy coding for efficient video compression. In ECCV, pages 453–468. Springer, 2020

work page 2020
[15]

Y . Liu, Z. Qin, S. Anwar, S. Caldwell, and T. Gedeon. Are deep neural architectures losing information? invertibility is indispensable. In ICONIP, pages 172–184. Springer, 2020

work page 2020
[16]

Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie. A convnet for the 2020s. In CVPR, pages 11976–11986, 2022

work page 2022
[17]

G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao. Dvc: An end-to-end deep video compression framework. In CVPR, pages 11006–11015, 2019

work page 2019
[18]

Mercat, M

A. Mercat, M. Viitanen, and J. Vanne. Uvg dataset: 50/120fps 4k sequences for video codec analysis and development. In ACMMM, pages 297–302, 2020

work page 2020
[19]

Oswal, A

S. Oswal, A. Singh, and K. Kumari. Deflate compression algorithm. International Journal of Engineering Research and General Science , 4(1):430–436, 2016

work page 2016
[20]

J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR, pages 165–174, 2019

work page 2019
[21]

Rippel, A

O. Rippel, A. G. Anderson, K. Tatwawadi, S. Nair, C. Lytle, and L. Bourdev. Elf-vc: Efficient learned flexible-rate video coding. In CVPR, pages 14479–14488, 2021

work page 2021
[22]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015

work page 2015
[23]

Roosendaal

T. Roosendaal. Big buck bunny. In SIGGRAPH, pages 62–62. 2008. 11

work page 2008
[24]

Sitzmann, J

V . Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein. Implicit neural representations with periodic activation functions. NeurIPS, 33:7462–7473, 2020

work page 2020
[25]

G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand. Overview of the high efficiency video coding (hevc) standard. TCSVT, 22(12):1649–1668, 2012

work page 2012
[26]

Van Den Oord, O

A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning. NeurIPS, 30, 2017

work page 2017
[27]

H. Wang, W. Gan, S. Hu, J. Y . Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo. Mcl-jcv: a jnd-based h. 264/avc video quality assessment dataset. In ICIP, pages 1509–1513. IEEE, 2016

work page 2016
[28]

Wiegand, G

T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the h. 264/avc video coding standard. IEEE TIP, 13(7):560–576, 2003

work page 2003
[29]

C.-Y . Wu, N. Singhal, and P. Krahenbuhl. Video compression through image interpolation. In ECCV, pages 416–431, 2018

work page 2018
[30]

M. Xiao, S. Zheng, C. Liu, Y . Wang, D. He, G. Ke, J. Bian, Z. Lin, and T.-Y . Liu. Invertible image rescaling. In ECCV, pages 126–144. Springer, 2020

work page 2020
[31]

Q. Zhao, M. S. Asif, and Z. Ma. Dnerv: Modeling inherent dynamics via difference neural representation for videos. In CVPR, pages 2031–2040, 2023. 12

work page 2031

[1] [1]

Agustsson, D

E. Agustsson, D. Minnen, N. Johnston, J. Balle, S. J. Hwang, and G. Toderici. Scale-space flow for end-to-end optimized video compression. In CVPR, pages 8503–8512, 2020

work page 2020

[2] [2]

Ahmed, T

N. Ahmed, T. Natarajan, and K. R. Rao. Discrete cosine transform. TC, 100(1):90–93, 1974

work page 1974

[3] [3]

Y . Bai, C. Dong, C. Wang, and C. Yuan. Ps-nerv: Patch-wise stylized neural representations for videos. In ICIP, pages 41–45. IEEE, 2023

work page 2023

[4] [4]

H. Chen, B. He, H. Wang, Y . Ren, S. N. Lim, and A. Shrivastava. Nerv: Neural representations for videos. NeurIPS, 34:21557–21568, 2021

work page 2021

[5] [5]

H. Chen, M. Gwilliam, S.-N. Lim, and A. Shrivastava. Hnerv: A hybrid neural representation for videos. In ICCV, pages 10270–10279, 2023

work page 2023

[6] [6]

X. Feng, Y . He, Y . Wang, C. Wang, Z. Kuang, J. Ding, F. Qin, J. Yu, and J. Fan. Zs-srt: An efficient zero-shot super-resolution training method for neural radiance fields. arXiv preprint arXiv:2312.12122, 2023

work page arXiv 2023

[7] [7]

Habibian, T

A. Habibian, T. v. Rozendaal, J. M. Tomczak, and T. S. Cohen. Video compression with rate-distortion autoencoders. In ICCV, pages 7033–7042, 2019

work page 2019

[8] [8]

M. Huh, B. Cheung, P. Agrawal, and P. Isola. Straightening out the straight-through esti- mator: Overcoming optimization challenges in vector quantized networks. arXiv preprint arXiv:2305.08842, 2023

work page arXiv 2023

[9] [9]

Khani, V

M. Khani, V . Sivaraman, and M. Alizadeh. Efficient video compression via content-adaptive super-resolution. In ICCV, pages 4521–4530, 2021

work page 2021

[10] [10]

H. M. Kwan, G. Gao, F. Zhang, A. Gower, and D. Bull. Hinerv: Video compression with hierarchical encoding-based neural representation. NeurIPS, 36, 2024

work page 2024

[11] [11]

J. C. Lee, D. Rho, J. H. Ko, and E. Park. Ffnerv: Flow-guided frame-wise neural representations for videos. In ACMMM, pages 7859–7870, 2023

work page 2023

[12] [12]

J. Li, B. Li, and Y . Lu. Deep contextual video compression. NeurIPS, 34:18114–18125, 2021

work page 2021

[13] [13]

Z. Li, M. Wang, H. Pi, K. Xu, J. Mei, and Y . Liu. E-nerv: Expedite neural video representation with disentangled spatial-temporal context. In ECCV, pages 267–284. Springer, 2022

work page 2022

[14] [14]

J. Liu, S. Wang, W.-C. Ma, M. Shah, R. Hu, P. Dhawan, and R. Urtasun. Conditional entropy coding for efficient video compression. In ECCV, pages 453–468. Springer, 2020

work page 2020

[15] [15]

Y . Liu, Z. Qin, S. Anwar, S. Caldwell, and T. Gedeon. Are deep neural architectures losing information? invertibility is indispensable. In ICONIP, pages 172–184. Springer, 2020

work page 2020

[16] [16]

Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie. A convnet for the 2020s. In CVPR, pages 11976–11986, 2022

work page 2022

[17] [17]

G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao. Dvc: An end-to-end deep video compression framework. In CVPR, pages 11006–11015, 2019

work page 2019

[18] [18]

Mercat, M

A. Mercat, M. Viitanen, and J. Vanne. Uvg dataset: 50/120fps 4k sequences for video codec analysis and development. In ACMMM, pages 297–302, 2020

work page 2020

[19] [19]

Oswal, A

S. Oswal, A. Singh, and K. Kumari. Deflate compression algorithm. International Journal of Engineering Research and General Science , 4(1):430–436, 2016

work page 2016

[20] [20]

J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR, pages 165–174, 2019

work page 2019

[21] [21]

Rippel, A

O. Rippel, A. G. Anderson, K. Tatwawadi, S. Nair, C. Lytle, and L. Bourdev. Elf-vc: Efficient learned flexible-rate video coding. In CVPR, pages 14479–14488, 2021

work page 2021

[22] [22]

Ronneberger, P

O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015

work page 2015

[23] [23]

Roosendaal

T. Roosendaal. Big buck bunny. In SIGGRAPH, pages 62–62. 2008. 11

work page 2008

[24] [24]

Sitzmann, J

V . Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein. Implicit neural representations with periodic activation functions. NeurIPS, 33:7462–7473, 2020

work page 2020

[25] [25]

G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand. Overview of the high efficiency video coding (hevc) standard. TCSVT, 22(12):1649–1668, 2012

work page 2012

[26] [26]

Van Den Oord, O

A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning. NeurIPS, 30, 2017

work page 2017

[27] [27]

H. Wang, W. Gan, S. Hu, J. Y . Lin, L. Jin, L. Song, P. Wang, I. Katsavounidis, A. Aaron, and C.-C. J. Kuo. Mcl-jcv: a jnd-based h. 264/avc video quality assessment dataset. In ICIP, pages 1509–1513. IEEE, 2016

work page 2016

[28] [28]

Wiegand, G

T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the h. 264/avc video coding standard. IEEE TIP, 13(7):560–576, 2003

work page 2003

[29] [29]

C.-Y . Wu, N. Singhal, and P. Krahenbuhl. Video compression through image interpolation. In ECCV, pages 416–431, 2018

work page 2018

[30] [30]

M. Xiao, S. Zheng, C. Liu, Y . Wang, D. He, G. Ke, J. Bian, Z. Lin, and T.-Y . Liu. Invertible image rescaling. In ECCV, pages 126–144. Springer, 2020

work page 2020

[31] [31]

Q. Zhao, M. S. Asif, and Z. Ma. Dnerv: Modeling inherent dynamics via difference neural representation for videos. In CVPR, pages 2031–2040, 2023. 12

work page 2031