pith. sign in

arxiv: 2605.18054 · v1 · pith:XO5ZDRVRnew · submitted 2026-05-18 · 📡 eess.IV · cs.CV· cs.MM

CATRF: Codec-Adaptive TriPlane Radiance Fields for Volumetric Content Delivery

Pith reviewed 2026-05-20 00:32 UTC · model grok-4.3

classification 📡 eess.IV cs.CVcs.MM
keywords volumetric compressionradiance fieldstriplane representationcodec-adaptive trainingrate-distortion optimizationstandard video codecsfree-viewpoint videostraight-through estimator
0
0 comments X

The pith

Training triplane radiance fields with real codec roundtrips lets volumetric content reach better rate-distortion performance than codec-agnostic baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CATRF as a compression framework that folds standard codecs such as HEVC and AV1 directly into the training of plane-factorized radiance fields. Feature planes are quantized, packed into codec-friendly canvases, encoded and decoded, then unpacked and dequantized before rendering; a straight-through estimator lets gradients flow through the non-differentiable codec so the features can adapt to the actual client-side distortions. The resulting representations are tested on both static and dynamic volumetric benchmarks and compared against codec-agnostic training and against recent compressed 3D Gaussian splatting methods. A sympathetic reader cares because the method keeps the representation compatible with widely deployed video codecs while still improving efficiency and decoding speed, which matters for practical free-viewpoint video delivery.

Core claim

CATRF trains triplane radiance fields by quantizing and packing the 2D feature planes into canvases, running a full roundtrip through a chosen standard codec, unpacking the decoded features, and using a straight-through estimator to back-propagate through the entire non-differentiable pipeline, so that the learned features become resilient to the specific quantization and coding artifacts that the target codec will introduce at inference time.

What carries the argument

Codec-in-the-loop training with straight-through estimator, which simulates the complete quantization-packing-encoding-decoding-unpacking pipeline on the triplane features so they can adapt to real codec distortions without any learned codec parameters.

If this is right

  • CATRF achieves a better rate-distortion trade-off than both codec-agnostic and learned-codec baselines on static and dynamic volumetric benchmarks.
  • The method also outperforms recent compressed 3D Gaussian splatting approaches in both compression efficiency and decoding speed.
  • The approach supplies a practical route to low-bitrate, compression-resilient volumetric representations suitable for free-viewpoint video streaming.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adaptation strategy could be applied to other plane-based or hybrid implicit representations without changing the codec pipeline.
  • Client-side decoding speed gains may enable higher frame-rate or multi-user free-viewpoint experiences on consumer hardware.
  • Because the method uses only standard codecs, it can be deployed immediately on existing video infrastructure while still benefiting from neural representations.

Load-bearing premise

The straight-through estimator lets the radiance-field features adapt to the non-differentiable distortions of standard codecs without training instability or unmodeled quantization artifacts dominating final quality.

What would settle it

Running the same static and dynamic benchmarks and finding that CATRF's rate-distortion curves lie below or on top of the codec-agnostic baselines at all operating points would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.18054 by Lingdong Wang, Ramesh K. Sitaraman, Subhransu Maji, Tung-I Chen.

Figure 1
Figure 1. Figure 1: Overview of codec-integrated NeRF compression [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the encode–decode codec round trip. In [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison with baselines on the NeRF Synthetic (left) and Tanks and Temples (right) benchmarks. On NeRF Synthetic, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons on Chair from NeRF Synthetic and Family from Tanks and Temples. Compared with recent learned￾codec and 3DGS compression baselines, CATRF-JPEG offers a flexible operating range: it can achieve substantially lower bitrate with only modest quality loss, or deliver sharper details and fewer artifacts at a slightly higher rate. weights ϕ, and any parameters required to retrieve (P, D). 3… view at source ↗
Figure 5
Figure 5. Figure 5: Rate-distortion (RD) curves. We compare CATRF with codec-agnostic TeTriRF [ [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparisons on the Neural 3D Video benchmark. At high bitrate, both CA and SCL methods produce comparable [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of appearance-plane canvases under dif [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training diagnostics of CATRF with STE as the gradient [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of visualized appearance planes packed with [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: More qualitative comparisons of NeRF Synthetic and Tanks and Temples benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: More qualitative comparisons of Neural 3D Video and NHR benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

Volumetric media promises next-generation content delivery applications, but its bandwidth demand remains a key bottleneck. Implicit and hybrid volumetric representations reduce model sizes, yet still require careful coding to reach 2D video-like bitrates. We present CATRF, a standard-codec-in-the-loop compression framework for plane-factorized radiance fields. During training, we quantize and pack 2D feature planes into codec-friendly canvases, run a standard codec roundtrip (JPEG/VP9/HEVC/AV1), then unpack and dequantize the decoded features before volume rendering. We use a straight-through estimator (STE) to insert the non-differentiable, standard codec pipeline into the training loop, allowing radiance-field features to adapt directly to the real, client-side codec distortions without introducing any learned codec parameters. On both static and dynamic benchmarks, CATRF consistently achieves a better rate-distortion trade-off over codec-agnostic and learned-codec-in-the-loop baselines, and also outperforms recent compressed 3DGS methods in both compression efficiency and decoding speed. These results highlight a practical path toward low-bitrate, compression-resilient volumetric representations for free-viewpoint video streaming.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CATRF, a standard-codec-in-the-loop compression method for triplane radiance fields. Feature planes are quantized and packed into codec-friendly canvases, passed through a non-differentiable roundtrip of JPEG/VP9/HEVC/AV1, unpacked and dequantized, then used for volume rendering. A straight-through estimator enables end-to-end training so that the radiance-field features adapt to the actual client-side codec distortions. The authors report consistent rate-distortion gains over codec-agnostic and learned-codec baselines on both static and dynamic benchmarks, plus better compression efficiency and decoding speed than recent compressed 3DGS approaches.

Significance. If the adaptation mechanism proves robust, the work provides a practical route to low-bitrate volumetric delivery that exploits mature, hardware-supported codecs rather than requiring learned compression modules. The reported gains in rate-distortion trade-off and decoding speed on both static and dynamic scenes would be relevant for free-viewpoint video streaming applications.

major comments (2)
  1. [§3] §3 (Method), around the STE insertion: the manuscript does not supply sufficient detail on the precise quantization thresholds, canvas packing layout, or the exact STE implementation (e.g., whether gradient clipping or scaling is applied). These choices directly affect whether the identity-gradient approximation remains informative for the highly nonlinear rate-distortion behavior of HEVC/AV1; without them the central claim that features adapt to real codec distortions cannot be fully evaluated.
  2. [§4] §4 (Experiments), rate-distortion curves and tables: the paper should include an ablation that trains the same triplanes codec-agnostically and then applies the identical quantization/packing/codec roundtrip at test time. If the reported gains largely disappear in this setting, the adaptation benefit attributed to STE training would be undermined.
minor comments (2)
  1. [Figure 3] Figure 3 (canvas packing illustration): the diagram would benefit from explicit annotation of the quantization step sizes and the exact spatial arrangement used for each codec.
  2. [Related Work] Related-work section: the comparison to compressed 3DGS methods should cite the specific decoding-time measurements (e.g., FPS on the same hardware) to support the speed claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of reproducibility and experimental validation. We address each point below and have updated the manuscript to incorporate the suggested clarifications and additional analysis.

read point-by-point responses
  1. Referee: [§3] §3 (Method), around the STE insertion: the manuscript does not supply sufficient detail on the precise quantization thresholds, canvas packing layout, or the exact STE implementation (e.g., whether gradient clipping or scaling is applied). These choices directly affect whether the identity-gradient approximation remains informative for the highly nonlinear rate-distortion behavior of HEVC/AV1; without them the central claim that features adapt to real codec distortions cannot be fully evaluated.

    Authors: We agree that more precise implementation details are required for full reproducibility and to allow readers to assess the STE approximation under nonlinear codecs. In the revised manuscript we have expanded §3 with the following: uniform quantization uses a fixed step size of 1/255 on normalized feature values in [0,1]; the canvas packing layout tiles the three feature planes (each 256×256×C) into a single 512×768 canvas with explicit row/column offsets and zero-padding to codec-friendly dimensions; the STE employs the identity function in the forward pass with straight-through gradient (no clipping or scaling applied, as empirical tests showed stable convergence). We also add a short discussion of why the identity approximation remains informative despite codec nonlinearity, supported by gradient-norm statistics collected during training. revision: yes

  2. Referee: [§4] §4 (Experiments), rate-distortion curves and tables: the paper should include an ablation that trains the same triplanes codec-agnostically and then applies the identical quantization/packing/codec roundtrip at test time. If the reported gains largely disappear in this setting, the adaptation benefit attributed to STE training would be undermined.

    Authors: We appreciate this suggestion for isolating the contribution of in-loop adaptation. We have added the requested ablation to §4: the identical triplane architecture is trained without any codec in the loop (codec-agnostic baseline) and then subjected to the exact same quantization, canvas packing, and codec roundtrip (JPEG/VP9/HEVC/AV1) at test time. The new results, presented in an additional row of Table 2 and as dashed curves in Figure 4, show a consistent drop in rate-distortion performance relative to CATRF (average BD-rate increase of 18–27 % across codecs). This confirms that the observed gains are attributable to feature adaptation during STE training rather than to the quantization/packing procedure alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents CATRF as an empirical compression framework that inserts a standard codec roundtrip into the training loop via straight-through estimator. All reported gains are obtained from direct comparisons against external codec-agnostic baselines, learned-codec baselines, and compressed 3DGS methods on static and dynamic benchmarks. No equations, fitted parameters, or self-citations are shown to reduce the claimed rate-distortion improvements to quantities defined on the same test data or to prior results by the same authors. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no explicit free parameters, axioms, or invented entities beyond the standard assumptions of radiance-field training and the use of a straight-through estimator.

pith-pipeline@v0.9.0 · 5755 in / 1173 out tokens · 32276 ms · 2026-05-20T00:32:02.959870+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 1 internal anchor

  1. [1]

    Simoncelli

    Johannes Ball ´e, Valero Laparra, and Eero P. Simoncelli. End-to-end optimized image compression. InICLR, 2017. 3

  2. [2]

    Simoncelli

    Johannes Ball ´e, David Minnen, Saurabh Singh, Sung Jin Hwang Johnston, and Eero P. Simoncelli. Variational im- age compression with a scale hyperprior.IEEE Transactions on Image Processing, 2018. 3

  3. [3]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas L ´eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013. 3, 6, 4

  4. [4]

    Low latency live streaming implementation in dash and hls

    Abdelhak Bentaleb, Zhengdao Zhan, Farzad Tashtarian, May Lim, Saad Harous, Christian Timmerer, Hermann Hellwag- ner, and Roger Zimmermann. Low latency live streaming implementation in dash and hls. InProceedings of the 30th ACM International Conference on Multimedia, pages 7343– 7346, 2022. 1

  5. [5]

    Abdelhak Bentaleb, May Lim, Sarra Hammoudi, Saad Harous, and Roger Zimmermann. Solutions, challenges, and opportunities in volumetric video streaming: an architectural perspective.ACM Transactions on Multimedia Computing, Communications and Applications, 21(7):1–35, 2025. 1

  6. [6]

    Calculation of average psnr differences between rd-curves.ITU-T SG16, Doc

    Gisle Bjontegaard. Calculation of average psnr differences between rd-curves.ITU-T SG16, Doc. VCEG-M33, 2001. 7

  7. [7]

    Proxylessnas: Direct neural architecture search on target task and hardware

    Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Direct neural architecture search on target task and hardware. In ICLR, 2019. 3

  8. [8]

    Efficient geometry-aware 3d generative adversarial networks

    Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16123–16133, 2022. 1, 3

  9. [9]

    Tensorf: Tensorial radiance fields

    Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean con- ference on computer vision, pages 333–350. Springer, 2022. 1, 3, 6, 7, 8, 5

  10. [10]

    How far can we compress instant-ngp-based nerf? In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 20321–20330, 2024

    Yihang Chen, Qianyi Wu, Mehrtash Harandi, and Jianfei Cai. How far can we compress instant-ngp-based nerf? In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 20321–20330, 2024. 2, 3, 4, 6, 7, 8, 1, 5

  11. [11]

    Hac: Hash-grid assisted context for 3d gaussian splatting compression

    Yihang Chen, Qianyi Wu, Weiyao Lin, Mehrtash Harandi, and Jianfei Cai. Hac: Hash-grid assisted context for 3d gaussian splatting compression. InEuropean Conference on Computer Vision, pages 422–438. Springer, 2024. 1, 3

  12. [12]

    Hac++: Towards 100x compression of 3d gaussian splatting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Yihang Chen, Qianyi Wu, Weiyao Lin, Mehrtash Harandi, and Jianfei Cai. Hac++: Towards 100x compression of 3d gaussian splatting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2, 3, 4, 7, 8, 5

  13. [13]

    High-quality streamable free-viewpoint video.ACM Transactions on Graphics (ToG), 34(4):1–13,

    Alvaro Collet, Ming Chuang, Pat Sweeney, Don Gillett, Den- nis Evseev, David Calabrese, Hugues Hoppe, Adam Kirk, and Steve Sullivan. High-quality streamable free-viewpoint video.ACM Transactions on Graphics (ToG), 34(4):1–13,

  14. [14]

    Binaryconnect: Training deep neural networks with binary weights during propagations

    Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. InNeurIPS, 2015. 3

  15. [15]

    Plenoxels: Radiance fields without neural networks

    Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5501–5510, 2022. 3, 6

  16. [16]

    K-planes: Explicit radiance fields in space, time, and appearance

    Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12479–12488, 2023. 3, 6, 7

  17. [17]

    Queen: Quantized efficient encoding of dynamic gaussians for streaming free-viewpoint videos.Advances in Neural Information Processing Systems, 37:43435–43467, 2024

    Sharath Girish, Tianye Li, Amrita Mazumdar, Abhinav Shri- vastava, Shalini De Mello, et al. Queen: Quantized efficient encoding of dynamic gaussians for streaming free-viewpoint videos.Advances in Neural Information Processing Systems, 37:43435–43467, 2024. 3, 7

  18. [18]

    Danillo Graziosi, Ohji Nakagami, Satoru Kuma, Alexandre Zaghetto, Teruhiko Suzuki, and Ali Tabatabai. An overview of ongoing point cloud compression standardization activi- ties: Video-based (v-pcc) and geometry-based (g-pcc).AP- SIPA Transactions on Signal and Information Processing, 9: e13, 2020. 2

  19. [19]

    3dgen: Triplane la- tent diffusion for textured mesh generation.arXiv preprint arXiv:2303.05371,

    Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Bar- las O˘guz. 3dgen: Triplane latent diffusion for textured mesh generation.arXiv preprint arXiv:2303.05371, 2023. 3

  20. [20]

    Vrvvc: Variable-rate nerf-based volumetric video compression

    Qiang Hu, Houqiang Zhong, Zihan Zheng, Xiaoyun Zhang, Zhengxue Cheng, Li Song, Guangtao Zhai, and Yanfeng Wang. Vrvvc: Variable-rate nerf-based volumetric video compression. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 3563–3571, 2025. 2, 3, 6

  21. [21]

    Sandwiched video compression: Efficiently extending the reach of standard codecs with neu- ral wrappers

    Berivan Isik, Onur G Guleryuz, Danhang Tang, Jonathan Taylor, and Philip A Chou. Sandwiched video compression: Efficiently extending the reach of standard codecs with neu- ral wrappers. In2023 IEEE International Conference on Im- age Processing (ICIP), pages 2055–2059. IEEE, 2023. 2, 3

  22. [22]

    Towards practical real-time neural video compression

    Zhaoyang Jia, Bin Li, Jiahao Li, Wenxuan Xie, Linfeng Qi, Houqiang Li, and Yan Lu. Towards practical real-time neural video compression. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12543–12552,

  23. [23]

    From capture to display: A survey on volumetric video

    Yili Jin, Kaiyuan Hu, Junhua Liu, Fangxin Wang, and Xue Liu. From capture to display: A survey on volumetric video. arXiv preprint arXiv:2309.05658, 2023. 1

  24. [24]

    Codecnerf: Toward fast encoding and decoding, compact, and high-quality novel-view synthesis

    Gyeongjin Kang, Younggeun Lee, Seungjun Oh, and Eun- byung Park. Codecnerf: Toward fast encoding and decoding, compact, and high-quality novel-view synthesis. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 4203–4211, 2025. 2

  25. [25]

    Plenoptic png: Real-time neural radiance fields in 150 kb

    Jae Yong Lee, Yuqun Wu, Chuhang Zou, Derek Hoiem, and Shenlong Wang. Plenoptic png: Real-time neural radiance fields in 150 kb. In2025 International Conference on 3D Vision (3DV), pages 502–511. IEEE, 2025. 7

  26. [26]

    Ecrf: Entropy-constrained neural ra- diance fields compression with frequency domain optimiza- tion

    Soonbin Lee, Fangwen Shu, Yago Sanchez, Thomas Schierl, and Cornelius Hellge. Ecrf: Entropy-constrained neural ra- diance fields compression with frequency domain optimiza- tion. In2024 IEEE 26th International Workshop on Multi- media Signal Processing (MMSP), pages 1–6. IEEE, 2024. 2, 6, 5

  27. [27]

    Compression of 3d gaussian splatting with optimized feature planes and standard video codecs,

    Soonbin Lee, Fangwen Shu, Yago Sanchez, Thomas Schierl, and Cornelius Hellge. Compression of 3d gaussian splatting with optimized feature planes and standard video codecs. arXiv preprint arXiv:2501.03399, 2025. 2, 3

  28. [28]

    Gifstream: 4d gaussian-based immersive video with feature stream

    Hao Li, Sicheng Li, Xiang Gao, Abudouaihati Batuer, Lu Yu, and Yiyi Liao. Gifstream: 4d gaussian-based immersive video with feature stream. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21761– 21770, 2025. 2, 3, 5, 7, 8

  29. [29]

    Neural video compression with feature modulation

    Jiahao Li, Bin Li, and Yan Lu. Neural video compression with feature modulation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26099–26108, 2024. 2, 3

  30. [30]

    Streaming radiance fields for 3d video synthe- sis.Advances in Neural Information Processing Systems, 35: 13485–13498, 2022

    Lingzhi Li, Zhen Shen, Zhongshu Wang, Li Shen, and Ping Tan. Streaming radiance fields for 3d video synthe- sis.Advances in Neural Information Processing Systems, 35: 13485–13498, 2022. 7

  31. [31]

    Instant3d: Instant text- to-3d generation.International Journal of Computer Vision, 132(10):4456–4472, 2024

    Ming Li, Pan Zhou, Jia-Wei Liu, Jussi Keppo, Min Lin, Shuicheng Yan, and Xiangyu Xu. Instant3d: Instant text- to-3d generation.International Journal of Computer Vision, 132(10):4456–4472, 2024. 3

  32. [32]

    Nerfcodec: Neural feature compression meets neural radiance fields for memory-efficient scene representation

    Sicheng Li, Hao Li, Yiyi Liao, and Lu Yu. Nerfcodec: Neural feature compression meets neural radiance fields for memory-efficient scene representation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21274–21283, 2024. 2, 3, 6, 7, 8, 1, 5

  33. [33]

    Neural 3d video synthesis from multi-view video

    Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 5521–5531, 2022. 6

  34. [34]

    Spacetime gaus- sian feature splatting for real-time dynamic view synthesis

    Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. Spacetime gaus- sian feature splatting for real-time dynamic view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8508–8520, 2024. 3, 7

  35. [35]

    Neural sparse voxel fields.Advances in Neural Information Processing Systems, 33:15651–15663,

    Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, Tat-Seng Chua, and Christian Theobalt. Neural sparse voxel fields.Advances in Neural Information Processing Systems, 33:15651–15663,

  36. [36]

    Efficient evaluation of quantization-effects in neural codecs.arXiv preprint arXiv:2502.04770, 2025

    Wolfgang Mack, Ahmed Mustafa, Rafał Łaganowski, and Samer Hijazy. Efficient evaluation of quantization-effects in neural codecs.arXiv preprint arXiv:2502.04770, 2025. 6, 3, 4

  37. [37]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1, 6, 5

  38. [38]

    Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022

    Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM transactions on graphics (TOG), 41(4):1–15, 2022. 3, 6

  39. [39]

    Compressed 3d gaussian splatting for accelerated novel view synthesis

    Simon Niedermayr, Josef Stumpfegger, and R ¨udiger West- ermann. Compressed 3d gaussian splatting for accelerated novel view synthesis. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 10349–10358, 2024. 1

  40. [40]

    Holoportation: Virtual 3d teleportation in real-time

    Sergio Orts-Escolano, Christoph Rhemann, Sean Fanello, Wayne Chang, Adarsh Kowdle, Yury Degtyarev, David Kim, Philip L Davidson, Sameh Khamis, Mingsong Dou, et al. Holoportation: Virtual 3d teleportation in real-time. InPro- ceedings of the 29th annual symposium on user interface software and technology, pages 741–754, 2016. 1

  41. [41]

    Differentiable signal processing with black-box audio effects

    Marco A Mart ´ınez Ram´ırez, Oliver Wang, Paris Smaragdis, and Nicholas J Bryan. Differentiable signal processing with black-box audio effects. InICASSP 2021-2021 IEEE Inter- national Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 66–70. IEEE, 2021. 3

  42. [42]

    Adaptive bitrate selection: A survey.IEEE Communications Surveys & Tutorials, 19(4):2985–3014, 2017

    Yusuf Sani, Andreas Mauthe, and Christopher Edwards. Adaptive bitrate selection: A survey.IEEE Communications Surveys & Tutorials, 19(4):2985–3014, 2017. 2

  43. [43]

    Swings: sliding windows for dynamic 3d gaussian splatting

    Richard Shaw, Michal Nazarczuk, Jifei Song, Arthur Moreau, Sibi Catley-Chandar, Helisa Dhamo, and Eduardo P´erez-Pellitero. Swings: sliding windows for dynamic 3d gaussian splatting. InEuropean Conference on Computer Vision, pages 37–54. Springer, 2024. 1

  44. [44]

    Binary radiance fields.Ad- vances in neural information processing systems, 36:55919– 55931, 2023

    Seungjoo Shin and Jaesik Park. Binary radiance fields.Ad- vances in neural information processing systems, 36:55919– 55931, 2023. 3, 6

  45. [45]

    3d neural field generation using triplane diffusion

    J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field generation using triplane diffusion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20875–20886, 2023. 3

  46. [46]

    The mpeg-dash standard for multimedia streaming over the internet.IEEE multimedia, 18(4):62–67,

    Iraj Sodagar. The mpeg-dash standard for multimedia streaming over the internet.IEEE multimedia, 18(4):62–67,

  47. [47]

    Nerf- player: A streamable dynamic scene representation with de- composed neural radiance fields.IEEE Transactions on Visu- alization and Computer Graphics, 29(5):2732–2742, 2023

    Liangchen Song, Anpei Chen, Zhong Li, Zhang Chen, Lele Chen, Junsong Yuan, Yi Xu, and Andreas Geiger. Nerf- player: A streamable dynamic scene representation with de- composed neural radiance fields.IEEE Transactions on Visu- alization and Computer Graphics, 29(5):2732–2742, 2023. 6, 7

  48. [48]

    An overview of the simultaneous perturba- tion method for efficient optimization.Johns Hopkins apl technical digest, 19(4):482–492, 1998

    James C Spall. An overview of the simultaneous perturba- tion method for efficient optimization.Johns Hopkins apl technical digest, 19(4):482–492, 1998. 6, 3, 4

  49. [49]

    Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction

    Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5459– 5469, 2022. 3, 6, 7, 8

  50. [50]

    3dgstream: On-the-fly training of 3d gaussians for efficient streaming of photo-realistic free- viewpoint videos

    Jiakai Sun, Han Jiao, Guangyuan Li, Zhanjie Zhang, Lei Zhao, and Wei Xing. 3dgstream: On-the-fly training of 3d gaussians for efficient streaming of photo-realistic free- viewpoint videos. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20675–20685, 2024. 3, 7

  51. [51]

    Videorf: Ren- dering dynamic radiance fields as 2d feature video streams

    Liao Wang, Kaixin Yao, Chengcheng Guo, Zhirui Zhang, Qiang Hu, Jingyi Yu, Lan Xu, and Minye Wu. Videorf: Ren- dering dynamic radiance fields as 2d feature video streams. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 470–481, 2024. 2

  52. [52]

    Vˆ 3: View- ing volumetric videos on mobiles via streamable 2d dynamic gaussians.ACM Transactions on Graphics (TOG), 43(6):1– 13, 2024

    Penghao Wang, Zhirui Zhang, Liao Wang, Kaixin Yao, Siyuan Xie, Jingyi Yu, Minye Wu, and Lan Xu. Vˆ 3: View- ing volumetric videos on mobiles via streamable 2d dynamic gaussians.ACM Transactions on Graphics (TOG), 43(6):1– 13, 2024. 2

  53. [53]

    4d gaussian splatting for real-time dynamic scene rendering

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20310–20320, 2024. 1

  54. [54]

    Multi-view neural human rendering

    Minye Wu, Yuehao Wang, Qiang Hu, and Jingyi Yu. Multi-view neural human rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1682–1691, 2020. 6

  55. [55]

    Tetrirf: Temporal tri-plane radiance fields for efficient free-viewpoint video

    Minye Wu, Zehao Wang, Georgios Kouros, and Tinne Tuyte- laars. Tetrirf: Temporal tri-plane radiance fields for efficient free-viewpoint video. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6487–6496, 2024. 2, 3, 5, 6, 7

  56. [56]

    Improving the straight- through estimator with zeroth-order information.arXiv preprint arXiv:2510.23926, 2025

    Ningfeng Yang and Tor M Aamodt. Improving the straight- through estimator with zeroth-order information.arXiv preprint arXiv:2510.23926, 2025. 3

  57. [57]

    Neural adaptive content-aware internet video delivery

    Hyunho Yeo, Youngmok Jung, Jaehong Kim, Jinwoo Shin, and Dongsu Han. Neural adaptive content-aware internet video delivery. In13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 645– 661, 2018. 2

  58. [58]

    Nemo: enabling neural-enhanced video streaming on commodity mobile devices

    Hyunho Yeo, Chan Ju Chong, Youngmok Jung, Juncheol Ye, and Dongsu Han. Nemo: enabling neural-enhanced video streaming on commodity mobile devices. InProceedings of the 26th Annual International Conference on Mobile Com- puting and Networking, pages 1–14, 2020. 2

  59. [59]

    Rate-aware compression for nerf- based volumetric video

    Zhiyu Zhang, Guo Lu, Huanxiong Liang, Zhengxue Cheng, Anni Tang, and Li Song. Rate-aware compression for nerf- based volumetric video. InProceedings of the 32nd ACM International Conference on Multimedia, pages 3974–3983,

  60. [60]

    Jointrf: end-to- end joint optimization for dynamic neural radiance field rep- resentation and compression

    Zihan Zheng, Houqiang Zhong, Qiang Hu, Xiaoyun Zhang, Li Song, Ya Zhang, and Yanfeng Wang. Jointrf: end-to- end joint optimization for dynamic neural radiance field rep- resentation and compression. In2024 IEEE International Conference on Image Processing (ICIP), pages 3292–3298. IEEE, 2024. 3, 6 CATRF: Codec-Adaptive TriPlane Radiance Fields for Volumetr...

  61. [61]

    Sample a frame indext, rays{(o,d)}, and camera poseπ

  62. [62]

    If the cache is empty org−g cache ≥M, markrefreshas true

  63. [63]

    If the relative change between the current(P ax t , Dt)and their cached snapshots exceeds a thresholdϵ, also markrefreshas true

  64. [64]

    2) for the video segment

    Ifrefreshis true, then run the encode–decode codec round trip (as illustrated in Fig. 2) for the video segment

  65. [65]

    (b) For density: bDt ←cached decoded density, eDt ← bDt + Dt −detach(D t)

    STE substitution using cached reconstructions: (a) For each axisax∈ {xy, xz, yz}: bP ax t ←cached decoded plane, eP ax t ← bP ax t + P ax t −detach(P ax t ) . (b) For density: bDt ←cached decoded density, eDt ← bDt + Dt −detach(D t)

  66. [66]

    4 shows that increasing the refresh intervalMyields a favorable trade-off between accuracy and training efficiency

    Render and compute losses: I← R( ePt,eDt, π;ϕ), Tab. 4 shows that increasing the refresh intervalMyields a favorable trade-off between accuracy and training efficiency. It suggests that relatively infrequent cache updates (e.g.,M= 128) already capture most of the benefit of SCL training, while keeping the overhead of expensive codec round trips manageable...