GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance

Hujun Bao; Jiale Shi; Jiarui Hu; Kaixuan Luan; Zesong Yang; Zhaopeng Cui

arxiv: 2605.18252 · v1 · pith:F6EIV2ETnew · submitted 2026-05-18 · 💻 cs.CV

GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance

Jiale Shi , Jiarui Hu , Zesong Yang , Kaixuan Luan , Hujun Bao , Zhaopeng Cui This is my paper

Pith reviewed 2026-05-20 10:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D Gaussian SplattingZoom-in ReconstructionSuper-ResolutionLevel-of-DetailGenerative 3D ModelingMulti-view ConsistencySemantic Guidance

0 comments

The pith

GaussianZoom enables high-fidelity extreme zoom-in rendering of 3D scenes from low-resolution inputs using progressive Gaussian splatting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GaussianZoom as a system for generating detailed close-up views of 3D scenes that exceed the resolution of the original input images. It builds an iterative process that refines the scene model step by step, combining geometric consistency with semantic understanding to add plausible fine details. A dedicated super-resolution step uses depth information to align features across views and vision-language guidance to synthesize new textures. A continuous level-of-detail structure keeps the representation efficient and smooth as the magnification increases. If the approach holds, it would let users explore reconstructed environments at arbitrary scales without needing higher-resolution source data.

Core claim

GaussianZoom is an iterative progressive framework for generative zoom-in 3D reconstruction that integrates geometry-consistent scene modeling and multi-scale semantic reasoning. It introduces a multi-view consistent super-resolution module that applies depth-based feature warping and VLM-driven detail synthesis to enrich appearance beyond the observed resolution while preserving correspondence. An expandable continuous Level-of-Detail hierarchy dynamically adjusts Gaussian visibility to support alias-free rendering across large magnification ranges. On Mip-NeRF360 and Tanks&Temples, the method reports better perceptual quality, multi-view consistency, and stability under extreme zoom.

What carries the argument

The multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis that enriches fine-scale appearance while keeping cross-view alignment, together with the expandable continuous Level-of-Detail hierarchy that modulates Gaussian visibility for smooth scaling.

If this is right

Achieves higher perceptual quality in zoomed renderings compared with prior 3D Gaussian methods.
Preserves multi-view consistency even when magnification exceeds the input resolution by large factors.
Remains stable without aliasing or popping when the viewer moves continuously across wide scale ranges.
Provides a working baseline that later methods for generative zoom-in reconstruction can improve upon.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same progressive refinement loop could be adapted to add user-specified details or correct specific regions after initial reconstruction.
Combining the level-of-detail hierarchy with real-time rendering engines might support interactive exploration in virtual environments from casual photo sets.
Extending the semantic guidance to handle time-varying scenes could open applications in video-based 3D zoom-in.

Load-bearing premise

The multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis can accurately enrich fine-scale appearance beyond the observed resolution while maintaining multi-view correspondence.

What would settle it

Ground-truth high-resolution images captured at extreme magnification on the same scenes showing that the synthesized details misalign across views or introduce visible artifacts not present in real data.

Figures

Figures reproduced from arXiv: 2605.18252 by Hujun Bao, Jiale Shi, Jiarui Hu, Kaixuan Luan, Zesong Yang, Zhaopeng Cui.

**Figure 1.** Figure 1: GaussianZoom progressively magnifies 3D scenes from low-resolution inputs, reconstructing them into multi-view consistent and detail-rich representations. The expandable continuous Level-of-Detail hierarchy organizes primitives across scales, enabling smooth and alias-free rendering throughout the zoom-in process. Please refer to the supp. material for more vivid video demonstrations. Abstract We introduce… view at source ↗

**Figure 2.** Figure 2: Comparison between flow-based and depth-based warp [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Method overview. Our framework jointly leverages geometry-aware alignment, semantic priors, and a continuous Level-ofDetail (LoD) representation to perform generative zoom-in reconstruction. Starting from a coarse 3D Gaussian Splatting model, we derive per-view depth maps that enable depth-based feature warping, providing accurate multi-view correspondence. In parallel, coarse and zoomed-in renderings are… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of 4× super-resolution results. Mip-Splatting reduces aliasing but lacks fine details; SuperGaussian, SRGS and Sequence Matters produces blurry textures; Our method reconstructs sharper textures, cleaner edges, and more coherent structures across views, closely approaching the ground truth. Method Mip-NeRF360 Tanks&Temples PSNR↑ SSIM↑ LPIPS↓ FID↓ PSNR↑ SSIM↑ LPIPS↓ FID↓ 3DGS [10] 20… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison under extreme zoom-in across multiple focal levels and viewpoints. Competing methods exhibit blurry, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Effectiveness of VLM guidance in detail synthsis. With [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Effectiveness of continuous LoD. Without LoD, opti [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

We introduce GaussianZoom, a generative zoom-in 3D reconstruction system with an iterative progressive framework that combines geometry-consistent scene modeling and multi-scale semantic reasoning to enable high-fidelity extreme zoom-in rendering from low-resolution inputs. To achieve this, we develop a novel multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis, ensuring accurate multi-view correspondence while enriching fine-scale appearance beyond the observed resolution. To support zooming across large magnification ranges, we further introduce a new expandable continuous Level-of-Detail hierarchy that dynamically modulates Gaussian visibility for smooth, alias-free cross-scale rendering. Experiments on Mip-NeRF360 and Tanks\&Temples demonstrate that GaussianZoom achieves superior perceptual quality, multi-view consistency, and robustness under extreme magnification, establishing a strong baseline for generative zoom-in 3D scene reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GaussianZoom adds a progressive VLM-guided super-resolution step and continuous LoD hierarchy to 3D Gaussian splatting for extreme zoom-in, but the consistency claims rest on unproven detail synthesis.

read the letter

GaussianZoom is a system that starts from low-resolution inputs and progressively builds higher-detail 3D Gaussian scenes for extreme zoom rendering. The core additions are a multi-view consistent super-resolution module that warps features by depth and uses a VLM to synthesize missing fine-scale appearance, plus an expandable continuous Level-of-Detail hierarchy that adjusts Gaussian visibility to avoid aliasing across scales. These pieces target a practical need in graphics and reconstruction where current methods lose coherence when you push magnification far beyond the input resolution. The experiments on Mip-NeRF360 and Tanks & Temples are presented as showing better perceptual quality and robustness than prior work. That combination of iterative geometry-consistent modeling with semantic guidance from VLMs is the actual new element here. The paper does a reasonable job laying out the pipeline and explaining how the modules fit together to support large zoom ranges without obvious tearing or popping. The soft spot is the reliance on VLM-driven synthesis for high-frequency content. Depth warping gives coarse alignment, but VLMs frequently produce plausible-looking details that do not match the underlying geometry or stay consistent across views once you go to 8x or 16x. Without ground-truth high-resolution data at those scales, it is easy for inconsistencies to slip through, and the abstract does not include the quantitative breakdowns or ablation numbers that would show how much the new modules actually move the needle versus tuning. If the full paper has solid metrics and failure-case analysis, that would strengthen the case. This work is aimed at people already working on 3D Gaussian splatting or novel-view synthesis who need to handle low-res capture and then zoom in for detail. A reader looking for concrete implementation ideas around progressive refinement and LoD control could extract useful pieces even without adopting the whole system. I would send it to peer review. The technical framing is clear enough and the problem is real, so referees can check the numbers and the consistency claims directly.

Referee Report

2 major / 2 minor

Summary. The paper introduces GaussianZoom, a progressive zoom-in generative 3D Gaussian Splatting framework that combines geometry-consistent scene modeling with multi-scale semantic reasoning. It proposes a multi-view consistent super-resolution module using depth-based feature warping and VLM-driven detail synthesis to enrich fine-scale appearance, plus an expandable continuous Level-of-Detail hierarchy for alias-free cross-scale rendering. Experiments on Mip-NeRF360 and Tanks&Temples are claimed to show superior perceptual quality, multi-view consistency, and robustness under extreme magnification.

Significance. If the central claims hold, the work would provide a useful baseline for generative zoom-in 3D reconstruction, particularly by integrating geometric guidance with VLM-based semantic detail synthesis and addressing multi-scale rendering via the proposed LOD hierarchy. This could advance applications in high-fidelity rendering from low-resolution inputs where extreme magnification is required.

major comments (2)

[Abstract] Abstract: The manuscript asserts superior performance on Mip-NeRF360 and Tanks&Temples benchmarks with respect to perceptual quality, multi-view consistency, and robustness under extreme magnification, yet supplies no quantitative results, error bars, ablation studies, or specific metrics in the provided text. This absence directly weakens evaluation of the load-bearing claims.
[Multi-view consistent super-resolution module] Multi-view consistent super-resolution module: The VLM-driven detail synthesis is presented as the mechanism for enriching fine-scale appearance beyond observed resolution while preserving correspondence via depth-based feature warping. However, depth warping supplies only coarse alignment and does not address potential semantic or textural hallucinations produced by VLMs at 8-16x zoom factors; without explicit consistency metrics or ground-truth high-frequency validation, this undermines the multi-view consistency and robustness assertions.

minor comments (2)

The description of the expandable continuous Level-of-Detail hierarchy would benefit from a clearer statement of how Gaussian visibility is modulated and any associated computational overhead.
Notation for the progressive iterative framework could be standardized earlier to improve readability of the geometric and semantic guidance components.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications from the full manuscript and indicating revisions where they strengthen the presentation of results and technical details.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript asserts superior performance on Mip-NeRF360 and Tanks&Temples benchmarks with respect to perceptual quality, multi-view consistency, and robustness under extreme magnification, yet supplies no quantitative results, error bars, ablation studies, or specific metrics in the provided text. This absence directly weakens evaluation of the load-bearing claims.

Authors: The abstract serves as a concise summary and conventionally omits specific numerical values, which are instead reported in full in the Experiments section. There we present quantitative comparisons on Mip-NeRF360 and Tanks&Temples using standard metrics (PSNR, SSIM, LPIPS) for perceptual quality, dedicated multi-view consistency scores, and robustness measures under extreme magnification, together with ablation studies and error bars on the reported plots. To make the abstract claims more self-contained, we will revise it to briefly reference these quantitative evaluations and direct readers to the detailed tables and figures. revision: yes
Referee: [Multi-view consistent super-resolution module] Multi-view consistent super-resolution module: The VLM-driven detail synthesis is presented as the mechanism for enriching fine-scale appearance beyond observed resolution while preserving correspondence via depth-based feature warping. However, depth warping supplies only coarse alignment and does not address potential semantic or textural hallucinations produced by VLMs at 8-16x zoom factors; without explicit consistency metrics or ground-truth high-frequency validation, this undermines the multi-view consistency and robustness assertions.

Authors: We agree that depth-based warping alone supplies only coarse geometric alignment. Multi-view consistency in our framework is additionally enforced by the geometry-consistent scene modeling, iterative progressive optimization across views, and the continuous LOD hierarchy. We evaluate this using explicit multi-view consistency metrics (cross-view feature similarity and perceptual consistency scores) reported in the experiments. The VLM synthesis is constrained by both geometric and semantic guidance to limit hallucinations. We acknowledge that ground-truth high-frequency references at 8-16x magnification are unavailable in the benchmarks, making direct validation difficult; we therefore rely on perceptual user studies and cross-method comparisons. We will expand the manuscript with a dedicated paragraph on the consistency metrics and a limitations discussion of potential VLM-induced artifacts. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a novel technical framework combining progressive Gaussian Splatting, depth-based feature warping, VLM-driven detail synthesis, and a continuous LOD hierarchy for zoom-in rendering. All load-bearing components are introduced as new modules whose behavior is defined by explicit algorithmic choices rather than by fitting parameters to the target metrics or by reducing to self-citations. Experiments on external datasets (Mip-NeRF360, Tanks&Temples) provide independent evaluation; no derivation step equates a claimed prediction or uniqueness result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no explicit free parameters, axioms, or invented entities are detailed in the available text.

pith-pipeline@v0.9.0 · 5691 in / 1170 out tokens · 37467 ms · 2026-05-20T10:42:05.367582+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-view consistent super-resolution module with depth-based feature warping and VLM-driven detail synthesis... expandable continuous Level-of-Detail hierarchy that dynamically modulates Gaussian visibility

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 3 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022. 5

work page 2022
[3]

Basicvsr: The search for essential compo- nents in video super-resolution and beyond

Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Basicvsr: The search for essential compo- nents in video super-resolution and beyond. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4947–4956, 2021. 6

work page 2021
[4]

Basicvsr++: Improving video super- resolution with enhanced propagation and alignment

Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Basicvsr++: Improving video super- resolution with enhanced propagation and alignment. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5972–5981, 2022. 2

work page 2022
[5]

Bridging diffusion mod- els and 3d representations: A 3d consistent super-resolution framework

Yi-Ting Chen, Ting-Hsuan Liao, Pengsheng Guo, Alexan- der Schwing, and Jia-Bin Huang. Bridging diffusion mod- els and 3d representations: A 3d consistent super-resolution framework. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13481–13490, 2025. 2

work page 2025
[6]

Srgs: Super-resolution 3d gaussian splatting.arXiv preprint arXiv:2404.10318, 2024

Xiang Feng, Yongbo He, Yubo Wang, Yan Yang, Wen Li, Yifei Chen, Zhenzhong Kuang, Jianping Fan, Yu Jun, et al. Srgs: Super-resolution 3d gaussian splatting.arXiv preprint arXiv:2404.10318, 2024. 2, 6, 7, 8

work page arXiv 2024
[7]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

work page 2017
[8]

Scope of va- lidity of psnr in image/video quality assessment.Electronics letters, 44(13):800–801, 2008

Quan Huynh-Thu and Mohammed Ghanbari. Scope of va- lidity of psnr in image/video quality assessment.Electronics letters, 44(13):800–801, 2008. 6

work page 2008
[9]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 6

work page 2021
[10]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page
[11]

A hierarchical 3d gaussian representation for real-time ren- dering of very large datasets.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024

Bernhard Kerbl, Andreas Meuleman, Georgios Kopanas, Michael Wimmer, Alexandre Lanvin, and George Drettakis. A hierarchical 3d gaussian representation for real-time ren- dering of very large datasets.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024. 2, 3

work page 2024
[12]

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Bryan Sangwoo Kim, Jeongsol Kim, and Jong Chul Ye. Chain-of-zoom: Extreme super-resolution via scale au- toregression and preference alignment.arXiv preprint arXiv:2505.18600, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36 (4):1–13, 2017

Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36 (4):1–13, 2017. 5

work page 2017
[14]

Sequence matters: Har- nessing video models in 3d super-resolution

Hyun-kyu Ko, Dongheok Park, Youngin Park, Byeonghyeon Lee, Juhee Han, and Eunbyung Park. Sequence matters: Har- nessing video models in 3d super-resolution. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 4356–4364, 2025. 2, 6, 7, 8

work page 2025
[15]

Lodge: Level-of- detail large-scale gaussian splatting with efficient rendering

Jonas Kulhanek, Marie-Julie Rakotosaona, Fabian Man- hardt, Christina Tsalicoglou, Michael Niemeyer, Torsten Sat- tler, Songyou Peng, and Federico Tombari. Lodge: Level-of- detail large-scale gaussian splatting with efficient rendering. arXiv preprint arXiv:2505.23158, 2025. 2, 3

work page arXiv 2025
[16]

Photo- realistic single image super-resolution using a generative ad- versarial network

Christian Ledig, Lucas Theis, Ferenc Husz´ar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo- realistic single image super-resolution using a generative ad- versarial network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690,

work page
[17]

Disr-nerf: Diffusion-guided view-consistent super-resolution nerf

Jie Long Lee, Chen Li, and Gim Hee Lee. Disr-nerf: Diffusion-guided view-consistent super-resolution nerf. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 20561–20570, 2024. 2

work page 2024
[18]

Swinir: Image restoration us- ing swin transformer

Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration us- ing swin transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1833–1844,

work page
[19]

Enhanced deep residual networks for single image super-resolution

Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. InProceedings of the IEEE confer- ence on computer vision and pattern recognition workshops, pages 136–144, 2017. 2

work page 2017
[20]

completely blind

Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Mak- ing a “completely blind” image quality analyzer.IEEE Sig- nal processing letters, 20(3):209–212, 2012. 6

work page 2012
[21]

Optical flow estima- tion using a spatial pyramid network

Anurag Ranjan and Michael J Black. Optical flow estima- tion using a spatial pyramid network. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 4161–4170, 2017. 3

work page 2017
[22]

Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024

Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024. 2, 3

work page arXiv 2024
[23]

Flod: Integrating flexible level of detail into 3d gaussian splatting for customizable rendering.arXiv preprint arXiv:2408.12894, 2024

Yunji Seo, Young Sun Choi, Hyun Seung Son, and Youngjung Uh. Flod: Integrating flexible level of detail into 3d gaussian splatting for customizable rendering.arXiv preprint arXiv:2408.12894, 2024. 2, 3

work page arXiv 2024
[24]

Su- pergaussian: Repurposing video models for 3d super reso- lution

Yuan Shen, Duygu Ceylan, Paul Guerrero, Zexiang Xu, Niloy J Mitra, Shenlong Wang, and Anna Fr ¨uhst¨uck. Su- pergaussian: Repurposing video models for 3d super reso- lution. InEuropean Conference on Computer Vision, pages 215–233. Springer, 2024. 2, 6, 7, 8

work page 2024
[25]

Rethinking alignment in video super- resolution transformers.Advances in Neural Information Processing Systems, 35:36081–36093, 2022

Shuwei Shi, Jinjin Gu, Liangbin Xie, Xintao Wang, Yujiu Yang, and Chao Dong. Rethinking alignment in video super- resolution transformers.Advances in Neural Information Processing Systems, 35:36081–36093, 2022. 2, 6

work page 2022
[26]

One-step diffusion for detail-rich and temporally consistent video super-resolution

Yujing Sun, Lingchen Sun, Shuaizheng Liu, Rongyuan Wu, Zhengqiang Zhang, and Lei Zhang. One-step diffusion for detail-rich and temporally consistent video super-resolution. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. 2, 6

work page 2025
[27]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 8

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Ex- ploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. InPro- ceedings of the AAAI conference on artificial intelligence, pages 2555–2563, 2023. 6

work page 2023
[29]

Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, 132(12):5929–5949, 2024

Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, 132(12):5929–5949, 2024. 2

work page 2024
[30]

Esrgan: En- hanced super-resolution generative adversarial networks

Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En- hanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018. 2

work page 2018
[31]

Edvr: Video restoration with enhanced deformable convolutional networks

Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019. 2

work page 2019
[32]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1905–1914,

work page 1905
[33]

Genera- tive powers of ten

Xiaojuan Wang, Janne Kontkanen, Brian Curless, Steven M Seitz, Ira Kemelmacher-Shlizerman, Ben Mildenhall, Pratul Srinivasan, Dor Verbin, and Aleksander Holynski. Genera- tive powers of ten. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7173–7182, 2024. 3

work page 2024
[34]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

work page 2004
[35]

One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024

Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024. 2

work page 2024
[36]

Supergs: Super-resolution 3d gaussian splatting via latent feature field and gradient-guided splitting.arXiv preprint arXiv:2410.02571, 1, 2024

Shiyun Xie, Zhiru Wang, Yinghao Zhu, and Chengwei Pan. Supergs: Super-resolution 3d gaussian splatting via latent feature field and gradient-guided splitting.arXiv preprint arXiv:2410.02571, 1, 2024. 2

work page arXiv 2024
[37]

Videogigagan: Towards detail-rich video super-resolution

Yiran Xu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Jia-Bin Huang, and Difan Liu. Videogigagan: Towards detail-rich video super-resolution. InProceedings of the Computer Vision and Pattern Recog- nition Conference, pages 2139–2149, 2025. 6

work page 2025
[38]

Gaus- siansr: 3d gaussian super-resolution with 2d diffusion priors

Xiqian Yu, Hanxin Zhu, Tianyu He, and Zhibo Chen. Gaus- siansr: 3d gaussian super-resolution with 2d diffusion priors. arXiv preprint arXiv:2406.10111, 2024. 2

work page arXiv 2024
[39]

Mip-splatting: Alias-free 3d gaussian splat- ting

Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 19447–19456,

work page
[40]

Resshift: Efficient diffusion model for image super- resolution by residual shifting.Advances in Neural Infor- mation Processing Systems, 36:13294–13307, 2023

Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super- resolution by residual shifting.Advances in Neural Infor- mation Processing Systems, 36:13294–13307, 2023. 2

work page 2023
[41]

Rade-gs: Rasterizing depth in gaussian splatting.arXiv preprint arXiv:2406.01467, 2024

Baowen Zhang, Chuan Fang, Rakesh Shrestha, Yixun Liang, Xiaoxiao Long, and Ping Tan. Rade-gs: Rasterizing depth in gaussian splatting.arXiv preprint arXiv:2406.01467, 2024. 2, 5, 6

work page arXiv 2024
[42]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

work page 2018
[43]

Image super-resolution using very deep residual channel attention networks

Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. InProceedings of the European conference on computer vision (ECCV), pages 286–301, 2018. 2

work page 2018
[44]

Upscale-a-video: Temporal- consistent diffusion model for real-world video super- resolution

Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, and Chen Change Loy. Upscale-a-video: Temporal- consistent diffusion model for real-world video super- resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2535– 2545, 2024. 2

work page 2024

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022. 5

work page 2022

[3] [3]

Basicvsr: The search for essential compo- nents in video super-resolution and beyond

Kelvin CK Chan, Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Basicvsr: The search for essential compo- nents in video super-resolution and beyond. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4947–4956, 2021. 6

work page 2021

[4] [4]

Basicvsr++: Improving video super- resolution with enhanced propagation and alignment

Kelvin CK Chan, Shangchen Zhou, Xiangyu Xu, and Chen Change Loy. Basicvsr++: Improving video super- resolution with enhanced propagation and alignment. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5972–5981, 2022. 2

work page 2022

[5] [5]

Bridging diffusion mod- els and 3d representations: A 3d consistent super-resolution framework

Yi-Ting Chen, Ting-Hsuan Liao, Pengsheng Guo, Alexan- der Schwing, and Jia-Bin Huang. Bridging diffusion mod- els and 3d representations: A 3d consistent super-resolution framework. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13481–13490, 2025. 2

work page 2025

[6] [6]

Srgs: Super-resolution 3d gaussian splatting.arXiv preprint arXiv:2404.10318, 2024

Xiang Feng, Yongbo He, Yubo Wang, Yan Yang, Wen Li, Yifei Chen, Zhenzhong Kuang, Jianping Fan, Yu Jun, et al. Srgs: Super-resolution 3d gaussian splatting.arXiv preprint arXiv:2404.10318, 2024. 2, 6, 7, 8

work page arXiv 2024

[7] [7]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium.Advances in neural information processing systems, 30, 2017. 6

work page 2017

[8] [8]

Scope of va- lidity of psnr in image/video quality assessment.Electronics letters, 44(13):800–801, 2008

Quan Huynh-Thu and Mohammed Ghanbari. Scope of va- lidity of psnr in image/video quality assessment.Electronics letters, 44(13):800–801, 2008. 6

work page 2008

[9] [9]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 6

work page 2021

[10] [10]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page

[11] [11]

A hierarchical 3d gaussian representation for real-time ren- dering of very large datasets.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024

Bernhard Kerbl, Andreas Meuleman, Georgios Kopanas, Michael Wimmer, Alexandre Lanvin, and George Drettakis. A hierarchical 3d gaussian representation for real-time ren- dering of very large datasets.ACM Transactions on Graphics (TOG), 43(4):1–15, 2024. 2, 3

work page 2024

[12] [12]

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Bryan Sangwoo Kim, Jeongsol Kim, and Jong Chul Ye. Chain-of-zoom: Extreme super-resolution via scale au- toregression and preference alignment.arXiv preprint arXiv:2505.18600, 2025. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36 (4):1–13, 2017

Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36 (4):1–13, 2017. 5

work page 2017

[14] [14]

Sequence matters: Har- nessing video models in 3d super-resolution

Hyun-kyu Ko, Dongheok Park, Youngin Park, Byeonghyeon Lee, Juhee Han, and Eunbyung Park. Sequence matters: Har- nessing video models in 3d super-resolution. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 4356–4364, 2025. 2, 6, 7, 8

work page 2025

[15] [15]

Lodge: Level-of- detail large-scale gaussian splatting with efficient rendering

Jonas Kulhanek, Marie-Julie Rakotosaona, Fabian Man- hardt, Christina Tsalicoglou, Michael Niemeyer, Torsten Sat- tler, Songyou Peng, and Federico Tombari. Lodge: Level-of- detail large-scale gaussian splatting with efficient rendering. arXiv preprint arXiv:2505.23158, 2025. 2, 3

work page arXiv 2025

[16] [16]

Photo- realistic single image super-resolution using a generative ad- versarial network

Christian Ledig, Lucas Theis, Ferenc Husz´ar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo- realistic single image super-resolution using a generative ad- versarial network. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690,

work page

[17] [17]

Disr-nerf: Diffusion-guided view-consistent super-resolution nerf

Jie Long Lee, Chen Li, and Gim Hee Lee. Disr-nerf: Diffusion-guided view-consistent super-resolution nerf. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 20561–20570, 2024. 2

work page 2024

[18] [18]

Swinir: Image restoration us- ing swin transformer

Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration us- ing swin transformer. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1833–1844,

work page

[19] [19]

Enhanced deep residual networks for single image super-resolution

Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. InProceedings of the IEEE confer- ence on computer vision and pattern recognition workshops, pages 136–144, 2017. 2

work page 2017

[20] [20]

completely blind

Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. Mak- ing a “completely blind” image quality analyzer.IEEE Sig- nal processing letters, 20(3):209–212, 2012. 6

work page 2012

[21] [21]

Optical flow estima- tion using a spatial pyramid network

Anurag Ranjan and Michael J Black. Optical flow estima- tion using a spatial pyramid network. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 4161–4170, 2017. 3

work page 2017

[22] [22]

Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024

Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024. 2, 3

work page arXiv 2024

[23] [23]

Flod: Integrating flexible level of detail into 3d gaussian splatting for customizable rendering.arXiv preprint arXiv:2408.12894, 2024

Yunji Seo, Young Sun Choi, Hyun Seung Son, and Youngjung Uh. Flod: Integrating flexible level of detail into 3d gaussian splatting for customizable rendering.arXiv preprint arXiv:2408.12894, 2024. 2, 3

work page arXiv 2024

[24] [24]

Su- pergaussian: Repurposing video models for 3d super reso- lution

Yuan Shen, Duygu Ceylan, Paul Guerrero, Zexiang Xu, Niloy J Mitra, Shenlong Wang, and Anna Fr ¨uhst¨uck. Su- pergaussian: Repurposing video models for 3d super reso- lution. InEuropean Conference on Computer Vision, pages 215–233. Springer, 2024. 2, 6, 7, 8

work page 2024

[25] [25]

Rethinking alignment in video super- resolution transformers.Advances in Neural Information Processing Systems, 35:36081–36093, 2022

Shuwei Shi, Jinjin Gu, Liangbin Xie, Xintao Wang, Yujiu Yang, and Chao Dong. Rethinking alignment in video super- resolution transformers.Advances in Neural Information Processing Systems, 35:36081–36093, 2022. 2, 6

work page 2022

[26] [26]

One-step diffusion for detail-rich and temporally consistent video super-resolution

Yujing Sun, Lingchen Sun, Shuaizheng Liu, Rongyuan Wu, Zhengqiang Zhang, and Lei Zhang. One-step diffusion for detail-rich and temporally consistent video super-resolution. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025. 2, 6

work page 2025

[27] [27]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. To- wards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018. 8

work page internal anchor Pith review Pith/arXiv arXiv 2018

[28] [28]

Ex- ploring clip for assessing the look and feel of images

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Ex- ploring clip for assessing the look and feel of images. InPro- ceedings of the AAAI conference on artificial intelligence, pages 2555–2563, 2023. 6

work page 2023

[29] [29]

Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, 132(12):5929–5949, 2024

Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution.International Journal of Computer Vision, 132(12):5929–5949, 2024. 2

work page 2024

[30] [30]

Esrgan: En- hanced super-resolution generative adversarial networks

Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En- hanced super-resolution generative adversarial networks. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018. 2

work page 2018

[31] [31]

Edvr: Video restoration with enhanced deformable convolutional networks

Xintao Wang, Kelvin CK Chan, Ke Yu, Chao Dong, and Chen Change Loy. Edvr: Video restoration with enhanced deformable convolutional networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 0–0, 2019. 2

work page 2019

[32] [32]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 1905–1914,

work page 1905

[33] [33]

Genera- tive powers of ten

Xiaojuan Wang, Janne Kontkanen, Brian Curless, Steven M Seitz, Ira Kemelmacher-Shlizerman, Ben Mildenhall, Pratul Srinivasan, Dor Verbin, and Aleksander Holynski. Genera- tive powers of ten. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 7173–7182, 2024. 3

work page 2024

[34] [34]

Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 6

work page 2004

[35] [35]

One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024

Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.Advances in Neural Information Process- ing Systems, 37:92529–92553, 2024. 2

work page 2024

[36] [36]

Supergs: Super-resolution 3d gaussian splatting via latent feature field and gradient-guided splitting.arXiv preprint arXiv:2410.02571, 1, 2024

Shiyun Xie, Zhiru Wang, Yinghao Zhu, and Chengwei Pan. Supergs: Super-resolution 3d gaussian splatting via latent feature field and gradient-guided splitting.arXiv preprint arXiv:2410.02571, 1, 2024. 2

work page arXiv 2024

[37] [37]

Videogigagan: Towards detail-rich video super-resolution

Yiran Xu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Jia-Bin Huang, and Difan Liu. Videogigagan: Towards detail-rich video super-resolution. InProceedings of the Computer Vision and Pattern Recog- nition Conference, pages 2139–2149, 2025. 6

work page 2025

[38] [38]

Gaus- siansr: 3d gaussian super-resolution with 2d diffusion priors

Xiqian Yu, Hanxin Zhu, Tianyu He, and Zhibo Chen. Gaus- siansr: 3d gaussian super-resolution with 2d diffusion priors. arXiv preprint arXiv:2406.10111, 2024. 2

work page arXiv 2024

[39] [39]

Mip-splatting: Alias-free 3d gaussian splat- ting

Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 19447–19456,

work page

[40] [40]

Resshift: Efficient diffusion model for image super- resolution by residual shifting.Advances in Neural Infor- mation Processing Systems, 36:13294–13307, 2023

Zongsheng Yue, Jianyi Wang, and Chen Change Loy. Resshift: Efficient diffusion model for image super- resolution by residual shifting.Advances in Neural Infor- mation Processing Systems, 36:13294–13307, 2023. 2

work page 2023

[41] [41]

Rade-gs: Rasterizing depth in gaussian splatting.arXiv preprint arXiv:2406.01467, 2024

Baowen Zhang, Chuan Fang, Rakesh Shrestha, Yixun Liang, Xiaoxiao Long, and Ping Tan. Rade-gs: Rasterizing depth in gaussian splatting.arXiv preprint arXiv:2406.01467, 2024. 2, 5, 6

work page arXiv 2024

[42] [42]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 6

work page 2018

[43] [43]

Image super-resolution using very deep residual channel attention networks

Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. InProceedings of the European conference on computer vision (ECCV), pages 286–301, 2018. 2

work page 2018

[44] [44]

Upscale-a-video: Temporal- consistent diffusion model for real-world video super- resolution

Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, and Chen Change Loy. Upscale-a-video: Temporal- consistent diffusion model for real-world video super- resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2535– 2545, 2024. 2

work page 2024