GS-STVSR achieves state-of-the-art continuous spatio-temporal video super-resolution quality with nearly constant inference time at standard scales and over 3x speedup at extreme scales using 2D Gaussian Splatting.
Infvsr: Breaking length limits of generic video super-resolution
5 Pith papers cite this work. Polarity classification is still indexing.
abstract
Real-world videos often extend over thousands of frames. Existing generative video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor consistency is hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which reformulates VSR as an autoregressive-one-step-diffusion paradigm, and enables streaming inference with video diffusion priors. First, we adapt the pretrained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Our code and models are available at https://github.com/Kai-Liu001/InfVSR.
citation-role summary
citation-polarity summary
fields
cs.CV 5roles
other 1polarities
unclear 1representative citing papers
Stream-DiffVSR enables practical low-latency video super-resolution by combining a four-step distilled denoiser, auto-regressive temporal guidance, and a temporal processor in a strictly causal pipeline.
DiffST delivers state-of-the-art real-world space-time video super-resolution with 17x faster inference than prior diffusion methods by using one-step sampling, cross-frame context aggregation, and video representation guidance.
DVFace uses a spatio-temporal dual-codebook and asymmetric fusion in a one-step diffusion model to deliver better video face restoration quality, temporal consistency, and identity preservation than recent methods.
TIGER is a tri-prior fusion method for face video restoration using identity, geometry, and generative priors with progressive training to achieve SOTA identity fidelity and temporal stability on a new large-scale dataset.
citing papers explorer
-
GS-STVSR: Ultra-Efficient Continuous Spatio-Temporal Video Super-Resolution via 2D Gaussian Splatting
GS-STVSR achieves state-of-the-art continuous spatio-temporal video super-resolution quality with nearly constant inference time at standard scales and over 3x speedup at extreme scales using 2D Gaussian Splatting.
-
DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution
DiffST delivers state-of-the-art real-world space-time video super-resolution with 17x faster inference than prior diffusion methods by using one-step sampling, cross-frame context aggregation, and video representation guidance.
-
DVFace: Spatio-Temporal Dual-Prior Diffusion for Video Face Restoration
DVFace uses a spatio-temporal dual-codebook and asymmetric fusion in a one-step diffusion model to deliver better video face restoration quality, temporal consistency, and identity preservation than recent methods.
-
TIGER: Taming Identity, Geometry, and Generative Priors for High-Quality Face Video Restoration
TIGER is a tri-prior fusion method for face video restoration using identity, geometry, and generative priors with progressive training to achieve SOTA identity fidelity and temporal stability on a new large-scale dataset.