{"paper":{"title":"DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"DiffST adapts pre-trained diffusion models for one-step whole-video sampling to lead real-world space-time super-resolution while running 17 times faster.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Chunming He, Dehua Song, Jin Han, Ruofan Yang, Yong Guo, Yulun Zhang, Zheng Chen, Zichen Zou","submitted_at":"2026-05-13T08:41:48Z","abstract_excerpt":"Diffusion-based models have shown strong performance in video super-resolution (VSR) and video frame interpolation (VFI). However, their role in the coupled space-time video super-resolution (STVSR) setting remains limited. Existing diffusion-based STVSR approaches suffer from two issues: (1) low inference efficiency and (2) insufficient utilization of spatiotemporal information. These limitations impede deployment. To address these issues, we introduce DiffST, an efficient spatiotemporal-aware video diffusion framework for real-world STVSR. To improve efficiency, we adapt a pre-trained diffus"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Extensive experiments show that DiffST obtains leading results on real-world STVSR tasks. It also maintains high inference efficiency, running about 17× faster than previous diffusion-based STVSR methods.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That adapting a pre-trained image diffusion model to one-step sampling on entire videos, combined with the proposed CFCA and VRG modules, will preserve or improve quality without introducing artifacts specific to real-world degradations.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"DiffST delivers state-of-the-art real-world space-time video super-resolution with 17x faster inference than prior diffusion methods by using one-step sampling, cross-frame context aggregation, and video representation guidance.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"DiffST adapts pre-trained diffusion models for one-step whole-video sampling to lead real-world space-time super-resolution while running 17 times faster.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"04c61a46452e9658c5dcc5ea3aa4a0e97a0cc40b05b40c0b37df4ba8395e74ee"},"source":{"id":"2605.13182","kind":"arxiv","version":1},"verdict":{"id":"d836683e-0fa2-4d77-ac11-5a0872d955b2","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T20:26:21.353988Z","strongest_claim":"Extensive experiments show that DiffST obtains leading results on real-world STVSR tasks. It also maintains high inference efficiency, running about 17× faster than previous diffusion-based STVSR methods.","one_line_summary":"DiffST delivers state-of-the-art real-world space-time video super-resolution with 17x faster inference than prior diffusion methods by using one-step sampling, cross-frame context aggregation, and video representation guidance.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That adapting a pre-trained image diffusion model to one-step sampling on entire videos, combined with the proposed CFCA and VRG modules, will preserve or improve quality without introducing artifacts specific to real-world degradations.","pith_extraction_headline":"DiffST adapts pre-trained diffusion models for one-step whole-video sampling to lead real-world space-time super-resolution while running 17 times faster."},"references":{"count":65,"sample":[{"doi":"","year":2022,"title":"Towards interpretable video super-resolution via alternating optimization","work_id":"b6438ad2-a647-4d09-b19b-9c01aec48269","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Basicvsr: The search for essential components in video super-resolution and beyond","work_id":"56e6bda2-3197-4b81-baa1-b8b0b25fb8c6","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Basicvsr++: Improving video super-resolution with enhanced propagation and alignment","work_id":"d9b04ae3-7d4d-4aa6-8be9-398d4ef7f072","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Investigating tradeoffs in real-world video super-resolution","work_id":"778cf319-6407-47f3-8cda-2ca81518103a","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Motif: Learning motion trajectories with local implicit neural functions for continuous space-time video super-resolution","work_id":"957198b7-cb16-4a1a-8dbe-f9270fbbfb98","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":65,"snapshot_sha256":"934d442b18378dedbe13ac19cbffff1c8a454cefea264feb9141e2c13bb02570","internal_anchors":2},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}