pith. sign in

arxiv: 2603.21002 · v2 · pith:5BNTP4IMnew · submitted 2025-11-25 · 💻 cs.GR

SURF: Signature-Retained Fast Video Generation

Pith reviewed 2026-05-21 18:06 UTC · model grok-4.3

classification 💻 cs.GR
keywords video generationfast inferencehigh resolutiondiffusion modelsmodel accelerationrefinernoise reshifting
0
0 comments X

The pith

SURF generates high-resolution videos up to 12.5 times faster by creating quick low-resolution previews with noise reshifting and then using a Refiner to restore full detail while staying close to the original model's signatures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve the slow inference speeds that limit high-resolution video generation, where a single 720p clip can take over 50 minutes with models like Wan2.1. It splits the task into a preview stage that runs the pretrained model at lower resolution after initial full-resolution denoising steps to avoid losing layout, semantics, and motion, and a refine stage that trains a Refiner to upscale the preview with far fewer steps. The approach is presented as a plug-in compatible with existing models and other acceleration techniques. It reports concrete speedups of 12.5 times for 5-second 16fps 720p Wan 2.1 videos and 8.7 times for HunyuanVideo at similar specs.

Core claim

SURF divides video generation into two stages: a preview stage that applies noise reshifting by running initial denoising at the model's original resolution before switching to low resolution to generate fast previews that retain signatures, and a refine stage that establishes a mapping from preview to high-resolution output via a Refiner trained with shifting windows, which reduces the number of denoising steps needed while keeping the output maximally close to the signatures of the given pretrained model.

What carries the argument

Noise reshifting in the preview stage to mitigate signature loss during low-resolution inference, paired with a Refiner network trained to map previews to high-resolution targets using fewer denoising steps.

If this is right

  • High-resolution video generation becomes practical on standard hardware instead of requiring long runtimes.
  • The distinctive signatures such as layout, semantics, and motion from the base model appear in the final output.
  • The framework can be added to existing models and combined with other acceleration methods as a plug-in.
  • Denoising steps in the high-resolution phase drop significantly once a good preview is available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same preview-plus-refiner split could be tested on image or audio generation tasks that face similar resolution and fidelity trade-offs.
  • Adaptive choice of when to switch resolutions based on prompt complexity might further improve results beyond the fixed procedure described.
  • Longer video clips could expose whether temporal consistency holds across the preview-to-refine boundary.

Load-bearing premise

That performing a few initial denoising steps at full resolution before switching to low resolution is enough to prevent loss of the model's distinctive layout, semantics, and motion, and that the Refiner can learn a reliable mapping that cuts denoising steps without introducing artifacts or reducing fidelity.

What would settle it

Generate identical prompts with both the original pretrained model and the full SURF pipeline, then compare the resulting videos for differences in object layout, semantic content, or motion patterns; substantial mismatches would show that signatures were not retained.

Figures

Figures reproduced from arXiv: 2603.21002 by Hengshuang Zhao, Kaixin Ding, Liang Hou, Sihui Ji, Xi Chen, Xin Tao, Yuan Gao.

Figure 1
Figure 1. Figure 1: SURF generates high-resolution and high-quality videos efficiently. It acts as a simple plug-in and retains the signatures (i.e., layout, semantic, motion, etc.) of the baseline model with more than 12× speed-up. More results are available on Project Page. Abstract The demand for high-resolution video generation is grow￾ing rapidly. However, the generation resolution is severely constrained by slow inferen… view at source ↗
Figure 2
Figure 2. Figure 2: Demonstration for signature retention. Distilled model loses the signatures of the base model, which could lead to misaligned body parts and weak semantic consistency. Our method maintains the layout, semantics, and motion of base model with huge speed up. 1. Introduction Video generation [19, 34, 43] has witnessed remarkable advancements in recent years, with sophisticated models continually pushing the b… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the SURF. Our framework employs a powerful pretrained model (e.g., Wan 2.1) and a lightweight Refiner network. Together, they execute the denoising process through an OptimRes-LowRes-HighRes flow (SURF), producing outputs with signature (e.g., layout, semantic, motion) closely match those of the base model (Wan 2.1), while achieving huge speed up. described by the following iterative update: z0… view at source ↗
Figure 4
Figure 4. Figure 4: Shift window across adjacent blocks. Training paradigm. We first obtain large amounts of high-quality videos and apply both pixel- and latent-level degradations to simulate low-quality video values. Pixel￾level degradation simulates blur, while latent-level degra￾dation prevents the task from becoming a trivial super￾resolution problem, encouraging the model to exploit its generative ability. These low-qua… view at source ↗
Figure 5
Figure 5. Figure 5: User study. We report pair-wise preference rates. SURF achieves comparable quality to Wan 2.1 with huge speed up [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparisons. Our method achieves up to 12× speedup while maintaining signatures of base model. Unreasonable contents are marked in orange. Rather than aimless camera panning, SURF generates high fidelity videos with semantically aligned motion. (d) A panda drinking coffee in a café watercolor painting (c) A boat sailing along the Seine River with the Eiffel Tower in background Van Gogh style (a… view at source ↗
Figure 7
Figure 7. Figure 7: Generation results for each stage. We mark regions with artifacts and lacking detail in the preview videos using yellow boxes, while improvements from the Refiner are highlighted in red. Zoom in for a better view. 4.3. Stage-wise Visualization This section presents visualizations of SURF, providing an intuitive comparison between the preview and the final results in overall layout similarity and refinement… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results for ablation. SW denotes shift￾window attention. The Refiner runtime on 1080p video is shown in the bottom-right corner. Our method achieves shorter runtime while preserving finer details. fingernail contours and appropriate wrinkles, regardless of whether the shift-window mechanism is applied. Addi￾tionally, the clarity of distant trees is noticeably enhanced. our findings suggest that… view at source ↗
read the original abstract

The demand for high-resolution video generation is growing rapidly. However, the generation resolution is severely constrained by slow inference speeds. For instance, Wan2.1 requires over 50 minutes to generate a single 720p video. While previous works explore accelerating video generation from various aspects, most of them compromise the distinctive signatures (e.g., layout, semantic, motion) of the original model. In this work, we propose SURF, an efficient framework for generating high-resolution videos, while maximally keeping the signatures. Specifically, SURF divides video generation into two stages: First, we leverage the pretrained model to infer at optimal resolution and downsample latent to generate low-resolution previews in fast speed; then we design a Refiner to upscale the preview. In the preview stage, we identify that directly inferring a model (trained with higher resolution) on lower resolution causes severe losses in signatures. So we introduce noise reshifting, a training-free technique that mitigates this issue by conducting initial denoising steps on the original resolution and switching to low resolution in later steps. In the refine stage, we establish a mapping relationship between the preview and the high-resolution target, which significantly reduces the denoising steps. We further integrate shifting windows and carefully design the training paradigm to get a powerful and efficient Refiner. In this way, SURF enables generating high-resolution videos efficiently while maximally closer to the signatures of the given pretrained model. SURF is conceptually simple and could serve as a plug-in that is compatible with various base model and acceleration methods. For example, it achieves 12.5x speedup for generating 5-second, 16fps, 720p Wan 2.1 videos and 8.7x speedup for generating 5-second, 24fps, 720p HunyuanVideo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SURF, an efficient framework for high-resolution video generation that aims to preserve the signatures of pretrained models. It divides the process into a preview stage, where noise reshifting is used to generate low-resolution previews quickly without severe signature loss, and a refine stage employing a Refiner to produce high-resolution outputs with fewer denoising steps. The approach is claimed to achieve 12.5x speedup for Wan 2.1 and 8.7x for HunyuanVideo while being compatible with various base models.

Significance. Should the empirical validation confirm effective signature retention, this could represent a practical advance in accelerating video diffusion models without sacrificing model-specific traits. The training-free aspect of noise reshifting and the plug-in design are positive features that could facilitate adoption. Currently, however, the lack of supporting quantitative data tempers the assessed significance.

major comments (2)
  1. The abstract reports specific speedups but supplies no quantitative signature metrics, ablation studies, or error analysis; therefore the data do not yet demonstrate that the central claim of retaining signatures holds.
  2. The claim that noise reshifting sufficiently prevents signature loss when running high-resolution-trained models at lower resolution lacks supporting metrics or ablations demonstrating restoration of fidelity in layout, semantic, and motion.
minor comments (1)
  1. Some figures illustrating the noise reshifting process or Refiner architecture would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and have revised the manuscript to strengthen the quantitative support for signature retention.

read point-by-point responses
  1. Referee: The abstract reports specific speedups but supplies no quantitative signature metrics, ablation studies, or error analysis; therefore the data do not yet demonstrate that the central claim of retaining signatures holds.

    Authors: We acknowledge that the abstract prioritizes the reported speedups. The full manuscript already presents quantitative signature metrics (including CLIP similarity for semantics, SSIM and layout consistency scores, and optical-flow-based motion metrics), along with ablations and error analysis in Sections 4 and 5. To ensure the abstract itself better demonstrates the central claim, we have revised it to briefly include key signature-retention numbers alongside the speedup figures and have added a short reference to the supporting ablations. revision: yes

  2. Referee: The claim that noise reshifting sufficiently prevents signature loss when running high-resolution-trained models at lower resolution lacks supporting metrics or ablations demonstrating restoration of fidelity in layout, semantic, and motion.

    Authors: We agree that explicit per-aspect metrics strengthen the claim. The original manuscript already shows that noise reshifting outperforms direct low-resolution inference through both qualitative examples and aggregate quantitative comparisons. In the revision we have added a dedicated ablation subsection with separate metrics: layout fidelity (SSIM and bounding-box overlap), semantic fidelity (CLIP score), and motion fidelity (endpoint error on optical flow). These results quantify the restoration achieved by the reshifting schedule. revision: yes

Circularity Check

0 steps flagged

No circularity in SURF derivation; techniques are independent additions to pretrained models

full rationale

The paper describes a two-stage pipeline (preview via noise reshifting on pretrained models, followed by a separately trained Refiner) without any equations, fitted parameters, or predictions that reduce by construction to the method's own inputs. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled via prior work, and no known results are merely renamed. The speedup and signature-retention claims rest on empirical application of the described procedures rather than tautological re-derivations, making the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on standard diffusion-model assumptions plus two new procedural inventions whose effectiveness is asserted but not derived from prior literature.

free parameters (1)
  • number of initial high-resolution denoising steps
    Chosen to balance signature retention against speed; exact count not stated in abstract.
axioms (1)
  • domain assumption A high-resolution-trained video diffusion model can be run at lower resolution after initial denoising steps without catastrophic signature loss.
    Invoked to justify the preview stage and noise reshifting.
invented entities (1)
  • Refiner no independent evidence
    purpose: Learns a direct mapping from low-resolution preview to high-resolution output to reduce total denoising steps
    New component introduced in the refine stage; no independent evidence supplied.

pith-pipeline@v0.9.0 · 5873 in / 1408 out tokens · 69305 ms · 2026-05-21T18:06:02.762391+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 11 internal anchors

  1. [1]

    VideoPhy: Evaluating Physical Commonsense for Video Generation

    Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

  2. [2]

    Multidiffusion: Fusing diffusion paths for controlled image generation, 2023

    Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation, 2023. URLhttps: //arxiv.org/abs/2302.08113. 3

  3. [3]

    Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction

    Han Cai, Junyan Li, Muyan Hu, Chuang Gan, and Song Han. Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. InPro- ceedings of the IEEE/CVF international conference on computer vision, pp. 17302–17313, 2023. 3

  4. [4]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pp. 9650– 9660, 2021. 5

  5. [5]

    Pixart-sigma: Weak- to-strong training of diffusion transformer for 4k text- to-image generation, 2024

    Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak- to-strong training of diffusion transformer for 4k text- to-image generation, 2024. URLhttps://arxiv. org/abs/2403.04692. 3

  6. [6]

    Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963,

    Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963,

  7. [7]

    Resadapter: Domain consistent resolution adapter for diffusion models

    Jiaxiang Cheng, Pan Xie, Xin Xia, Jiashi Li, Jie Wu, Yuxi Ren, Huixia Li, Xuefeng Xiao, Shilei Wen, and Lean Fu. Resadapter: Domain consistent resolution adapter for diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 39, pp. 2438–2446, 2025. 3

  8. [8]

    Rethinking Attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sar- los, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020. 3

  9. [9]

    Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e

    Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory- efficient exact attention with IO-awareness. InAd- vances in Neural Information Processing Systems (NeurIPS), 2022. 3

  10. [10]

    Efficient-vdit: Efficient video diffusion transformers with attention tile, 2025

    Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, and Hao Zhang. Efficient-vdit: Efficient video diffusion transformers with attention tile, 2025. URLhttps://arxiv. org/abs/2502.06155. 2

  11. [11]

    DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

    Zihan Ding, Chi Jin, Difan Liu, Haitian Zheng, Kr- ishna Kumar Singh, Qiang Zhang, Yan Kang, Zhe Lin, and Yuchen Liu. Dollar: Few-step video generation via distillation and latent reward optimization.arXiv preprint arXiv:2412.15689, 2024. 3

  12. [12]

    Demofusion: Democratising high-resolution image generation with no $$$

    Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high-resolution image generation with no $$$. InCVPR, 2024. 3

  13. [13]

    Scaling rectified flow transformers for high- resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high- resolution image synthesis. InForty-first international conference on machine learning, 2024. 3

  14. [14]

    Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models

    Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. InThe Twelfth International Con- ference on Learning Representations, 2024. 3

  15. [15]

    Fouriscale: A frequency perspective on training-free high-resolution image synthesis, 2024

    Linjiang Huang, Rongyao Fang, Aiping Zhang, Guan- glu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution image synthesis, 2024. URLhttps: //arxiv.org/abs/2403.12963. 3

  16. [16]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianx- ing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 21807–21818, 2024. 5

  17. [17]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap- pas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pp. 5156–5165. PMLR, 2020. 3

  18. [18]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A sys- tematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024. 2, 6

  19. [19]

    Kling ai.https://klingai.com/,

    Kuaishou. Kling ai.https://klingai.com/,

  20. [20]

    Syncdiffusion: Coherent montage via synchronized joint diffusions

    Yuseung Lee, Kunho Kim, Hyunjin Kim, and Min- hyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions. InThirty-seventh Con- ference on Neural Information Processing Systems,

  21. [21]

    Hiprompt: Tuning-free higher-resolution generation with hierarchical mllm prompts, 2024

    Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, Wenhan Luo, Qifeng Liu, and Yike Guo. Hiprompt: Tuning-free higher-resolution generation with hierarchical mllm prompts, 2024. URLhttps: //arxiv.org/abs/2409.02919. 3

  22. [22]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 2

  23. [23]

    Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

    Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,

  24. [24]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pp. 8748–8763. PmLR, 2021. 5

  25. [26]

    Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks, 2024

    Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks, 2024. URLhttps: //arxiv.org/abs/2407.02158. 3

  26. [27]

    Turbo2k: Towards ultra-efficient and high-quality 2k video synthesis, 2025

    Jingjing Ren, Wenbo Li, Zhongdao Wang, Haoze Sun, Bangzhen Liu, Haoyu Chen, Jiaqi Xu, Aoxue Li, Shifeng Zhang, Bin Shao, Yong Guo, and Lei Zhu. Turbo2k: Towards ultra-efficient and high-quality 2k video synthesis, 2025. URLhttps://arxiv. org/abs/2504.14470. 2

  27. [28]

    Laion-5b: An open large- scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large- scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 5

  28. [29]

    Resmaster: Mas- tering high-resolution image generation via structural and fine-grained guidance, 2024

    Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. Resmaster: Mas- tering high-resolution image generation via structural and fine-grained guidance, 2024. URLhttps:// arxiv.org/abs/2406.16476. 3

  29. [30]

    Scale-wise distillation of diffusion models.arXiv preprint arXiv:2503.16397, 2025

    Nikita Starodubcev, Denis Kuznedelev, Artem Babenko, and Dmitry Baranchuk. Scale-wise distillation of diffusion models.arXiv preprint arXiv:2503.16397, 2025. 2

  30. [31]

    Roformer: Enhanced transformer with rotary position embedding.Neuro- computing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neuro- computing, 568:127063, 2024. 5

  31. [32]

    Fastvideo: A unified frame- work for accelerated video generation, April 2024

    The FastVideo Team. Fastvideo: A unified frame- work for accelerated video generation, April 2024. URLhttps://github.com/hao- ai- lab/ FastVideo. 5, 6

  32. [33]

    Training-free diffusion acceleration with bot- tleneck sampling.arXiv preprint arXiv:2503.18940,

    Ye Tian, Xin Xia, Yuxi Ren, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Yunhai Tong, Ling Yang, and Bin Cui. Training-free diffusion acceleration with bot- tleneck sampling.arXiv preprint arXiv:2503.18940,

  33. [34]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen- Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Rui- hang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing...

  34. [35]

    Seedvr: Seeding infinity in diffusion transformer towards generic video restoration.arXiv preprint arXiv:2501.01320, 2025

    Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Fei Xiao, Chen Change Loy, and Lu Jiang. Seedvr: Seeding infinity in diffusion transformer towards generic video restoration.arXiv preprint arXiv:2501.01320, 2025. 3

  35. [36]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768,

  36. [37]

    Real-esrgan: Training real-world blind super- resolution with pure synthetic data

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super- resolution with pure synthetic data. InInternational Conference on Computer Vision Workshops (ICCVW). 5

  37. [38]

    Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

    Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pp. 20144– 20154, 2023. 5

  38. [39]

    Sparse videogen: Accelerat- ing video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025

    Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerat- ing video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025. 3, 5, 6

  39. [40]

    Training-free and adaptive sparse attention for efficient long video generation.arXiv preprint arXiv:2502.21079, 2025

    Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, and Bin Cui. Training-free and adaptive sparse attention for efficient long video generation.arXiv preprint arXiv:2502.21079, 2025. 3

  40. [41]

    SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024. 3

  41. [42]

    Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

    Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.arXiv preprint arXiv:2505.18875, 2025. 3

  42. [43]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072,

  43. [44]

    Metaformer is actually what you need for vision

    Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10819– 10829, 2022. 3

  44. [45]

    Accvideo: Accel- erating video diffusion model with synthetic dataset

    Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. Accvideo: Accel- erating video diffusion model with synthetic dataset. arXiv preprint arXiv:2503.19462, 2025. 2, 6

  45. [46]

    Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

    Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137,

  46. [47]

    Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

    Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, and Hao Zhang. Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025. 3

  47. [48]

    Hidiffusion: Un- locking higher-resolution creativity and efficiency in pretrained diffusion models

    Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Yuhao Chen, Yao Tang, and Jiajun Liang. Hidiffusion: Un- locking higher-resolution creativity and efficiency in pretrained diffusion models. InEuropean Conference on Computer Vision, pp. 145–161. Springer, 2025. 3

  48. [49]

    Flashvideo: Flowing fidelity to detail for efficient high-resolution video generation.arXiv preprint arXiv:2502.05179, 2025

    Shilong Zhang, Wenbo Li, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, and Ping Luo. Flashvideo: Flowing fidelity to detail for efficient high-resolution video generation.arXiv preprint arXiv:2502.05179, 2025. 3

  49. [50]

    Training-free efficient video gener- ation via dynamic token carving.arXiv preprint arXiv:2505.16864, 2025

    Yuechen Zhang, Jinbo Xing, Bin Xia, Shaoteng Liu, Bohao Peng, Xin Tao, Pengfei Wan, Eric Lo, and Jiaya Jia. Training-free efficient video gener- ation via dynamic token carving.arXiv preprint arXiv:2505.16864, 2025. 3, 6