SURF: Signature-Retained Fast Video Generation
Pith reviewed 2026-05-21 18:06 UTC · model grok-4.3
The pith
SURF generates high-resolution videos up to 12.5 times faster by creating quick low-resolution previews with noise reshifting and then using a Refiner to restore full detail while staying close to the original model's signatures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SURF divides video generation into two stages: a preview stage that applies noise reshifting by running initial denoising at the model's original resolution before switching to low resolution to generate fast previews that retain signatures, and a refine stage that establishes a mapping from preview to high-resolution output via a Refiner trained with shifting windows, which reduces the number of denoising steps needed while keeping the output maximally close to the signatures of the given pretrained model.
What carries the argument
Noise reshifting in the preview stage to mitigate signature loss during low-resolution inference, paired with a Refiner network trained to map previews to high-resolution targets using fewer denoising steps.
If this is right
- High-resolution video generation becomes practical on standard hardware instead of requiring long runtimes.
- The distinctive signatures such as layout, semantics, and motion from the base model appear in the final output.
- The framework can be added to existing models and combined with other acceleration methods as a plug-in.
- Denoising steps in the high-resolution phase drop significantly once a good preview is available.
Where Pith is reading between the lines
- The same preview-plus-refiner split could be tested on image or audio generation tasks that face similar resolution and fidelity trade-offs.
- Adaptive choice of when to switch resolutions based on prompt complexity might further improve results beyond the fixed procedure described.
- Longer video clips could expose whether temporal consistency holds across the preview-to-refine boundary.
Load-bearing premise
That performing a few initial denoising steps at full resolution before switching to low resolution is enough to prevent loss of the model's distinctive layout, semantics, and motion, and that the Refiner can learn a reliable mapping that cuts denoising steps without introducing artifacts or reducing fidelity.
What would settle it
Generate identical prompts with both the original pretrained model and the full SURF pipeline, then compare the resulting videos for differences in object layout, semantic content, or motion patterns; substantial mismatches would show that signatures were not retained.
Figures
read the original abstract
The demand for high-resolution video generation is growing rapidly. However, the generation resolution is severely constrained by slow inference speeds. For instance, Wan2.1 requires over 50 minutes to generate a single 720p video. While previous works explore accelerating video generation from various aspects, most of them compromise the distinctive signatures (e.g., layout, semantic, motion) of the original model. In this work, we propose SURF, an efficient framework for generating high-resolution videos, while maximally keeping the signatures. Specifically, SURF divides video generation into two stages: First, we leverage the pretrained model to infer at optimal resolution and downsample latent to generate low-resolution previews in fast speed; then we design a Refiner to upscale the preview. In the preview stage, we identify that directly inferring a model (trained with higher resolution) on lower resolution causes severe losses in signatures. So we introduce noise reshifting, a training-free technique that mitigates this issue by conducting initial denoising steps on the original resolution and switching to low resolution in later steps. In the refine stage, we establish a mapping relationship between the preview and the high-resolution target, which significantly reduces the denoising steps. We further integrate shifting windows and carefully design the training paradigm to get a powerful and efficient Refiner. In this way, SURF enables generating high-resolution videos efficiently while maximally closer to the signatures of the given pretrained model. SURF is conceptually simple and could serve as a plug-in that is compatible with various base model and acceleration methods. For example, it achieves 12.5x speedup for generating 5-second, 16fps, 720p Wan 2.1 videos and 8.7x speedup for generating 5-second, 24fps, 720p HunyuanVideo.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SURF, an efficient framework for high-resolution video generation that aims to preserve the signatures of pretrained models. It divides the process into a preview stage, where noise reshifting is used to generate low-resolution previews quickly without severe signature loss, and a refine stage employing a Refiner to produce high-resolution outputs with fewer denoising steps. The approach is claimed to achieve 12.5x speedup for Wan 2.1 and 8.7x for HunyuanVideo while being compatible with various base models.
Significance. Should the empirical validation confirm effective signature retention, this could represent a practical advance in accelerating video diffusion models without sacrificing model-specific traits. The training-free aspect of noise reshifting and the plug-in design are positive features that could facilitate adoption. Currently, however, the lack of supporting quantitative data tempers the assessed significance.
major comments (2)
- The abstract reports specific speedups but supplies no quantitative signature metrics, ablation studies, or error analysis; therefore the data do not yet demonstrate that the central claim of retaining signatures holds.
- The claim that noise reshifting sufficiently prevents signature loss when running high-resolution-trained models at lower resolution lacks supporting metrics or ablations demonstrating restoration of fidelity in layout, semantic, and motion.
minor comments (1)
- Some figures illustrating the noise reshifting process or Refiner architecture would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major comment below and have revised the manuscript to strengthen the quantitative support for signature retention.
read point-by-point responses
-
Referee: The abstract reports specific speedups but supplies no quantitative signature metrics, ablation studies, or error analysis; therefore the data do not yet demonstrate that the central claim of retaining signatures holds.
Authors: We acknowledge that the abstract prioritizes the reported speedups. The full manuscript already presents quantitative signature metrics (including CLIP similarity for semantics, SSIM and layout consistency scores, and optical-flow-based motion metrics), along with ablations and error analysis in Sections 4 and 5. To ensure the abstract itself better demonstrates the central claim, we have revised it to briefly include key signature-retention numbers alongside the speedup figures and have added a short reference to the supporting ablations. revision: yes
-
Referee: The claim that noise reshifting sufficiently prevents signature loss when running high-resolution-trained models at lower resolution lacks supporting metrics or ablations demonstrating restoration of fidelity in layout, semantic, and motion.
Authors: We agree that explicit per-aspect metrics strengthen the claim. The original manuscript already shows that noise reshifting outperforms direct low-resolution inference through both qualitative examples and aggregate quantitative comparisons. In the revision we have added a dedicated ablation subsection with separate metrics: layout fidelity (SSIM and bounding-box overlap), semantic fidelity (CLIP score), and motion fidelity (endpoint error on optical flow). These results quantify the restoration achieved by the reshifting schedule. revision: yes
Circularity Check
No circularity in SURF derivation; techniques are independent additions to pretrained models
full rationale
The paper describes a two-stage pipeline (preview via noise reshifting on pretrained models, followed by a separately trained Refiner) without any equations, fitted parameters, or predictions that reduce by construction to the method's own inputs. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled via prior work, and no known results are merely renamed. The speedup and signature-retention claims rest on empirical application of the described procedures rather than tautological re-derivations, making the chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of initial high-resolution denoising steps
axioms (1)
- domain assumption A high-resolution-trained video diffusion model can be run at lower resolution after initial denoising steps without catastrophic signature loss.
invented entities (1)
-
Refiner
no independent evidence
Reference graph
Works this paper leans on
-
[1]
VideoPhy: Evaluating Physical Commonsense for Video Generation
Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Multidiffusion: Fusing diffusion paths for controlled image generation, 2023
Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation, 2023. URLhttps: //arxiv.org/abs/2302.08113. 3
-
[3]
Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction
Han Cai, Junyan Li, Muyan Hu, Chuang Gan, and Song Han. Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. InPro- ceedings of the IEEE/CVF international conference on computer vision, pp. 17302–17313, 2023. 3
work page 2023
-
[4]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pp. 9650– 9660, 2021. 5
work page 2021
-
[5]
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak- to-strong training of diffusion transformer for 4k text- to-image generation, 2024. URLhttps://arxiv. org/abs/2403.04692. 3
-
[6]
Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963,
Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963,
-
[7]
Resadapter: Domain consistent resolution adapter for diffusion models
Jiaxiang Cheng, Pan Xie, Xin Xia, Jiashi Li, Jie Wu, Yuxi Ren, Huixia Li, Xuefeng Xiao, Shilei Wen, and Lean Fu. Resadapter: Domain consistent resolution adapter for diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 39, pp. 2438–2446, 2025. 3
work page 2025
-
[8]
Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sar- los, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020. 3
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[9]
Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e
Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory- efficient exact attention with IO-awareness. InAd- vances in Neural Information Processing Systems (NeurIPS), 2022. 3
work page 2022
-
[10]
Efficient-vdit: Efficient video diffusion transformers with attention tile, 2025
Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, and Hao Zhang. Efficient-vdit: Efficient video diffusion transformers with attention tile, 2025. URLhttps://arxiv. org/abs/2502.06155. 2
-
[11]
DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization
Zihan Ding, Chi Jin, Difan Liu, Haitian Zheng, Kr- ishna Kumar Singh, Qiang Zhang, Yan Kang, Zhe Lin, and Yuchen Liu. Dollar: Few-step video generation via distillation and latent reward optimization.arXiv preprint arXiv:2412.15689, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Demofusion: Democratising high-resolution image generation with no $$$
Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high-resolution image generation with no $$$. InCVPR, 2024. 3
work page 2024
-
[13]
Scaling rectified flow transformers for high- resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high- resolution image synthesis. InForty-first international conference on machine learning, 2024. 3
work page 2024
-
[14]
Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models
Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. InThe Twelfth International Con- ference on Learning Representations, 2024. 3
work page 2024
-
[15]
Fouriscale: A frequency perspective on training-free high-resolution image synthesis, 2024
Linjiang Huang, Rongyao Fang, Aiping Zhang, Guan- glu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution image synthesis, 2024. URLhttps: //arxiv.org/abs/2403.12963. 3
-
[16]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianx- ing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 21807–21818, 2024. 5
work page 2024
-
[17]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap- pas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pp. 5156–5165. PMLR, 2020. 3
work page 2020
-
[18]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A sys- tematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [19]
-
[20]
Syncdiffusion: Coherent montage via synchronized joint diffusions
Yuseung Lee, Kunho Kim, Hyunjin Kim, and Min- hyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions. InThirty-seventh Con- ference on Neural Information Processing Systems,
-
[21]
Hiprompt: Tuning-free higher-resolution generation with hierarchical mllm prompts, 2024
Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, Wenhan Luo, Qifeng Liu, and Yike Guo. Hiprompt: Tuning-free higher-resolution generation with hierarchical mllm prompts, 2024. URLhttps: //arxiv.org/abs/2409.02919. 3
-
[22]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pp. 8748–8763. PmLR, 2021. 5
work page 2021
-
[26]
Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks, 2024
Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks, 2024. URLhttps: //arxiv.org/abs/2407.02158. 3
-
[27]
Turbo2k: Towards ultra-efficient and high-quality 2k video synthesis, 2025
Jingjing Ren, Wenbo Li, Zhongdao Wang, Haoze Sun, Bangzhen Liu, Haoyu Chen, Jiaqi Xu, Aoxue Li, Shifeng Zhang, Bin Shao, Yong Guo, and Lei Zhu. Turbo2k: Towards ultra-efficient and high-quality 2k video synthesis, 2025. URLhttps://arxiv. org/abs/2504.14470. 2
-
[28]
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large- scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 5
work page 2022
-
[29]
Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. Resmaster: Mas- tering high-resolution image generation via structural and fine-grained guidance, 2024. URLhttps:// arxiv.org/abs/2406.16476. 3
-
[30]
Scale-wise distillation of diffusion models.arXiv preprint arXiv:2503.16397, 2025
Nikita Starodubcev, Denis Kuznedelev, Artem Babenko, and Dmitry Baranchuk. Scale-wise distillation of diffusion models.arXiv preprint arXiv:2503.16397, 2025. 2
-
[31]
Roformer: Enhanced transformer with rotary position embedding.Neuro- computing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neuro- computing, 568:127063, 2024. 5
work page 2024
-
[32]
Fastvideo: A unified frame- work for accelerated video generation, April 2024
The FastVideo Team. Fastvideo: A unified frame- work for accelerated video generation, April 2024. URLhttps://github.com/hao- ai- lab/ FastVideo. 5, 6
work page 2024
-
[33]
Training-free diffusion acceleration with bot- tleneck sampling.arXiv preprint arXiv:2503.18940,
Ye Tian, Xin Xia, Yuxi Ren, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Yunhai Tong, Ling Yang, and Bin Cui. Training-free diffusion acceleration with bot- tleneck sampling.arXiv preprint arXiv:2503.18940,
-
[34]
Wan: Open and Advanced Large-Scale Video Generative Models
Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen- Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Rui- hang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Fei Xiao, Chen Change Loy, and Lu Jiang. Seedvr: Seeding infinity in diffusion transformer towards generic video restoration.arXiv preprint arXiv:2501.01320, 2025. 3
-
[36]
Linformer: Self-Attention with Linear Complexity
Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[37]
Real-esrgan: Training real-world blind super- resolution with pure synthetic data
Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super- resolution with pure synthetic data. InInternational Conference on Computer Vision Workshops (ICCVW). 5
-
[38]
Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pp. 20144– 20154, 2023. 5
work page 2023
-
[39]
Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerat- ing video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025. 3, 5, 6
-
[40]
Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, and Bin Cui. Training-free and adaptive sparse attention for efficient long video generation.arXiv preprint arXiv:2502.21079, 2025. 3
-
[41]
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.arXiv preprint arXiv:2505.18875, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Metaformer is actually what you need for vision
Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10819– 10829, 2022. 3
work page 2022
-
[45]
Accvideo: Accel- erating video diffusion model with synthetic dataset
Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. Accvideo: Accel- erating video diffusion model with synthetic dataset. arXiv preprint arXiv:2503.19462, 2025. 2, 6
-
[46]
Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137,
-
[47]
Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025
Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, and Hao Zhang. Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025. 3
-
[48]
Hidiffusion: Un- locking higher-resolution creativity and efficiency in pretrained diffusion models
Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Yuhao Chen, Yao Tang, and Jiajun Liang. Hidiffusion: Un- locking higher-resolution creativity and efficiency in pretrained diffusion models. InEuropean Conference on Computer Vision, pp. 145–161. Springer, 2025. 3
work page 2025
-
[49]
Shilong Zhang, Wenbo Li, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, and Ping Luo. Flashvideo: Flowing fidelity to detail for efficient high-resolution video generation.arXiv preprint arXiv:2502.05179, 2025. 3
-
[50]
Yuechen Zhang, Jinbo Xing, Bin Xia, Shaoteng Liu, Bohao Peng, Xin Tao, Pengfei Wan, Eric Lo, and Jiaya Jia. Training-free efficient video gener- ation via dynamic token carving.arXiv preprint arXiv:2505.16864, 2025. 3, 6
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.