SURF: Signature-Retained Fast Video Generation

Hengshuang Zhao; Kaixin Ding; Liang Hou; Sihui Ji; Xi Chen; Xin Tao; Yuan Gao

arxiv: 2603.21002 · v2 · pith:5BNTP4IMnew · submitted 2025-11-25 · 💻 cs.GR

SURF: Signature-Retained Fast Video Generation

Kaixin Ding , Xi Chen , Sihui Ji , Yuan Gao , Liang Hou , Xin Tao , Hengshuang Zhao This is my paper

Pith reviewed 2026-05-21 18:06 UTC · model grok-4.3

classification 💻 cs.GR

keywords video generationfast inferencehigh resolutiondiffusion modelsmodel accelerationrefinernoise reshifting

0 comments

The pith

SURF generates high-resolution videos up to 12.5 times faster by creating quick low-resolution previews with noise reshifting and then using a Refiner to restore full detail while staying close to the original model's signatures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve the slow inference speeds that limit high-resolution video generation, where a single 720p clip can take over 50 minutes with models like Wan2.1. It splits the task into a preview stage that runs the pretrained model at lower resolution after initial full-resolution denoising steps to avoid losing layout, semantics, and motion, and a refine stage that trains a Refiner to upscale the preview with far fewer steps. The approach is presented as a plug-in compatible with existing models and other acceleration techniques. It reports concrete speedups of 12.5 times for 5-second 16fps 720p Wan 2.1 videos and 8.7 times for HunyuanVideo at similar specs.

Core claim

SURF divides video generation into two stages: a preview stage that applies noise reshifting by running initial denoising at the model's original resolution before switching to low resolution to generate fast previews that retain signatures, and a refine stage that establishes a mapping from preview to high-resolution output via a Refiner trained with shifting windows, which reduces the number of denoising steps needed while keeping the output maximally close to the signatures of the given pretrained model.

What carries the argument

Noise reshifting in the preview stage to mitigate signature loss during low-resolution inference, paired with a Refiner network trained to map previews to high-resolution targets using fewer denoising steps.

If this is right

High-resolution video generation becomes practical on standard hardware instead of requiring long runtimes.
The distinctive signatures such as layout, semantics, and motion from the base model appear in the final output.
The framework can be added to existing models and combined with other acceleration methods as a plug-in.
Denoising steps in the high-resolution phase drop significantly once a good preview is available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same preview-plus-refiner split could be tested on image or audio generation tasks that face similar resolution and fidelity trade-offs.
Adaptive choice of when to switch resolutions based on prompt complexity might further improve results beyond the fixed procedure described.
Longer video clips could expose whether temporal consistency holds across the preview-to-refine boundary.

Load-bearing premise

That performing a few initial denoising steps at full resolution before switching to low resolution is enough to prevent loss of the model's distinctive layout, semantics, and motion, and that the Refiner can learn a reliable mapping that cuts denoising steps without introducing artifacts or reducing fidelity.

What would settle it

Generate identical prompts with both the original pretrained model and the full SURF pipeline, then compare the resulting videos for differences in object layout, semantic content, or motion patterns; substantial mismatches would show that signatures were not retained.

Figures

Figures reproduced from arXiv: 2603.21002 by Hengshuang Zhao, Kaixin Ding, Liang Hou, Sihui Ji, Xi Chen, Xin Tao, Yuan Gao.

**Figure 1.** Figure 1: SURF generates high-resolution and high-quality videos efficiently. It acts as a simple plug-in and retains the signatures (i.e., layout, semantic, motion, etc.) of the baseline model with more than 12× speed-up. More results are available on Project Page. Abstract The demand for high-resolution video generation is growing rapidly. However, the generation resolution is severely constrained by slow inferen… view at source ↗

**Figure 2.** Figure 2: Demonstration for signature retention. Distilled model loses the signatures of the base model, which could lead to misaligned body parts and weak semantic consistency. Our method maintains the layout, semantics, and motion of base model with huge speed up. 1. Introduction Video generation [19, 34, 43] has witnessed remarkable advancements in recent years, with sophisticated models continually pushing the b… view at source ↗

**Figure 3.** Figure 3: Overview of the SURF. Our framework employs a powerful pretrained model (e.g., Wan 2.1) and a lightweight Refiner network. Together, they execute the denoising process through an OptimRes-LowRes-HighRes flow (SURF), producing outputs with signature (e.g., layout, semantic, motion) closely match those of the base model (Wan 2.1), while achieving huge speed up. described by the following iterative update: z0… view at source ↗

**Figure 4.** Figure 4: Shift window across adjacent blocks. Training paradigm. We first obtain large amounts of high-quality videos and apply both pixel- and latent-level degradations to simulate low-quality video values. Pixellevel degradation simulates blur, while latent-level degradation prevents the task from becoming a trivial superresolution problem, encouraging the model to exploit its generative ability. These low-qua… view at source ↗

**Figure 5.** Figure 5: User study. We report pair-wise preference rates. SURF achieves comparable quality to Wan 2.1 with huge speed up [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparisons. Our method achieves up to 12× speedup while maintaining signatures of base model. Unreasonable contents are marked in orange. Rather than aimless camera panning, SURF generates high fidelity videos with semantically aligned motion. (d) A panda drinking coffee in a café watercolor painting (c) A boat sailing along the Seine River with the Eiffel Tower in background Van Gogh style (a… view at source ↗

**Figure 7.** Figure 7: Generation results for each stage. We mark regions with artifacts and lacking detail in the preview videos using yellow boxes, while improvements from the Refiner are highlighted in red. Zoom in for a better view. 4.3. Stage-wise Visualization This section presents visualizations of SURF, providing an intuitive comparison between the preview and the final results in overall layout similarity and refinement… view at source ↗

**Figure 8.** Figure 8: Qualitative results for ablation. SW denotes shiftwindow attention. The Refiner runtime on 1080p video is shown in the bottom-right corner. Our method achieves shorter runtime while preserving finer details. fingernail contours and appropriate wrinkles, regardless of whether the shift-window mechanism is applied. Additionally, the clarity of distant trees is noticeably enhanced. our findings suggest that… view at source ↗

read the original abstract

The demand for high-resolution video generation is growing rapidly. However, the generation resolution is severely constrained by slow inference speeds. For instance, Wan2.1 requires over 50 minutes to generate a single 720p video. While previous works explore accelerating video generation from various aspects, most of them compromise the distinctive signatures (e.g., layout, semantic, motion) of the original model. In this work, we propose SURF, an efficient framework for generating high-resolution videos, while maximally keeping the signatures. Specifically, SURF divides video generation into two stages: First, we leverage the pretrained model to infer at optimal resolution and downsample latent to generate low-resolution previews in fast speed; then we design a Refiner to upscale the preview. In the preview stage, we identify that directly inferring a model (trained with higher resolution) on lower resolution causes severe losses in signatures. So we introduce noise reshifting, a training-free technique that mitigates this issue by conducting initial denoising steps on the original resolution and switching to low resolution in later steps. In the refine stage, we establish a mapping relationship between the preview and the high-resolution target, which significantly reduces the denoising steps. We further integrate shifting windows and carefully design the training paradigm to get a powerful and efficient Refiner. In this way, SURF enables generating high-resolution videos efficiently while maximally closer to the signatures of the given pretrained model. SURF is conceptually simple and could serve as a plug-in that is compatible with various base model and acceleration methods. For example, it achieves 12.5x speedup for generating 5-second, 16fps, 720p Wan 2.1 videos and 8.7x speedup for generating 5-second, 24fps, 720p HunyuanVideo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SURF gives a simple two-stage plug-in to speed up high-res video generation from existing models while trying to hold onto their signatures, but the abstract shows no metrics or ablations to confirm the signatures are actually retained.

read the letter

Hi, the main point is that SURF splits the process into a fast low-resolution preview using noise reshifting, then a Refiner to produce the final high-res output with fewer denoising steps. This is meant to deliver speedups like 12.5x on Wan 2.1 and 8.7x on HunyuanVideo without the signature losses seen in other acceleration tricks. The noise reshifting starts denoising at native resolution before switching to low res, and the Refiner uses shifting windows in its training to learn the preview-to-target mapping. These pieces are presented as training-free or lightweight additions that work as a plug-in across base models. The paper does a solid job naming the real deployment problem and keeping the method straightforward rather than requiring full retraining. It also positions the work as compatible with other acceleration techniques, which is a practical plus. The soft spots sit in the evidence. The description flags the signature loss issue when running high-res models at lower resolution, yet there are no quantitative metrics on layout, semantic, or motion fidelity, no ablations on the reshifting steps or number of initial high-res steps, and no checks on whether the Refiner introduces new artifacts or holds fidelity when cutting steps. Those gaps make it hard to judge if the central claim holds up or if the speed comes with hidden quality costs. This is aimed at engineers and researchers who need faster inference for 720p video from pretrained diffusion models. A reader working on practical deployment would find the ideas worth testing. The work shows clear engagement with the tradeoff and has enough of a concrete method to deserve a serious referee, though it will need stronger experimental backing. I would recommend sending it to peer review with a focus on adding the missing signature metrics and ablations.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SURF, an efficient framework for high-resolution video generation that aims to preserve the signatures of pretrained models. It divides the process into a preview stage, where noise reshifting is used to generate low-resolution previews quickly without severe signature loss, and a refine stage employing a Refiner to produce high-resolution outputs with fewer denoising steps. The approach is claimed to achieve 12.5x speedup for Wan 2.1 and 8.7x for HunyuanVideo while being compatible with various base models.

Significance. Should the empirical validation confirm effective signature retention, this could represent a practical advance in accelerating video diffusion models without sacrificing model-specific traits. The training-free aspect of noise reshifting and the plug-in design are positive features that could facilitate adoption. Currently, however, the lack of supporting quantitative data tempers the assessed significance.

major comments (2)

The abstract reports specific speedups but supplies no quantitative signature metrics, ablation studies, or error analysis; therefore the data do not yet demonstrate that the central claim of retaining signatures holds.
The claim that noise reshifting sufficiently prevents signature loss when running high-resolution-trained models at lower resolution lacks supporting metrics or ablations demonstrating restoration of fidelity in layout, semantic, and motion.

minor comments (1)

Some figures illustrating the noise reshifting process or Refiner architecture would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and have revised the manuscript to strengthen the quantitative support for signature retention.

read point-by-point responses

Referee: The abstract reports specific speedups but supplies no quantitative signature metrics, ablation studies, or error analysis; therefore the data do not yet demonstrate that the central claim of retaining signatures holds.

Authors: We acknowledge that the abstract prioritizes the reported speedups. The full manuscript already presents quantitative signature metrics (including CLIP similarity for semantics, SSIM and layout consistency scores, and optical-flow-based motion metrics), along with ablations and error analysis in Sections 4 and 5. To ensure the abstract itself better demonstrates the central claim, we have revised it to briefly include key signature-retention numbers alongside the speedup figures and have added a short reference to the supporting ablations. revision: yes
Referee: The claim that noise reshifting sufficiently prevents signature loss when running high-resolution-trained models at lower resolution lacks supporting metrics or ablations demonstrating restoration of fidelity in layout, semantic, and motion.

Authors: We agree that explicit per-aspect metrics strengthen the claim. The original manuscript already shows that noise reshifting outperforms direct low-resolution inference through both qualitative examples and aggregate quantitative comparisons. In the revision we have added a dedicated ablation subsection with separate metrics: layout fidelity (SSIM and bounding-box overlap), semantic fidelity (CLIP score), and motion fidelity (endpoint error on optical flow). These results quantify the restoration achieved by the reshifting schedule. revision: yes

Circularity Check

0 steps flagged

No circularity in SURF derivation; techniques are independent additions to pretrained models

full rationale

The paper describes a two-stage pipeline (preview via noise reshifting on pretrained models, followed by a separately trained Refiner) without any equations, fitted parameters, or predictions that reduce by construction to the method's own inputs. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled via prior work, and no known results are merely renamed. The speedup and signature-retention claims rest on empirical application of the described procedures rather than tautological re-derivations, making the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on standard diffusion-model assumptions plus two new procedural inventions whose effectiveness is asserted but not derived from prior literature.

free parameters (1)

number of initial high-resolution denoising steps
Chosen to balance signature retention against speed; exact count not stated in abstract.

axioms (1)

domain assumption A high-resolution-trained video diffusion model can be run at lower resolution after initial denoising steps without catastrophic signature loss.
Invoked to justify the preview stage and noise reshifting.

invented entities (1)

Refiner no independent evidence
purpose: Learns a direct mapping from low-resolution preview to high-resolution output to reduce total denoising steps
New component introduced in the refine stage; no independent evidence supplied.

pith-pipeline@v0.9.0 · 5873 in / 1408 out tokens · 69305 ms · 2026-05-21T18:06:02.762391+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 11 internal anchors

[1]

VideoPhy: Evaluating Physical Commonsense for Video Generation

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Multidiffusion: Fusing diffusion paths for controlled image generation, 2023

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation, 2023. URLhttps: //arxiv.org/abs/2302.08113. 3

work page arXiv 2023
[3]

Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction

Han Cai, Junyan Li, Muyan Hu, Chuang Gan, and Song Han. Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. InPro- ceedings of the IEEE/CVF international conference on computer vision, pp. 17302–17313, 2023. 3

work page 2023
[4]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pp. 9650– 9660, 2021. 5

work page 2021
[5]

Pixart-sigma: Weak- to-strong training of diffusion transformer for 4k text- to-image generation, 2024

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak- to-strong training of diffusion transformer for 4k text- to-image generation, 2024. URLhttps://arxiv. org/abs/2403.04692. 3

work page arXiv 2024
[6]

Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963,

Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963,

work page arXiv
[7]

Resadapter: Domain consistent resolution adapter for diffusion models

Jiaxiang Cheng, Pan Xie, Xin Xia, Jiashi Li, Jie Wu, Yuxi Ren, Huixia Li, Xuefeng Xiao, Shilei Wen, and Lean Fu. Resadapter: Domain consistent resolution adapter for diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 39, pp. 2438–2446, 2025. 3

work page 2025
[8]

Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sar- los, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2009
[9]

Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory- efficient exact attention with IO-awareness. InAd- vances in Neural Information Processing Systems (NeurIPS), 2022. 3

work page 2022
[10]

Efficient-vdit: Efficient video diffusion transformers with attention tile, 2025

Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, and Hao Zhang. Efficient-vdit: Efficient video diffusion transformers with attention tile, 2025. URLhttps://arxiv. org/abs/2502.06155. 2

work page arXiv 2025
[11]

DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

Zihan Ding, Chi Jin, Difan Liu, Haitian Zheng, Kr- ishna Kumar Singh, Qiang Zhang, Yan Kang, Zhe Lin, and Yuchen Liu. Dollar: Few-step video generation via distillation and latent reward optimization.arXiv preprint arXiv:2412.15689, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Demofusion: Democratising high-resolution image generation with no $$$

Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high-resolution image generation with no $$$. InCVPR, 2024. 3

work page 2024
[13]

Scaling rectified flow transformers for high- resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high- resolution image synthesis. InForty-first international conference on machine learning, 2024. 3

work page 2024
[14]

Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models

Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. InThe Twelfth International Con- ference on Learning Representations, 2024. 3

work page 2024
[15]

Fouriscale: A frequency perspective on training-free high-resolution image synthesis, 2024

Linjiang Huang, Rongyao Fang, Aiping Zhang, Guan- glu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution image synthesis, 2024. URLhttps: //arxiv.org/abs/2403.12963. 3

work page arXiv 2024
[16]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianx- ing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 21807–21818, 2024. 5

work page 2024
[17]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap- pas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pp. 5156–5165. PMLR, 2020. 3

work page 2020
[18]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A sys- tematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Kling ai.https://klingai.com/,

Kuaishou. Kling ai.https://klingai.com/,

work page
[20]

Syncdiffusion: Coherent montage via synchronized joint diffusions

Yuseung Lee, Kunho Kim, Hyunjin Kim, and Min- hyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions. InThirty-seventh Con- ference on Neural Information Processing Systems,

work page
[21]

Hiprompt: Tuning-free higher-resolution generation with hierarchical mllm prompts, 2024

Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, Wenhan Luo, Qifeng Liu, and Yike Guo. Hiprompt: Tuning-free higher-resolution generation with hierarchical mllm prompts, 2024. URLhttps: //arxiv.org/abs/2409.02919. 3

work page arXiv 2024
[22]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pp. 8748–8763. PmLR, 2021. 5

work page 2021
[26]

Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks, 2024

Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks, 2024. URLhttps: //arxiv.org/abs/2407.02158. 3

work page arXiv 2024
[27]

Turbo2k: Towards ultra-efficient and high-quality 2k video synthesis, 2025

Jingjing Ren, Wenbo Li, Zhongdao Wang, Haoze Sun, Bangzhen Liu, Haoyu Chen, Jiaqi Xu, Aoxue Li, Shifeng Zhang, Bin Shao, Yong Guo, and Lei Zhu. Turbo2k: Towards ultra-efficient and high-quality 2k video synthesis, 2025. URLhttps://arxiv. org/abs/2504.14470. 2

work page arXiv 2025
[28]

Laion-5b: An open large- scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large- scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 5

work page 2022
[29]

Resmaster: Mas- tering high-resolution image generation via structural and fine-grained guidance, 2024

Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. Resmaster: Mas- tering high-resolution image generation via structural and fine-grained guidance, 2024. URLhttps:// arxiv.org/abs/2406.16476. 3

work page arXiv 2024
[30]

Scale-wise distillation of diffusion models.arXiv preprint arXiv:2503.16397, 2025

Nikita Starodubcev, Denis Kuznedelev, Artem Babenko, and Dmitry Baranchuk. Scale-wise distillation of diffusion models.arXiv preprint arXiv:2503.16397, 2025. 2

work page arXiv 2025
[31]

Roformer: Enhanced transformer with rotary position embedding.Neuro- computing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neuro- computing, 568:127063, 2024. 5

work page 2024
[32]

Fastvideo: A unified frame- work for accelerated video generation, April 2024

The FastVideo Team. Fastvideo: A unified frame- work for accelerated video generation, April 2024. URLhttps://github.com/hao- ai- lab/ FastVideo. 5, 6

work page 2024
[33]

Training-free diffusion acceleration with bot- tleneck sampling.arXiv preprint arXiv:2503.18940,

Ye Tian, Xin Xia, Yuxi Ren, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Yunhai Tong, Ling Yang, and Bin Cui. Training-free diffusion acceleration with bot- tleneck sampling.arXiv preprint arXiv:2503.18940,

work page arXiv
[34]

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen- Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Rui- hang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Seedvr: Seeding infinity in diffusion transformer towards generic video restoration.arXiv preprint arXiv:2501.01320, 2025

Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Fei Xiao, Chen Change Loy, and Lu Jiang. Seedvr: Seeding infinity in diffusion transformer towards generic video restoration.arXiv preprint arXiv:2501.01320, 2025. 3

work page arXiv 2025
[36]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[37]

Real-esrgan: Training real-world blind super- resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super- resolution with pure synthetic data. InInternational Conference on Computer Vision Workshops (ICCVW). 5

work page
[38]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pp. 20144– 20154, 2023. 5

work page 2023
[39]

Sparse videogen: Accelerat- ing video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerat- ing video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025. 3, 5, 6

work page arXiv 2025
[40]

Training-free and adaptive sparse attention for efficient long video generation.arXiv preprint arXiv:2502.21079, 2025

Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, and Bin Cui. Training-free and adaptive sparse attention for efficient long video generation.arXiv preprint arXiv:2502.21079, 2025. 3

work page arXiv 2025
[41]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.arXiv preprint arXiv:2505.18875, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Metaformer is actually what you need for vision

Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10819– 10829, 2022. 3

work page 2022
[45]

Accvideo: Accel- erating video diffusion model with synthetic dataset

Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. Accvideo: Accel- erating video diffusion model with synthetic dataset. arXiv preprint arXiv:2503.19462, 2025. 2, 6

work page arXiv 2025
[46]

Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137,

work page arXiv
[47]

Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, and Hao Zhang. Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025. 3

work page arXiv 2025
[48]

Hidiffusion: Un- locking higher-resolution creativity and efficiency in pretrained diffusion models

Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Yuhao Chen, Yao Tang, and Jiajun Liang. Hidiffusion: Un- locking higher-resolution creativity and efficiency in pretrained diffusion models. InEuropean Conference on Computer Vision, pp. 145–161. Springer, 2025. 3

work page 2025
[49]

Flashvideo: Flowing fidelity to detail for efficient high-resolution video generation.arXiv preprint arXiv:2502.05179, 2025

Shilong Zhang, Wenbo Li, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, and Ping Luo. Flashvideo: Flowing fidelity to detail for efficient high-resolution video generation.arXiv preprint arXiv:2502.05179, 2025. 3

work page arXiv 2025
[50]

Training-free efficient video gener- ation via dynamic token carving.arXiv preprint arXiv:2505.16864, 2025

Yuechen Zhang, Jinbo Xing, Bin Xia, Shaoteng Liu, Bohao Peng, Xin Tao, Pengfei Wan, Eric Lo, and Jiaya Jia. Training-free efficient video gener- ation via dynamic token carving.arXiv preprint arXiv:2505.16864, 2025. 3, 6

work page arXiv 2025

[1] [1]

VideoPhy: Evaluating Physical Commonsense for Video Generation

Hritik Bansal, Zongyu Lin, Tianyi Xie, Zeshun Zong, Michal Yarom, Yonatan Bitton, Chenfanfu Jiang, Yizhou Sun, Kai-Wei Chang, and Aditya Grover. Videophy: Evaluating physical commonsense for video generation.arXiv preprint arXiv:2406.03520,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Multidiffusion: Fusing diffusion paths for controlled image generation, 2023

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation, 2023. URLhttps: //arxiv.org/abs/2302.08113. 3

work page arXiv 2023

[3] [3]

Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction

Han Cai, Junyan Li, Muyan Hu, Chuang Gan, and Song Han. Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. InPro- ceedings of the IEEE/CVF international conference on computer vision, pp. 17302–17313, 2023. 3

work page 2023

[4] [4]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pp. 9650– 9660, 2021. 5

work page 2021

[5] [5]

Pixart-sigma: Weak- to-strong training of diffusion transformer for 4k text- to-image generation, 2024

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-sigma: Weak- to-strong training of diffusion transformer for 4k text- to-image generation, 2024. URLhttps://arxiv. org/abs/2403.04692. 3

work page arXiv 2024

[6] [6]

Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963,

Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, and Ping Luo. Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963,

work page arXiv

[7] [7]

Resadapter: Domain consistent resolution adapter for diffusion models

Jiaxiang Cheng, Pan Xie, Xin Xia, Jiashi Li, Jie Wu, Yuxi Ren, Huixia Li, Xuefeng Xiao, Shilei Wen, and Lean Fu. Resadapter: Domain consistent resolution adapter for diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 39, pp. 2438–2446, 2025. 3

work page 2025

[8] [8]

Rethinking Attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sar- los, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2009

[9] [9]

Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e

Tri Dao, Daniel Y . Fu, Stefano Ermon, Atri Rudra, and Christopher R ´e. FlashAttention: Fast and memory- efficient exact attention with IO-awareness. InAd- vances in Neural Information Processing Systems (NeurIPS), 2022. 3

work page 2022

[10] [10]

Efficient-vdit: Efficient video diffusion transformers with attention tile, 2025

Hangliang Ding, Dacheng Li, Runlong Su, Peiyuan Zhang, Zhijie Deng, Ion Stoica, and Hao Zhang. Efficient-vdit: Efficient video diffusion transformers with attention tile, 2025. URLhttps://arxiv. org/abs/2502.06155. 2

work page arXiv 2025

[11] [11]

DOLLAR: Few-Step Video Generation via Distillation and Latent Reward Optimization

Zihan Ding, Chi Jin, Difan Liu, Haitian Zheng, Kr- ishna Kumar Singh, Qiang Zhang, Yan Kang, Zhe Lin, and Yuchen Liu. Dollar: Few-step video generation via distillation and latent reward optimization.arXiv preprint arXiv:2412.15689, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Demofusion: Democratising high-resolution image generation with no $$$

Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, and Zhanyu Ma. Demofusion: Democratising high-resolution image generation with no $$$. InCVPR, 2024. 3

work page 2024

[13] [13]

Scaling rectified flow transformers for high- resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high- resolution image synthesis. InForty-first international conference on machine learning, 2024. 3

work page 2024

[14] [14]

Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models

Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. InThe Twelfth International Con- ference on Learning Representations, 2024. 3

work page 2024

[15] [15]

Fouriscale: A frequency perspective on training-free high-resolution image synthesis, 2024

Linjiang Huang, Rongyao Fang, Aiping Zhang, Guan- glu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution image synthesis, 2024. URLhttps: //arxiv.org/abs/2403.12963. 3

work page arXiv 2024

[16] [16]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianx- ing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pp. 21807–21818, 2024. 5

work page 2024

[17] [17]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap- pas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pp. 5156–5165. PMLR, 2020. 3

work page 2020

[18] [18]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A sys- tematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024. 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Kling ai.https://klingai.com/,

Kuaishou. Kling ai.https://klingai.com/,

work page

[20] [20]

Syncdiffusion: Coherent montage via synchronized joint diffusions

Yuseung Lee, Kunho Kim, Hyunjin Kim, and Min- hyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions. InThirty-seventh Con- ference on Neural Information Processing Systems,

work page

[21] [21]

Hiprompt: Tuning-free higher-resolution generation with hierarchical mllm prompts, 2024

Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, Wenhan Luo, Qifeng Liu, and Yike Guo. Hiprompt: Tuning-free higher-resolution generation with hierarchical mllm prompts, 2024. URLhttps: //arxiv.org/abs/2409.02919. 3

work page arXiv 2024

[22] [22]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, and Ping Luo. Towards world simulator: Crafting physical commonsense-based benchmark for video generation.arXiv preprint arXiv:2410.05363,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pp. 8748–8763. PmLR, 2021. 5

work page 2021

[25] [26]

Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks, 2024

Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra-high-resolution image synthesis to new peaks, 2024. URLhttps: //arxiv.org/abs/2407.02158. 3

work page arXiv 2024

[26] [27]

Turbo2k: Towards ultra-efficient and high-quality 2k video synthesis, 2025

Jingjing Ren, Wenbo Li, Zhongdao Wang, Haoze Sun, Bangzhen Liu, Haoyu Chen, Jiaqi Xu, Aoxue Li, Shifeng Zhang, Bin Shao, Yong Guo, and Lei Zhu. Turbo2k: Towards ultra-efficient and high-quality 2k video synthesis, 2025. URLhttps://arxiv. org/abs/2504.14470. 2

work page arXiv 2025

[27] [28]

Laion-5b: An open large- scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large- scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 5

work page 2022

[28] [29]

Resmaster: Mas- tering high-resolution image generation via structural and fine-grained guidance, 2024

Shuwei Shi, Wenbo Li, Yuechen Zhang, Jingwen He, Biao Gong, and Yinqiang Zheng. Resmaster: Mas- tering high-resolution image generation via structural and fine-grained guidance, 2024. URLhttps:// arxiv.org/abs/2406.16476. 3

work page arXiv 2024

[29] [30]

Scale-wise distillation of diffusion models.arXiv preprint arXiv:2503.16397, 2025

Nikita Starodubcev, Denis Kuznedelev, Artem Babenko, and Dmitry Baranchuk. Scale-wise distillation of diffusion models.arXiv preprint arXiv:2503.16397, 2025. 2

work page arXiv 2025

[30] [31]

Roformer: Enhanced transformer with rotary position embedding.Neuro- computing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neuro- computing, 568:127063, 2024. 5

work page 2024

[31] [32]

Fastvideo: A unified frame- work for accelerated video generation, April 2024

The FastVideo Team. Fastvideo: A unified frame- work for accelerated video generation, April 2024. URLhttps://github.com/hao- ai- lab/ FastVideo. 5, 6

work page 2024

[32] [33]

Training-free diffusion acceleration with bot- tleneck sampling.arXiv preprint arXiv:2503.18940,

Ye Tian, Xin Xia, Yuxi Ren, Shanchuan Lin, Xing Wang, Xuefeng Xiao, Yunhai Tong, Ling Yang, and Bin Cui. Training-free diffusion acceleration with bot- tleneck sampling.arXiv preprint arXiv:2503.18940,

work page arXiv

[33] [34]

Wan: Open and Advanced Large-Scale Video Generative Models

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen- Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Rui- hang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Tianxing...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [35]

Seedvr: Seeding infinity in diffusion transformer towards generic video restoration.arXiv preprint arXiv:2501.01320, 2025

Jianyi Wang, Zhijie Lin, Meng Wei, Yang Zhao, Ceyuan Yang, Fei Xiao, Chen Change Loy, and Lu Jiang. Seedvr: Seeding infinity in diffusion transformer towards generic video restoration.arXiv preprint arXiv:2501.01320, 2025. 3

work page arXiv 2025

[35] [36]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768,

work page internal anchor Pith review Pith/arXiv arXiv 2006

[36] [37]

Real-esrgan: Training real-world blind super- resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super- resolution with pure synthetic data. InInternational Conference on Computer Vision Workshops (ICCVW). 5

work page

[37] [38]

Exploring video quality assessment on user generated contents from aesthetic and technical perspectives

Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, Jingwen Hou, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pp. 20144– 20154, 2023. 5

work page 2023

[38] [39]

Sparse videogen: Accelerat- ing video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025

Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerat- ing video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025. 3, 5, 6

work page arXiv 2025

[39] [40]

Training-free and adaptive sparse attention for efficient long video generation.arXiv preprint arXiv:2502.21079, 2025

Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, and Bin Cui. Training-free and adaptive sparse attention for efficient long video generation.arXiv preprint arXiv:2502.21079, 2025. 3

work page arXiv 2025

[40] [41]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [42]

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation.arXiv preprint arXiv:2505.18875, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [43]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072,

work page internal anchor Pith review Pith/arXiv arXiv

[43] [44]

Metaformer is actually what you need for vision

Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10819– 10829, 2022. 3

work page 2022

[44] [45]

Accvideo: Accel- erating video diffusion model with synthetic dataset

Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, and Yu Qiao. Accvideo: Accel- erating video diffusion model with synthetic dataset. arXiv preprint arXiv:2503.19462, 2025. 2, 6

work page arXiv 2025

[45] [46]

Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137, 2025

Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, and Jianfei Chen. Spargeattn: Accurate sparse attention accelerating any model inference.arXiv preprint arXiv:2502.18137,

work page arXiv

[46] [47]

Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhenghong Liu, and Hao Zhang. Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025. 3

work page arXiv 2025

[47] [48]

Hidiffusion: Un- locking higher-resolution creativity and efficiency in pretrained diffusion models

Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Yuhao Chen, Yao Tang, and Jiajun Liang. Hidiffusion: Un- locking higher-resolution creativity and efficiency in pretrained diffusion models. InEuropean Conference on Computer Vision, pp. 145–161. Springer, 2025. 3

work page 2025

[48] [49]

Flashvideo: Flowing fidelity to detail for efficient high-resolution video generation.arXiv preprint arXiv:2502.05179, 2025

Shilong Zhang, Wenbo Li, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, and Ping Luo. Flashvideo: Flowing fidelity to detail for efficient high-resolution video generation.arXiv preprint arXiv:2502.05179, 2025. 3

work page arXiv 2025

[49] [50]

Training-free efficient video gener- ation via dynamic token carving.arXiv preprint arXiv:2505.16864, 2025

Yuechen Zhang, Jinbo Xing, Bin Xia, Shaoteng Liu, Bohao Peng, Xin Tao, Pengfei Wan, Eric Lo, and Jiaya Jia. Training-free efficient video gener- ation via dynamic token carving.arXiv preprint arXiv:2505.16864, 2025. 3, 6

work page arXiv 2025