arxiv: 2605.06509 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

Fangda Chen , Shanshan Zhao , Longrong Yang , Chuanfu Xu , Zhigang Luo , Long Lan

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords long video generationtraining-free methodsvideo diffusion modelssingular value decompositiontemporal consistencycontent driftspectral reconstructionfeature decomposition

0 comments

The pith

Singular value decomposition fuses global low-rank guidance with local high-rank details to extend video diffusion models to long sequences without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that long-video problems in diffusion models stem from enlarged attention windows that concentrate spectral energy into a few low-rank directions, preserving coarse structure but losing spatial details and motion variations. It shows that previous global-plus-local methods rely on rigid feature splits that fail when appearance and action are coupled. FreeSpec instead applies singular value decomposition across branches, letting the global part supply low-rank spectral guidance and the local part supply a high-rank reconstruction basis. This spectrum-level fusion keeps long-range consistency while recovering the suppressed high-rank components. A sympathetic reader would care because the approach offers a training-free way to stretch short-video models to longer outputs with less drift and smoother dynamics.

Core claim

Enlarged self-attention windows induce spectral concentration in which energy is dominated by a few low-rank singular directions, suppressing high-rank spatial details and motion-rich temporal variations. FreeSpec decomposes global and local features with singular value decomposition, using the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion avoids the rigid partitioning of earlier rules and preserves long-range consistency while better retaining spatial details and temporal dynamics.

What carries the argument

Singular value decomposition applied to global and local feature branches, treating the global output as low-rank spectral guidance and the local output as high-rank reconstruction basis for spectrum-level fusion.

If this is right

Long videos maintain both global coherence and local temporal variations when generated from short-video diffusion backbones.
Spatial details and action progression are recovered without requiring separate appearance or motion branches.
Existing models can be extended to longer durations by simple inference-time feature recombination rather than retraining.
Rigid hand-crafted partitioning rules become unnecessary when fusion occurs at the spectrum level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same low-rank versus high-rank separation could be tested on other attention-heavy generative tasks such as long audio or 3-D synthesis.
Adaptive choice of how many singular directions count as low-rank versus high-rank might further improve results on videos with varying motion complexity.
If spectral concentration proves general, attention-window scaling rules in future architectures might be redesigned to limit rank collapse from the start.

Load-bearing premise

That spectral concentration from enlarged attention windows is the main driver of content drift and over-smoothed dynamics, and that SVD-based low-rank and high-rank fusion separates them cleanly without creating fresh artifacts.

What would settle it

A side-by-side comparison in which FreeSpec-generated long videos still show measurable content drift or loss of fine motion at extended lengths, or in which a non-SVD decomposition achieves comparable consistency and detail without addressing spectral concentration.

Figures

Figures reproduced from arXiv: 2605.06509 by Chuanfu Xu, Fangda Chen, Long Lan, Longrong Yang, Shanshan Zhao, Zhigang Luo.

**Figure 1.** Figure 1: Long-video examples generated on Wan2.1 [7] with 4× the native training length. Existing training-free methods preserve stable appearance but may weaken continuous camera trajectories in the forest case and collapse sequential motocross actions into repetitive motion view at source ↗

**Figure 2.** Figure 2: Spectral and qualitative analysis of enlarged self-attention windows. Here, W = f × h × w denotes the native self-attention token length, where f, h, and w are the temporal length, height, and width of the video latent, respectively. (a) shows the effective-rank dynamics across denoising timesteps under different window sizes. (b) reports the effective rank at representative timesteps, showing that enlarge… view at source ↗

**Figure 3.** Figure 3: Overview of FreeSpec. FreeSpec extends a frozen short-video diffusion model to long-video generation by replacing self-attention with SVD-guided dual-branch self-attention during inference. It combines global full-window guidance and local sliding-window priors through singularspectrum modulation, followed by local-basis reconstruction and a lightweight global residual. level relative-position and context… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison under 4× length extension on Wan2.1 and LTX-Video view at source ↗

**Figure 5.** Figure 5: Qualitative ablation results of FreeSpec on Wan2.1. Yellow boxes highlight representative view at source ↗

**Figure 6.** Figure 6: Failure cases. FreeSpec preserves temporal dynamics but fails to infer implicit scene transitions, such as underwater-to-air and diving-platform-to-pool transitions. 5 Failure Cases and Limitations Although FreeSpec improves long-range temporal dynamics, it still inherits the semantic limitations of the pretrained base model. As shown in view at source ↗

read the original abstract

Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a global branch with a local branch, but they often further decompose appearance consistency and temporal dynamics within each branch using predefined criteria. This assignment is unreliable when appearance and action progression are tightly coupled, such as in camera motion and sequential motion. We analyze the video temporal extension issue from a singular-spectrum perspective and show that enlarged self-attention windows induce spectral concentration: spectral energy becomes dominated by a few low-rank singular directions, preserving coarse structure but suppressing high-rank spatial details and motion-rich temporal variations. To mitigate this problem, we propose FreeSpec, a training-free spectral reconstruction framework for long-video generation. FreeSpec decomposes global and local features with singular value decomposition, and uses the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion avoids the rigid feature partitioning of previous decomposition rules, preserving long-range consistency while better retaining spatial details and temporal dynamics. Experiments on Wan2.1 and LTX-Video demonstrate that FreeSpec improves long-video generation, especially for temporal dynamics, while maintaining strong visual quality and temporal consistency. Project demo: https://fdchen24.github.io/FreeSpec-Website/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FreeSpec gives a fresh SVD-based way to split global consistency from local details in training-free long video diffusion, but the experiments stay qualitative and skip the needed ablations.

read the letter

The main point is that this paper reframes the long-video problem in diffusion models as a singular-spectrum issue. Enlarged attention windows concentrate energy into low-rank directions, which they say causes drift and smoothing. FreeSpec then runs SVD on global and local branches separately, treating the global one as low-rank guidance and the local one as high-rank basis for reconstruction. This avoids the rigid appearance-versus-dynamics splits used in prior global-local methods, especially when motion and appearance are entangled like in camera pans or sequential actions.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FreeSpec, a training-free framework for extending video diffusion models to long videos. It analyzes temporal inconsistency and content drift from a singular-spectrum viewpoint, showing that enlarged self-attention windows induce spectral concentration (energy dominated by low-rank singular directions, suppressing high-rank spatial details and temporal variations). FreeSpec decomposes global and local features via SVD, treating the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion is claimed to avoid rigid feature partitioning of prior methods. Qualitative experiments on Wan2.1 and LTX-Video report improved temporal dynamics and consistency while preserving visual quality.

Significance. If the claimed mechanism holds, FreeSpec offers a simple, training-free spectral reconstruction technique that could meaningfully advance long-video synthesis in diffusion models by addressing drift through low-rank/high-rank fusion rather than heuristic decompositions. The singular-spectrum analysis provides a fresh perspective on attention-induced artifacts. The approach is parameter-free in its core reconstruction step and could be broadly applicable as a plug-in module.

major comments (2)

[Experiments] Experiments section: The evaluation provides only qualitative comparisons and visual examples on Wan2.1 and LTX-Video. No quantitative metrics (FVD, CLIP similarity, temporal consistency scores), error bars, or statistical significance tests are reported. This weakens the central claim of improvement in dynamics and consistency, as visual inspection alone cannot isolate the contribution of the SVD fusion.
[Method] Method section: The analysis links enlarged attention windows to singular-value decay and spectral concentration, but the manuscript does not isolate this as the dominant cause of drift (versus positional encoding drift or noise scheduling). No ablation holds attention window size and other factors fixed while varying only the low-rank/high-rank spectral split to test whether the fusion re-injects high-rank components without new artifacts or inconsistency.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly state the video lengths and motion types (e.g., camera motion vs. object motion) used in the qualitative examples for reproducibility.
Figure captions should detail the exact baseline methods and attention window sizes being compared to allow readers to assess the visual differences more precisely.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper to incorporate additional experiments and ablations that strengthen the claims.

read point-by-point responses

Referee: Experiments section: The evaluation provides only qualitative comparisons and visual examples on Wan2.1 and LTX-Video. No quantitative metrics (FVD, CLIP similarity, temporal consistency scores), error bars, or statistical significance tests are reported. This weakens the central claim of improvement in dynamics and consistency, as visual inspection alone cannot isolate the contribution of the SVD fusion.

Authors: We agree that quantitative metrics would strengthen the evidence. In the revised manuscript we will add FVD, CLIP similarity, and temporal consistency scores (e.g., frame-to-frame optical-flow consistency) computed on the same Wan2.1 and LTX-Video examples, together with baseline comparisons. Where feasible we will also report results across multiple random seeds to include error bars. revision: yes
Referee: Method section: The analysis links enlarged attention windows to singular-value decay and spectral concentration, but the manuscript does not isolate this as the dominant cause of drift (versus positional encoding drift or noise scheduling). No ablation holds attention window size and other factors fixed while varying only the low-rank/high-rank spectral split to test whether the fusion re-injects high-rank components without new artifacts or inconsistency.

Authors: The singular-spectrum analysis is motivated by direct observation of the attention-window effect inside the diffusion U-Net; we view spectral concentration as a primary mechanism, though we acknowledge other factors can interact. To isolate the fusion step we will add a controlled ablation that fixes attention-window size, positional encodings, and noise schedule while varying only the low-rank/high-rank singular-value split. The new results will quantify whether the spectrum-level reconstruction re-injects high-rank components without introducing artifacts or inconsistency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's core chain consists of an empirical observation (enlarged attention windows concentrate singular energy) followed by an independent proposal (SVD decomposition of global/local branches with low-rank guidance + high-rank basis fusion). No equations, fitted parameters, or self-citations are shown that reduce the claimed improvements in dynamics/consistency to quantities defined by the method itself. The framework is presented as a new reconstruction step rather than a renaming or self-referential fit. This matches the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that video features exhibit a clean low-rank/high-rank spectral split that can be fused without loss, plus the unstated premise that the observed improvements stem from this spectral mechanism rather than other factors.

axioms (2)

domain assumption Enlarged self-attention windows induce spectral concentration that suppresses high-rank spatial details and motion-rich temporal variations.
Invoked in the analysis of the video temporal extension issue.
domain assumption Global branch provides reliable low-rank spectral guidance and local branch provides reliable high-rank reconstruction basis.
Core of the proposed fusion strategy.

pith-pipeline@v0.9.0 · 5555 in / 1343 out tokens · 31231 ms · 2026-05-08T13:08:19.216273+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 18 canonical work pages · 9 internal anchors

[1]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

2023
[2]

Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025

Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025

2025
[3]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7310–7320, 2024

2024
[4]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[5]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review arXiv 2024
[6]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review arXiv 2024
[7]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review arXiv 2025
[8]

Jointtuner: Appearance-motion adaptive joint training for customized video generation.arXiv preprint arXiv:2503.23951, 2025

Fangda Chen, Shanshan Zhao, Chuanfu Xu, and Long Lan. Jointtuner: Appearance-motion adaptive joint training for customized video generation.arXiv preprint arXiv:2503.23951, 2025

work page arXiv 2025
[9]

Analysis of video quality datasets via design of minimalistic video quality models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(11):7056–7071, 2024

Wei Sun, Wen Wen, Xiongkuo Min, Long Lan, Guangtao Zhai, and Kede Ma. Analysis of video quality datasets via design of minimalistic video quality models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(11):7056–7071, 2024

2024
[10]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tade- vosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025

2025
[11]

Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

work page arXiv 2024
[12]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

work page internal anchor Pith review arXiv 2025
[13]

Gen- L-Video: multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023

Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023

work page arXiv 2023
[14]

Freenoise: Tuning-free longer video diffusion via noise rescheduling

Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling. InThe Twelfth International Conference on Learning Representations, 2024

2024
[15]

Fifo-diffusion: Generating infinite videos from text without training.Advances in Neural Information Processing Systems, 37: 89834–89868, 2024

Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training.Advances in Neural Information Processing Systems, 37: 89834–89868, 2024. 10

2024
[16]

Riflex: A free lunch for length extrapolation in video diffusion transformers

Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers. InF orty-second International Conference on Machine Learning, 2025

2025
[17]

Longdiff: Training-free long video generation in one go

Zhuoling Li, Hossein Rahmani, Qiuhong Ke, and Jun Liu. Longdiff: Training-free long video generation in one go. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17789–17798, 2025

2025
[18]

Free-lunch long video generation via layer-adaptive ood correction.arXiv preprint arXiv:2603.25209, 2026

Jiahao Tian, Chenxi Song, Wei Cheng, and Chi Zhang. Free-lunch long video generation via layer-adaptive ood correction.arXiv preprint arXiv:2603.25209, 2026

work page arXiv 2026
[19]

Freelong: Training-free long video generation with spectralblend temporal attention.Advances in Neural Information Processing Systems, 37: 131434–131455, 2024

Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. Freelong: Training-free long video generation with spectralblend temporal attention.Advances in Neural Information Processing Systems, 37: 131434–131455, 2024

2024
[20]

Freelong++: Training-free long video generation via multi-band spectralfusion

Yu Lu and Yi Yang. Freelong++: Training-free long video generation via multi-band spectralfu- sion.arXiv preprint arXiv:2507.00162, 2025

work page arXiv 2025
[21]

Freepca: Integrating consistency information across long-short frames in training-free long video generation via principal com- ponent analysis

Jiangtong Tan, Hu Yu, Jie Huang, Jie Xiao, and Feng Zhao. Freepca: Integrating consistency information across long-short frames in training-free long video generation via principal com- ponent analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27979–27988, 2025

2025
[22]

Mind the gap: a spectral analysis of rank collapse and signal propagation in attention layers

Thiziri Nait Saada, Alireza Naderi, and Jared Tanner. Mind the gap: a spectral analysis of rank collapse and signal propagation in attention layers. InF orty-second International Conference on Machine Learning, 2025

2025
[23]

Critical attention scaling in long-context transformers

Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Critical attention scaling in long-context transformers.arXiv preprint arXiv:2510.05554, 2025

work page arXiv 2025
[24]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review arXiv 2023
[25]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In12th International Conference on Learning Representations, ICLR 2024, 2024

2024
[26]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023

work page internal anchor Pith review arXiv 2023
[27]

Genmo Team. Mochi 1. https://github.com/genmoai/models, 2024

2024
[28]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review arXiv 2024
[29]

SkyReels-V2: Infinite-length Film Generative Model

Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

work page internal anchor Pith review arXiv 2025
[30]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

work page internal anchor Pith review arXiv 2025
[31]

Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025a

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

work page arXiv 2025
[32]

arXiv preprint arXiv:2510.02283 (2025)

Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025. 11

work page arXiv 2025
[33]

VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelli...

work page doi:10.1109/tpami.2025.3633890 2025
[34]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[35]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

2021
[36]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

2021
[37]

Amt: All-pairs multi-field transforms for efficient frame interpolation

Zhen Li, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun-Le Guo, and Ming-Ming Cheng. Amt: All-pairs multi-field transforms for efficient frame interpolation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9801–9810, 2023

2023
[38]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the European Conference on Computer Vision, pages 402–419. Springer, 2020

2020
[39]

aesthetic-predictor, 2022

LAION-AI. aesthetic-predictor, 2022. https://github.com/LAION-AI/aesthetic-predictor

2022
[40]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 12 A Supplementary Overview This supplementary material provides additional analysis, implementation details, and qualitative results for FreeSpec...

2021