pith. machine review for the scientific record. sign in

arxiv: 2605.06509 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords long video generationtraining-free methodsvideo diffusion modelssingular value decompositiontemporal consistencycontent driftspectral reconstructionfeature decomposition
0
0 comments X

The pith

Singular value decomposition fuses global low-rank guidance with local high-rank details to extend video diffusion models to long sequences without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that long-video problems in diffusion models stem from enlarged attention windows that concentrate spectral energy into a few low-rank directions, preserving coarse structure but losing spatial details and motion variations. It shows that previous global-plus-local methods rely on rigid feature splits that fail when appearance and action are coupled. FreeSpec instead applies singular value decomposition across branches, letting the global part supply low-rank spectral guidance and the local part supply a high-rank reconstruction basis. This spectrum-level fusion keeps long-range consistency while recovering the suppressed high-rank components. A sympathetic reader would care because the approach offers a training-free way to stretch short-video models to longer outputs with less drift and smoother dynamics.

Core claim

Enlarged self-attention windows induce spectral concentration in which energy is dominated by a few low-rank singular directions, suppressing high-rank spatial details and motion-rich temporal variations. FreeSpec decomposes global and local features with singular value decomposition, using the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion avoids the rigid partitioning of earlier rules and preserves long-range consistency while better retaining spatial details and temporal dynamics.

What carries the argument

Singular value decomposition applied to global and local feature branches, treating the global output as low-rank spectral guidance and the local output as high-rank reconstruction basis for spectrum-level fusion.

If this is right

  • Long videos maintain both global coherence and local temporal variations when generated from short-video diffusion backbones.
  • Spatial details and action progression are recovered without requiring separate appearance or motion branches.
  • Existing models can be extended to longer durations by simple inference-time feature recombination rather than retraining.
  • Rigid hand-crafted partitioning rules become unnecessary when fusion occurs at the spectrum level.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same low-rank versus high-rank separation could be tested on other attention-heavy generative tasks such as long audio or 3-D synthesis.
  • Adaptive choice of how many singular directions count as low-rank versus high-rank might further improve results on videos with varying motion complexity.
  • If spectral concentration proves general, attention-window scaling rules in future architectures might be redesigned to limit rank collapse from the start.

Load-bearing premise

That spectral concentration from enlarged attention windows is the main driver of content drift and over-smoothed dynamics, and that SVD-based low-rank and high-rank fusion separates them cleanly without creating fresh artifacts.

What would settle it

A side-by-side comparison in which FreeSpec-generated long videos still show measurable content drift or loss of fine motion at extended lengths, or in which a non-SVD decomposition achieves comparable consistency and detail without addressing spectral concentration.

Figures

Figures reproduced from arXiv: 2605.06509 by Chuanfu Xu, Fangda Chen, Long Lan, Longrong Yang, Shanshan Zhao, Zhigang Luo.

Figure 1
Figure 1. Figure 1: Long-video examples generated on Wan2.1 [7] with 4× the native training length. Existing training-free methods preserve stable appearance but may weaken continuous camera trajectories in the forest case and collapse sequential motocross actions into repetitive motion view at source ↗
Figure 2
Figure 2. Figure 2: Spectral and qualitative analysis of enlarged self-attention windows. Here, W = f × h × w denotes the native self-attention token length, where f, h, and w are the temporal length, height, and width of the video latent, respectively. (a) shows the effective-rank dynamics across denoising timesteps under different window sizes. (b) reports the effective rank at representative timesteps, showing that enlarge… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of FreeSpec. FreeSpec extends a frozen short-video diffusion model to long-video generation by replacing self-attention with SVD-guided dual-branch self-attention during inference. It combines global full-window guidance and local sliding-window priors through singular￾spectrum modulation, followed by local-basis reconstruction and a lightweight global residual. level relative-position and context… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison under 4× length extension on Wan2.1 and LTX-Video view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative ablation results of FreeSpec on Wan2.1. Yellow boxes highlight representative view at source ↗
Figure 6
Figure 6. Figure 6: Failure cases. FreeSpec preserves temporal dynamics but fails to infer implicit scene transitions, such as underwater-to-air and diving-platform-to-pool transitions. 5 Failure Cases and Limitations Although FreeSpec improves long-range temporal dynamics, it still inherits the semantic limitations of the pretrained base model. As shown in view at source ↗
read the original abstract

Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a global branch with a local branch, but they often further decompose appearance consistency and temporal dynamics within each branch using predefined criteria. This assignment is unreliable when appearance and action progression are tightly coupled, such as in camera motion and sequential motion. We analyze the video temporal extension issue from a singular-spectrum perspective and show that enlarged self-attention windows induce spectral concentration: spectral energy becomes dominated by a few low-rank singular directions, preserving coarse structure but suppressing high-rank spatial details and motion-rich temporal variations. To mitigate this problem, we propose FreeSpec, a training-free spectral reconstruction framework for long-video generation. FreeSpec decomposes global and local features with singular value decomposition, and uses the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion avoids the rigid feature partitioning of previous decomposition rules, preserving long-range consistency while better retaining spatial details and temporal dynamics. Experiments on Wan2.1 and LTX-Video demonstrate that FreeSpec improves long-video generation, especially for temporal dynamics, while maintaining strong visual quality and temporal consistency. Project demo: https://fdchen24.github.io/FreeSpec-Website/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FreeSpec, a training-free framework for extending video diffusion models to long videos. It analyzes temporal inconsistency and content drift from a singular-spectrum viewpoint, showing that enlarged self-attention windows induce spectral concentration (energy dominated by low-rank singular directions, suppressing high-rank spatial details and temporal variations). FreeSpec decomposes global and local features via SVD, treating the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion is claimed to avoid rigid feature partitioning of prior methods. Qualitative experiments on Wan2.1 and LTX-Video report improved temporal dynamics and consistency while preserving visual quality.

Significance. If the claimed mechanism holds, FreeSpec offers a simple, training-free spectral reconstruction technique that could meaningfully advance long-video synthesis in diffusion models by addressing drift through low-rank/high-rank fusion rather than heuristic decompositions. The singular-spectrum analysis provides a fresh perspective on attention-induced artifacts. The approach is parameter-free in its core reconstruction step and could be broadly applicable as a plug-in module.

major comments (2)
  1. [Experiments] Experiments section: The evaluation provides only qualitative comparisons and visual examples on Wan2.1 and LTX-Video. No quantitative metrics (FVD, CLIP similarity, temporal consistency scores), error bars, or statistical significance tests are reported. This weakens the central claim of improvement in dynamics and consistency, as visual inspection alone cannot isolate the contribution of the SVD fusion.
  2. [Method] Method section: The analysis links enlarged attention windows to singular-value decay and spectral concentration, but the manuscript does not isolate this as the dominant cause of drift (versus positional encoding drift or noise scheduling). No ablation holds attention window size and other factors fixed while varying only the low-rank/high-rank spectral split to test whether the fusion re-injects high-rank components without new artifacts or inconsistency.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly state the video lengths and motion types (e.g., camera motion vs. object motion) used in the qualitative examples for reproducibility.
  2. Figure captions should detail the exact baseline methods and attention window sizes being compared to allow readers to assess the visual differences more precisely.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper to incorporate additional experiments and ablations that strengthen the claims.

read point-by-point responses
  1. Referee: Experiments section: The evaluation provides only qualitative comparisons and visual examples on Wan2.1 and LTX-Video. No quantitative metrics (FVD, CLIP similarity, temporal consistency scores), error bars, or statistical significance tests are reported. This weakens the central claim of improvement in dynamics and consistency, as visual inspection alone cannot isolate the contribution of the SVD fusion.

    Authors: We agree that quantitative metrics would strengthen the evidence. In the revised manuscript we will add FVD, CLIP similarity, and temporal consistency scores (e.g., frame-to-frame optical-flow consistency) computed on the same Wan2.1 and LTX-Video examples, together with baseline comparisons. Where feasible we will also report results across multiple random seeds to include error bars. revision: yes

  2. Referee: Method section: The analysis links enlarged attention windows to singular-value decay and spectral concentration, but the manuscript does not isolate this as the dominant cause of drift (versus positional encoding drift or noise scheduling). No ablation holds attention window size and other factors fixed while varying only the low-rank/high-rank spectral split to test whether the fusion re-injects high-rank components without new artifacts or inconsistency.

    Authors: The singular-spectrum analysis is motivated by direct observation of the attention-window effect inside the diffusion U-Net; we view spectral concentration as a primary mechanism, though we acknowledge other factors can interact. To isolate the fusion step we will add a controlled ablation that fixes attention-window size, positional encodings, and noise schedule while varying only the low-rank/high-rank singular-value split. The new results will quantify whether the spectrum-level reconstruction re-injects high-rank components without introducing artifacts or inconsistency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's core chain consists of an empirical observation (enlarged attention windows concentrate singular energy) followed by an independent proposal (SVD decomposition of global/local branches with low-rank guidance + high-rank basis fusion). No equations, fitted parameters, or self-citations are shown that reduce the claimed improvements in dynamics/consistency to quantities defined by the method itself. The framework is presented as a new reconstruction step rather than a renaming or self-referential fit. This matches the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that video features exhibit a clean low-rank/high-rank spectral split that can be fused without loss, plus the unstated premise that the observed improvements stem from this spectral mechanism rather than other factors.

axioms (2)
  • domain assumption Enlarged self-attention windows induce spectral concentration that suppresses high-rank spatial details and motion-rich temporal variations.
    Invoked in the analysis of the video temporal extension issue.
  • domain assumption Global branch provides reliable low-rank spectral guidance and local branch provides reliable high-rank reconstruction basis.
    Core of the proposed fusion strategy.

pith-pipeline@v0.9.0 · 5555 in / 1343 out tokens · 31231 ms · 2026-05-08T13:08:19.216273+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 18 canonical work pages · 9 internal anchors

  1. [1]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023

  2. [2]

    Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025

    Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models.International Journal of Computer Vision, 133(5):3059–3078, 2025

  3. [3]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7310–7320, 2024

  4. [4]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations, 2025

  5. [5]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  6. [6]

    LTX-Video: Realtime Video Latent Diffusion

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richard- son, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  7. [7]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  8. [8]

    Jointtuner: Appearance-motion adaptive joint training for customized video generation.arXiv preprint arXiv:2503.23951, 2025

    Fangda Chen, Shanshan Zhao, Chuanfu Xu, and Long Lan. Jointtuner: Appearance-motion adaptive joint training for customized video generation.arXiv preprint arXiv:2503.23951, 2025

  9. [9]

    Analysis of video quality datasets via design of minimalistic video quality models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(11):7056–7071, 2024

    Wei Sun, Wen Wen, Xiongkuo Min, Long Lan, Guangtao Zhai, and Kede Ma. Analysis of video quality datasets via design of minimalistic video quality models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(11):7056–7071, 2024

  10. [10]

    Streamingt2v: Consistent, dynamic, and extendable long video generation from text

    Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tade- vosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025

  11. [11]

    Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

    Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to-video generator with diffusion models.arXiv preprint arXiv:2405.04233, 2024

  12. [12]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025

  13. [13]

    Gen- L-Video: multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023

    Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising.arXiv preprint arXiv:2305.18264, 2023

  14. [14]

    Freenoise: Tuning-free longer video diffusion via noise rescheduling

    Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling. InThe Twelfth International Conference on Learning Representations, 2024

  15. [15]

    Fifo-diffusion: Generating infinite videos from text without training.Advances in Neural Information Processing Systems, 37: 89834–89868, 2024

    Jihwan Kim, Junoh Kang, Jinyoung Choi, and Bohyung Han. Fifo-diffusion: Generating infinite videos from text without training.Advances in Neural Information Processing Systems, 37: 89834–89868, 2024. 10

  16. [16]

    Riflex: A free lunch for length extrapolation in video diffusion transformers

    Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers. InF orty-second International Conference on Machine Learning, 2025

  17. [17]

    Longdiff: Training-free long video generation in one go

    Zhuoling Li, Hossein Rahmani, Qiuhong Ke, and Jun Liu. Longdiff: Training-free long video generation in one go. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17789–17798, 2025

  18. [18]

    Free-lunch long video generation via layer-adaptive ood correction.arXiv preprint arXiv:2603.25209, 2026

    Jiahao Tian, Chenxi Song, Wei Cheng, and Chi Zhang. Free-lunch long video generation via layer-adaptive ood correction.arXiv preprint arXiv:2603.25209, 2026

  19. [19]

    Freelong: Training-free long video generation with spectralblend temporal attention.Advances in Neural Information Processing Systems, 37: 131434–131455, 2024

    Yu Lu, Yuanzhi Liang, Linchao Zhu, and Yi Yang. Freelong: Training-free long video generation with spectralblend temporal attention.Advances in Neural Information Processing Systems, 37: 131434–131455, 2024

  20. [20]

    Freelong++: Training-free long video generation via multi-band spectralfusion

    Yu Lu and Yi Yang. Freelong++: Training-free long video generation via multi-band spectralfu- sion.arXiv preprint arXiv:2507.00162, 2025

  21. [21]

    Freepca: Integrating consistency information across long-short frames in training-free long video generation via principal com- ponent analysis

    Jiangtong Tan, Hu Yu, Jie Huang, Jie Xiao, and Feng Zhao. Freepca: Integrating consistency information across long-short frames in training-free long video generation via principal com- ponent analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27979–27988, 2025

  22. [22]

    Mind the gap: a spectral analysis of rank collapse and signal propagation in attention layers

    Thiziri Nait Saada, Alireza Naderi, and Jared Tanner. Mind the gap: a spectral analysis of rank collapse and signal propagation in attention layers. InF orty-second International Conference on Machine Learning, 2025

  23. [23]

    Critical attention scaling in long-context transformers

    Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Critical attention scaling in long-context transformers.arXiv preprint arXiv:2510.05554, 2025

  24. [24]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Do- minik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  25. [25]

    Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In12th International Conference on Learning Representations, ICLR 2024, 2024

  26. [26]

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation.arXiv preprint arXiv:2310.19512, 2023

  27. [27]

    Genmo Team. Mochi 1. https://github.com/genmoai/models, 2024

  28. [28]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720, 2024

  29. [29]

    SkyReels-V2: Infinite-length Film Generative Model

    Guibin Chen, Dixuan Lin, Jiangping Yang, Chunze Lin, Junchen Zhu, Mingyuan Fan, Hao Zhang, Sheng Chen, Zheng Chen, Chengcheng Ma, et al. Skyreels-v2: Infinite-length film generative model.arXiv preprint arXiv:2504.13074, 2025

  30. [30]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  31. [31]

    Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025a

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. Rolling forcing: Autoregressive long video diffusion in real time.arXiv preprint arXiv:2509.25161, 2025

  32. [32]

    arXiv preprint arXiv:2510.02283 (2025)

    Justin Cui, Jie Wu, Ming Li, Tao Yang, Xiaojie Li, Rui Wang, Andrew Bai, Yuanhao Ban, and Cho-Jui Hsieh. Self-forcing++: Towards minute-scale high-quality video generation.arXiv preprint arXiv:2510.02283, 2025. 11

  33. [33]

    VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench++: Comprehensive and versatile benchmark suite for video generative models.IEEE Transactions on Pattern Analysis and Machine Intelli...

  34. [34]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  35. [35]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  36. [36]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  37. [37]

    Amt: All-pairs multi-field transforms for efficient frame interpolation

    Zhen Li, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun-Le Guo, and Ming-Ming Cheng. Amt: All-pairs multi-field transforms for efficient frame interpolation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9801–9810, 2023

  38. [38]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the European Conference on Computer Vision, pages 402–419. Springer, 2020

  39. [39]

    aesthetic-predictor, 2022

    LAION-AI. aesthetic-predictor, 2022. https://github.com/LAION-AI/aesthetic-predictor

  40. [40]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021. 12 A Supplementary Overview This supplementary material provides additional analysis, implementation details, and qualitative results for FreeSpec...